Table of Contents

Class QLoRAAdapter<T>

Namespace
AiDotNet.LoRA.Adapters
Assembly
AiDotNet.dll

QLoRA (Quantized LoRA) adapter for parameter-efficient fine-tuning with 4-bit quantized base weights.

public class QLoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
QLoRAAdapter<T>
Implements
Inherited Members

Remarks

QLoRA extends the LoRA (Low-Rank Adaptation) technique by quantizing the base layer's weights to 4-bit precision while keeping the LoRA adapter matrices (A and B) in full precision. This achieves dramatic memory savings (typically 4x reduction) while maintaining training quality comparable to full 16-bit fine-tuning.

Key Features: - Base layer weights stored in 4-bit precision (INT4 or NF4) - LoRA matrices (A and B) remain in full precision for accurate gradient updates - Double quantization for constant quantization parameters (further memory savings) - Paged optimizers support for handling memory spikes during training - Dequantization happens on-the-fly during forward pass

Memory Savings: For a typical transformer layer with 1000x1000 weights: - Standard 16-bit: 2MB for weights - QLoRA 4-bit base: 0.5MB for base weights + full precision LoRA (e.g., 32KB for rank 8) - Total savings: ~75% memory reduction on base weights

Quantization Types: - INT4: Uniform 4-bit integer quantization (-8 to 7) - NF4 (4-bit Normal Float): Information-theoretically optimal for normally distributed weights

For Beginners: QLoRA is an advanced technique that makes fine-tuning large models even more memory-efficient than standard LoRA. Here's how it works:

Imagine you have a huge model with millions of parameters:

  • Standard LoRA: Freezes the base model, trains small adapters (huge memory savings)
  • QLoRA: Does the same BUT also compresses the base model to 4-bit (even more savings!)

Think of it like storing a high-resolution image:

  • Original model: Full 16-bit floating point (2 bytes per number)
  • QLoRA base: Compressed to 4-bit (0.5 bytes per number)
  • LoRA adapters: Still full precision (for accurate learning)

The result: You can fine-tune models 4x larger on the same hardware, or use 4x less GPU memory!

When to use QLoRA vs Standard LoRA:

  • Use QLoRA when: GPU memory is very limited, model is huge, inference speed is critical
  • Use Standard LoRA when: Memory is not a constraint, maximum accuracy is needed
  • Both achieve similar quality in practice, QLoRA just uses less memory

Trade-offs:

  • Pros: 75% less memory, same performance as 16-bit LoRA, faster inference after merging
  • Cons: Slightly slower forward pass (dequantization overhead), more complex implementation

Research Background: QLoRA was introduced in "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023). It enables fine-tuning of 65B parameter models on a single 48GB GPU by combining: 1. 4-bit NormalFloat (NF4) quantization optimized for normally distributed weights 2. Double quantization to reduce memory footprint of quantization constants 3. Paged optimizers to handle memory spikes during gradient checkpointing

Constructors

QLoRAAdapter(ILayer<T>, int, double, QuantizationType, bool, int, bool)

Initializes a new QLoRA adapter wrapping an existing Dense or FullyConnected layer.

public QLoRAAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, QLoRAAdapter<T>.QuantizationType quantizationType = QuantizationType.NF4, bool useDoubleQuantization = true, int quantizationBlockSize = 64, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>

The Dense or FullyConnected layer to adapt with QLoRA.

rank int

The rank of the LoRA decomposition.

alpha double

The LoRA scaling factor (defaults to rank if negative).

quantizationType QLoRAAdapter<T>.QuantizationType

The type of 4-bit quantization to use (default: NF4).

useDoubleQuantization bool

Whether to use double quantization for constants (default: true).

quantizationBlockSize int

The block size for quantization (default: 64).

freezeBaseLayer bool

Whether to freeze the base layer's parameters during training (default: true, recommended for QLoRA).

Remarks

The constructor quantizes the base layer's weights immediately to save memory. LoRA matrices are initialized normally and remain in full precision.

For Beginners: This creates a QLoRA adapter that wraps your existing layer.

Parameters explained:

  • baseLayer: The layer you want to compress and adapt (e.g., a Dense layer)
  • rank: How many parameters for the LoRA adapter (lower = more efficient)
  • alpha: How strong the LoRA corrections are
  • quantizationType: NF4 (recommended) or INT4 (simpler but less accurate)
  • useDoubleQuantization: true (recommended) saves extra 3-5% memory
  • quantizationBlockSize: 64 (recommended) balances accuracy and memory
  • freezeBaseLayer: true (recommended) - only train the LoRA adapter, not the base weights

After construction, the base layer's weights are immediately compressed to 4-bit, freeing up 75% of the memory they were using!

Exceptions

ArgumentNullException

Thrown when baseLayer is null.

ArgumentException

Thrown when the base layer doesn't have 1D input/output shapes or when block size is invalid.

Properties

BlockSize

Gets the quantization block size.

public int BlockSize { get; }

Property Value

int

Quantization

Gets the quantization type used for base layer weights.

public QLoRAAdapter<T>.QuantizationType Quantization { get; }

Property Value

QLoRAAdapter<T>.QuantizationType

UsesDoubleQuantization

Gets whether double quantization is enabled.

public bool UsesDoubleQuantization { get; }

Property Value

bool

Methods

Backward(Tensor<T>)

Performs the backward pass through both layers (only updates LoRA if base is frozen).

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

Gradient flowing back from the next layer.

Returns

Tensor<T>

Gradient to pass to the previous layer.

Remarks

For QLoRA, the base layer is typically frozen (only LoRA is trained). The backward pass: 1. Computes gradients for LoRA layer (always) 2. Skips base layer gradient computation (if frozen) 3. Propagates input gradients back

For Beginners: This is where learning happens, but only for the LoRA adapter! Since the base layer is compressed and frozen, we only update the small LoRA matrices. This is what makes QLoRA so efficient - we're only training a tiny fraction of parameters.

Forward(Tensor<T>)

Performs the forward pass through both quantized base and LoRA layers.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

Input tensor.

Returns

Tensor<T>

Sum of dequantized base layer output and LoRA output.

Remarks

The forward pass: 1. Dequantizes base layer weights (if not already cached) 2. Computes base layer output with dequantized weights 3. Computes LoRA layer output (full precision) 4. Returns sum of both outputs

For Beginners: This is where we use the compressed model for prediction. The steps are: 1. Decompress the base weights from 4-bit to full precision 2. Run the input through the decompressed base layer 3. Run the input through the LoRA adapter (always full precision) 4. Add the results together

The decompression happens automatically - from the outside, it looks like a normal layer!

MergeToOriginalLayer()

Merges the LoRA adaptation into the base layer and returns a quantized merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>

A new DenseLayer with LoRA weights merged and quantized.

Remarks

This method: 1. Dequantizes the base weights 2. Computes the LoRA weight contribution 3. Merges them together 4. Creates a new layer with merged weights 5. Optionally re-quantizes for deployment

For Beginners: This "bakes in" your LoRA training into a single compressed layer. After training, you can: 1. Decompress the base weights 2. Add the LoRA corrections 3. Create a new layer with the improved weights 4. Optionally compress it again for deployment

The result is a single layer that includes all the improvements from training, ready to use in production!

Exceptions

InvalidOperationException

Thrown when the base layer type is not DenseLayer or FullyConnectedLayer.

ResetState()

Resets the internal state of the adapter.

public override void ResetState()

Remarks

Clears cached dequantized weights and resets both base and LoRA layers.

For Beginners: This clears the adapter's memory, including any cached decompressed weights. Useful when starting a new batch or switching tasks.