Class QLoRAAdapter<T>
QLoRA (Quantized LoRA) adapter for parameter-efficient fine-tuning with 4-bit quantized base weights.
public class QLoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>QLoRAAdapter<T>
- Implements
-
ILoRAAdapter<T>ILayer<T>
- Inherited Members
Remarks
QLoRA extends the LoRA (Low-Rank Adaptation) technique by quantizing the base layer's weights to 4-bit precision while keeping the LoRA adapter matrices (A and B) in full precision. This achieves dramatic memory savings (typically 4x reduction) while maintaining training quality comparable to full 16-bit fine-tuning.
Key Features: - Base layer weights stored in 4-bit precision (INT4 or NF4) - LoRA matrices (A and B) remain in full precision for accurate gradient updates - Double quantization for constant quantization parameters (further memory savings) - Paged optimizers support for handling memory spikes during training - Dequantization happens on-the-fly during forward pass
Memory Savings: For a typical transformer layer with 1000x1000 weights: - Standard 16-bit: 2MB for weights - QLoRA 4-bit base: 0.5MB for base weights + full precision LoRA (e.g., 32KB for rank 8) - Total savings: ~75% memory reduction on base weights
Quantization Types: - INT4: Uniform 4-bit integer quantization (-8 to 7) - NF4 (4-bit Normal Float): Information-theoretically optimal for normally distributed weights
For Beginners: QLoRA is an advanced technique that makes fine-tuning large models even more memory-efficient than standard LoRA. Here's how it works:
Imagine you have a huge model with millions of parameters:
- Standard LoRA: Freezes the base model, trains small adapters (huge memory savings)
- QLoRA: Does the same BUT also compresses the base model to 4-bit (even more savings!)
Think of it like storing a high-resolution image:
- Original model: Full 16-bit floating point (2 bytes per number)
- QLoRA base: Compressed to 4-bit (0.5 bytes per number)
- LoRA adapters: Still full precision (for accurate learning)
The result: You can fine-tune models 4x larger on the same hardware, or use 4x less GPU memory!
When to use QLoRA vs Standard LoRA:
- Use QLoRA when: GPU memory is very limited, model is huge, inference speed is critical
- Use Standard LoRA when: Memory is not a constraint, maximum accuracy is needed
- Both achieve similar quality in practice, QLoRA just uses less memory
Trade-offs:
- Pros: 75% less memory, same performance as 16-bit LoRA, faster inference after merging
- Cons: Slightly slower forward pass (dequantization overhead), more complex implementation
Research Background: QLoRA was introduced in "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023). It enables fine-tuning of 65B parameter models on a single 48GB GPU by combining: 1. 4-bit NormalFloat (NF4) quantization optimized for normally distributed weights 2. Double quantization to reduce memory footprint of quantization constants 3. Paged optimizers to handle memory spikes during gradient checkpointing
Constructors
QLoRAAdapter(ILayer<T>, int, double, QuantizationType, bool, int, bool)
Initializes a new QLoRA adapter wrapping an existing Dense or FullyConnected layer.
public QLoRAAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, QLoRAAdapter<T>.QuantizationType quantizationType = QuantizationType.NF4, bool useDoubleQuantization = true, int quantizationBlockSize = 64, bool freezeBaseLayer = true)
Parameters
baseLayerILayer<T>The Dense or FullyConnected layer to adapt with QLoRA.
rankintThe rank of the LoRA decomposition.
alphadoubleThe LoRA scaling factor (defaults to rank if negative).
quantizationTypeQLoRAAdapter<T>.QuantizationTypeThe type of 4-bit quantization to use (default: NF4).
useDoubleQuantizationboolWhether to use double quantization for constants (default: true).
quantizationBlockSizeintThe block size for quantization (default: 64).
freezeBaseLayerboolWhether to freeze the base layer's parameters during training (default: true, recommended for QLoRA).
Remarks
The constructor quantizes the base layer's weights immediately to save memory. LoRA matrices are initialized normally and remain in full precision.
For Beginners: This creates a QLoRA adapter that wraps your existing layer.
Parameters explained:
- baseLayer: The layer you want to compress and adapt (e.g., a Dense layer)
- rank: How many parameters for the LoRA adapter (lower = more efficient)
- alpha: How strong the LoRA corrections are
- quantizationType: NF4 (recommended) or INT4 (simpler but less accurate)
- useDoubleQuantization: true (recommended) saves extra 3-5% memory
- quantizationBlockSize: 64 (recommended) balances accuracy and memory
- freezeBaseLayer: true (recommended) - only train the LoRA adapter, not the base weights
After construction, the base layer's weights are immediately compressed to 4-bit, freeing up 75% of the memory they were using!
Exceptions
- ArgumentNullException
Thrown when baseLayer is null.
- ArgumentException
Thrown when the base layer doesn't have 1D input/output shapes or when block size is invalid.
Properties
BlockSize
Gets the quantization block size.
public int BlockSize { get; }
Property Value
Quantization
Gets the quantization type used for base layer weights.
public QLoRAAdapter<T>.QuantizationType Quantization { get; }
Property Value
UsesDoubleQuantization
Gets whether double quantization is enabled.
public bool UsesDoubleQuantization { get; }
Property Value
Methods
Backward(Tensor<T>)
Performs the backward pass through both layers (only updates LoRA if base is frozen).
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>Gradient flowing back from the next layer.
Returns
- Tensor<T>
Gradient to pass to the previous layer.
Remarks
For QLoRA, the base layer is typically frozen (only LoRA is trained). The backward pass: 1. Computes gradients for LoRA layer (always) 2. Skips base layer gradient computation (if frozen) 3. Propagates input gradients back
For Beginners: This is where learning happens, but only for the LoRA adapter! Since the base layer is compressed and frozen, we only update the small LoRA matrices. This is what makes QLoRA so efficient - we're only training a tiny fraction of parameters.
Forward(Tensor<T>)
Performs the forward pass through both quantized base and LoRA layers.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>Input tensor.
Returns
- Tensor<T>
Sum of dequantized base layer output and LoRA output.
Remarks
The forward pass: 1. Dequantizes base layer weights (if not already cached) 2. Computes base layer output with dequantized weights 3. Computes LoRA layer output (full precision) 4. Returns sum of both outputs
For Beginners: This is where we use the compressed model for prediction. The steps are: 1. Decompress the base weights from 4-bit to full precision 2. Run the input through the decompressed base layer 3. Run the input through the LoRA adapter (always full precision) 4. Add the results together
The decompression happens automatically - from the outside, it looks like a normal layer!
MergeToOriginalLayer()
Merges the LoRA adaptation into the base layer and returns a quantized merged layer.
public override ILayer<T> MergeToOriginalLayer()
Returns
- ILayer<T>
A new DenseLayer with LoRA weights merged and quantized.
Remarks
This method: 1. Dequantizes the base weights 2. Computes the LoRA weight contribution 3. Merges them together 4. Creates a new layer with merged weights 5. Optionally re-quantizes for deployment
For Beginners: This "bakes in" your LoRA training into a single compressed layer. After training, you can: 1. Decompress the base weights 2. Add the LoRA corrections 3. Create a new layer with the improved weights 4. Optionally compress it again for deployment
The result is a single layer that includes all the improvements from training, ready to use in production!
Exceptions
- InvalidOperationException
Thrown when the base layer type is not DenseLayer or FullyConnectedLayer.
ResetState()
Resets the internal state of the adapter.
public override void ResetState()
Remarks
Clears cached dequantized weights and resets both base and LoRA layers.
For Beginners: This clears the adapter's memory, including any cached decompressed weights. Useful when starting a new batch or switching tasks.