Class QALoRAAdapter<T>
Quantization-Aware LoRA (QA-LoRA) adapter that combines parameter-efficient fine-tuning with group-wise quantization awareness.
public class QALoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>QALoRAAdapter<T>
- Implements
-
ILoRAAdapter<T>ILayer<T>
- Inherited Members
Remarks
QA-LoRA extends standard LoRA by being aware of quantization during training. This allows the adapter to learn compensations for quantization errors, resulting in better final accuracy compared to post-training quantization approaches. The key innovation is simulating quantization during the forward pass so that gradients account for quantization effects.
For Beginners: QA-LoRA solves a critical problem when deploying models to resource-constrained devices.
The Problem:
- Modern neural networks use high-precision numbers (32-bit floats)
- Mobile/edge devices need lower precision (4-bit or 8-bit integers) for speed and memory
- Converting after training (post-training quantization) often loses accuracy
QA-LoRA's Solution:
- Simulates low-precision during training (quantization-aware training)
- Learns to compensate for quantization errors
- Uses LoRA for parameter efficiency (only trains the adaptation, not full model)
- Applies group-wise quantization (groups of weights share scaling factors)
Key Concepts:
Quantization: Converting high-precision numbers to low-precision Example: 32-bit float 0.7234 → 4-bit integer 11 (range 0-15)
Group-wise Quantization: Instead of one scale for all weights, weights are divided into groups, each with its own scale. This preserves more information. Example: 64 weights → 4 groups of 16 weights each, each group has its own scale
Quantization-Aware Training: During training, simulate quantization in forward pass:
- Convert weights to low-precision (quantize)
- Immediately convert back to high-precision (dequantize)
- Use these "quantized" values for computation
- Gradients learn to compensate for the quantization noise
Straight-Through Estimator (STE): During backward pass, treat quantization as identity
- Forward: y = quantize(x)
- Backward: ∂y/∂x ≈ 1 (gradient flows through unchanged)
- This allows gradients to update the full-precision weights
Parameters:
- QuantizationBits: How many bits to use (4-bit, 8-bit, etc.)
- GroupSize: How many weights per quantization group (e.g., 64, 128)
- Smaller GroupSize = more scales = better accuracy but more overhead
- Larger GroupSize = fewer scales = more efficient but less accurate
Example Workflow:
- Training: Forward pass uses simulated 4-bit quantization
- Gradients: Backward pass learns to work around quantization errors
- Deployment: Actually quantize the merged weights to 4-bit for inference
- Result: Much better accuracy than quantizing after training
Research Context:
- QLoRA (May 2023): Introduced efficient 4-bit quantization for LoRA
- QA-LoRA: Extends this with quantization-aware training for better results
- Typical improvement: 1-3% accuracy gain over post-training quantization
Use Cases:
- Deploying large language models on mobile devices
- Edge AI applications with strict memory constraints
- Reducing model size while maintaining accuracy
- Fine-tuning for deployment on specific hardware (TPUs, specialized accelerators)
Constructors
QALoRAAdapter(ILayer<T>, int, int, int, double, bool)
Initializes a new QA-LoRA adapter with quantization awareness.
public QALoRAAdapter(ILayer<T> baseLayer, int rank, int quantizationBits, int groupSize, double alpha = -1, bool freezeBaseLayer = true)
Parameters
baseLayerILayer<T>The layer to adapt with QA-LoRA.
rankintThe rank of the LoRA decomposition.
quantizationBitsintNumber of bits for quantization (e.g., 4, 8).
groupSizeintNumber of weights per quantization group.
alphadoubleThe LoRA scaling factor (defaults to rank if negative).
freezeBaseLayerboolWhether to freeze the base layer's parameters during training.
Remarks
For Beginners: This creates a QA-LoRA adapter that will train with quantization awareness.
Parameters:
- baseLayer: The layer you want to efficiently fine-tune
- rank: How much compression for LoRA (lower = fewer parameters)
- quantizationBits: Target precision for deployment (4 or 8 typically)
- groupSize: Granularity of quantization (64-128 recommended)
- alpha: How strong the LoRA effect is
- freezeBaseLayer: Whether to lock the original weights (usually true)
Example: QALoRAAdapter(myLayer, rank=8, quantizationBits=4, groupSize=64)
- Uses 8-rank LoRA for parameter efficiency
- Simulates 4-bit quantization during training
- Groups of 64 weights share scaling factors
Exceptions
- ArgumentNullException
Thrown when baseLayer is null.
- ArgumentException
Thrown when quantizationBits or groupSize are invalid.
Properties
GroupSize
Gets or sets the group size for group-wise quantization.
public int GroupSize { get; set; }
Property Value
Remarks
Group-wise quantization divides weights into groups, each with independent scaling factors. This preserves more dynamic range than using a single scale for all weights.
For Beginners: Imagine you have 1024 weights to quantize: - GroupSize = 1024: One scale for all weights (simple but loses information) - GroupSize = 128: Eight scales (1024/128 = 8 groups, better accuracy) - GroupSize = 64: Sixteen scales (1024/64 = 16 groups, even better but more overhead)
Smaller groups mean each group's weights are more similar, so a single scale per group is more accurate. But you need to store more scales.
QuantizationBits
Gets or sets the number of bits used for quantization.
public int QuantizationBits { get; set; }
Property Value
Remarks
Common values: - 4 bits: Extremely memory-efficient, requires careful tuning - 8 bits: Good balance of efficiency and accuracy - 16 bits: Close to full precision, minimal savings
For Beginners: This controls how much compression you apply. - 4-bit: 8x compression (32-bit → 4-bit), more aggressive - 8-bit: 4x compression (32-bit → 8-bit), safer choice Lower bits = smaller model but harder to maintain accuracy.
QuantizationEnabled
Gets or sets whether quantization simulation is enabled during forward/backward passes.
public bool QuantizationEnabled { get; set; }
Property Value
Remarks
Disabling quantization can be useful for: - Initial warmup phases - Evaluating full-precision performance - Debugging training issues
For Beginners: This is like a toggle switch: - Enabled: Simulate low-precision during training (quantization-aware) - Disabled: Use full-precision (standard LoRA training) You might start with it disabled for stability, then enable it partway through training.
Methods
Backward(Tensor<T>)
Performs the backward pass through both layers, accounting for quantization in gradients.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>Gradient flowing back from the next layer.
Returns
- Tensor<T>
Gradient to pass to the previous layer.
Remarks
The backward pass uses the Straight-Through Estimator (STE) for quantization: - Forward: y = quantize(x) - Backward: ∂L/∂x = ∂L/∂y (gradient passes through unchanged)
This allows gradients to flow to the full-precision weights despite quantization.
For Beginners: This is the tricky part of quantization-aware training!
The Problem:
- Quantization is a discontinuous operation (rounding)
- Discontinuous operations have zero or undefined gradients
- If gradients can't flow, we can't update weights, so training fails
The Solution (Straight-Through Estimator):
- Pretend quantization is the identity function during backprop
- Forward: actually quantize (add noise)
- Backward: pretend we didn't quantize (gradient flows through)
- This is mathematically "wrong" but works well in practice!
Why it works:
- The forward pass sees quantized values (learns to compensate)
- The backward pass updates full-precision weights (maintains precision)
- The network learns weights that work well when quantized
Example: Forward: weight = 0.7234 → quantize → 0.7333 (closest 4-bit value) Backward: gradient flows as if 0.7234 → 0.7234 (identity) Update: 0.7234 - learning_rate * gradient (updates full-precision weight)
Forward(Tensor<T>)
Performs the forward pass through both base and LoRA layers with quantization simulation.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>Input tensor.
Returns
- Tensor<T>
Sum of base layer output and quantized LoRA output.
Remarks
The forward pass with quantization awareness: 1. Compute base layer output (no quantization) 2. Get LoRA layer parameters 3. Simulate quantization: quantize → dequantize (if enabled) 4. Compute LoRA output with quantized parameters 5. Sum base + quantized LoRA outputs
For Beginners: This is where quantization-aware training happens!
Normal LoRA forward pass:
- base_output = base_layer(input)
- lora_output = lora_layer(input) // Uses full-precision weights
- return base_output + lora_output
QA-LoRA forward pass:
- base_output = base_layer(input)
- lora_weights_full = get_lora_weights() // Full precision
- lora_weights_quant = dequantize(quantize(lora_weights_full)) // Simulate quantization
- lora_output = compute_with_quantized_weights(input, lora_weights_quant)
- return base_output + lora_output
The key difference: We temporarily quantize and dequantize the LoRA weights, which adds noise. The gradients will learn to work despite this noise!
GetQuantizationStats()
Gets statistics about quantization for the current LoRA parameters.
public (double averageError, double maxError, int numGroups) GetQuantizationStats()
Returns
- (double averageError, double maxError, int numGroups)
A tuple containing (average error, max error, number of groups).
Remarks
This method helps you understand the quantization quality: - Average error: Mean absolute difference between full-precision and quantized values - Max error: Worst-case difference in any parameter - Number of groups: How many quantization groups are used
For Beginners: Use this to check how much information is lost to quantization.
Example output:
- Average error: 0.002 (most parameters within 0.002 of original)
- Max error: 0.015 (worst case is 0.015 away from original)
- Number of groups: 16 (using 16 different scales)
Lower errors mean better quantization. If errors are too high:
- Decrease group size (more scales, more accurate)
- Increase quantization bits (more precision)
- Adjust learning rate (help network adapt better)
MergeToOriginalLayer()
Merges the LoRA adaptation into the base layer and returns a quantized merged layer.
public override ILayer<T> MergeToOriginalLayer()
Returns
- ILayer<T>
A new layer with LoRA weights merged and quantized into the base layer's weights.
Remarks
The merging process for QA-LoRA: 1. Get the LoRA weight contribution (B * A * scaling) 2. Add LoRA weights to base layer weights 3. Apply actual quantization to the merged weights (for deployment) 4. Return a new layer with quantized merged weights
For Beginners: This is the final step - creating the deployment model!
Training vs. Deployment:
- During training: Simulate quantization, keep full-precision weights
- After training: Actually quantize and merge for deployment
What this method does:
- Merge: base_weights + lora_weights → full_precision_merged
- Quantize: full_precision_merged → quantized_weights (actually reduced to N bits)
- Create new layer: DenseLayer with quantized_weights
Result: A layer that's actually using N-bit precision (smaller, faster) instead of just simulating it!
Note: This example quantizes to the parameter vector. For true deployment, you'd use a specialized quantized layer class that stores integer weights and performs integer arithmetic. This is a simplified version for demonstration.