Class QALoRAAdapter<T>

Namespace: AiDotNet.LoRA.Adapters

Assembly: AiDotNet.dll

Quantization-Aware LoRA (QA-LoRA) adapter that combines parameter-efficient fine-tuning with group-wise quantization awareness.

public class QALoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T: The numeric type used for calculations, typically float or double.

Inheritance: object

LayerBase<T>

LoRAAdapterBase<T>

QALoRAAdapter<T>

Implements: IDisposable

ILoRAAdapter<T>

ILayer<T>

IJitCompilable<T>

IDiagnosticsProvider

IWeightLoadable<T>

Inherited Members: LoRAAdapterBase<T>._baseLayer

LoRAAdapterBase<T>._loraLayer

LoRAAdapterBase<T>._freezeBaseLayer

LoRAAdapterBase<T>.BaseLayer

LoRAAdapterBase<T>.LoRALayer

LoRAAdapterBase<T>.IsBaseLayerFrozen

LoRAAdapterBase<T>.Rank

LoRAAdapterBase<T>.Alpha

LoRAAdapterBase<T>.ParameterCount

LoRAAdapterBase<T>.SupportsTraining

LoRAAdapterBase<T>.CreateLoRALayer(int, double)

LoRAAdapterBase<T>.UpdateParameters(T)

LoRAAdapterBase<T>.GetParameters()

LoRAAdapterBase<T>.SetParameters(Vector<T>)

LoRAAdapterBase<T>.CreateMergedLayerWithClone(Vector<T>)

LoRAAdapterBase<T>.MergeToDenseOrFullyConnected()

LoRAAdapterBase<T>.UpdateParametersFromLayers()

LoRAAdapterBase<T>.ResetState()

LoRAAdapterBase<T>.SupportsJitCompilation

LoRAAdapterBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuExecution

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.ForwardGpu(params IGpuTensor<T>[])

LayerBase<T>.BackwardGpu(IGpuTensor<T>)

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.Forward(params Tensor<T>[])

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.Clone()

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.UpdateParameters(Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.GetDiagnostics()

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

QA-LoRA extends standard LoRA by being aware of quantization during training. This allows the adapter to learn compensations for quantization errors, resulting in better final accuracy compared to post-training quantization approaches. The key innovation is simulating quantization during the forward pass so that gradients account for quantization effects.

For Beginners: QA-LoRA solves a critical problem when deploying models to resource-constrained devices.

The Problem:

Modern neural networks use high-precision numbers (32-bit floats)
Mobile/edge devices need lower precision (4-bit or 8-bit integers) for speed and memory
Converting after training (post-training quantization) often loses accuracy

QA-LoRA's Solution:

Simulates low-precision during training (quantization-aware training)
Learns to compensate for quantization errors
Uses LoRA for parameter efficiency (only trains the adaptation, not full model)
Applies group-wise quantization (groups of weights share scaling factors)

Key Concepts:

Quantization: Converting high-precision numbers to low-precision Example: 32-bit float 0.7234 → 4-bit integer 11 (range 0-15)
Group-wise Quantization: Instead of one scale for all weights, weights are divided into groups, each with its own scale. This preserves more information. Example: 64 weights → 4 groups of 16 weights each, each group has its own scale
Quantization-Aware Training: During training, simulate quantization in forward pass:
- Convert weights to low-precision (quantize)
- Immediately convert back to high-precision (dequantize)
- Use these "quantized" values for computation
- Gradients learn to compensate for the quantization noise
Straight-Through Estimator (STE): During backward pass, treat quantization as identity
- Forward: y = quantize(x)
- Backward: ∂y/∂x ≈ 1 (gradient flows through unchanged)
- This allows gradients to update the full-precision weights

Parameters:

QuantizationBits: How many bits to use (4-bit, 8-bit, etc.)
GroupSize: How many weights per quantization group (e.g., 64, 128)
Smaller GroupSize = more scales = better accuracy but more overhead
Larger GroupSize = fewer scales = more efficient but less accurate

Example Workflow:

Training: Forward pass uses simulated 4-bit quantization
Gradients: Backward pass learns to work around quantization errors
Deployment: Actually quantize the merged weights to 4-bit for inference
Result: Much better accuracy than quantizing after training

Research Context:

QLoRA (May 2023): Introduced efficient 4-bit quantization for LoRA
QA-LoRA: Extends this with quantization-aware training for better results
Typical improvement: 1-3% accuracy gain over post-training quantization

Use Cases:

Deploying large language models on mobile devices
Edge AI applications with strict memory constraints
Reducing model size while maintaining accuracy
Fine-tuning for deployment on specific hardware (TPUs, specialized accelerators)

Constructors

QALoRAAdapter(ILayer<T>, int, int, int, double, bool)

Initializes a new QA-LoRA adapter with quantization awareness.

public QALoRAAdapter(ILayer<T> baseLayer, int rank, int quantizationBits, int groupSize, double alpha = -1, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>: The layer to adapt with QA-LoRA.
rank int: The rank of the LoRA decomposition.
quantizationBits int: Number of bits for quantization (e.g., 4, 8).
groupSize int: Number of weights per quantization group.
alpha double: The LoRA scaling factor (defaults to rank if negative).
freezeBaseLayer bool: Whether to freeze the base layer's parameters during training.

Remarks

For Beginners: This creates a QA-LoRA adapter that will train with quantization awareness.

Parameters:

baseLayer: The layer you want to efficiently fine-tune
rank: How much compression for LoRA (lower = fewer parameters)
quantizationBits: Target precision for deployment (4 or 8 typically)
groupSize: Granularity of quantization (64-128 recommended)
alpha: How strong the LoRA effect is
freezeBaseLayer: Whether to lock the original weights (usually true)

Example: QALoRAAdapter(myLayer, rank=8, quantizationBits=4, groupSize=64)

Uses 8-rank LoRA for parameter efficiency
Simulates 4-bit quantization during training
Groups of 64 weights share scaling factors

Exceptions

ArgumentNullException: Thrown when baseLayer is null.
ArgumentException: Thrown when quantizationBits or groupSize are invalid.

Properties

GroupSize

Gets or sets the group size for group-wise quantization.

public int GroupSize { get; set; }

Property Value

int

Remarks

Group-wise quantization divides weights into groups, each with independent scaling factors. This preserves more dynamic range than using a single scale for all weights.

For Beginners: Imagine you have 1024 weights to quantize: - GroupSize = 1024: One scale for all weights (simple but loses information) - GroupSize = 128: Eight scales (1024/128 = 8 groups, better accuracy) - GroupSize = 64: Sixteen scales (1024/64 = 16 groups, even better but more overhead)

Smaller groups mean each group's weights are more similar, so a single scale per group is more accurate. But you need to store more scales.

QuantizationBits

Gets or sets the number of bits used for quantization.

public int QuantizationBits { get; set; }

Property Value

int

Remarks

Common values: - 4 bits: Extremely memory-efficient, requires careful tuning - 8 bits: Good balance of efficiency and accuracy - 16 bits: Close to full precision, minimal savings

For Beginners: This controls how much compression you apply. - 4-bit: 8x compression (32-bit → 4-bit), more aggressive - 8-bit: 4x compression (32-bit → 8-bit), safer choice Lower bits = smaller model but harder to maintain accuracy.

QuantizationEnabled

Gets or sets whether quantization simulation is enabled during forward/backward passes.

public bool QuantizationEnabled { get; set; }

Property Value

bool

Remarks

Disabling quantization can be useful for: - Initial warmup phases - Evaluating full-precision performance - Debugging training issues

For Beginners: This is like a toggle switch: - Enabled: Simulate low-precision during training (quantization-aware) - Disabled: Use full-precision (standard LoRA training) You might start with it disabled for stability, then enable it partway through training.

Methods

Backward(Tensor<T>)

Performs the backward pass through both layers, accounting for quantization in gradients.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: Gradient flowing back from the next layer.

Returns

Tensor<T>: Gradient to pass to the previous layer.

Remarks

The backward pass uses the Straight-Through Estimator (STE) for quantization: - Forward: y = quantize(x) - Backward: ∂L/∂x = ∂L/∂y (gradient passes through unchanged)

This allows gradients to flow to the full-precision weights despite quantization.

For Beginners: This is the tricky part of quantization-aware training!

The Problem:

Quantization is a discontinuous operation (rounding)
Discontinuous operations have zero or undefined gradients
If gradients can't flow, we can't update weights, so training fails

The Solution (Straight-Through Estimator):

Pretend quantization is the identity function during backprop
Forward: actually quantize (add noise)
Backward: pretend we didn't quantize (gradient flows through)
This is mathematically "wrong" but works well in practice!

Why it works:

The forward pass sees quantized values (learns to compensate)
The backward pass updates full-precision weights (maintains precision)
The network learns weights that work well when quantized

Example: Forward: weight = 0.7234 → quantize → 0.7333 (closest 4-bit value) Backward: gradient flows as if 0.7234 → 0.7234 (identity) Update: 0.7234 - learning_rate * gradient (updates full-precision weight)

Forward(Tensor<T>)

Performs the forward pass through both base and LoRA layers with quantization simulation.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: Input tensor.

Returns

Tensor<T>: Sum of base layer output and quantized LoRA output.

Remarks

The forward pass with quantization awareness: 1. Compute base layer output (no quantization) 2. Get LoRA layer parameters 3. Simulate quantization: quantize → dequantize (if enabled) 4. Compute LoRA output with quantized parameters 5. Sum base + quantized LoRA outputs

For Beginners: This is where quantization-aware training happens!

Normal LoRA forward pass:

base_output = base_layer(input)
lora_output = lora_layer(input) // Uses full-precision weights
return base_output + lora_output

QA-LoRA forward pass:

base_output = base_layer(input)
lora_weights_full = get_lora_weights() // Full precision
lora_weights_quant = dequantize(quantize(lora_weights_full)) // Simulate quantization
lora_output = compute_with_quantized_weights(input, lora_weights_quant)
return base_output + lora_output

The key difference: We temporarily quantize and dequantize the LoRA weights, which adds noise. The gradients will learn to work despite this noise!

GetQuantizationStats()

Gets statistics about quantization for the current LoRA parameters.

public (double averageError, double maxError, int numGroups) GetQuantizationStats()

Returns

(double averageError, double maxError, int numGroups): A tuple containing (average error, max error, number of groups).

Remarks

This method helps you understand the quantization quality: - Average error: Mean absolute difference between full-precision and quantized values - Max error: Worst-case difference in any parameter - Number of groups: How many quantization groups are used

For Beginners: Use this to check how much information is lost to quantization.

Example output:

Average error: 0.002 (most parameters within 0.002 of original)
Max error: 0.015 (worst case is 0.015 away from original)
Number of groups: 16 (using 16 different scales)

Lower errors mean better quantization. If errors are too high:

Decrease group size (more scales, more accurate)
Increase quantization bits (more precision)
Adjust learning rate (help network adapt better)

MergeToOriginalLayer()

Merges the LoRA adaptation into the base layer and returns a quantized merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>: A new layer with LoRA weights merged and quantized into the base layer's weights.

Remarks

The merging process for QA-LoRA: 1. Get the LoRA weight contribution (B * A * scaling) 2. Add LoRA weights to base layer weights 3. Apply actual quantization to the merged weights (for deployment) 4. Return a new layer with quantized merged weights

For Beginners: This is the final step - creating the deployment model!

Training vs. Deployment:

During training: Simulate quantization, keep full-precision weights
After training: Actually quantize and merge for deployment

What this method does:

Merge: base_weights + lora_weights → full_precision_merged
Quantize: full_precision_merged → quantized_weights (actually reduced to N bits)
Create new layer: DenseLayer with quantized_weights

Result: A layer that's actually using N-bit precision (smaller, faster) instead of just simulating it!

Note: This example quantizes to the parameter vector. For true deployment, you'd use a specialized quantized layer class that stores integer weights and performs integer arithmetic. This is a simplified version for demonstration.

Table of Contents

Class QALoRAAdapter<T>

Type Parameters

Remarks

Constructors

QALoRAAdapter(ILayer<T>, int, int, int, double, bool)

Parameters

Remarks

Exceptions

Properties

GroupSize

Property Value

Remarks

QuantizationBits

Property Value

Remarks

QuantizationEnabled

Property Value

Remarks

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

Forward(Tensor<T>)

Parameters

Returns

Remarks

GetQuantizationStats()

Returns

Remarks

MergeToOriginalLayer()

Returns

Remarks