Class VeRAAdapter<T>
VeRA (Vector-based Random Matrix Adaptation) adapter - an extreme parameter-efficient variant of LoRA.
public class VeRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>VeRAAdapter<T>
- Implements
-
ILoRAAdapter<T>ILayer<T>
- Inherited Members
Remarks
VeRA achieves 10x fewer trainable parameters than standard LoRA by: - Using a single pair of random low-rank matrices (A and B) shared across ALL layers - Freezing these shared matrices (they are never trained) - Training only small scaling vectors (d and b) that are specific to each layer
The forward computation is: output = base_layer(input) + d * (B * A * input) * b where d and b are trainable vectors, and A and B are frozen shared matrices.
For Beginners: VeRA is an ultra-efficient version of LoRA for extreme memory constraints.
Think of the difference this way:
- Standard LoRA: Each layer has its own pair of small matrices (A and B) that are trained
- VeRA: ALL layers share the same random matrices (A and B) which are frozen. Only tiny scaling vectors are trained per layer.
Example parameter comparison for a 1000x1000 layer with rank=8:
- Full fine-tuning: 1,000,000 parameters
- Standard LoRA (rank=8): 16,000 parameters (98.4% reduction)
- VeRA (rank=8): ~1,600 parameters (99.84% reduction) - 10x fewer than LoRA!
Trade-offs:
- ✅ Extreme parameter efficiency (10x fewer than LoRA)
- ✅ Very low memory footprint
- ✅ Shared matrices reduce storage when adapting many layers
- ⚠️ Slightly less flexible than standard LoRA (shared random projection)
- ⚠️ Performance may be marginally lower than LoRA in some cases
When to use VeRA:
- Extreme memory constraints (mobile, edge devices)
- Fine-tuning many layers with limited resources
- Rapid prototyping with minimal parameter overhead
- When LoRA is still too expensive
Constructors
VeRAAdapter(ILayer<T>, int, double, bool)
Initializes a new VeRA adapter wrapping an existing layer.
public VeRAAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, bool freezeBaseLayer = true)
Parameters
baseLayerILayer<T>The layer to adapt with VeRA.
rankintThe rank of the low-rank decomposition (shared across all VeRA layers).
alphadoubleThe scaling factor (defaults to rank if negative).
freezeBaseLayerboolWhether to freeze the base layer's parameters during training.
Remarks
Before creating any VeRA adapters, you must call InitializeSharedMatrices() once to set up the shared random matrices that all VeRA layers will use.
For Beginners: This creates a VeRA adapter for a layer. Unlike standard LoRA, you must initialize the shared random matrices first by calling:
VeRAAdapter<T>.InitializeSharedMatrices(inputSize, outputSize, rank);
This needs to be done once before creating any VeRA adapters.
Parameters:
- baseLayer: The layer you want to adapt
- rank: How much compression (lower = fewer parameters)
- alpha: How strong the VeRA adaptation is
- freezeBaseLayer: Whether to lock the original layer's weights (usually true)
Exceptions
- ArgumentNullException
Thrown when baseLayer is null.
- ArgumentException
Thrown when rank is invalid or shared matrices are not initialized.
Properties
AreSharedMatricesInitialized
Gets whether the shared matrices have been initialized.
public static bool AreSharedMatricesInitialized { get; }
Property Value
ParameterCount
Gets the total number of trainable parameters (only the scaling vectors d and b).
public override int ParameterCount { get; }
Property Value
Remarks
VeRA only trains the scaling vectors, not the shared matrices. For a layer with outputSize and rank r, this is: outputSize + rank. This is typically 10x fewer parameters than standard LoRA.
Methods
Backward(Tensor<T>)
Performs the backward pass through the VeRA adapter.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>Gradient flowing back from the next layer.
Returns
- Tensor<T>
Gradient to pass to the previous layer.
Remarks
The backward pass computes gradients ONLY for the scaling vectors d and b. The shared matrices A and B remain frozen and are never updated.
For Beginners: This is where VeRA learns! During backpropagation: 1. Compute gradients for scaling vectors d and b (these are trained) 2. Shared matrices A and B are NOT updated (they stay frozen) 3. Pass gradients back to earlier layers
This is why VeRA is so efficient - we only train tiny scaling vectors!
CreateLoRALayer(int, double)
Creates a VeRA-specific layer (not used since VeRA doesn't use LoRALayer).
protected override LoRALayer<T> CreateLoRALayer(int rank, double alpha)
Parameters
Returns
- LoRALayer<T>
Remarks
VeRA doesn't use the standard LoRALayer, so this creates a dummy layer. The actual VeRA computation is handled in Forward() and Backward() methods.
Forward(Tensor<T>)
Performs the forward pass through the VeRA adapter.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>Input tensor.
Returns
- Tensor<T>
Sum of base layer output and VeRA output.
Remarks
The VeRA forward pass computes: output = base_layer(input) + d * (B * A * input) * b * scaling where d and b are trainable scaling vectors, A and B are frozen shared matrices, and scaling = alpha/rank.
For Beginners: This processes input through both the original layer and the VeRA adaptation: 1. Base layer processes the input (original behavior) 2. VeRA computes: input → A (shared) → b (scale) → B (shared) → d (scale) 3. The outputs are added together
The key difference from standard LoRA: A and B are shared and frozen, only d and b are trained!
GetParameters()
Gets the current parameters as a vector (scaling vectors only).
public override Vector<T> GetParameters()
Returns
- Vector<T>
Vector containing VeRA parameters (d and b vectors).
InitializeSharedMatrices(int, int, int, int?)
Initializes the shared random matrices used by all VeRA adapters.
public static void InitializeSharedMatrices(int inputSize, int outputSize, int rank, int? seed = null)
Parameters
inputSizeintThe input dimension for the layers.
outputSizeintThe output dimension for the layers.
rankintThe rank of the low-rank decomposition.
seedint?Optional random seed for reproducibility.
Remarks
This method must be called once before creating any VeRA adapters. It initializes the shared matrices A and B with random values that are frozen (never trained).
The shared matrices are initialized with Gaussian random values similar to Kaiming initialization. Once initialized, they remain frozen and are shared across all VeRA adapters with matching dimensions.
For Beginners: Call this once at the start before creating any VeRA layers:
// Initialize shared random matrices (do this once) VeRAAdapter<double>.InitializeSharedMatrices(inputSize: 784, outputSize: 128, rank: 8);
// Now create VeRA adapters (they will use the shared matrices) var adapter1 = new VeRAAdapter<double>(layer1, rank: 8); var adapter2 = new VeRAAdapter<double>(layer2, rank: 8);
All adapters share the same random A and B matrices, saving memory!
MergeToOriginalLayer()
Merges the VeRA adaptation into the base layer and returns the merged layer.
public override ILayer<T> MergeToOriginalLayer()
Returns
- ILayer<T>
A new layer with VeRA weights merged into the base layer's weights.
Remarks
This computes the full weight contribution from VeRA: W_vera = d * B * A * b * scaling, and adds it to the base layer's weights.
For Beginners: This "bakes in" the VeRA adaptation for deployment. After training, you can merge the adaptation into the original weights for faster inference. The merged layer will behave identically but without the VeRA overhead.
ResetSharedMatrices()
Resets the shared matrices (useful for testing or reinitializing).
public static void ResetSharedMatrices()
ResetState()
Resets the internal state of the VeRA adapter.
public override void ResetState()
SetParameters(Vector<T>)
Sets the layer parameters from a vector.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>Vector containing VeRA parameters.
UpdateParameters(T)
Updates parameters using the specified learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate for parameter updates.
Remarks
VeRA only updates the scaling vectors d and b. The shared matrices A and B remain frozen.
UpdateParametersFromLayers()
Updates the parameter vector from the current layer states.
protected override void UpdateParametersFromLayers()
Remarks
VeRA overrides this to only copy scaling vectors (d and b), not the full LoRA layer parameters. This is called from the base constructor before scaling vectors are initialized, so we check for null and skip if not ready yet.