Class LoRAXSAdapter<T>

Namespace: AiDotNet.LoRA.Adapters

Assembly: AiDotNet.dll

LoRA-XS (Extremely Small) adapter for ultra-parameter-efficient fine-tuning using SVD with trainable scaling matrix.

public class LoRAXSAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T: The numeric type used for calculations, typically float or double.

Inheritance: object

LayerBase<T>

LoRAAdapterBase<T>

LoRAXSAdapter<T>

Implements: IDisposable

ILoRAAdapter<T>

ILayer<T>

IJitCompilable<T>

IDiagnosticsProvider

IWeightLoadable<T>

Inherited Members: LoRAAdapterBase<T>._baseLayer

LoRAAdapterBase<T>._loraLayer

LoRAAdapterBase<T>._freezeBaseLayer

LoRAAdapterBase<T>.BaseLayer

LoRAAdapterBase<T>.LoRALayer

LoRAAdapterBase<T>.IsBaseLayerFrozen

LoRAAdapterBase<T>.Rank

LoRAAdapterBase<T>.Alpha

LoRAAdapterBase<T>.SupportsTraining

LoRAAdapterBase<T>.CreateLoRALayer(int, double)

LoRAAdapterBase<T>.CreateMergedLayerWithClone(Vector<T>)

LoRAAdapterBase<T>.MergeToDenseOrFullyConnected()

LoRAAdapterBase<T>.SupportsJitCompilation

LoRAAdapterBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuExecution

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.ForwardGpu(params IGpuTensor<T>[])

LayerBase<T>.BackwardGpu(IGpuTensor<T>)

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.Forward(params Tensor<T>[])

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.Clone()

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.UpdateParameters(Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.GetDiagnostics()

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

LoRA-XS achieves extreme parameter efficiency by leveraging SVD of pretrained weights to create frozen orthonormal bases (U and V matrices), with only a small r×r trainable matrix R positioned between them. This architecture reduces parameter count to r² instead of 2nr (standard LoRA), achieving 100x+ reduction while matching or exceeding full fine-tuning performance.

Architecture Comparison: - Standard LoRA: W' = W + BA, where A ∈ ℝ^(d×r), B ∈ ℝ^(r×d) (2dr parameters) - LoRA-XS: W' = W + U_r Σ_r R V_r^T, where only R ∈ ℝ^(r×r) is trainable (r² parameters) - U_r and V_r are frozen orthonormal bases from SVD of pretrained W - Σ_r is the frozen diagonal matrix of top-r singular values

Key Innovation: Instead of training both A and B matrices (standard LoRA), LoRA-XS: 1. Computes SVD of pretrained weights: W = U Σ V^T 2. Freezes U_r (top-r left singular vectors) and V_r^T (top-r right singular vectors) 3. Freezes Σ_r (top-r singular values as diagonal matrix) 4. Trains only R (r×r mixing matrix) that interpolates between frozen bases 5. Parameter count is independent of hidden dimensions: only r² trainable parameters

Performance Metrics (from paper):

RoBERTa-large on GLUE (6 tasks):

LoRA-XS (rank 16): 88.03% avg accuracy, 24.6K parameters
Standard LoRA (rank 16): Similar accuracy, 100x more parameters
Full fine-tuning: 88.0% avg accuracy, ~125M parameters per task

LLaMA2-7B on Commonsense Reasoning:

LoRA-XS: 80.5% avg accuracy, 3.67M parameters
Standard LoRA: 77.6% avg accuracy, 56M parameters (15x more)

Mistral-7B on GSM8K (Math Reasoning):

LoRA-XS: 70.35% accuracy, 3.67M parameters
Standard LoRA: 67.70% accuracy, 168M parameters (46x more)

GPT-3 Personalization (1M models):

LoRA-XS: 96GB total storage
Standard LoRA: 144TB total storage (1500x reduction)

Mathematical Formulation: Forward pass computes: output = (W + U_r Σ_r R V_r^T) * input = W * input + (U_r Σ_r) * (R * (V_r^T * input))

Where:

W is frozen pretrained weights
U_r ∈ ℝ^(d_out × r): frozen left singular vectors (orthonormal columns)
Σ_r ∈ ℝ^(r × r): frozen diagonal matrix of singular values
R ∈ ℝ^(r × r): trainable mixing matrix (only trainable component!)
V_r^T ∈ ℝ^(r × d_in): frozen right singular vectors (orthonormal rows)

Why This Works: The SVD provides an optimal orthonormal basis for representing weight updates. By freezing these bases and training only the mixing matrix R, LoRA-XS achieves: - Drastically fewer parameters (r² vs 2dr) - Better generalization (constrained to pretrained subspace) - Faster convergence (optimal basis from initialization) - No inference overhead (can be merged back into W) - Scalable personalization (parameter count independent of model size)

For Beginners: Think of LoRA-XS as "ultra-compressed LoRA".

Imagine you have a large language model with huge weight matrices (e.g., 4096×4096):

Standard LoRA (rank 8):

Creates two matrices: A (4096×8) and B (8×4096)
Total parameters: 40968 + 84096 = 65,536 parameters
Both matrices are trainable

LoRA-XS (rank 8):

Decomposes pretrained weights with SVD into U, Σ, V
Keeps top 8 singular vectors (U_8, Σ_8, V_8) FROZEN
Trains only R matrix: 8×8 = 64 parameters
Achieves similar or better performance with 1000x fewer parameters!

It's like having two fixed "coordinate systems" from the pretrained model, and you only train a small "rotation matrix" between them. The fixed coordinate systems capture the pretrained knowledge, while the rotation matrix adapts to your task.

Example workflow:

Load pretrained model weights W
Compute SVD: W = U Σ V^T
Extract top-r components: U_r, Σ_r, V_r
Create LoRA-XS adapter with these frozen bases
Train only the tiny R matrix (64 params for rank 8)
Deploy with merged weights: W' = W + U_r Σ_r R V_r^T

References: - Paper: "LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters" - arXiv: 2405.17604 (May 2024) - GitHub: MohammadrezaBanaei/LoRA-XS - Key Innovation: Parameter count O(r²) instead of O(dr), enabling extreme efficiency

Constructors

LoRAXSAdapter(ILayer<T>, int, double, bool)

Initializes a new LoRA-XS adapter wrapping an existing layer.

public LoRAXSAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>: The layer to adapt with LoRA-XS.
rank int: The rank of the SVD decomposition (number of singular values to use).
alpha double: The LoRA scaling factor (defaults to rank if negative).
freezeBaseLayer bool: Whether to freeze the base layer's parameters during training (always true for LoRA-XS).

Remarks

This constructor creates a LoRA-XS adapter. After construction, you MUST call InitializeFromSVD to properly initialize the frozen bases and trainable R matrix. Without SVD initialization, the adapter cannot function as intended.

For Beginners: This creates a LoRA-XS adapter for your layer.

Important steps:

Create the adapter with this constructor
Call InitializeFromSVD with your pretrained weights
Start training (only the tiny R matrix gets updated!)

The rank parameter determines the size:

rank = 4: Only 16 trainable parameters (4×4)
rank = 8: Only 64 trainable parameters (8×8)
rank = 16: Only 256 trainable parameters (16×16)

Compare this to standard LoRA which would have thousands or millions of parameters!

Exceptions

ArgumentNullException: Thrown when baseLayer is null.

Properties

FrozenSigma

Gets the frozen singular values.

public Vector<T>? FrozenSigma { get; }

Property Value

Vector<T>

FrozenU

Gets the frozen U matrix (left singular vectors).

public Matrix<T>? FrozenU { get; }

Property Value

Matrix<T>

FrozenVt

Gets the frozen V^T matrix (right singular vectors transposed).

public Matrix<T>? FrozenVt { get; }

Property Value

Matrix<T>

InitializedFromSVD

Gets whether this adapter was initialized from SVD.

public bool InitializedFromSVD { get; }

Property Value

bool

Remarks

Returns true if InitializeFromSVD was called successfully. Without SVD initialization, LoRA-XS loses its key advantages and effectively becomes a very limited random adapter.

ParameterCount

Gets the total number of trainable parameters (only r² for the R matrix).

public override int ParameterCount { get; }

Property Value

int

Remarks

CRITICAL: Returns full base LoRA layer parameter count to match base constructor expectations. Even though only the R matrix (rank²) is trainable in LoRA-XS, the base constructor allocates Parameters buffer based on this count and packs the underlying LoRA layer.

LoRA-XS only trains the rank×rank R matrix, so ParameterCount returns rank². The frozen U, Σ, and V matrices are not trainable parameters.

TrainableR

Gets the trainable R matrix.

public Matrix<T> TrainableR { get; }

Property Value

Matrix<T>

Methods

Backward(Tensor<T>)

Performs the backward pass through the LoRA-XS adapter.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: Gradient flowing back from the next layer.

Returns

Tensor<T>: Gradient to pass to the previous layer.

Remarks

The backward pass computes gradients for the trainable R matrix and propagates gradients back.

Gradient computation: dL/dR = (Σ_r * U_r^T * outputGrad) * (V_r^T * input)^T * scaling dL/dinput = base_grad + V_r * R^T * Σ_r * U_r^T * outputGrad * scaling

Note: U, Σ, and V are frozen, so no gradients computed for them.

For Beginners: This is backpropagation for LoRA-XS!

What happens:

Gradients flow back from the next layer
We compute how to adjust R matrix to reduce error (U, Σ, V are frozen so we don't compute gradients for them)
We pass gradients back to the previous layer

The key: only R learns! This is why training is so efficient.

Forward(Tensor<T>)

Performs the forward pass through the LoRA-XS adapter.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: Input tensor.

Returns

Tensor<T>: Sum of base layer output and LoRA-XS adaptation.

Remarks

The forward pass computes: output = base_layer(input) + U_r * Σ_r * R * V_r^T * input * scaling

Steps:

x1 = V_r^T * input (project input onto frozen right singular vectors)
x2 = R * x1 (apply trainable mixing matrix)
x3 = Σ_r * x2 (scale by frozen singular values)
x4 = U_r * x3 (project onto frozen left singular vectors)
output = base_output + scaling * x4

For Beginners: This is how data flows through LoRA-XS:

Run input through the original layer (base layer)
Also run through LoRA-XS path:
- Project input using V (fixed patterns from pretraining)
- Mix with R matrix (the ONLY thing that's learning!)
- Scale by Σ (importance weights, fixed)
- Project back using U (fixed output patterns)
Add the two results together

Think of it like: original output + small learned adjustment The adjustment is constrained to the most important pretrained patterns!

GetParameters()

Gets the current parameters as a vector (only R matrix elements).

public override Vector<T> GetParameters()

Returns

Vector<T>: Vector containing R matrix flattened row-major.

InitializeFromSVD(Matrix<T>, SvdAlgorithmType)

Initializes the adapter from SVD of pretrained weights.

public void InitializeFromSVD(Matrix<T> pretrainedWeights, SvdAlgorithmType svdAlgorithm = SvdAlgorithmType.GolubReinsch)

Parameters

pretrainedWeights Matrix<T>: The pretrained weight matrix to decompose. Shape: [outputSize, inputSize]
svdAlgorithm SvdAlgorithmType: The SVD algorithm to use (default: GolubReinsch).

Remarks

This method performs the core LoRA-XS initialization: 1. Computes full SVD: W = U Σ V^T 2. Extracts top-r components: U_r (outputSize × r), Σ_r (r diagonal values), V_r^T (r × inputSize) 3. Freezes U_r, Σ_r, and V_r^T as orthonormal bases 4. Initializes trainable R matrix to identity (neutral transformation) 5. During training: only R is updated, U/Σ/V remain frozen

For Beginners: This is where LoRA-XS gets initialized properly!

What happens:

Takes your pretrained weights (e.g., from a language model layer)
Uses SVD to find the top-r most important patterns (like finding main themes in data)
Saves these patterns as frozen "coordinate systems" (U and V)
Saves their importance scores (Σ, the singular values)
Creates a small R matrix that will learn to adapt between these coordinates

After this, when you train:

The frozen patterns (U, Σ, V) don't change
Only the tiny R matrix learns
This is why you only train r² parameters instead of millions!

Example: For a 4096×4096 weight matrix with rank=8:

Freezes 4096×8 U matrix (32,768 values, but frozen)
Freezes 8 singular values
Freezes 8×4096 V^T matrix (32,768 values, but frozen)
Trains only 8×8 R matrix (64 parameters!)

Exceptions

ArgumentNullException: Thrown when pretrainedWeights is null.
ArgumentException: Thrown when weight matrix dimensions don't match layer dimensions.

MergeToOriginalLayer()

Merges the LoRA-XS adaptation into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>: A new layer with LoRA-XS weights merged into base weights.

Remarks

Computes: W' = W + U_r * Σ_r * R * V_r^T * scaling This allows deployment without the adapter overhead.

For Beginners: This "bakes in" your LoRA-XS training.

After training the R matrix, you can merge it back into the original weights:

Original weights + learned adaptation = new merged weights
Deployed model runs at full speed (no adapter overhead)
You can discard the adapter structure after merging

This is one of the key advantages: ultra-efficient training, normal-speed inference!

ResetState()

Resets the internal state of the adapter.

public override void ResetState()

SetParameters(Vector<T>)

Sets the layer parameters from a vector (R matrix only).

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: Vector containing R matrix elements.

UpdateParameters(T)

Updates the trainable R matrix using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T: The learning rate for parameter updates.

Remarks

Only the R matrix is updated; U, Σ, and V remain frozen.

UpdateParametersFromLayers()

Overrides base class parameter packing to prevent buffer overrun during base constructor.

protected override void UpdateParametersFromLayers()

Remarks

The base class constructor calls UpdateParametersFromLayers() which tries to pack _loraLayer.GetParameters() (size 2*d*r). However, LoRAXSAdapter's ParameterCount returns Rank*Rank (much smaller) before _trainableR is initialized. This override guards against that early call and delegates to UpdateParametersFromR once the R matrix is ready.

Table of Contents

Class LoRAXSAdapter<T>

Type Parameters

Remarks

Constructors

LoRAXSAdapter(ILayer<T>, int, double, bool)

Parameters

Remarks

Exceptions

Properties

FrozenSigma

Property Value

FrozenU

Property Value

FrozenVt

Property Value

InitializedFromSVD

Property Value

Remarks

ParameterCount

Property Value

Remarks

TrainableR

Property Value

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

Forward(Tensor<T>)

Parameters

Returns

Remarks

GetParameters()

Returns

InitializeFromSVD(Matrix<T>, SvdAlgorithmType)

Parameters

Remarks

Exceptions

MergeToOriginalLayer()

Returns

Remarks

ResetState()

SetParameters(Vector<T>)

Parameters

UpdateParameters(T)

Parameters

Remarks

UpdateParametersFromLayers()

Remarks