Table of Contents

Class LoRAXSAdapter<T>

Namespace
AiDotNet.LoRA.Adapters
Assembly
AiDotNet.dll

LoRA-XS (Extremely Small) adapter for ultra-parameter-efficient fine-tuning using SVD with trainable scaling matrix.

public class LoRAXSAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
LoRAXSAdapter<T>
Implements
Inherited Members

Remarks

LoRA-XS achieves extreme parameter efficiency by leveraging SVD of pretrained weights to create frozen orthonormal bases (U and V matrices), with only a small r×r trainable matrix R positioned between them. This architecture reduces parameter count to r² instead of 2nr (standard LoRA), achieving 100x+ reduction while matching or exceeding full fine-tuning performance.

Architecture Comparison: - Standard LoRA: W' = W + BA, where A ∈ ℝ^(d×r), B ∈ ℝ^(r×d) (2dr parameters) - LoRA-XS: W' = W + U_r Σ_r R V_r^T, where only R ∈ ℝ^(r×r) is trainable (r² parameters) - U_r and V_r are frozen orthonormal bases from SVD of pretrained W - Σ_r is the frozen diagonal matrix of top-r singular values

Key Innovation: Instead of training both A and B matrices (standard LoRA), LoRA-XS: 1. Computes SVD of pretrained weights: W = U Σ V^T 2. Freezes U_r (top-r left singular vectors) and V_r^T (top-r right singular vectors) 3. Freezes Σ_r (top-r singular values as diagonal matrix) 4. Trains only R (r×r mixing matrix) that interpolates between frozen bases 5. Parameter count is independent of hidden dimensions: only r² trainable parameters

Performance Metrics (from paper):

RoBERTa-large on GLUE (6 tasks):

  • LoRA-XS (rank 16): 88.03% avg accuracy, 24.6K parameters
  • Standard LoRA (rank 16): Similar accuracy, 100x more parameters
  • Full fine-tuning: 88.0% avg accuracy, ~125M parameters per task

LLaMA2-7B on Commonsense Reasoning:

  • LoRA-XS: 80.5% avg accuracy, 3.67M parameters
  • Standard LoRA: 77.6% avg accuracy, 56M parameters (15x more)

Mistral-7B on GSM8K (Math Reasoning):

  • LoRA-XS: 70.35% accuracy, 3.67M parameters
  • Standard LoRA: 67.70% accuracy, 168M parameters (46x more)

GPT-3 Personalization (1M models):

  • LoRA-XS: 96GB total storage
  • Standard LoRA: 144TB total storage (1500x reduction)

Mathematical Formulation: Forward pass computes: output = (W + U_r Σ_r R V_r^T) * input = W * input + (U_r Σ_r) * (R * (V_r^T * input))

Where:

  • W is frozen pretrained weights
  • U_r ∈ ℝ^(d_out × r): frozen left singular vectors (orthonormal columns)
  • Σ_r ∈ ℝ^(r × r): frozen diagonal matrix of singular values
  • R ∈ ℝ^(r × r): trainable mixing matrix (only trainable component!)
  • V_r^T ∈ ℝ^(r × d_in): frozen right singular vectors (orthonormal rows)

Why This Works: The SVD provides an optimal orthonormal basis for representing weight updates. By freezing these bases and training only the mixing matrix R, LoRA-XS achieves: - Drastically fewer parameters (r² vs 2dr) - Better generalization (constrained to pretrained subspace) - Faster convergence (optimal basis from initialization) - No inference overhead (can be merged back into W) - Scalable personalization (parameter count independent of model size)

For Beginners: Think of LoRA-XS as "ultra-compressed LoRA".

Imagine you have a large language model with huge weight matrices (e.g., 4096×4096):

Standard LoRA (rank 8):

  • Creates two matrices: A (4096×8) and B (8×4096)
  • Total parameters: 40968 + 84096 = 65,536 parameters
  • Both matrices are trainable

LoRA-XS (rank 8):

  • Decomposes pretrained weights with SVD into U, Σ, V
  • Keeps top 8 singular vectors (U_8, Σ_8, V_8) FROZEN
  • Trains only R matrix: 8×8 = 64 parameters
  • Achieves similar or better performance with 1000x fewer parameters!

It's like having two fixed "coordinate systems" from the pretrained model, and you only train a small "rotation matrix" between them. The fixed coordinate systems capture the pretrained knowledge, while the rotation matrix adapts to your task.

Example workflow:

  1. Load pretrained model weights W
  2. Compute SVD: W = U Σ V^T
  3. Extract top-r components: U_r, Σ_r, V_r
  4. Create LoRA-XS adapter with these frozen bases
  5. Train only the tiny R matrix (64 params for rank 8)
  6. Deploy with merged weights: W' = W + U_r Σ_r R V_r^T

References: - Paper: "LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters" - arXiv: 2405.17604 (May 2024) - GitHub: MohammadrezaBanaei/LoRA-XS - Key Innovation: Parameter count O(r²) instead of O(dr), enabling extreme efficiency

Constructors

LoRAXSAdapter(ILayer<T>, int, double, bool)

Initializes a new LoRA-XS adapter wrapping an existing layer.

public LoRAXSAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>

The layer to adapt with LoRA-XS.

rank int

The rank of the SVD decomposition (number of singular values to use).

alpha double

The LoRA scaling factor (defaults to rank if negative).

freezeBaseLayer bool

Whether to freeze the base layer's parameters during training (always true for LoRA-XS).

Remarks

This constructor creates a LoRA-XS adapter. After construction, you MUST call InitializeFromSVD to properly initialize the frozen bases and trainable R matrix. Without SVD initialization, the adapter cannot function as intended.

For Beginners: This creates a LoRA-XS adapter for your layer.

Important steps:

  1. Create the adapter with this constructor
  2. Call InitializeFromSVD with your pretrained weights
  3. Start training (only the tiny R matrix gets updated!)

The rank parameter determines the size:

  • rank = 4: Only 16 trainable parameters (4×4)
  • rank = 8: Only 64 trainable parameters (8×8)
  • rank = 16: Only 256 trainable parameters (16×16)

Compare this to standard LoRA which would have thousands or millions of parameters!

Exceptions

ArgumentNullException

Thrown when baseLayer is null.

Properties

FrozenSigma

Gets the frozen singular values.

public Vector<T>? FrozenSigma { get; }

Property Value

Vector<T>

FrozenU

Gets the frozen U matrix (left singular vectors).

public Matrix<T>? FrozenU { get; }

Property Value

Matrix<T>

FrozenVt

Gets the frozen V^T matrix (right singular vectors transposed).

public Matrix<T>? FrozenVt { get; }

Property Value

Matrix<T>

InitializedFromSVD

Gets whether this adapter was initialized from SVD.

public bool InitializedFromSVD { get; }

Property Value

bool

Remarks

Returns true if InitializeFromSVD was called successfully. Without SVD initialization, LoRA-XS loses its key advantages and effectively becomes a very limited random adapter.

ParameterCount

Gets the total number of trainable parameters (only r² for the R matrix).

public override int ParameterCount { get; }

Property Value

int

Remarks

CRITICAL: Returns full base LoRA layer parameter count to match base constructor expectations. Even though only the R matrix (rank²) is trainable in LoRA-XS, the base constructor allocates Parameters buffer based on this count and packs the underlying LoRA layer.

LoRA-XS only trains the rank×rank R matrix, so ParameterCount returns rank². The frozen U, Σ, and V matrices are not trainable parameters.

TrainableR

Gets the trainable R matrix.

public Matrix<T> TrainableR { get; }

Property Value

Matrix<T>

Methods

Backward(Tensor<T>)

Performs the backward pass through the LoRA-XS adapter.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

Gradient flowing back from the next layer.

Returns

Tensor<T>

Gradient to pass to the previous layer.

Remarks

The backward pass computes gradients for the trainable R matrix and propagates gradients back.

Gradient computation: dL/dR = (Σ_r * U_r^T * outputGrad) * (V_r^T * input)^T * scaling dL/dinput = base_grad + V_r * R^T * Σ_r * U_r^T * outputGrad * scaling

Note: U, Σ, and V are frozen, so no gradients computed for them.

For Beginners: This is backpropagation for LoRA-XS!

What happens:

  1. Gradients flow back from the next layer
  2. We compute how to adjust R matrix to reduce error (U, Σ, V are frozen so we don't compute gradients for them)
  3. We pass gradients back to the previous layer

The key: only R learns! This is why training is so efficient.

Forward(Tensor<T>)

Performs the forward pass through the LoRA-XS adapter.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

Input tensor.

Returns

Tensor<T>

Sum of base layer output and LoRA-XS adaptation.

Remarks

The forward pass computes: output = base_layer(input) + U_r * Σ_r * R * V_r^T * input * scaling

Steps:

  1. x1 = V_r^T * input (project input onto frozen right singular vectors)
  2. x2 = R * x1 (apply trainable mixing matrix)
  3. x3 = Σ_r * x2 (scale by frozen singular values)
  4. x4 = U_r * x3 (project onto frozen left singular vectors)
  5. output = base_output + scaling * x4

For Beginners: This is how data flows through LoRA-XS:

  1. Run input through the original layer (base layer)
  2. Also run through LoRA-XS path:
    • Project input using V (fixed patterns from pretraining)
    • Mix with R matrix (the ONLY thing that's learning!)
    • Scale by Σ (importance weights, fixed)
    • Project back using U (fixed output patterns)
  3. Add the two results together

Think of it like: original output + small learned adjustment The adjustment is constrained to the most important pretrained patterns!

GetParameters()

Gets the current parameters as a vector (only R matrix elements).

public override Vector<T> GetParameters()

Returns

Vector<T>

Vector containing R matrix flattened row-major.

InitializeFromSVD(Matrix<T>, SvdAlgorithmType)

Initializes the adapter from SVD of pretrained weights.

public void InitializeFromSVD(Matrix<T> pretrainedWeights, SvdAlgorithmType svdAlgorithm = SvdAlgorithmType.GolubReinsch)

Parameters

pretrainedWeights Matrix<T>

The pretrained weight matrix to decompose. Shape: [outputSize, inputSize]

svdAlgorithm SvdAlgorithmType

The SVD algorithm to use (default: GolubReinsch).

Remarks

This method performs the core LoRA-XS initialization: 1. Computes full SVD: W = U Σ V^T 2. Extracts top-r components: U_r (outputSize × r), Σ_r (r diagonal values), V_r^T (r × inputSize) 3. Freezes U_r, Σ_r, and V_r^T as orthonormal bases 4. Initializes trainable R matrix to identity (neutral transformation) 5. During training: only R is updated, U/Σ/V remain frozen

For Beginners: This is where LoRA-XS gets initialized properly!

What happens:

  1. Takes your pretrained weights (e.g., from a language model layer)
  2. Uses SVD to find the top-r most important patterns (like finding main themes in data)
  3. Saves these patterns as frozen "coordinate systems" (U and V)
  4. Saves their importance scores (Σ, the singular values)
  5. Creates a small R matrix that will learn to adapt between these coordinates

After this, when you train:

  • The frozen patterns (U, Σ, V) don't change
  • Only the tiny R matrix learns
  • This is why you only train r² parameters instead of millions!

Example: For a 4096×4096 weight matrix with rank=8:

  • Freezes 4096×8 U matrix (32,768 values, but frozen)
  • Freezes 8 singular values
  • Freezes 8×4096 V^T matrix (32,768 values, but frozen)
  • Trains only 8×8 R matrix (64 parameters!)

Exceptions

ArgumentNullException

Thrown when pretrainedWeights is null.

ArgumentException

Thrown when weight matrix dimensions don't match layer dimensions.

MergeToOriginalLayer()

Merges the LoRA-XS adaptation into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>

A new layer with LoRA-XS weights merged into base weights.

Remarks

Computes: W' = W + U_r * Σ_r * R * V_r^T * scaling This allows deployment without the adapter overhead.

For Beginners: This "bakes in" your LoRA-XS training.

After training the R matrix, you can merge it back into the original weights:

  • Original weights + learned adaptation = new merged weights
  • Deployed model runs at full speed (no adapter overhead)
  • You can discard the adapter structure after merging

This is one of the key advantages: ultra-efficient training, normal-speed inference!

ResetState()

Resets the internal state of the adapter.

public override void ResetState()

SetParameters(Vector<T>)

Sets the layer parameters from a vector (R matrix only).

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

Vector containing R matrix elements.

UpdateParameters(T)

Updates the trainable R matrix using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate for parameter updates.

Remarks

Only the R matrix is updated; U, Σ, and V remain frozen.

UpdateParametersFromLayers()

Overrides base class parameter packing to prevent buffer overrun during base constructor.

protected override void UpdateParametersFromLayers()

Remarks

The base class constructor calls UpdateParametersFromLayers() which tries to pack _loraLayer.GetParameters() (size 2*d*r). However, LoRAXSAdapter's ParameterCount returns Rank*Rank (much smaller) before _trainableR is initialized. This override guards against that early call and delegates to UpdateParametersFromR once the R matrix is ready.