Class LoRAXSAdapter<T>
LoRA-XS (Extremely Small) adapter for ultra-parameter-efficient fine-tuning using SVD with trainable scaling matrix.
public class LoRAXSAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>LoRAXSAdapter<T>
- Implements
-
ILoRAAdapter<T>ILayer<T>
- Inherited Members
Remarks
LoRA-XS achieves extreme parameter efficiency by leveraging SVD of pretrained weights to create frozen orthonormal bases (U and V matrices), with only a small r×r trainable matrix R positioned between them. This architecture reduces parameter count to r² instead of 2nr (standard LoRA), achieving 100x+ reduction while matching or exceeding full fine-tuning performance.
Architecture Comparison: - Standard LoRA: W' = W + BA, where A ∈ ℝ^(d×r), B ∈ ℝ^(r×d) (2dr parameters) - LoRA-XS: W' = W + U_r Σ_r R V_r^T, where only R ∈ ℝ^(r×r) is trainable (r² parameters) - U_r and V_r are frozen orthonormal bases from SVD of pretrained W - Σ_r is the frozen diagonal matrix of top-r singular values
Key Innovation: Instead of training both A and B matrices (standard LoRA), LoRA-XS: 1. Computes SVD of pretrained weights: W = U Σ V^T 2. Freezes U_r (top-r left singular vectors) and V_r^T (top-r right singular vectors) 3. Freezes Σ_r (top-r singular values as diagonal matrix) 4. Trains only R (r×r mixing matrix) that interpolates between frozen bases 5. Parameter count is independent of hidden dimensions: only r² trainable parameters
Performance Metrics (from paper):
RoBERTa-large on GLUE (6 tasks):
- LoRA-XS (rank 16): 88.03% avg accuracy, 24.6K parameters
- Standard LoRA (rank 16): Similar accuracy, 100x more parameters
- Full fine-tuning: 88.0% avg accuracy, ~125M parameters per task
LLaMA2-7B on Commonsense Reasoning:
- LoRA-XS: 80.5% avg accuracy, 3.67M parameters
- Standard LoRA: 77.6% avg accuracy, 56M parameters (15x more)
Mistral-7B on GSM8K (Math Reasoning):
- LoRA-XS: 70.35% accuracy, 3.67M parameters
- Standard LoRA: 67.70% accuracy, 168M parameters (46x more)
GPT-3 Personalization (1M models):
- LoRA-XS: 96GB total storage
- Standard LoRA: 144TB total storage (1500x reduction)
Mathematical Formulation: Forward pass computes: output = (W + U_r Σ_r R V_r^T) * input = W * input + (U_r Σ_r) * (R * (V_r^T * input))
Where:
- W is frozen pretrained weights
- U_r ∈ ℝ^(d_out × r): frozen left singular vectors (orthonormal columns)
- Σ_r ∈ ℝ^(r × r): frozen diagonal matrix of singular values
- R ∈ ℝ^(r × r): trainable mixing matrix (only trainable component!)
- V_r^T ∈ ℝ^(r × d_in): frozen right singular vectors (orthonormal rows)
Why This Works: The SVD provides an optimal orthonormal basis for representing weight updates. By freezing these bases and training only the mixing matrix R, LoRA-XS achieves: - Drastically fewer parameters (r² vs 2dr) - Better generalization (constrained to pretrained subspace) - Faster convergence (optimal basis from initialization) - No inference overhead (can be merged back into W) - Scalable personalization (parameter count independent of model size)
For Beginners: Think of LoRA-XS as "ultra-compressed LoRA".
Imagine you have a large language model with huge weight matrices (e.g., 4096×4096):
Standard LoRA (rank 8):
- Creates two matrices: A (4096×8) and B (8×4096)
- Total parameters: 40968 + 84096 = 65,536 parameters
- Both matrices are trainable
LoRA-XS (rank 8):
- Decomposes pretrained weights with SVD into U, Σ, V
- Keeps top 8 singular vectors (U_8, Σ_8, V_8) FROZEN
- Trains only R matrix: 8×8 = 64 parameters
- Achieves similar or better performance with 1000x fewer parameters!
It's like having two fixed "coordinate systems" from the pretrained model, and you only train a small "rotation matrix" between them. The fixed coordinate systems capture the pretrained knowledge, while the rotation matrix adapts to your task.
Example workflow:
- Load pretrained model weights W
- Compute SVD: W = U Σ V^T
- Extract top-r components: U_r, Σ_r, V_r
- Create LoRA-XS adapter with these frozen bases
- Train only the tiny R matrix (64 params for rank 8)
- Deploy with merged weights: W' = W + U_r Σ_r R V_r^T
References: - Paper: "LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters" - arXiv: 2405.17604 (May 2024) - GitHub: MohammadrezaBanaei/LoRA-XS - Key Innovation: Parameter count O(r²) instead of O(dr), enabling extreme efficiency
Constructors
LoRAXSAdapter(ILayer<T>, int, double, bool)
Initializes a new LoRA-XS adapter wrapping an existing layer.
public LoRAXSAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, bool freezeBaseLayer = true)
Parameters
baseLayerILayer<T>The layer to adapt with LoRA-XS.
rankintThe rank of the SVD decomposition (number of singular values to use).
alphadoubleThe LoRA scaling factor (defaults to rank if negative).
freezeBaseLayerboolWhether to freeze the base layer's parameters during training (always true for LoRA-XS).
Remarks
This constructor creates a LoRA-XS adapter. After construction, you MUST call InitializeFromSVD to properly initialize the frozen bases and trainable R matrix. Without SVD initialization, the adapter cannot function as intended.
For Beginners: This creates a LoRA-XS adapter for your layer.
Important steps:
- Create the adapter with this constructor
- Call InitializeFromSVD with your pretrained weights
- Start training (only the tiny R matrix gets updated!)
The rank parameter determines the size:
- rank = 4: Only 16 trainable parameters (4×4)
- rank = 8: Only 64 trainable parameters (8×8)
- rank = 16: Only 256 trainable parameters (16×16)
Compare this to standard LoRA which would have thousands or millions of parameters!
Exceptions
- ArgumentNullException
Thrown when baseLayer is null.
Properties
FrozenSigma
Gets the frozen singular values.
public Vector<T>? FrozenSigma { get; }
Property Value
- Vector<T>
FrozenU
Gets the frozen U matrix (left singular vectors).
public Matrix<T>? FrozenU { get; }
Property Value
- Matrix<T>
FrozenVt
Gets the frozen V^T matrix (right singular vectors transposed).
public Matrix<T>? FrozenVt { get; }
Property Value
- Matrix<T>
InitializedFromSVD
Gets whether this adapter was initialized from SVD.
public bool InitializedFromSVD { get; }
Property Value
Remarks
Returns true if InitializeFromSVD was called successfully. Without SVD initialization, LoRA-XS loses its key advantages and effectively becomes a very limited random adapter.
ParameterCount
Gets the total number of trainable parameters (only r² for the R matrix).
public override int ParameterCount { get; }
Property Value
Remarks
CRITICAL: Returns full base LoRA layer parameter count to match base constructor expectations. Even though only the R matrix (rank²) is trainable in LoRA-XS, the base constructor allocates Parameters buffer based on this count and packs the underlying LoRA layer.
LoRA-XS only trains the rank×rank R matrix, so ParameterCount returns rank². The frozen U, Σ, and V matrices are not trainable parameters.
TrainableR
Gets the trainable R matrix.
public Matrix<T> TrainableR { get; }
Property Value
- Matrix<T>
Methods
Backward(Tensor<T>)
Performs the backward pass through the LoRA-XS adapter.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>Gradient flowing back from the next layer.
Returns
- Tensor<T>
Gradient to pass to the previous layer.
Remarks
The backward pass computes gradients for the trainable R matrix and propagates gradients back.
Gradient computation: dL/dR = (Σ_r * U_r^T * outputGrad) * (V_r^T * input)^T * scaling dL/dinput = base_grad + V_r * R^T * Σ_r * U_r^T * outputGrad * scaling
Note: U, Σ, and V are frozen, so no gradients computed for them.
For Beginners: This is backpropagation for LoRA-XS!
What happens:
- Gradients flow back from the next layer
- We compute how to adjust R matrix to reduce error (U, Σ, V are frozen so we don't compute gradients for them)
- We pass gradients back to the previous layer
The key: only R learns! This is why training is so efficient.
Forward(Tensor<T>)
Performs the forward pass through the LoRA-XS adapter.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>Input tensor.
Returns
- Tensor<T>
Sum of base layer output and LoRA-XS adaptation.
Remarks
The forward pass computes: output = base_layer(input) + U_r * Σ_r * R * V_r^T * input * scaling
Steps:
- x1 = V_r^T * input (project input onto frozen right singular vectors)
- x2 = R * x1 (apply trainable mixing matrix)
- x3 = Σ_r * x2 (scale by frozen singular values)
- x4 = U_r * x3 (project onto frozen left singular vectors)
- output = base_output + scaling * x4
For Beginners: This is how data flows through LoRA-XS:
- Run input through the original layer (base layer)
- Also run through LoRA-XS path:
- Project input using V (fixed patterns from pretraining)
- Mix with R matrix (the ONLY thing that's learning!)
- Scale by Σ (importance weights, fixed)
- Project back using U (fixed output patterns)
- Add the two results together
Think of it like: original output + small learned adjustment The adjustment is constrained to the most important pretrained patterns!
GetParameters()
Gets the current parameters as a vector (only R matrix elements).
public override Vector<T> GetParameters()
Returns
- Vector<T>
Vector containing R matrix flattened row-major.
InitializeFromSVD(Matrix<T>, SvdAlgorithmType)
Initializes the adapter from SVD of pretrained weights.
public void InitializeFromSVD(Matrix<T> pretrainedWeights, SvdAlgorithmType svdAlgorithm = SvdAlgorithmType.GolubReinsch)
Parameters
pretrainedWeightsMatrix<T>The pretrained weight matrix to decompose. Shape: [outputSize, inputSize]
svdAlgorithmSvdAlgorithmTypeThe SVD algorithm to use (default: GolubReinsch).
Remarks
This method performs the core LoRA-XS initialization: 1. Computes full SVD: W = U Σ V^T 2. Extracts top-r components: U_r (outputSize × r), Σ_r (r diagonal values), V_r^T (r × inputSize) 3. Freezes U_r, Σ_r, and V_r^T as orthonormal bases 4. Initializes trainable R matrix to identity (neutral transformation) 5. During training: only R is updated, U/Σ/V remain frozen
For Beginners: This is where LoRA-XS gets initialized properly!
What happens:
- Takes your pretrained weights (e.g., from a language model layer)
- Uses SVD to find the top-r most important patterns (like finding main themes in data)
- Saves these patterns as frozen "coordinate systems" (U and V)
- Saves their importance scores (Σ, the singular values)
- Creates a small R matrix that will learn to adapt between these coordinates
After this, when you train:
- The frozen patterns (U, Σ, V) don't change
- Only the tiny R matrix learns
- This is why you only train r² parameters instead of millions!
Example: For a 4096×4096 weight matrix with rank=8:
- Freezes 4096×8 U matrix (32,768 values, but frozen)
- Freezes 8 singular values
- Freezes 8×4096 V^T matrix (32,768 values, but frozen)
- Trains only 8×8 R matrix (64 parameters!)
Exceptions
- ArgumentNullException
Thrown when pretrainedWeights is null.
- ArgumentException
Thrown when weight matrix dimensions don't match layer dimensions.
MergeToOriginalLayer()
Merges the LoRA-XS adaptation into the base layer and returns the merged layer.
public override ILayer<T> MergeToOriginalLayer()
Returns
- ILayer<T>
A new layer with LoRA-XS weights merged into base weights.
Remarks
Computes: W' = W + U_r * Σ_r * R * V_r^T * scaling This allows deployment without the adapter overhead.
For Beginners: This "bakes in" your LoRA-XS training.
After training the R matrix, you can merge it back into the original weights:
- Original weights + learned adaptation = new merged weights
- Deployed model runs at full speed (no adapter overhead)
- You can discard the adapter structure after merging
This is one of the key advantages: ultra-efficient training, normal-speed inference!
ResetState()
Resets the internal state of the adapter.
public override void ResetState()
SetParameters(Vector<T>)
Sets the layer parameters from a vector (R matrix only).
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>Vector containing R matrix elements.
UpdateParameters(T)
Updates the trainable R matrix using the specified learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate for parameter updates.
Remarks
Only the R matrix is updated; U, Σ, and V remain frozen.
UpdateParametersFromLayers()
Overrides base class parameter packing to prevent buffer overrun during base constructor.
protected override void UpdateParametersFromLayers()
Remarks
The base class constructor calls UpdateParametersFromLayers() which tries to pack _loraLayer.GetParameters() (size 2*d*r). However, LoRAXSAdapter's ParameterCount returns Rank*Rank (much smaller) before _trainableR is initialized. This override guards against that early call and delegates to UpdateParametersFromR once the R matrix is ready.