Class RoSAAdapter<T>

Namespace: AiDotNet.LoRA.Adapters

Assembly: AiDotNet.dll

RoSA (Robust Adaptation) adapter for parameter-efficient fine-tuning with improved robustness to distribution shifts.

public class RoSAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T: The numeric type used for calculations, typically float or double.

Inheritance: object

LayerBase<T>

LoRAAdapterBase<T>

RoSAAdapter<T>

Implements: IDisposable

ILoRAAdapter<T>

ILayer<T>

IJitCompilable<T>

IDiagnosticsProvider

IWeightLoadable<T>

Inherited Members: LoRAAdapterBase<T>._baseLayer

LoRAAdapterBase<T>._loraLayer

LoRAAdapterBase<T>._freezeBaseLayer

LoRAAdapterBase<T>.BaseLayer

LoRAAdapterBase<T>.LoRALayer

LoRAAdapterBase<T>.IsBaseLayerFrozen

LoRAAdapterBase<T>.Rank

LoRAAdapterBase<T>.Alpha

LoRAAdapterBase<T>.SupportsTraining

LoRAAdapterBase<T>.CreateLoRALayer(int, double)

LoRAAdapterBase<T>.CreateMergedLayerWithClone(Vector<T>)

LoRAAdapterBase<T>.MergeToDenseOrFullyConnected()

LoRAAdapterBase<T>.UpdateParametersFromLayers()

LoRAAdapterBase<T>.SupportsJitCompilation

LoRAAdapterBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuExecution

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.ForwardGpu(params IGpuTensor<T>[])

LayerBase<T>.BackwardGpu(IGpuTensor<T>)

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.Forward(params Tensor<T>[])

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.Clone()

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.UpdateParameters(Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.GetDiagnostics()

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

RoSA (Robust Adaptation) extends standard LoRA by combining two complementary components: 1. Low-rank component (standard LoRA): Captures common, structured patterns in adaptations 2. Sparse component: Captures specific, rare, or outlier patterns that low-rank cannot represent

Mathematical Formulation: Given input x and pre-trained weights W, RoSA computes: - Low-rank component: L = (alpha/rank) * B * A * x - Sparse component: S = W_sparse * x (where W_sparse is highly sparse) - Final output: y = W*x + L + S

The sparse component is maintained through magnitude-based pruning, keeping only the most significant weights and zeroing out the rest. This creates a sparse matrix that captures specific patterns while remaining parameter-efficient.

Research Context: RoSA was introduced in January 2024 as a robust alternative to standard LoRA. The key insight is that low-rank approximations work well for common patterns but struggle with distribution shifts and rare patterns. By adding a sparse component, RoSA can capture outliers and domain-specific patterns without significantly increasing parameter count.

In experiments on domain adaptation tasks, RoSA showed:

Better generalization to new domains (+5-10% over standard LoRA)
More robust to distribution shifts
Ability to capture both global patterns (low-rank) and local exceptions (sparse)
Only modest increase in parameters (typically 5-15% more than pure LoRA)

For Beginners: RoSA is like LoRA with a safety net for unusual cases.

Think of it this way:

Low-rank LoRA is like learning general rules ("most images of cats have pointed ears")
Sparse component is like remembering specific exceptions ("this one cat breed has round ears")
Together they make a robust model that handles both common and rare cases

Why RoSA is more robust:

Low-rank component: Efficient for common patterns across domains
Sparse component: Handles outliers and domain-specific quirks
Result: Better performance when test data differs from training data

When to use RoSA over standard LoRA:

When you expect distribution shifts (train on news, test on social media)
When your data has outliers or rare patterns that matter
When you need robustness more than absolute parameter efficiency
When adapting to multiple related but distinct domains

Trade-offs vs standard LoRA:

More robust to distribution shifts
Better handles rare patterns
More flexible adaptation

Slightly more parameters (sparse component adds ~5-15%)
Slightly more computation (extra sparse matrix multiply)
Requires tuning sparsity ratio

Reference: "RoSA: Robust Adaptation through Sparse Regularization" January 2024

Constructors

RoSAAdapter(ILayer<T>, int, double, double, double, bool)

Initializes a new RoSA adapter wrapping an existing layer.

public RoSAAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, double sparsityRatio = 0.95, double sparseThreshold = 0.01, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>: The layer to adapt with RoSA.
rank int: The rank of the low-rank LoRA decomposition.
alpha double: The LoRA scaling factor (defaults to rank if negative).
sparsityRatio double: Target sparsity ratio (0.0 to 1.0, typically 0.9-0.99).
sparseThreshold double: Magnitude threshold for pruning sparse weights (typically 0.001-0.1).
freezeBaseLayer bool: Whether to freeze the base layer's parameters during training.

Remarks

The constructor initializes the RoSA adapter by: 1. Setting up the standard LoRA components (via base constructor) 2. Initializing the sparse weight matrix (starts with small random values) 3. Applying initial pruning to enforce sparsity

For Beginners: This creates a RoSA adapter around your existing layer.

Parameters:

baseLayer: The layer you want to fine-tune efficiently and robustly
rank: How much compression for the low-rank component (lower = fewer parameters)
alpha: Scaling factor for LoRA contribution (usually equals rank)
sparsityRatio: How sparse the sparse component should be (0.95 = 95% zeros)
sparseThreshold: Minimum importance for keeping a sparse weight (0.01 is typical)
freezeBaseLayer: Usually true - we only train LoRA + sparse, not base weights

Example: For a 1000x1000 layer with rank=8 and sparsityRatio=0.95:

Base layer: 1,000,000 parameters (frozen)
LoRA: 16,000 parameters (8 * (1000 + 1000))
Sparse: ~50,000 parameters (5% of 1,000,000)
Total trainable: ~66,000 parameters (vs 1M for full fine-tuning!)

Exceptions

ArgumentNullException: Thrown when baseLayer is null.
ArgumentException: Thrown when sparsityRatio is not between 0 and 1.

Properties

ParameterCount

Gets the total number of trainable parameters.

public override int ParameterCount { get; }

Property Value

int

Remarks

RoSA parameters include: - Base layer parameters (if not frozen) - LoRA parameters (rank * (inputSize + outputSize)) - Non-zero sparse parameters (varies based on sparsity)

For parameter counting, we report the full sparse matrix size, but in practice only the non-zero elements need to be stored and updated.

SparseThreshold

Threshold for magnitude-based pruning of sparse weights. Weights with magnitude below this threshold are set to zero.

public double SparseThreshold { get; set; }

Property Value

double

Remarks

This threshold controls the sparsity of the sparse component. Lower values result in more non-zero weights (less sparse), higher values result in fewer non-zero weights (more sparse).

For Beginners: This is like a "minimum importance" cutoff. If a weight's importance is below this value, we zero it out to maintain sparsity. Typical values: 0.001 to 0.1

SparsityRatio

Target sparsity ratio (fraction of zeros in sparse component).

public double SparsityRatio { get; set; }

Property Value

double

Remarks

This value controls how sparse the sparse component should be. - 0.0 = no sparsity (all weights can be non-zero) - 0.5 = 50% of weights are zero - 0.95 = 95% of weights are zero (very sparse) - 0.99 = 99% of weights are zero (extremely sparse)

For Beginners: This is the target percentage of zeros we want. Higher values (like 0.95) mean fewer non-zero weights, which keeps the model efficient. Lower values mean more flexibility but more parameters.

Typical values:

0.90 (90% zeros): More flexible, for complex domains
0.95 (95% zeros): Good balance (recommended starting point)
0.99 (99% zeros): Very efficient, for simple adaptations

Methods

Backward(Tensor<T>)

Performs the backward pass through RoSA adapter.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: Gradient flowing back from the next layer.

Returns

Tensor<T>: Gradient to pass to the previous layer.

Remarks

The backward pass computes gradients for all three components: 1. LoRA component (via LoRA layer's backward) 2. Sparse component (direct gradient computation) 3. Base layer (if not frozen)

Gradients are accumulated and input gradients are summed.

For Beginners: This is where RoSA learns from errors.

The backward pass tells each component how to improve:

LoRA component: Update low-rank matrices A and B
Sparse component: Update the sparse weight matrix
Base layer: Update if not frozen (usually frozen)

After this, UpdateParameters() will apply the learning using these gradients. The sparse gradients will be pruned to maintain sparsity.

Forward(Tensor<T>)

Performs the forward pass through RoSA adapter.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: Input tensor.

Returns

Tensor<T>: Output combining base layer, low-rank LoRA, and sparse components.

Remarks

The RoSA forward pass computes: 1. Base output: y_base = base_layer(input) 2. LoRA output: y_lora = lora_layer(input) 3. Sparse output: y_sparse = input @ sparse_weights^T 4. Final output: y = y_base + y_lora + y_sparse

For Beginners: This is where all three components work together.

Think of it as three parallel processing paths:

Base layer: Original pre-trained knowledge (usually frozen)
LoRA component: Low-rank corrections for common patterns
Sparse component: Specific corrections for rare patterns

All three outputs are added together to get the final result. This combination gives RoSA its robustness: the low-rank handles common patterns efficiently, while sparse handles outliers.

GetParameters()

Gets the current parameters as a vector.

public override Vector<T> GetParameters()

Returns

Vector<T>: Vector containing all parameters (base if not frozen, LoRA, sparse).

GetSparsity()

Gets the current sparsity of the sparse component.

public double GetSparsity()

Returns

double: The fraction of zeros in the sparse weight matrix (0.0 to 1.0).

Remarks

This method computes the actual sparsity by counting zero and near-zero elements. The result can be compared to SparsityRatio to see how well pruning is working.

For Beginners: This tells you what percentage of the sparse component is actually zero.

If you set SparsityRatio to 0.95, this should return close to 0.95 after pruning. If it's much lower, you might need to adjust the threshold or pruning frequency.

Example return values:

0.95 = 95% zeros (good for target of 0.95)
0.80 = 80% zeros (less sparse than target)
0.99 = 99% zeros (more sparse than target)

MergeToOriginalLayer()

Merges the RoSA adaptation into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>: A new layer with both LoRA and sparse weights merged into the base layer's weights.

Remarks

This method creates a final layer by merging both components: - Merged weights: W' = W_base + W_lora + W_sparse where W_lora = (alpha/rank) * B * A

For Beginners: This "bakes in" both the LoRA and sparse adaptations for deployment.

After training with RoSA, you can create a single efficient layer by:

Computing the LoRA weight contribution (B * A)
Adding the sparse weights
Adding both to the base weights
Creating a new layer with the merged weights

The result is a standard layer that has all the adaptations built in:

Faster inference (no need for three separate computations)
Simpler deployment (single layer instead of adapter)
Same behavior as the RoSA adapter
Compatible with any system (doesn't need RoSA support)

Trade-off: You lose the ability to adjust LoRA/sparse contributions separately, but gain inference speed and simplicity.

Exceptions

InvalidOperationException: Thrown when the base layer type is not supported for merging.

PruneSparseWeights()

Prunes sparse weights based on magnitude to maintain target sparsity.

public void PruneSparseWeights()

Remarks

This method implements magnitude-based pruning: 1. Computes magnitude of all sparse weights 2. Determines threshold based on target sparsity ratio 3. Sets weights below threshold to zero

This ensures the sparse component maintains its sparsity during training.

For Beginners: This is like cleaning up the sparse component.

We keep only the most important weights:

Look at all the weights and their magnitudes
Sort them by importance (magnitude)
Keep the top X% (based on sparsity ratio)
Zero out the rest

Example with sparsity ratio 0.95:

We have 1000 weights
We want 95% zeros (950 zeros, 50 non-zeros)
Keep the 50 largest magnitudes
Set the other 950 to zero

This is called periodically during training to maintain sparsity.

ResetState()

Resets the internal state of the adapter.

public override void ResetState()

SetParameters(Vector<T>)

Sets the layer parameters from a vector.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: Vector containing all parameters.

UpdateParameters(T)

Updates parameters using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T: The learning rate for parameter updates.

Remarks

This method updates all trainable components: 1. LoRA layer (always) 2. Sparse weights (always, then prunes to maintain sparsity) 3. Base layer (only if not frozen)

For Beginners: This applies the learning from the backward pass.

For each component:

Use the gradients to update parameters
For sparse weights: update, then prune to maintain sparsity
This ensures we're always learning while keeping the model efficient

Table of Contents

Class RoSAAdapter<T>

Type Parameters

Remarks

Constructors

RoSAAdapter(ILayer<T>, int, double, double, double, bool)

Parameters

Remarks

Exceptions

Properties

ParameterCount

Property Value

Remarks

SparseThreshold

Property Value

Remarks

SparsityRatio

Property Value

Remarks

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

Forward(Tensor<T>)

Parameters

Returns

Remarks

GetParameters()

Returns

GetSparsity()

Returns

Remarks

MergeToOriginalLayer()

Returns

Remarks

Exceptions

PruneSparseWeights()

Remarks

ResetState()

SetParameters(Vector<T>)

Parameters

UpdateParameters(T)

Parameters

Remarks