Class RoSAAdapter<T>
RoSA (Robust Adaptation) adapter for parameter-efficient fine-tuning with improved robustness to distribution shifts.
public class RoSAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>RoSAAdapter<T>
- Implements
-
ILoRAAdapter<T>ILayer<T>
- Inherited Members
Remarks
RoSA (Robust Adaptation) extends standard LoRA by combining two complementary components: 1. Low-rank component (standard LoRA): Captures common, structured patterns in adaptations 2. Sparse component: Captures specific, rare, or outlier patterns that low-rank cannot represent
Mathematical Formulation: Given input x and pre-trained weights W, RoSA computes: - Low-rank component: L = (alpha/rank) * B * A * x - Sparse component: S = W_sparse * x (where W_sparse is highly sparse) - Final output: y = W*x + L + S
The sparse component is maintained through magnitude-based pruning, keeping only the most significant weights and zeroing out the rest. This creates a sparse matrix that captures specific patterns while remaining parameter-efficient.
Research Context: RoSA was introduced in January 2024 as a robust alternative to standard LoRA. The key insight is that low-rank approximations work well for common patterns but struggle with distribution shifts and rare patterns. By adding a sparse component, RoSA can capture outliers and domain-specific patterns without significantly increasing parameter count.
In experiments on domain adaptation tasks, RoSA showed:
- Better generalization to new domains (+5-10% over standard LoRA)
- More robust to distribution shifts
- Ability to capture both global patterns (low-rank) and local exceptions (sparse)
- Only modest increase in parameters (typically 5-15% more than pure LoRA)
For Beginners: RoSA is like LoRA with a safety net for unusual cases.
Think of it this way:
- Low-rank LoRA is like learning general rules ("most images of cats have pointed ears")
- Sparse component is like remembering specific exceptions ("this one cat breed has round ears")
- Together they make a robust model that handles both common and rare cases
Why RoSA is more robust:
- Low-rank component: Efficient for common patterns across domains
- Sparse component: Handles outliers and domain-specific quirks
- Result: Better performance when test data differs from training data
When to use RoSA over standard LoRA:
- When you expect distribution shifts (train on news, test on social media)
- When your data has outliers or rare patterns that matter
- When you need robustness more than absolute parameter efficiency
- When adapting to multiple related but distinct domains
Trade-offs vs standard LoRA:
- More robust to distribution shifts
- Better handles rare patterns
- More flexible adaptation
- Slightly more parameters (sparse component adds ~5-15%)
- Slightly more computation (extra sparse matrix multiply)
- Requires tuning sparsity ratio
Reference: "RoSA: Robust Adaptation through Sparse Regularization" January 2024
Constructors
RoSAAdapter(ILayer<T>, int, double, double, double, bool)
Initializes a new RoSA adapter wrapping an existing layer.
public RoSAAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, double sparsityRatio = 0.95, double sparseThreshold = 0.01, bool freezeBaseLayer = true)
Parameters
baseLayerILayer<T>The layer to adapt with RoSA.
rankintThe rank of the low-rank LoRA decomposition.
alphadoubleThe LoRA scaling factor (defaults to rank if negative).
sparsityRatiodoubleTarget sparsity ratio (0.0 to 1.0, typically 0.9-0.99).
sparseThresholddoubleMagnitude threshold for pruning sparse weights (typically 0.001-0.1).
freezeBaseLayerboolWhether to freeze the base layer's parameters during training.
Remarks
The constructor initializes the RoSA adapter by: 1. Setting up the standard LoRA components (via base constructor) 2. Initializing the sparse weight matrix (starts with small random values) 3. Applying initial pruning to enforce sparsity
For Beginners: This creates a RoSA adapter around your existing layer.
Parameters:
- baseLayer: The layer you want to fine-tune efficiently and robustly
- rank: How much compression for the low-rank component (lower = fewer parameters)
- alpha: Scaling factor for LoRA contribution (usually equals rank)
- sparsityRatio: How sparse the sparse component should be (0.95 = 95% zeros)
- sparseThreshold: Minimum importance for keeping a sparse weight (0.01 is typical)
- freezeBaseLayer: Usually true - we only train LoRA + sparse, not base weights
Example: For a 1000x1000 layer with rank=8 and sparsityRatio=0.95:
- Base layer: 1,000,000 parameters (frozen)
- LoRA: 16,000 parameters (8 * (1000 + 1000))
- Sparse: ~50,000 parameters (5% of 1,000,000)
- Total trainable: ~66,000 parameters (vs 1M for full fine-tuning!)
Exceptions
- ArgumentNullException
Thrown when baseLayer is null.
- ArgumentException
Thrown when sparsityRatio is not between 0 and 1.
Properties
ParameterCount
Gets the total number of trainable parameters.
public override int ParameterCount { get; }
Property Value
Remarks
RoSA parameters include: - Base layer parameters (if not frozen) - LoRA parameters (rank * (inputSize + outputSize)) - Non-zero sparse parameters (varies based on sparsity)
For parameter counting, we report the full sparse matrix size, but in practice only the non-zero elements need to be stored and updated.
SparseThreshold
Threshold for magnitude-based pruning of sparse weights. Weights with magnitude below this threshold are set to zero.
public double SparseThreshold { get; set; }
Property Value
Remarks
This threshold controls the sparsity of the sparse component. Lower values result in more non-zero weights (less sparse), higher values result in fewer non-zero weights (more sparse).
For Beginners: This is like a "minimum importance" cutoff. If a weight's importance is below this value, we zero it out to maintain sparsity. Typical values: 0.001 to 0.1
SparsityRatio
Target sparsity ratio (fraction of zeros in sparse component).
public double SparsityRatio { get; set; }
Property Value
Remarks
This value controls how sparse the sparse component should be. - 0.0 = no sparsity (all weights can be non-zero) - 0.5 = 50% of weights are zero - 0.95 = 95% of weights are zero (very sparse) - 0.99 = 99% of weights are zero (extremely sparse)
For Beginners: This is the target percentage of zeros we want. Higher values (like 0.95) mean fewer non-zero weights, which keeps the model efficient. Lower values mean more flexibility but more parameters.
Typical values:
- 0.90 (90% zeros): More flexible, for complex domains
- 0.95 (95% zeros): Good balance (recommended starting point)
- 0.99 (99% zeros): Very efficient, for simple adaptations
Methods
Backward(Tensor<T>)
Performs the backward pass through RoSA adapter.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>Gradient flowing back from the next layer.
Returns
- Tensor<T>
Gradient to pass to the previous layer.
Remarks
The backward pass computes gradients for all three components: 1. LoRA component (via LoRA layer's backward) 2. Sparse component (direct gradient computation) 3. Base layer (if not frozen)
Gradients are accumulated and input gradients are summed.
For Beginners: This is where RoSA learns from errors.
The backward pass tells each component how to improve:
- LoRA component: Update low-rank matrices A and B
- Sparse component: Update the sparse weight matrix
- Base layer: Update if not frozen (usually frozen)
After this, UpdateParameters() will apply the learning using these gradients. The sparse gradients will be pruned to maintain sparsity.
Forward(Tensor<T>)
Performs the forward pass through RoSA adapter.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>Input tensor.
Returns
- Tensor<T>
Output combining base layer, low-rank LoRA, and sparse components.
Remarks
The RoSA forward pass computes: 1. Base output: y_base = base_layer(input) 2. LoRA output: y_lora = lora_layer(input) 3. Sparse output: y_sparse = input @ sparse_weights^T 4. Final output: y = y_base + y_lora + y_sparse
For Beginners: This is where all three components work together.
Think of it as three parallel processing paths:
- Base layer: Original pre-trained knowledge (usually frozen)
- LoRA component: Low-rank corrections for common patterns
- Sparse component: Specific corrections for rare patterns
All three outputs are added together to get the final result. This combination gives RoSA its robustness: the low-rank handles common patterns efficiently, while sparse handles outliers.
GetParameters()
Gets the current parameters as a vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
Vector containing all parameters (base if not frozen, LoRA, sparse).
GetSparsity()
Gets the current sparsity of the sparse component.
public double GetSparsity()
Returns
- double
The fraction of zeros in the sparse weight matrix (0.0 to 1.0).
Remarks
This method computes the actual sparsity by counting zero and near-zero elements. The result can be compared to SparsityRatio to see how well pruning is working.
For Beginners: This tells you what percentage of the sparse component is actually zero.
If you set SparsityRatio to 0.95, this should return close to 0.95 after pruning. If it's much lower, you might need to adjust the threshold or pruning frequency.
Example return values:
- 0.95 = 95% zeros (good for target of 0.95)
- 0.80 = 80% zeros (less sparse than target)
- 0.99 = 99% zeros (more sparse than target)
MergeToOriginalLayer()
Merges the RoSA adaptation into the base layer and returns the merged layer.
public override ILayer<T> MergeToOriginalLayer()
Returns
- ILayer<T>
A new layer with both LoRA and sparse weights merged into the base layer's weights.
Remarks
This method creates a final layer by merging both components: - Merged weights: W' = W_base + W_lora + W_sparse where W_lora = (alpha/rank) * B * A
For Beginners: This "bakes in" both the LoRA and sparse adaptations for deployment.
After training with RoSA, you can create a single efficient layer by:
- Computing the LoRA weight contribution (B * A)
- Adding the sparse weights
- Adding both to the base weights
- Creating a new layer with the merged weights
The result is a standard layer that has all the adaptations built in:
- Faster inference (no need for three separate computations)
- Simpler deployment (single layer instead of adapter)
- Same behavior as the RoSA adapter
- Compatible with any system (doesn't need RoSA support)
Trade-off: You lose the ability to adjust LoRA/sparse contributions separately, but gain inference speed and simplicity.
Exceptions
- InvalidOperationException
Thrown when the base layer type is not supported for merging.
PruneSparseWeights()
Prunes sparse weights based on magnitude to maintain target sparsity.
public void PruneSparseWeights()
Remarks
This method implements magnitude-based pruning: 1. Computes magnitude of all sparse weights 2. Determines threshold based on target sparsity ratio 3. Sets weights below threshold to zero
This ensures the sparse component maintains its sparsity during training.
For Beginners: This is like cleaning up the sparse component.
We keep only the most important weights:
- Look at all the weights and their magnitudes
- Sort them by importance (magnitude)
- Keep the top X% (based on sparsity ratio)
- Zero out the rest
Example with sparsity ratio 0.95:
- We have 1000 weights
- We want 95% zeros (950 zeros, 50 non-zeros)
- Keep the 50 largest magnitudes
- Set the other 950 to zero
This is called periodically during training to maintain sparsity.
ResetState()
Resets the internal state of the adapter.
public override void ResetState()
SetParameters(Vector<T>)
Sets the layer parameters from a vector.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>Vector containing all parameters.
UpdateParameters(T)
Updates parameters using the specified learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate for parameter updates.
Remarks
This method updates all trainable components: 1. LoRA layer (always) 2. Sparse weights (always, then prunes to maintain sparsity) 3. Base layer (only if not frozen)
For Beginners: This applies the learning from the backward pass.
For each component:
- Use the gradients to update parameters
- For sparse weights: update, then prune to maintain sparsity
- This ensures we're always learning while keeping the model efficient