Class MoRAAdapter<T>
Implements MoRA (High-Rank Updating for Parameter-Efficient Fine-Tuning) adapter.
public class MoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>MoRAAdapter<T>
- Implements
-
ILoRAAdapter<T>ILayer<T>
- Inherited Members
Remarks
Paper Reference: "MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning" by Ting Jiang, Shaohan Huang, et al. (arXiv:2405.12130, May 2024)
MoRA addresses a fundamental limitation of LoRA: the low-rank constraint restricts the model's ability to learn and memorize new knowledge. While LoRA uses two rectangular matrices (A and B) to create low-rank updates, MoRA uses a single square matrix M combined with non-parameter-sharing operators to achieve high-rank updates while maintaining the same parameter count.
Key Innovations:
High-Rank Updates: Unlike LoRA's rank-r updates (r << d), MoRA achieves rank-r̂ updates where r̂ can equal the full dimension d, enabling the model to learn richer representations.
Square Matrix M: Instead of LoRA's A (d×r) and B (r×d) matrices, MoRA uses a single square matrix M (r×r) where r = sqrt(d×d / 2). For the same parameter count as LoRA, MoRA achieves much higher effective rank.
Non-Parameter-Sharing Operators: MoRA uses rotation, permutation, or other linear transformations that don't add trainable parameters but enable dimension compression and decompression around the square matrix M.
Input Compression / Output Decompression: The architecture is:
- Compress: Input (d) to Compressed (r) via rotation/permutation
- Transform: Compressed (r) to Transformed (r) via trainable matrix M
- Decompress: Transformed (r) to Output (d) via inverse rotation/permutation
Architecture Comparison:
LoRA: W = W₀ + BA where A ∈ ℝ^(d×r), B ∈ ℝ^(r×d)
- Parameters: 2dr
- Rank: r (low-rank constraint)
- Typical r: 8-64
MoRA: W = W₀ + R_d^(-1) M R_c where M ∈ ℝ^(r×r)
- Parameters: r²
- Rank: min(r, d) (can be full-rank)
- For same param count as LoRA: r = sqrt(2dr), so rank ≈ sqrt(2dr)
- Example: LoRA with r=8, d=1024 has 16,384 params and rank 8 MoRA with same params: r=128, rank 128 (16× higher!)
Performance (from paper):
Compared to LoRA on various tasks:
- Memory-Intensive Tasks: MoRA significantly outperforms LoRA
- Continual Pretraining: ~15% better perplexity
- Instruction Tuning: ~8% better accuracy on knowledge-intensive QA
- Reasoning Tasks: MoRA performs comparably to LoRA
- Mathematical Reasoning: Similar performance (within 1-2%)
- Parameter Efficiency: Same parameter count as LoRA
- Training Speed: Slightly slower than LoRA due to rotation operations (≈5-10% overhead)
When to Use MoRA vs LoRA:
Use MoRA when:
- Task requires memorizing new facts or knowledge
- Domain adaptation with significant vocabulary changes
- Continual learning scenarios
- You need the model to "remember" rather than just "adapt"
Use LoRA when:
- Task is primarily reasoning or pattern recognition
- Minimal new knowledge acquisition needed
- Training speed is critical
- Standard parameter-efficient fine-tuning is sufficient
Implementation Details:
This implementation uses rotation matrices as the non-parameter-sharing operators:
- Compression R_c: Projects input from dimension d to dimension r
- Decompression R_d: Projects from dimension r back to dimension d
- These are generated using random orthogonal matrices (Gram-Schmidt orthogonalization)
- They remain fixed during training (non-trainable)
Alternative operators mentioned in the paper (not implemented here):
- RoPE-based rotations (Rotary Position Embeddings)
- Random permutations
- Structured rotations (e.g., Hadamard transforms)
For Beginners: MoRA is like an upgraded version of LoRA that can learn more complex changes to a model while using the same amount of memory.
Think of it like this:
- LoRA is like having 2 small notebooks to write changes (matrices A and B)
- MoRA is like having 1 square notebook plus a compression/decompression scheme
The key insight: By compressing the input, applying changes in compressed space, and then decompressing, MoRA can make higher-rank updates that capture more complex patterns. This is especially useful when you're teaching the model entirely new facts or concepts, not just adapting its existing knowledge.
Example: If you're fine-tuning a model to learn medical terminology, MoRA will be better at memorizing the new terms, while LoRA might be better at learning to reason about medical cases using existing knowledge.
Constructors
MoRAAdapter(ILayer<T>, int, double, bool)
public MoRAAdapter(ILayer<T> baseLayer, int rank, double alpha = 1, bool freezeBaseLayer = true)
Parameters
Properties
ParameterCount
Gets the total number of trainable parameters.
public override int ParameterCount { get; }
Property Value
Remarks
If the base layer is frozen, this returns only the LoRA parameter count. Otherwise, it returns the sum of base and LoRA parameters.
SquareRank
Gets the effective rank of the MoRA adaptation.
public int SquareRank { get; }
Property Value
Remarks
This is the dimension of the square matrix M, which determines the maximum rank of the updates MoRA can make. Unlike LoRA where this is typically 8-64, MoRA can achieve ranks of 128+ with the same parameter count.
Methods
Backward(Tensor<T>)
Performs the backward pass through both layers.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>Gradient flowing back from the next layer.
Returns
- Tensor<T>
Gradient to pass to the previous layer.
Remarks
The backward pass propagates gradients through both the LoRA layer and (if not frozen) the base layer. The input gradients from both paths are summed.
For Beginners: During learning, this figures out how to improve both layers: - Always updates the LoRA layer (that's what we're training) - Only updates the base layer if it's not frozen - Combines the gradients from both paths to tell earlier layers how to improve
CreateLoRALayer(int, double)
Creates a minimal placeholder LoRA layer to satisfy base class requirements.
protected override LoRALayer<T> CreateLoRALayer(int rank, double alpha)
Parameters
Returns
- LoRALayer<T>
Remarks
IMPORTANT: MoRA does NOT use the standard LoRA layer architecture. This method creates a minimal LoRALayer with rank=1 only to satisfy the LoRAAdapterBase contract, but it is never used in MoRA's Forward, Backward, or UpdateParameters methods.
MoRA uses its own square matrix M combined with compression/decompression matrices instead of the standard A/B low-rank decomposition. The actual MoRA logic is implemented directly in the overridden methods using _matrixM, _compressionMatrix, and _decompressionMatrix.
This design choice maintains compatibility with LoRAAdapterBase while avoiding the overhead of a full-rank unused LoRA layer. Future refactoring could make the LoRA layer optional in LoRAAdapterBase or have MoRAAdapter extend LayerBase directly.
Forward(Tensor<T>)
Performs the forward pass through both base and LoRA layers.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>Input tensor.
Returns
- Tensor<T>
Sum of base layer output and LoRA output.
Remarks
The forward pass computes: output = base_layer(input) + lora_layer(input)
For Beginners: This runs the input through both the original layer and the LoRA correction layer, then adds their outputs together. The result is the original behavior plus the learned adaptation.
GetParameters()
Gets the current parameter values (base layer + MoRA matrix M).
public override Vector<T> GetParameters()
Returns
- Vector<T>
A cloned vector containing all parameters.
Remarks
Since MoRA does not use the standard LoRA layer architecture, this method overrides the base implementation to pack parameters from the base layer (if not frozen) and the square matrix M directly.
MergeToOriginalLayer()
Merges the LoRA adaptation into the base layer and returns the merged layer.
public override ILayer<T> MergeToOriginalLayer()
Returns
- ILayer<T>
A new layer with LoRA weights merged into the base layer's weights.
Remarks
This method must be implemented by derived classes to handle layer-type-specific merging logic. Each type of adapter (Dense, Convolutional, etc.) needs to know how to combine its LoRA weights with the base layer's weights.
For Beginners: This "bakes in" your LoRA adaptation to create a regular layer. After training with LoRA, you can merge the adaptation into the original weights for: - Faster inference (no need to compute LoRA separately) - Simpler deployment (single layer instead of two) - Compatibility with systems that don't support LoRA
Each layer type implements this differently because they have different internal structures.
ResetState()
Resets the internal state of both the base layer and LoRA layer.
public override void ResetState()
Remarks
For Beginners: This clears the memory of both the base layer and the LoRA layer. It's useful when starting to process a completely new, unrelated batch of data.
SetParameters(Vector<T>)
Sets the parameter values (base layer + MoRA matrix M).
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>Parameter vector to set.
Remarks
This method unpacks the parameter vector into the base layer (if not frozen) and the square matrix M. The parameter layout is: - Base layer parameters (if !_freezeBaseLayer): [0 .. baseLayerParamCount) - Matrix M parameters (row-major): [baseLayerParamCount .. ParameterCount)
Exceptions
- ArgumentException
Thrown if parameter count doesn't match ParameterCount.
UpdateParameters(T)
Updates parameters using the specified learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate for parameter updates.
UpdateParametersFromLayers()
Overrides the base parameter packing to use the MoRA matrix M instead of the placeholder LoRA layer. This ensures that the public parameter surface is consistent with ParameterCount.
protected override void UpdateParametersFromLayers()