Table of Contents

Class MoRAAdapter<T>

Namespace
AiDotNet.LoRA.Adapters
Assembly
AiDotNet.dll

Implements MoRA (High-Rank Updating for Parameter-Efficient Fine-Tuning) adapter.

public class MoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
MoRAAdapter<T>
Implements
Inherited Members

Remarks

Paper Reference: "MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning" by Ting Jiang, Shaohan Huang, et al. (arXiv:2405.12130, May 2024)

MoRA addresses a fundamental limitation of LoRA: the low-rank constraint restricts the model's ability to learn and memorize new knowledge. While LoRA uses two rectangular matrices (A and B) to create low-rank updates, MoRA uses a single square matrix M combined with non-parameter-sharing operators to achieve high-rank updates while maintaining the same parameter count.

Key Innovations:

  1. High-Rank Updates: Unlike LoRA's rank-r updates (r << d), MoRA achieves rank-r̂ updates where r̂ can equal the full dimension d, enabling the model to learn richer representations.

  2. Square Matrix M: Instead of LoRA's A (d×r) and B (r×d) matrices, MoRA uses a single square matrix M (r×r) where r = sqrt(d×d / 2). For the same parameter count as LoRA, MoRA achieves much higher effective rank.

  3. Non-Parameter-Sharing Operators: MoRA uses rotation, permutation, or other linear transformations that don't add trainable parameters but enable dimension compression and decompression around the square matrix M.

  4. Input Compression / Output Decompression: The architecture is:

    • Compress: Input (d) to Compressed (r) via rotation/permutation
    • Transform: Compressed (r) to Transformed (r) via trainable matrix M
    • Decompress: Transformed (r) to Output (d) via inverse rotation/permutation

Architecture Comparison:

LoRA: W = W₀ + BA where A ∈ ℝ^(d×r), B ∈ ℝ^(r×d)

  • Parameters: 2dr
  • Rank: r (low-rank constraint)
  • Typical r: 8-64

MoRA: W = W₀ + R_d^(-1) M R_c where M ∈ ℝ^(r×r)

  • Parameters: r²
  • Rank: min(r, d) (can be full-rank)
  • For same param count as LoRA: r = sqrt(2dr), so rank ≈ sqrt(2dr)
  • Example: LoRA with r=8, d=1024 has 16,384 params and rank 8 MoRA with same params: r=128, rank 128 (16× higher!)

Performance (from paper):

Compared to LoRA on various tasks:

  • Memory-Intensive Tasks: MoRA significantly outperforms LoRA
    • Continual Pretraining: ~15% better perplexity
    • Instruction Tuning: ~8% better accuracy on knowledge-intensive QA
  • Reasoning Tasks: MoRA performs comparably to LoRA
    • Mathematical Reasoning: Similar performance (within 1-2%)
  • Parameter Efficiency: Same parameter count as LoRA
  • Training Speed: Slightly slower than LoRA due to rotation operations (≈5-10% overhead)

When to Use MoRA vs LoRA:

Use MoRA when:

  • Task requires memorizing new facts or knowledge
  • Domain adaptation with significant vocabulary changes
  • Continual learning scenarios
  • You need the model to "remember" rather than just "adapt"

Use LoRA when:

  • Task is primarily reasoning or pattern recognition
  • Minimal new knowledge acquisition needed
  • Training speed is critical
  • Standard parameter-efficient fine-tuning is sufficient

Implementation Details:

This implementation uses rotation matrices as the non-parameter-sharing operators:

  • Compression R_c: Projects input from dimension d to dimension r
  • Decompression R_d: Projects from dimension r back to dimension d
  • These are generated using random orthogonal matrices (Gram-Schmidt orthogonalization)
  • They remain fixed during training (non-trainable)

Alternative operators mentioned in the paper (not implemented here):

  • RoPE-based rotations (Rotary Position Embeddings)
  • Random permutations
  • Structured rotations (e.g., Hadamard transforms)

For Beginners: MoRA is like an upgraded version of LoRA that can learn more complex changes to a model while using the same amount of memory.

Think of it like this:

  • LoRA is like having 2 small notebooks to write changes (matrices A and B)
  • MoRA is like having 1 square notebook plus a compression/decompression scheme

The key insight: By compressing the input, applying changes in compressed space, and then decompressing, MoRA can make higher-rank updates that capture more complex patterns. This is especially useful when you're teaching the model entirely new facts or concepts, not just adapting its existing knowledge.

Example: If you're fine-tuning a model to learn medical terminology, MoRA will be better at memorizing the new terms, while LoRA might be better at learning to reason about medical cases using existing knowledge.

Constructors

MoRAAdapter(ILayer<T>, int, double, bool)

public MoRAAdapter(ILayer<T> baseLayer, int rank, double alpha = 1, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>
rank int
alpha double
freezeBaseLayer bool

Properties

ParameterCount

Gets the total number of trainable parameters.

public override int ParameterCount { get; }

Property Value

int

Remarks

If the base layer is frozen, this returns only the LoRA parameter count. Otherwise, it returns the sum of base and LoRA parameters.

SquareRank

Gets the effective rank of the MoRA adaptation.

public int SquareRank { get; }

Property Value

int

Remarks

This is the dimension of the square matrix M, which determines the maximum rank of the updates MoRA can make. Unlike LoRA where this is typically 8-64, MoRA can achieve ranks of 128+ with the same parameter count.

Methods

Backward(Tensor<T>)

Performs the backward pass through both layers.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

Gradient flowing back from the next layer.

Returns

Tensor<T>

Gradient to pass to the previous layer.

Remarks

The backward pass propagates gradients through both the LoRA layer and (if not frozen) the base layer. The input gradients from both paths are summed.

For Beginners: During learning, this figures out how to improve both layers: - Always updates the LoRA layer (that's what we're training) - Only updates the base layer if it's not frozen - Combines the gradients from both paths to tell earlier layers how to improve

CreateLoRALayer(int, double)

Creates a minimal placeholder LoRA layer to satisfy base class requirements.

protected override LoRALayer<T> CreateLoRALayer(int rank, double alpha)

Parameters

rank int
alpha double

Returns

LoRALayer<T>

Remarks

IMPORTANT: MoRA does NOT use the standard LoRA layer architecture. This method creates a minimal LoRALayer with rank=1 only to satisfy the LoRAAdapterBase contract, but it is never used in MoRA's Forward, Backward, or UpdateParameters methods.

MoRA uses its own square matrix M combined with compression/decompression matrices instead of the standard A/B low-rank decomposition. The actual MoRA logic is implemented directly in the overridden methods using _matrixM, _compressionMatrix, and _decompressionMatrix.

This design choice maintains compatibility with LoRAAdapterBase while avoiding the overhead of a full-rank unused LoRA layer. Future refactoring could make the LoRA layer optional in LoRAAdapterBase or have MoRAAdapter extend LayerBase directly.

Forward(Tensor<T>)

Performs the forward pass through both base and LoRA layers.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

Input tensor.

Returns

Tensor<T>

Sum of base layer output and LoRA output.

Remarks

The forward pass computes: output = base_layer(input) + lora_layer(input)

For Beginners: This runs the input through both the original layer and the LoRA correction layer, then adds their outputs together. The result is the original behavior plus the learned adaptation.

GetParameters()

Gets the current parameter values (base layer + MoRA matrix M).

public override Vector<T> GetParameters()

Returns

Vector<T>

A cloned vector containing all parameters.

Remarks

Since MoRA does not use the standard LoRA layer architecture, this method overrides the base implementation to pack parameters from the base layer (if not frozen) and the square matrix M directly.

MergeToOriginalLayer()

Merges the LoRA adaptation into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>

A new layer with LoRA weights merged into the base layer's weights.

Remarks

This method must be implemented by derived classes to handle layer-type-specific merging logic. Each type of adapter (Dense, Convolutional, etc.) needs to know how to combine its LoRA weights with the base layer's weights.

For Beginners: This "bakes in" your LoRA adaptation to create a regular layer. After training with LoRA, you can merge the adaptation into the original weights for: - Faster inference (no need to compute LoRA separately) - Simpler deployment (single layer instead of two) - Compatibility with systems that don't support LoRA

Each layer type implements this differently because they have different internal structures.

ResetState()

Resets the internal state of both the base layer and LoRA layer.

public override void ResetState()

Remarks

For Beginners: This clears the memory of both the base layer and the LoRA layer. It's useful when starting to process a completely new, unrelated batch of data.

SetParameters(Vector<T>)

Sets the parameter values (base layer + MoRA matrix M).

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

Parameter vector to set.

Remarks

This method unpacks the parameter vector into the base layer (if not frozen) and the square matrix M. The parameter layout is: - Base layer parameters (if !_freezeBaseLayer): [0 .. baseLayerParamCount) - Matrix M parameters (row-major): [baseLayerParamCount .. ParameterCount)

Exceptions

ArgumentException

Thrown if parameter count doesn't match ParameterCount.

UpdateParameters(T)

Updates parameters using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate for parameter updates.

UpdateParametersFromLayers()

Overrides the base parameter packing to use the MoRA matrix M instead of the placeholder LoRA layer. This ensures that the public parameter surface is consistent with ParameterCount.

protected override void UpdateParametersFromLayers()