Table of Contents

Class DoRAAdapter<T>

Namespace
AiDotNet.LoRA.Adapters
Assembly
AiDotNet.dll

DoRA (Weight-Decomposed Low-Rank Adaptation) adapter for parameter-efficient fine-tuning with improved stability.

public class DoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
DoRAAdapter<T>
Implements
Inherited Members

Remarks

DoRA (Weight-Decomposed LoRA) extends standard LoRA by decomposing pre-trained weights into magnitude and direction components, then applying LoRA only to the direction component. This decomposition leads to more stable training and better convergence compared to standard LoRA.

Mathematical Formulation: Given pre-trained weights W, DoRA decomposes them as: - W = m * d, where m is magnitude (scalar per neuron) and d is direction (unit vector) - W' = m * normalize(d + LoRA_delta) - LoRA_delta = (alpha/rank) * B * A

This ensures that LoRA adaptations primarily affect the direction of weights, not their magnitude, which improves training stability and convergence.

Research Context: DoRA was published in February 2024 and presented as an ICML 2024 Oral paper. In experiments on LLaMA-7B, DoRA achieved +3.7% improvement over standard LoRA. The key insight is that separating magnitude and direction allows more stable gradient flow and better control over the adaptation process.

For Beginners: DoRA is an improved version of LoRA that works better in practice.

Think of neural network weights as arrows:

  • Each arrow has a length (magnitude) and a direction
  • Standard LoRA adjusts both length and direction at the same time
  • DoRA separates them: it keeps the length fixed and only adjusts the direction
  • This makes training more stable and gives better results

Why this matters:

  • More stable training (fewer divergences and NaN errors)
  • Better final performance (+3.7% on LLaMA-7B)
  • Same parameter efficiency as standard LoRA
  • Slightly more computation (due to normalization), but worth it for the stability

When to use DoRA over standard LoRA:

  • When training stability is important (large models, complex tasks)
  • When you want the best possible fine-tuning results
  • When you have the computational budget for normalization overhead
  • When adapting very large pre-trained models (LLMs, large vision models)

Reference: "DoRA: Weight-Decomposed Low-Rank Adaptation" ICML 2024 Oral https://arxiv.org/abs/2402.09353

Constructors

DoRAAdapter(ILayer<T>, int, double, bool)

Initializes a new DoRA adapter wrapping an existing layer.

public DoRAAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>

The layer to adapt with DoRA.

rank int

The rank of the LoRA decomposition.

alpha double

The LoRA scaling factor (defaults to rank if negative).

freezeBaseLayer bool

Whether to freeze the base layer's parameters during training.

Remarks

The constructor initializes the DoRA adapter by: 1. Setting up the standard LoRA components (via base constructor) 2. Decomposing the base layer's initial weights into magnitude and direction 3. Initializing magnitude gradients

For Beginners: This creates a DoRA adapter around your existing layer.

What happens during initialization:

  • The base class sets up standard LoRA (matrices A and B)
  • We then decompose the layer's weights into magnitude and direction
  • The magnitude starts as the actual magnitudes from the original weights
  • During training, both the LoRA matrices and the magnitudes will be updated

Parameters:

  • baseLayer: The layer you want to fine-tune efficiently
  • rank: How much compression for LoRA (lower = fewer parameters)
  • alpha: Scaling factor for LoRA contribution
  • freezeBaseLayer: Usually true - we only train LoRA + magnitude, not base weights

Exceptions

ArgumentNullException

Thrown when baseLayer is null.

Properties

ParameterCount

Gets the total number of trainable parameters.

public override int ParameterCount { get; }

Property Value

int

Remarks

DoRA adds the magnitude parameters (one per output neuron) to the standard LoRA parameters. Total = (base layer parameters if not frozen) + LoRA parameters + magnitude parameters.

Methods

Backward(Tensor<T>)

Performs the backward pass through DoRA adapter.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

Gradient flowing back from the next layer.

Returns

Tensor<T>

Gradient to pass to the previous layer.

Remarks

The backward pass computes gradients for: 1. Magnitude parameters (one per output neuron) 2. LoRA matrices A and B (via LoRA layer's backward) 3. Base layer weights (if not frozen)

The key challenge is computing how changes to magnitude and direction affect the loss, given that the direction is normalized during forward pass.

For Beginners: This is where DoRA learns during training.

Backward pass figures out how to improve three things:

  1. The magnitude of each output neuron's weights
  2. The LoRA matrices that adjust the direction
  3. The base layer weights (if we're training them too)

The math is complex because we need to account for the normalization step. When we normalize the direction, it creates a dependency between all elements of a weight vector, so the gradients need to account for that.

For simplicity, this implementation computes approximate gradients that work well in practice. The exact gradients would require storing more intermediate values from the forward pass.

Forward(Tensor<T>)

Performs the forward pass through DoRA adapter.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

Input tensor.

Returns

Tensor<T>

Output combining base layer with DoRA-adapted weights.

Remarks

The DoRA forward pass: 1. Gets base layer weights W 2. Computes direction: d = W / ||W|| 3. Applies LoRA to direction: d' = d + LoRA(input) 4. Normalizes adapted direction: d_norm = d' / ||d'|| 5. Recomposes weights: W' = m * d_norm 6. Computes output: y = input @ W'^T

For Beginners: This is where DoRA's magic happens during prediction.

Step by step:

  1. Get the original weights from the base layer
  2. Split into magnitude (stored) and direction (computed)
  3. Apply LoRA's correction to the direction (not the magnitude!)
  4. Normalize the new direction to keep it as a unit vector
  5. Multiply magnitude back in to get final weights
  6. Use these adjusted weights to compute the output

The key difference from standard LoRA:

  • Standard LoRA: output = base_output + lora_output
  • DoRA: output = input @ (m * normalize(d + lora_output))

DoRA's approach gives more stable training because we control magnitude separately.

GetParameters()

Gets the current parameters as a vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

Vector containing all parameters (base if not frozen, LoRA, magnitude).

MergeToOriginalLayer()

Merges the DoRA adaptation into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>

A new layer with DoRA weights merged into the base layer's weights.

Remarks

This method creates a final layer with the DoRA adaptations baked in. The merged weights are: W' = m * normalize(d + LoRA_delta) where m is magnitude, d is base direction, and LoRA_delta is the LoRA contribution.

For Beginners: This "bakes in" your DoRA adaptation for deployment.

After training with DoRA, you probably want to deploy a simpler model without all the DoRA machinery. This method creates that simpler model by:

  1. Computing the final adapted direction (base + LoRA)
  2. Normalizing the direction
  3. Multiplying by magnitude to get final weights
  4. Creating a new layer with these merged weights

The result is a standard layer that behaves like your DoRA-adapted model but is faster to run because it doesn't need to do the decomposition at runtime.

Exceptions

InvalidOperationException

Thrown when the base layer type is not supported for merging.

ResetState()

Resets the internal state of the adapter.

public override void ResetState()

SetParameters(Vector<T>)

Sets the layer parameters from a vector.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

Vector containing all parameters.

UpdateParameters(T)

Updates parameters using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate for parameter updates.