Class DoRAAdapter<T>

Namespace: AiDotNet.LoRA.Adapters

Assembly: AiDotNet.dll

DoRA (Weight-Decomposed Low-Rank Adaptation) adapter for parameter-efficient fine-tuning with improved stability.

public class DoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T: The numeric type used for calculations, typically float or double.

Inheritance: object

LayerBase<T>

LoRAAdapterBase<T>

DoRAAdapter<T>

Implements: IDisposable

ILoRAAdapter<T>

ILayer<T>

IJitCompilable<T>

IDiagnosticsProvider

IWeightLoadable<T>

Inherited Members: LoRAAdapterBase<T>._baseLayer

LoRAAdapterBase<T>._loraLayer

LoRAAdapterBase<T>._freezeBaseLayer

LoRAAdapterBase<T>.BaseLayer

LoRAAdapterBase<T>.LoRALayer

LoRAAdapterBase<T>.IsBaseLayerFrozen

LoRAAdapterBase<T>.Rank

LoRAAdapterBase<T>.Alpha

LoRAAdapterBase<T>.SupportsTraining

LoRAAdapterBase<T>.CreateLoRALayer(int, double)

LoRAAdapterBase<T>.CreateMergedLayerWithClone(Vector<T>)

LoRAAdapterBase<T>.MergeToDenseOrFullyConnected()

LoRAAdapterBase<T>.UpdateParametersFromLayers()

LoRAAdapterBase<T>.SupportsJitCompilation

LoRAAdapterBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuExecution

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.ForwardGpu(params IGpuTensor<T>[])

LayerBase<T>.BackwardGpu(IGpuTensor<T>)

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.Forward(params Tensor<T>[])

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.Clone()

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.UpdateParameters(Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.GetDiagnostics()

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

DoRA (Weight-Decomposed LoRA) extends standard LoRA by decomposing pre-trained weights into magnitude and direction components, then applying LoRA only to the direction component. This decomposition leads to more stable training and better convergence compared to standard LoRA.

Mathematical Formulation: Given pre-trained weights W, DoRA decomposes them as: - W = m * d, where m is magnitude (scalar per neuron) and d is direction (unit vector) - W' = m * normalize(d + LoRA_delta) - LoRA_delta = (alpha/rank) * B * A

This ensures that LoRA adaptations primarily affect the direction of weights, not their magnitude, which improves training stability and convergence.

Research Context: DoRA was published in February 2024 and presented as an ICML 2024 Oral paper. In experiments on LLaMA-7B, DoRA achieved +3.7% improvement over standard LoRA. The key insight is that separating magnitude and direction allows more stable gradient flow and better control over the adaptation process.

For Beginners: DoRA is an improved version of LoRA that works better in practice.

Think of neural network weights as arrows:

Each arrow has a length (magnitude) and a direction
Standard LoRA adjusts both length and direction at the same time
DoRA separates them: it keeps the length fixed and only adjusts the direction
This makes training more stable and gives better results

Why this matters:

More stable training (fewer divergences and NaN errors)
Better final performance (+3.7% on LLaMA-7B)
Same parameter efficiency as standard LoRA
Slightly more computation (due to normalization), but worth it for the stability

When to use DoRA over standard LoRA:

When training stability is important (large models, complex tasks)
When you want the best possible fine-tuning results
When you have the computational budget for normalization overhead
When adapting very large pre-trained models (LLMs, large vision models)

Reference: "DoRA: Weight-Decomposed Low-Rank Adaptation" ICML 2024 Oral https://arxiv.org/abs/2402.09353

Constructors

DoRAAdapter(ILayer<T>, int, double, bool)

Initializes a new DoRA adapter wrapping an existing layer.

public DoRAAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>: The layer to adapt with DoRA.
rank int: The rank of the LoRA decomposition.
alpha double: The LoRA scaling factor (defaults to rank if negative).
freezeBaseLayer bool: Whether to freeze the base layer's parameters during training.

Remarks

The constructor initializes the DoRA adapter by: 1. Setting up the standard LoRA components (via base constructor) 2. Decomposing the base layer's initial weights into magnitude and direction 3. Initializing magnitude gradients

For Beginners: This creates a DoRA adapter around your existing layer.

What happens during initialization:

The base class sets up standard LoRA (matrices A and B)
We then decompose the layer's weights into magnitude and direction
The magnitude starts as the actual magnitudes from the original weights
During training, both the LoRA matrices and the magnitudes will be updated

Parameters:

baseLayer: The layer you want to fine-tune efficiently
rank: How much compression for LoRA (lower = fewer parameters)
alpha: Scaling factor for LoRA contribution
freezeBaseLayer: Usually true - we only train LoRA + magnitude, not base weights

Exceptions

ArgumentNullException: Thrown when baseLayer is null.

Properties

ParameterCount

Gets the total number of trainable parameters.

public override int ParameterCount { get; }

Property Value

int

Remarks

DoRA adds the magnitude parameters (one per output neuron) to the standard LoRA parameters. Total = (base layer parameters if not frozen) + LoRA parameters + magnitude parameters.

Methods

Backward(Tensor<T>)

Performs the backward pass through DoRA adapter.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: Gradient flowing back from the next layer.

Returns

Tensor<T>: Gradient to pass to the previous layer.

Remarks

The backward pass computes gradients for: 1. Magnitude parameters (one per output neuron) 2. LoRA matrices A and B (via LoRA layer's backward) 3. Base layer weights (if not frozen)

The key challenge is computing how changes to magnitude and direction affect the loss, given that the direction is normalized during forward pass.

For Beginners: This is where DoRA learns during training.

Backward pass figures out how to improve three things:

The magnitude of each output neuron's weights
The LoRA matrices that adjust the direction
The base layer weights (if we're training them too)

The math is complex because we need to account for the normalization step. When we normalize the direction, it creates a dependency between all elements of a weight vector, so the gradients need to account for that.

For simplicity, this implementation computes approximate gradients that work well in practice. The exact gradients would require storing more intermediate values from the forward pass.

Forward(Tensor<T>)

Performs the forward pass through DoRA adapter.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: Input tensor.

Returns

Tensor<T>: Output combining base layer with DoRA-adapted weights.

Remarks

The DoRA forward pass: 1. Gets base layer weights W 2. Computes direction: d = W / ||W|| 3. Applies LoRA to direction: d' = d + LoRA(input) 4. Normalizes adapted direction: d_norm = d' / ||d'|| 5. Recomposes weights: W' = m * d_norm 6. Computes output: y = input @ W'^T

For Beginners: This is where DoRA's magic happens during prediction.

Step by step:

Get the original weights from the base layer
Split into magnitude (stored) and direction (computed)
Apply LoRA's correction to the direction (not the magnitude!)
Normalize the new direction to keep it as a unit vector
Multiply magnitude back in to get final weights
Use these adjusted weights to compute the output

The key difference from standard LoRA:

Standard LoRA: output = base_output + lora_output
DoRA: output = input @ (m * normalize(d + lora_output))

DoRA's approach gives more stable training because we control magnitude separately.

GetParameters()

Gets the current parameters as a vector.

public override Vector<T> GetParameters()

Returns

Vector<T>: Vector containing all parameters (base if not frozen, LoRA, magnitude).

MergeToOriginalLayer()

Merges the DoRA adaptation into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>: A new layer with DoRA weights merged into the base layer's weights.

Remarks

This method creates a final layer with the DoRA adaptations baked in. The merged weights are: W' = m * normalize(d + LoRA_delta) where m is magnitude, d is base direction, and LoRA_delta is the LoRA contribution.

For Beginners: This "bakes in" your DoRA adaptation for deployment.

After training with DoRA, you probably want to deploy a simpler model without all the DoRA machinery. This method creates that simpler model by:

Computing the final adapted direction (base + LoRA)
Normalizing the direction
Multiplying by magnitude to get final weights
Creating a new layer with these merged weights

The result is a standard layer that behaves like your DoRA-adapted model but is faster to run because it doesn't need to do the decomposition at runtime.

Exceptions

InvalidOperationException: Thrown when the base layer type is not supported for merging.

ResetState()

Resets the internal state of the adapter.

public override void ResetState()

SetParameters(Vector<T>)

Sets the layer parameters from a vector.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: Vector containing all parameters.

UpdateParameters(T)

Updates parameters using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T: The learning rate for parameter updates.

Table of Contents

Class DoRAAdapter<T>

Type Parameters

Remarks

Constructors

DoRAAdapter(ILayer<T>, int, double, bool)

Parameters

Remarks

Exceptions

Properties

ParameterCount

Property Value

Remarks

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

Forward(Tensor<T>)

Parameters

Returns

Remarks

GetParameters()

Returns

MergeToOriginalLayer()

Returns

Remarks

Exceptions

ResetState()

SetParameters(Vector<T>)

Parameters

UpdateParameters(T)

Parameters