Class XLoRAAdapter<T>

Namespace: AiDotNet.LoRA.Adapters

Assembly: AiDotNet.dll

X-LoRA (Mixture of LoRA Experts) adapter that uses multiple LoRA experts with learned routing.

public class XLoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T: The numeric type used for calculations, typically float or double.

Inheritance: object

LayerBase<T>

LoRAAdapterBase<T>

XLoRAAdapter<T>

Implements: IDisposable

ILoRAAdapter<T>

ILayer<T>

IJitCompilable<T>

IDiagnosticsProvider

IWeightLoadable<T>

Inherited Members: LoRAAdapterBase<T>._baseLayer

LoRAAdapterBase<T>._loraLayer

LoRAAdapterBase<T>._freezeBaseLayer

LoRAAdapterBase<T>.BaseLayer

LoRAAdapterBase<T>.LoRALayer

LoRAAdapterBase<T>.IsBaseLayerFrozen

LoRAAdapterBase<T>.Rank

LoRAAdapterBase<T>.Alpha

LoRAAdapterBase<T>.SupportsTraining

LoRAAdapterBase<T>.CreateLoRALayer(int, double)

LoRAAdapterBase<T>.CreateMergedLayerWithClone(Vector<T>)

LoRAAdapterBase<T>.MergeToDenseOrFullyConnected()

LoRAAdapterBase<T>.SupportsJitCompilation

LoRAAdapterBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuExecution

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.ForwardGpu(params IGpuTensor<T>[])

LayerBase<T>.BackwardGpu(IGpuTensor<T>)

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.Forward(params Tensor<T>[])

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.Clone()

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.UpdateParameters(Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.GetDiagnostics()

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

X-LoRA extends standard LoRA by using a mixture of experts approach: - Multiple LoRA adapters ("experts") are applied to the same layer - A gating network learns to weight each expert's contribution based on the input - Different inputs may activate different experts, allowing for more flexible adaptation - This provides greater capacity than a single LoRA adapter with the same total rank

The forward pass computes: - base_output = base_layer(input) - For each expert i: expert_output[i] = lora_expert[i](input) - gating_weights = softmax(gating_network(input)) - final_lora_output = sum(gating_weights[i] * expert_output[i]) - output = base_output + final_lora_output

For Beginners: X-LoRA is like having multiple specialists instead of one generalist.

Think of it like this:

Standard LoRA: One adapter tries to handle all tasks
X-LoRA: Multiple expert adapters, each specializing in different patterns
A "gating network" decides which experts to use for each input

Real-world analogy: Instead of one doctor handling all patients, you have:

Expert 1: Specializes in one type of pattern (e.g., cat images)
Expert 2: Specializes in another pattern (e.g., dog images)
Expert 3: Handles other cases
Gating network: Looks at each input and decides which expert(s) to consult

Benefits:

More capacity: Multiple experts can learn different aspects
Better specialization: Each expert focuses on what it's good at
Dynamic routing: Different inputs activate different experts
Efficient: Only computes what's needed for each input

Example: For a 1000x1000 layer with 4 experts at rank=4 each:

Total LoRA parameters: 4 * (4 * 1000 + 4 * 1000) = 32,000 parameters
Gating network: ~1000 parameters
Total: ~33,000 parameters (still 96.7% reduction from 1M!)
But with more capacity than single rank=16 LoRA (32,000 params)

Trade-offs:

More flexible: Experts specialize in different patterns
Better performance: Often outperforms single LoRA at same parameter count
Dynamic routing: Adapts to different inputs

More complex: Requires training gating network
Slightly slower: Must compute multiple experts and gating weights

Reference: "Mixture of LoRA Experts" (X-LoRA) https://arxiv.org/abs/2402.07148

Constructors

XLoRAAdapter(ILayer<T>, int, int, double, bool)

Initializes a new X-LoRA adapter with the specified parameters.

public XLoRAAdapter(ILayer<T> baseLayer, int numberOfExperts, int expertRank, double alpha = -1, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>: The layer to adapt with X-LoRA.
numberOfExperts int: The number of LoRA experts to create.
expertRank int: The rank of each LoRA expert decomposition.
alpha double: The LoRA scaling factor for experts (defaults to expertRank if negative).
freezeBaseLayer bool: Whether to freeze the base layer's parameters during training.

Remarks

For Beginners: This creates an X-LoRA adapter with multiple expert adapters.

Parameters:

baseLayer: The layer you want to adapt (typically Dense or FullyConnected)
numberOfExperts: How many specialist adapters to create (typically 2-8)
expertRank: The rank for each expert (compression level)
alpha: How strong each expert's adaptation is
freezeBaseLayer: Whether to lock the original layer's weights (usually true)

The adapter will:

Create multiple LoRA experts (all with the same rank)
Create a gating network to route inputs to experts
Learn to specialize each expert for different patterns

Common configurations:

numberOfExperts=2, expertRank=8: Simple mixture for binary specialization
numberOfExperts=4, expertRank=4: Balanced approach (4 specialists, 16 total rank)
numberOfExperts=8, expertRank=2: Many specialists, each handling narrow patterns

Trade-off: More experts = more specialization but more parameters and computation.

Exceptions

ArgumentNullException: Thrown when baseLayer is null.
ArgumentException: Thrown when numberOfExperts is invalid.

Properties

Experts

Gets the array of LoRA expert layers.

public LoRALayer<T>[] Experts { get; }

Property Value

LoRALayer<T>[]

Remarks

Returns a copy of the experts array to prevent external modification.

GatingNetwork

Gets the gating network used for routing.

public DenseLayer<T> GatingNetwork { get; }

Property Value

DenseLayer<T>

NumberOfExpertss

Gets the number of LoRA experts in this adapter.

public int NumberOfExpertss { get; }

Property Value

int

ParameterCount

Gets the total number of trainable parameters.

public override int ParameterCount { get; }

Property Value

int

Remarks

Includes parameters from:

Base layer (if not frozen)
All expert LoRA layers
Gating network

Methods

Backward(Tensor<T>)

Performs the backward pass through the mixture of experts.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: Gradient flowing back from the next layer.

Returns

Tensor<T>: Gradient to pass to the previous layer.

Remarks

The backward pass propagates gradients through: 1. All expert LoRA layers (weighted by their gating weights) 2. The gating network (to learn better routing) 3. The base layer (if not frozen)

For Beginners: This is where all components learn to improve!

During backpropagation:

Each expert receives gradients weighted by how much it was used
- Expert with weight 0.6 gets 60% of the gradient
- Expert with weight 0.1 gets 10% of the gradient
The gating network learns to route inputs better
- If an expert's output helped, increase its weight next time
- If an expert's output hurt, decrease its weight
The base layer updates if not frozen

This creates a feedback loop where:

Experts specialize in patterns they're good at
Gating network learns which expert to use for which input
Together, they improve performance beyond single LoRA

Forward(Tensor<T>)

Performs the forward pass using mixture of LoRA experts.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: Input tensor.

Returns

Tensor<T>: Output combining base layer and weighted expert outputs.

Remarks

The forward pass: 1. Computes base layer output 2. Computes gating weights from gating network (determines expert contributions) 3. Computes output from each expert 4. Combines expert outputs using gating weights (weighted sum) 5. Returns base_output + weighted_expert_output

For Beginners: This is where the magic happens!

Process:

Run input through base layer (original behavior)
Run input through gating network to get expert weights
- Example: [0.6, 0.3, 0.1, 0.0] means mostly use expert 1, some expert 2
Run input through all experts to get their opinions
Combine expert outputs using weights (weighted average)
Add combined expert output to base output

The gating weights ensure that:

Relevant experts contribute more (high weights)
Irrelevant experts contribute less (low weights)
All weights sum to 1.0 (thanks to softmax in gating network)

GetLastGatingWeights()

Gets the gating weights from the last forward pass.

public Tensor<T>? GetLastGatingWeights()

Returns

Tensor<T>: Tensor containing gating weights for each sample and expert.

Remarks

This is useful for analyzing which experts are being used for different inputs. The weights are per-sample probabilities summing to 1.0 across experts.

For Beginners: This shows you which experts the gating network chose for the last batch of inputs. High values mean that expert was important, low values mean it wasn't used much.

Example interpretation:

Sample 1: [0.7, 0.2, 0.1, 0.0] -> Mostly expert 1, some expert 2
Sample 2: [0.0, 0.1, 0.8, 0.1] -> Mostly expert 3

This helps you understand:

Which experts specialize in which patterns
Whether routing is working correctly
If some experts are underutilized (might reduce number of experts)

GetParameters()

Gets the current parameters as a vector.

public override Vector<T> GetParameters()

Returns

Vector<T>: Vector containing parameters from all experts, gating network, and optionally base layer.

MergeToOriginalLayer()

Merges all LoRA expert adaptations into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>: A new layer with all expert adaptations merged into the base layer's weights.

Remarks

Since X-LoRA uses input-dependent gating, the merge averages all expert contributions. This provides a reasonable approximation but loses the dynamic routing capability. For deployment, consider keeping the full X-LoRA structure if dynamic routing is important.

For Beginners: This "bakes in" all expert adaptations to create a regular layer.

Important caveat: X-LoRA's strength is dynamic routing (different experts for different inputs). When we merge:

We average all expert contributions (equal weighting)
We lose the dynamic routing capability
The result is a static layer that works okay but not as well as the full X-LoRA

Use this for:

Simpler deployment when dynamic routing isn't critical
Compatibility with systems that don't support X-LoRA
Reducing inference complexity

DON'T use this if:

Dynamic routing is important for your task
Different inputs need very different adaptations
You want maximum performance

Better approach for deployment: Keep the full X-LoRA structure and implement efficient inference.

Exceptions

InvalidOperationException: Thrown when the base layer type is not DenseLayer or FullyConnectedLayer.

ResetState()

Resets the internal state of the base layer, all experts, and the gating network.

public override void ResetState()

Remarks

For Beginners: This clears the memory of all components (base layer, all experts, and gating network). It's useful when starting to process a completely new, unrelated batch of data.

SetParameters(Vector<T>)

Sets the layer parameters from a vector.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: Vector containing parameters for all components.

UpdateParameters(T)

Updates parameters using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T: The learning rate for parameter updates.

Remarks

Updates all experts, the gating network, and optionally the base layer.

UpdateParametersFromLayers()

Updates the parameter vector from the current layer states.

protected override void UpdateParametersFromLayers()

Table of Contents

Class XLoRAAdapter<T>

Type Parameters

Remarks

Constructors

XLoRAAdapter(ILayer<T>, int, int, double, bool)

Parameters

Remarks

Exceptions

Properties

Experts

Property Value

Remarks

GatingNetwork

Property Value

NumberOfExpertss

Property Value

ParameterCount

Property Value

Remarks

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

Forward(Tensor<T>)

Parameters

Returns

Remarks

GetLastGatingWeights()

Returns

Remarks

GetParameters()

Returns

MergeToOriginalLayer()

Returns

Remarks

Exceptions

ResetState()

Remarks

SetParameters(Vector<T>)

Parameters

UpdateParameters(T)

Parameters

Remarks

UpdateParametersFromLayers()