Table of Contents

Class XLoRAAdapter<T>

Namespace
AiDotNet.LoRA.Adapters
Assembly
AiDotNet.dll

X-LoRA (Mixture of LoRA Experts) adapter that uses multiple LoRA experts with learned routing.

public class XLoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
XLoRAAdapter<T>
Implements
Inherited Members

Remarks

X-LoRA extends standard LoRA by using a mixture of experts approach: - Multiple LoRA adapters ("experts") are applied to the same layer - A gating network learns to weight each expert's contribution based on the input - Different inputs may activate different experts, allowing for more flexible adaptation - This provides greater capacity than a single LoRA adapter with the same total rank

The forward pass computes: - base_output = base_layer(input) - For each expert i: expert_output[i] = lora_expert[i](input) - gating_weights = softmax(gating_network(input)) - final_lora_output = sum(gating_weights[i] * expert_output[i]) - output = base_output + final_lora_output

For Beginners: X-LoRA is like having multiple specialists instead of one generalist.

Think of it like this:

  • Standard LoRA: One adapter tries to handle all tasks
  • X-LoRA: Multiple expert adapters, each specializing in different patterns
  • A "gating network" decides which experts to use for each input

Real-world analogy: Instead of one doctor handling all patients, you have:

  • Expert 1: Specializes in one type of pattern (e.g., cat images)
  • Expert 2: Specializes in another pattern (e.g., dog images)
  • Expert 3: Handles other cases
  • Gating network: Looks at each input and decides which expert(s) to consult

Benefits:

  • More capacity: Multiple experts can learn different aspects
  • Better specialization: Each expert focuses on what it's good at
  • Dynamic routing: Different inputs activate different experts
  • Efficient: Only computes what's needed for each input

Example: For a 1000x1000 layer with 4 experts at rank=4 each:

  • Total LoRA parameters: 4 * (4 * 1000 + 4 * 1000) = 32,000 parameters
  • Gating network: ~1000 parameters
  • Total: ~33,000 parameters (still 96.7% reduction from 1M!)
  • But with more capacity than single rank=16 LoRA (32,000 params)

Trade-offs:

  • More flexible: Experts specialize in different patterns
  • Better performance: Often outperforms single LoRA at same parameter count
  • Dynamic routing: Adapts to different inputs
  • More complex: Requires training gating network
  • Slightly slower: Must compute multiple experts and gating weights

Reference: "Mixture of LoRA Experts" (X-LoRA) https://arxiv.org/abs/2402.07148

Constructors

XLoRAAdapter(ILayer<T>, int, int, double, bool)

Initializes a new X-LoRA adapter with the specified parameters.

public XLoRAAdapter(ILayer<T> baseLayer, int numberOfExperts, int expertRank, double alpha = -1, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>

The layer to adapt with X-LoRA.

numberOfExperts int

The number of LoRA experts to create.

expertRank int

The rank of each LoRA expert decomposition.

alpha double

The LoRA scaling factor for experts (defaults to expertRank if negative).

freezeBaseLayer bool

Whether to freeze the base layer's parameters during training.

Remarks

For Beginners: This creates an X-LoRA adapter with multiple expert adapters.

Parameters:

  • baseLayer: The layer you want to adapt (typically Dense or FullyConnected)
  • numberOfExperts: How many specialist adapters to create (typically 2-8)
  • expertRank: The rank for each expert (compression level)
  • alpha: How strong each expert's adaptation is
  • freezeBaseLayer: Whether to lock the original layer's weights (usually true)

The adapter will:

  1. Create multiple LoRA experts (all with the same rank)
  2. Create a gating network to route inputs to experts
  3. Learn to specialize each expert for different patterns

Common configurations:

  • numberOfExperts=2, expertRank=8: Simple mixture for binary specialization
  • numberOfExperts=4, expertRank=4: Balanced approach (4 specialists, 16 total rank)
  • numberOfExperts=8, expertRank=2: Many specialists, each handling narrow patterns

Trade-off: More experts = more specialization but more parameters and computation.

Exceptions

ArgumentNullException

Thrown when baseLayer is null.

ArgumentException

Thrown when numberOfExperts is invalid.

Properties

Experts

Gets the array of LoRA expert layers.

public LoRALayer<T>[] Experts { get; }

Property Value

LoRALayer<T>[]

Remarks

Returns a copy of the experts array to prevent external modification.

GatingNetwork

Gets the gating network used for routing.

public DenseLayer<T> GatingNetwork { get; }

Property Value

DenseLayer<T>

NumberOfExpertss

Gets the number of LoRA experts in this adapter.

public int NumberOfExpertss { get; }

Property Value

int

ParameterCount

Gets the total number of trainable parameters.

public override int ParameterCount { get; }

Property Value

int

Remarks

Includes parameters from:

  • Base layer (if not frozen)
  • All expert LoRA layers
  • Gating network

Methods

Backward(Tensor<T>)

Performs the backward pass through the mixture of experts.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

Gradient flowing back from the next layer.

Returns

Tensor<T>

Gradient to pass to the previous layer.

Remarks

The backward pass propagates gradients through: 1. All expert LoRA layers (weighted by their gating weights) 2. The gating network (to learn better routing) 3. The base layer (if not frozen)

For Beginners: This is where all components learn to improve!

During backpropagation:

  1. Each expert receives gradients weighted by how much it was used
    • Expert with weight 0.6 gets 60% of the gradient
    • Expert with weight 0.1 gets 10% of the gradient
  2. The gating network learns to route inputs better
    • If an expert's output helped, increase its weight next time
    • If an expert's output hurt, decrease its weight
  3. The base layer updates if not frozen

This creates a feedback loop where:

  • Experts specialize in patterns they're good at
  • Gating network learns which expert to use for which input
  • Together, they improve performance beyond single LoRA

Forward(Tensor<T>)

Performs the forward pass using mixture of LoRA experts.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

Input tensor.

Returns

Tensor<T>

Output combining base layer and weighted expert outputs.

Remarks

The forward pass: 1. Computes base layer output 2. Computes gating weights from gating network (determines expert contributions) 3. Computes output from each expert 4. Combines expert outputs using gating weights (weighted sum) 5. Returns base_output + weighted_expert_output

For Beginners: This is where the magic happens!

Process:

  1. Run input through base layer (original behavior)
  2. Run input through gating network to get expert weights
    • Example: [0.6, 0.3, 0.1, 0.0] means mostly use expert 1, some expert 2
  3. Run input through all experts to get their opinions
  4. Combine expert outputs using weights (weighted average)
  5. Add combined expert output to base output

The gating weights ensure that:

  • Relevant experts contribute more (high weights)
  • Irrelevant experts contribute less (low weights)
  • All weights sum to 1.0 (thanks to softmax in gating network)

GetLastGatingWeights()

Gets the gating weights from the last forward pass.

public Tensor<T>? GetLastGatingWeights()

Returns

Tensor<T>

Tensor containing gating weights for each sample and expert.

Remarks

This is useful for analyzing which experts are being used for different inputs. The weights are per-sample probabilities summing to 1.0 across experts.

For Beginners: This shows you which experts the gating network chose for the last batch of inputs. High values mean that expert was important, low values mean it wasn't used much.

Example interpretation:

  • Sample 1: [0.7, 0.2, 0.1, 0.0] -> Mostly expert 1, some expert 2
  • Sample 2: [0.0, 0.1, 0.8, 0.1] -> Mostly expert 3

This helps you understand:

  • Which experts specialize in which patterns
  • Whether routing is working correctly
  • If some experts are underutilized (might reduce number of experts)

GetParameters()

Gets the current parameters as a vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

Vector containing parameters from all experts, gating network, and optionally base layer.

MergeToOriginalLayer()

Merges all LoRA expert adaptations into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>

A new layer with all expert adaptations merged into the base layer's weights.

Remarks

Since X-LoRA uses input-dependent gating, the merge averages all expert contributions. This provides a reasonable approximation but loses the dynamic routing capability. For deployment, consider keeping the full X-LoRA structure if dynamic routing is important.

For Beginners: This "bakes in" all expert adaptations to create a regular layer.

Important caveat: X-LoRA's strength is dynamic routing (different experts for different inputs). When we merge:

  1. We average all expert contributions (equal weighting)
  2. We lose the dynamic routing capability
  3. The result is a static layer that works okay but not as well as the full X-LoRA

Use this for:

  • Simpler deployment when dynamic routing isn't critical
  • Compatibility with systems that don't support X-LoRA
  • Reducing inference complexity

DON'T use this if:

  • Dynamic routing is important for your task
  • Different inputs need very different adaptations
  • You want maximum performance

Better approach for deployment: Keep the full X-LoRA structure and implement efficient inference.

Exceptions

InvalidOperationException

Thrown when the base layer type is not DenseLayer or FullyConnectedLayer.

ResetState()

Resets the internal state of the base layer, all experts, and the gating network.

public override void ResetState()

Remarks

For Beginners: This clears the memory of all components (base layer, all experts, and gating network). It's useful when starting to process a completely new, unrelated batch of data.

SetParameters(Vector<T>)

Sets the layer parameters from a vector.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

Vector containing parameters for all components.

UpdateParameters(T)

Updates parameters using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate for parameter updates.

Remarks

Updates all experts, the gating network, and optionally the base layer.

UpdateParametersFromLayers()

Updates the parameter vector from the current layer states.

protected override void UpdateParametersFromLayers()