Class XLoRAAdapter<T>
X-LoRA (Mixture of LoRA Experts) adapter that uses multiple LoRA experts with learned routing.
public class XLoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>XLoRAAdapter<T>
- Implements
-
ILoRAAdapter<T>ILayer<T>
- Inherited Members
Remarks
X-LoRA extends standard LoRA by using a mixture of experts approach: - Multiple LoRA adapters ("experts") are applied to the same layer - A gating network learns to weight each expert's contribution based on the input - Different inputs may activate different experts, allowing for more flexible adaptation - This provides greater capacity than a single LoRA adapter with the same total rank
The forward pass computes: - base_output = base_layer(input) - For each expert i: expert_output[i] = lora_expert[i](input) - gating_weights = softmax(gating_network(input)) - final_lora_output = sum(gating_weights[i] * expert_output[i]) - output = base_output + final_lora_output
For Beginners: X-LoRA is like having multiple specialists instead of one generalist.
Think of it like this:
- Standard LoRA: One adapter tries to handle all tasks
- X-LoRA: Multiple expert adapters, each specializing in different patterns
- A "gating network" decides which experts to use for each input
Real-world analogy: Instead of one doctor handling all patients, you have:
- Expert 1: Specializes in one type of pattern (e.g., cat images)
- Expert 2: Specializes in another pattern (e.g., dog images)
- Expert 3: Handles other cases
- Gating network: Looks at each input and decides which expert(s) to consult
Benefits:
- More capacity: Multiple experts can learn different aspects
- Better specialization: Each expert focuses on what it's good at
- Dynamic routing: Different inputs activate different experts
- Efficient: Only computes what's needed for each input
Example: For a 1000x1000 layer with 4 experts at rank=4 each:
- Total LoRA parameters: 4 * (4 * 1000 + 4 * 1000) = 32,000 parameters
- Gating network: ~1000 parameters
- Total: ~33,000 parameters (still 96.7% reduction from 1M!)
- But with more capacity than single rank=16 LoRA (32,000 params)
Trade-offs:
- More flexible: Experts specialize in different patterns
- Better performance: Often outperforms single LoRA at same parameter count
- Dynamic routing: Adapts to different inputs
- More complex: Requires training gating network
- Slightly slower: Must compute multiple experts and gating weights
Reference: "Mixture of LoRA Experts" (X-LoRA) https://arxiv.org/abs/2402.07148
Constructors
XLoRAAdapter(ILayer<T>, int, int, double, bool)
Initializes a new X-LoRA adapter with the specified parameters.
public XLoRAAdapter(ILayer<T> baseLayer, int numberOfExperts, int expertRank, double alpha = -1, bool freezeBaseLayer = true)
Parameters
baseLayerILayer<T>The layer to adapt with X-LoRA.
numberOfExpertsintThe number of LoRA experts to create.
expertRankintThe rank of each LoRA expert decomposition.
alphadoubleThe LoRA scaling factor for experts (defaults to expertRank if negative).
freezeBaseLayerboolWhether to freeze the base layer's parameters during training.
Remarks
For Beginners: This creates an X-LoRA adapter with multiple expert adapters.
Parameters:
- baseLayer: The layer you want to adapt (typically Dense or FullyConnected)
- numberOfExperts: How many specialist adapters to create (typically 2-8)
- expertRank: The rank for each expert (compression level)
- alpha: How strong each expert's adaptation is
- freezeBaseLayer: Whether to lock the original layer's weights (usually true)
The adapter will:
- Create multiple LoRA experts (all with the same rank)
- Create a gating network to route inputs to experts
- Learn to specialize each expert for different patterns
Common configurations:
- numberOfExperts=2, expertRank=8: Simple mixture for binary specialization
- numberOfExperts=4, expertRank=4: Balanced approach (4 specialists, 16 total rank)
- numberOfExperts=8, expertRank=2: Many specialists, each handling narrow patterns
Trade-off: More experts = more specialization but more parameters and computation.
Exceptions
- ArgumentNullException
Thrown when baseLayer is null.
- ArgumentException
Thrown when numberOfExperts is invalid.
Properties
Experts
Gets the array of LoRA expert layers.
public LoRALayer<T>[] Experts { get; }
Property Value
- LoRALayer<T>[]
Remarks
Returns a copy of the experts array to prevent external modification.
GatingNetwork
Gets the gating network used for routing.
public DenseLayer<T> GatingNetwork { get; }
Property Value
- DenseLayer<T>
NumberOfExpertss
Gets the number of LoRA experts in this adapter.
public int NumberOfExpertss { get; }
Property Value
ParameterCount
Gets the total number of trainable parameters.
public override int ParameterCount { get; }
Property Value
Remarks
Includes parameters from:
- Base layer (if not frozen)
- All expert LoRA layers
- Gating network
Methods
Backward(Tensor<T>)
Performs the backward pass through the mixture of experts.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>Gradient flowing back from the next layer.
Returns
- Tensor<T>
Gradient to pass to the previous layer.
Remarks
The backward pass propagates gradients through: 1. All expert LoRA layers (weighted by their gating weights) 2. The gating network (to learn better routing) 3. The base layer (if not frozen)
For Beginners: This is where all components learn to improve!
During backpropagation:
- Each expert receives gradients weighted by how much it was used
- Expert with weight 0.6 gets 60% of the gradient
- Expert with weight 0.1 gets 10% of the gradient
- The gating network learns to route inputs better
- If an expert's output helped, increase its weight next time
- If an expert's output hurt, decrease its weight
- The base layer updates if not frozen
This creates a feedback loop where:
- Experts specialize in patterns they're good at
- Gating network learns which expert to use for which input
- Together, they improve performance beyond single LoRA
Forward(Tensor<T>)
Performs the forward pass using mixture of LoRA experts.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>Input tensor.
Returns
- Tensor<T>
Output combining base layer and weighted expert outputs.
Remarks
The forward pass: 1. Computes base layer output 2. Computes gating weights from gating network (determines expert contributions) 3. Computes output from each expert 4. Combines expert outputs using gating weights (weighted sum) 5. Returns base_output + weighted_expert_output
For Beginners: This is where the magic happens!
Process:
- Run input through base layer (original behavior)
- Run input through gating network to get expert weights
- Example: [0.6, 0.3, 0.1, 0.0] means mostly use expert 1, some expert 2
- Run input through all experts to get their opinions
- Combine expert outputs using weights (weighted average)
- Add combined expert output to base output
The gating weights ensure that:
- Relevant experts contribute more (high weights)
- Irrelevant experts contribute less (low weights)
- All weights sum to 1.0 (thanks to softmax in gating network)
GetLastGatingWeights()
Gets the gating weights from the last forward pass.
public Tensor<T>? GetLastGatingWeights()
Returns
- Tensor<T>
Tensor containing gating weights for each sample and expert.
Remarks
This is useful for analyzing which experts are being used for different inputs. The weights are per-sample probabilities summing to 1.0 across experts.
For Beginners: This shows you which experts the gating network chose for the last batch of inputs. High values mean that expert was important, low values mean it wasn't used much.
Example interpretation:
- Sample 1: [0.7, 0.2, 0.1, 0.0] -> Mostly expert 1, some expert 2
- Sample 2: [0.0, 0.1, 0.8, 0.1] -> Mostly expert 3
This helps you understand:
- Which experts specialize in which patterns
- Whether routing is working correctly
- If some experts are underutilized (might reduce number of experts)
GetParameters()
Gets the current parameters as a vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
Vector containing parameters from all experts, gating network, and optionally base layer.
MergeToOriginalLayer()
Merges all LoRA expert adaptations into the base layer and returns the merged layer.
public override ILayer<T> MergeToOriginalLayer()
Returns
- ILayer<T>
A new layer with all expert adaptations merged into the base layer's weights.
Remarks
Since X-LoRA uses input-dependent gating, the merge averages all expert contributions. This provides a reasonable approximation but loses the dynamic routing capability. For deployment, consider keeping the full X-LoRA structure if dynamic routing is important.
For Beginners: This "bakes in" all expert adaptations to create a regular layer.
Important caveat: X-LoRA's strength is dynamic routing (different experts for different inputs). When we merge:
- We average all expert contributions (equal weighting)
- We lose the dynamic routing capability
- The result is a static layer that works okay but not as well as the full X-LoRA
Use this for:
- Simpler deployment when dynamic routing isn't critical
- Compatibility with systems that don't support X-LoRA
- Reducing inference complexity
DON'T use this if:
- Dynamic routing is important for your task
- Different inputs need very different adaptations
- You want maximum performance
Better approach for deployment: Keep the full X-LoRA structure and implement efficient inference.
Exceptions
- InvalidOperationException
Thrown when the base layer type is not DenseLayer or FullyConnectedLayer.
ResetState()
Resets the internal state of the base layer, all experts, and the gating network.
public override void ResetState()
Remarks
For Beginners: This clears the memory of all components (base layer, all experts, and gating network). It's useful when starting to process a completely new, unrelated batch of data.
SetParameters(Vector<T>)
Sets the layer parameters from a vector.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>Vector containing parameters for all components.
UpdateParameters(T)
Updates parameters using the specified learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate for parameter updates.
Remarks
Updates all experts, the gating network, and optionally the base layer.
UpdateParametersFromLayers()
Updates the parameter vector from the current layer states.
protected override void UpdateParametersFromLayers()