Class MixtureOfExpertsLayer<T>
- Namespace
- AiDotNet.NeuralNetworks.Layers
- Assembly
- AiDotNet.dll
Implements a Mixture-of-Experts (MoE) layer that routes inputs through multiple expert networks.
public class MixtureOfExpertsLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>MixtureOfExpertsLayer<T>
- Implements
-
ILayer<T>
- Inherited Members
Remarks
A Mixture-of-Experts layer contains multiple expert networks and a gating/routing network. For each input, the router determines how much weight to give each expert's output, allowing the model to specialize different experts for different types of inputs. This architecture enables models with very high capacity while remaining computationally efficient by activating only a subset of parameters per input.
For Beginners: Think of a Mixture-of-Experts as a team of specialists working together.
How it works:
- You have multiple "experts" (specialized neural networks)
- A "router" (gating network) decides which experts should handle each input
- Each expert processes the input independently
- The final output is a weighted combination of the experts' outputs
Why use MoE:
- Scalability: Add more experts to increase model capacity without proportionally increasing computation
- Specialization: Different experts learn to handle different types of inputs
- Efficiency: Only activate the most relevant experts for each input (sparse MoE)
Real-world analogy: Imagine you're running a hospital with specialists:
- A cardiologist (expert 1) handles heart problems
- A neurologist (expert 2) handles brain issues
- A pediatrician (expert 3) handles children's health
- A triage nurse (router) directs patients to the right specialist(s)
The router learns to send cardiac patients to the cardiologist, neurological cases to the neurologist, etc. This is more efficient than having one doctor handle everything, and allows each specialist to become highly skilled in their area.
Key Features:
- Support for any number of experts
- Learned routing via a dense gating network
- Softmax routing: All experts contribute with learned weights
- Top-K routing: Only the top K experts are activated per input
- Load balancing: Optional auxiliary loss to encourage balanced expert usage
Constructors
MixtureOfExpertsLayer(List<ILayer<T>>, ILayer<T>, int[], int[], int, IActivationFunction<T>?, bool, T?)
Initializes a new instance of the MixtureOfExpertsLayer<T> class.
public MixtureOfExpertsLayer(List<ILayer<T>> experts, ILayer<T> router, int[] inputShape, int[] outputShape, int topK = 0, IActivationFunction<T>? activationFunction = null, bool useLoadBalancing = false, T? loadBalancingWeight = default)
Parameters
expertsList<ILayer<T>>The list of expert networks.
routerILayer<T>The routing/gating network.
inputShapeint[]The shape of input tensors.
outputShapeint[]The shape of output tensors.
topKintNumber of experts to activate per input (0 = use all experts). Default is 0.
activationFunctionIActivationFunction<T>Optional activation function to apply after combining expert outputs.
useLoadBalancingboolloadBalancingWeightT
Remarks
Creates a Mixture-of-Experts layer with the specified experts and router. All experts should have compatible input/output shapes. The router should output a tensor with one value per expert.
For Beginners: This creates a new MoE layer with your chosen experts and router.
To create an MoE layer:
- Create your expert networks (can be any ILayer<T>, often Expert<T> or DenseLayer<T>)
- Create a router (typically a DenseLayer that outputs numExperts values)
- Specify input/output shapes
- Optionally set topK for sparse routing
Example - MoE with 4 experts and Top-2 routing:
// Create 4 expert networks
var experts = new List<ILayer<float>>();
for (int i = 0; i < 4; i++)
{
var expertLayers = new List<ILayer<float>>
{
new DenseLayer<float>(128, 256, new ReLUActivation<float>()),
new DenseLayer<float>(256, 128, new ReLUActivation<float>())
};
experts.Add(new ExpertLayer<float>(expertLayers, new[] { 128 }, new[] { 128 }));
}
// Create router that outputs 4 scores (one per expert)
var router = new DenseLayer<float>(128, 4);
// Create MoE layer with Top-2 routing
var moe = new MixtureOfExpertsLayer<float>(
experts, router,
new[] { 128 }, new[] { 128 },
topK: 2
);
Exceptions
- ArgumentException
Thrown when the experts list is empty or when topK is invalid.
Properties
AuxiliaryLossWeight
Gets or sets the weight for the auxiliary load balancing loss.
public T AuxiliaryLossWeight { get; set; }
Property Value
- T
The coefficient that determines how much the load balancing loss influences training.
Remarks
This weight is multiplied by the load balancing loss before adding it to the primary loss. Typical values range from 0.01 to 0.1. Higher values enforce stronger load balancing.
For Beginners: Controls the importance of load balancing.
Recommended starting value: 0.01
Tuning guidelines:
- If experts are very imbalanced: increase (e.g., 0.05 or 0.1)
- If primary task accuracy suffers: decrease (e.g., 0.005)
- Monitor both primary loss and expert usage statistics to find the right balance
NumExperts
Gets the number of experts in this MoE layer.
public int NumExperts { get; }
Property Value
- int
The count of expert networks.
Remarks
This is the total number of expert networks, regardless of how many are activated per input.
For Beginners: How many specialist networks are available.
Common configurations:
- Small models: 4-8 experts
- Medium models: 8-16 experts
- Large models: 32-128+ experts
The number of experts affects:
- Model capacity (more experts = more capacity)
- Memory usage (more experts = more memory)
- Specialization potential (more experts = more specialized roles)
ParameterCount
Gets the total number of trainable parameters in the layer.
public override int ParameterCount { get; }
Property Value
- int
The sum of the router's parameters and all experts' parameters.
Remarks
This includes all parameters from the router and all experts combined. This gives you the total model capacity and memory requirement for this layer.
For Beginners: The total count of all adjustable numbers in this layer.
This includes:
- All weights and biases in the router
- All weights and biases in all experts
For example, with:
- Router: 1000 parameters
- 8 experts with 5000 parameters each: 40,000 parameters
- Total: 41,000 parameters
More parameters = more capacity to learn, but also more memory needed. MoE shines because you can have huge capacity (many experts) but still only activate a fraction of them per input with sparse routing.
SupportsJitCompilation
Gets a value indicating whether this layer supports JIT compilation.
public override bool SupportsJitCompilation { get; }
Property Value
- bool
trueif both the router and all experts support JIT compilation; otherwise,false.
Remarks
JIT compilation for MoE uses TopKSoftmax for differentiable expert selection. The routing is performed by the router network, and the selected experts' outputs are weighted by the softmax-normalized routing scores.
SupportsTraining
Gets a value indicating whether this layer supports training.
public override bool SupportsTraining { get; }
Property Value
- bool
trueif the router or any expert supports training; otherwise,false.
Remarks
The MoE layer supports training if either its router or any of its experts have trainable parameters.
For Beginners: This tells you if the MoE layer can learn from data.
The layer can learn if:
- The router can learn better routing decisions
- Any expert can improve its predictions
In almost all practical cases, this will be true since both the router and experts typically have trainable parameters.
UseAuxiliaryLoss
Gets or sets a value indicating whether to use the auxiliary load balancing loss.
public bool UseAuxiliaryLoss { get; set; }
Property Value
- bool
trueto compute and apply load balancing loss during training; otherwise,false.
Remarks
When enabled, the layer computes a load balancing loss that encourages balanced expert usage. This loss is added to the primary task loss during training to prevent expert imbalance.
For Beginners: Turn load balancing on or off.
Enable this during training to ensure all experts are used roughly equally. Disable during inference/testing since load balancing is only needed during training.
Benefits of load balancing:
- Prevents expert collapse (all inputs routed to the same expert)
- Encourages specialization across different experts
- Improves overall model quality and generalization
Methods
Backward(Tensor<T>)
Performs the backward pass through the MoE layer.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient of the loss with respect to this layer's output.
Returns
- Tensor<T>
The gradient of the loss with respect to this layer's input.
Remarks
The backward pass: 1. Applies the derivative of the activation function 2. Computes gradients for each expert's contribution 3. Backpropagates through active experts 4. Computes gradients for the router 5. Backpropagates through the router 6. Returns the combined input gradient
For Beginners: This is where the MoE layer learns from its mistakes.
The backward pass works in reverse:
Receive Error Signal: Get information about how wrong the output was
- This comes from layers after this one (or from the loss function)
Activation Gradient: Account for the activation function
- If we applied ReLU, apply its derivative
- This adjusts the error signal appropriately
Expert Gradients: Calculate how each expert should improve
- Weight the error by how much each expert contributed
- Expert with weight 0.7 gets more of the blame/credit than one with 0.1
- Send these weighted errors back through each expert
Router Gradients: Calculate how routing should improve
- If expert 1 was useful, increase its future routing weight for similar inputs
- If expert 3 was harmful, decrease its future routing weight
- This helps the router make better decisions next time
Combine Input Gradients: Sum up gradients from router and experts
- This tells earlier layers how they should adjust
After backward pass completes, all components know how to improve, but haven't changed yet. The actual changes happen in UpdateParameters().
Exceptions
- InvalidOperationException
Thrown when backward is called before forward.
Clone()
Creates a deep copy of this MoE layer.
public override LayerBase<T> Clone()
Returns
- LayerBase<T>
A new MixtureOfExpertsLayer with the same configuration and parameters.
Remarks
Creates an independent copy of this layer, including the router and all experts. Changes to the clone won't affect the original.
For Beginners: Makes an identical copy of the entire MoE layer.
The clone includes:
- A copy of the router
- Copies of all experts
- Same configuration (TopK, shapes, etc.)
- Same learned parameters
Useful for:
- Creating an ensemble of similar models
- Experimenting with different training approaches
- Saving checkpoints during training
- Implementing certain meta-learning algorithms
The clone is completely independent - training one won't affect the other.
ComputeAuxiliaryLoss()
Computes the load balancing auxiliary loss based on expert usage from the last forward pass.
public T ComputeAuxiliaryLoss()
Returns
- T
The load balancing loss value.
Remarks
The load balancing loss encourages balanced expert usage by penalizing imbalanced routing. It is computed as the dot product of two fractions for each expert: - Token fraction: Proportion of tokens (inputs) routed to this expert - Probability mass fraction: Average routing probability for this expert
Loss = NumExperts * sum(token_fraction_i * prob_mass_fraction_i) for all experts i
This loss is minimized when all experts receive equal numbers of tokens and equal total probability mass, encouraging balanced utilization.
For Beginners: Calculates a penalty for imbalanced expert usage.
How it works:
Count Token Assignments:
- For each expert, count how many inputs chose it (with Top-K) or had non-zero weight
- Example with 8 inputs and 4 experts: [3, 2, 2, 1] tokens per expert
Calculate Probability Mass:
- For each expert, sum up its routing weights across all inputs
- Example: [0.4, 0.3, 0.2, 0.1] total probability per expert
Compute Load Balancing Loss:
- Convert counts to fractions: [3/8, 2/8, 2/8, 1/8] = [0.375, 0.25, 0.25, 0.125]
- Convert probabilities to fractions: [0.4, 0.3, 0.2, 0.1]
- Dot product: 0.3750.4 + 0.250.3 + 0.250.2 + 0.1250.1
- Multiply by numExperts (4): gives load balancing loss
Why this works:
- If all experts are used equally, both fractions are [0.25, 0.25, 0.25, 0.25]
- Dot product: 0.25*0.25 * 4 = 0.25 (minimum possible)
- If imbalanced like [0.5, 0.3, 0.15, 0.05] × [0.6, 0.25, 0.1, 0.05]
- Dot product: 0.5*0.6 + ... = higher value (penalty for imbalance)
The loss is minimized when usage is perfectly balanced!
Exceptions
- InvalidOperationException
Thrown when called before a forward pass or when auxiliary loss is disabled.
ExportComputationGraph(List<ComputationNode<T>>)
Exports the layer's computation graph for JIT compilation.
public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)
Parameters
inputNodesList<ComputationNode<T>>List to populate with input computation nodes.
Returns
- ComputationNode<T>
The output computation node representing the layer's operation.
Remarks
This method constructs a computation graph representation of the layer's forward pass that can be JIT compiled for faster inference. All layers MUST implement this method to support JIT compilation.
For Beginners: JIT (Just-In-Time) compilation converts the layer's operations into optimized native code for 5-10x faster inference.
To support JIT compilation, a layer must:
- Implement this method to export its computation graph
- Set SupportsJitCompilation to true
- Use ComputationNode and TensorOperations to build the graph
All layers are required to implement this method, even if they set SupportsJitCompilation = false.
Forward(Tensor<T>)
Performs the forward pass through the MoE layer.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>The input tensor.
Returns
- Tensor<T>
The output tensor after routing through experts and combining their outputs.
Remarks
The forward pass: 1. Routes the input through the gating network to get expert scores 2. Applies softmax to convert scores to routing probabilities 3. Optionally selects only top-K experts (sparse routing) 4. Passes input through selected experts 5. Combines expert outputs using routing weights 6. Applies the layer's activation function
For Beginners: This is where the MoE layer processes input data.
Step-by-step process:
Routing: The router looks at the input and scores each expert
- Input: data to process
- Output: a score for each expert (raw numbers)
Normalization: Convert scores to probabilities using softmax
- Scores: might be [2.1, -0.5, 1.3, 0.8]
- Weights: becomes [0.55, 0.04, 0.26, 0.15] (sum = 1.0)
Selection (if using Top-K): Keep only the best K experts
- With Top-2, keep experts with weights 0.55 and 0.26
- Set others to 0 and renormalize: [0.68, 0, 0.32, 0]
Expert Processing: Run input through selected experts
- Expert 1 produces output A
- Expert 3 produces output B
- Others are skipped (if using Top-K)
Combination: Mix expert outputs using weights
- Output = 0.68 * A + 0.32 * B
- This is the weighted average of expert predictions
Activation: Apply final transformation
- Usually identity (no change) or ReLU
The result is a smart combination of expert predictions, where each expert contributes based on its relevance to the specific input.
ForwardGpu(params IGpuTensor<T>[])
Performs the forward pass on GPU tensors by routing through experts. All computations stay GPU-resident for maximum performance.
public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)
Parameters
inputsIGpuTensor<T>[]GPU tensor inputs (uses first input).
Returns
- IGpuTensor<T>
GPU tensor output after routing through experts and combining outputs.
Remarks
The GPU forward pass (all operations GPU-resident): 1. Routes input through the router network (GPU) 2. Applies softmax to get routing probabilities (GPU) 3. Optionally applies Top-K selection (GPU) 4. Passes input through each expert (GPU) 5. Combines expert outputs using routing weights (GPU) 6. Applies activation function (GPU) Only downloads to CPU in training mode for gradient caching.
GetAuxiliaryLossDiagnostics()
Gets diagnostic information about expert usage and load balancing.
public Dictionary<string, string> GetAuxiliaryLossDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic metrics including per-expert usage statistics, load balancing metrics, and routing weight distributions.
Remarks
This method provides detailed statistics about expert usage that can be used for monitoring training progress, debugging routing issues, and tuning load balancing parameters.
For Beginners: Gets a detailed report about how experts are being used.
The returned dictionary includes:
- expert_i_tokens: How many inputs were routed to expert i
- expert_i_prob_mass: Total routing weight for expert i across all inputs
- expert_i_avg_weight: Average routing weight when expert i is selected
- load_balance_loss: Current load balancing loss value
- usage_variance: Variance in expert usage (lower is better balanced)
- max_min_ratio: Ratio of most-used to least-used expert (1.0 is perfect)
Use this information to:
- Monitor if experts are being used balanced or if some are overused
- Decide if you need to adjust the load balancing weight
- Detect expert collapse (all inputs routed to one expert)
- Track training health over time
Example output: { "expert_0_tokens": "245", "expert_1_tokens": "198", "expert_2_tokens": "223", "expert_3_tokens": "234", "expert_0_prob_mass": "0.28", "expert_1_prob_mass": "0.22", ... "load_balance_loss": "0.253", "usage_variance": "0.0012", "max_min_ratio": "1.24" }
GetDiagnostics()
Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.
public override Dictionary<string, string> GetDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().
GetParameters()
Gets all trainable parameters as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all parameters from the router and all experts.
Remarks
Parameters are ordered as: [router parameters] [expert1 parameters] [expert2 parameters] ...
For Beginners: Collects all learned values into one list.
The returned vector contains:
- First, all parameters from the router
- Then, all parameters from expert 1
- Then, all parameters from expert 2
- And so on
This is useful for:
- Saving the entire MoE model to disk
- Implementing advanced optimization algorithms
- Analyzing the model's learned parameters
- Transferring knowledge to another model
ResetState()
Resets the internal state of the layer, clearing all cached values.
public override void ResetState()
Remarks
This clears cached values from forward/backward passes and resets the state of the router and all experts. Call this between training batches or when switching between training and inference.
For Beginners: Clears the layer's "short-term memory".
This resets:
- Cached inputs and outputs
- Routing weights and decisions
- Expert activations
- All temporary values used for learning
When to call this:
- Between different batches of training data
- When switching from training to testing mode
- Before processing a new, unrelated input
This ensures that information from one batch doesn't leak into the next batch, which could cause incorrect gradient calculations or predictions.
SetParameters(Vector<T>)
Sets all trainable parameters from a single vector.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>A vector containing parameters for the router and all experts.
Remarks
Parameters should be in the same order as returned by GetParameters(): [router parameters] [expert1 parameters] [expert2 parameters] ...
For Beginners: Loads previously saved parameters back into the model.
This is the opposite of GetParameters():
- Takes a vector of all parameters
- Distributes them to the router and experts
- Must match the exact format returned by GetParameters()
Use this to:
- Load a saved model from disk
- Initialize with pre-trained parameters
- Implement custom optimization algorithms
If the parameter count doesn't match exactly, an error is thrown to prevent accidentally corrupting the model.
Exceptions
- ArgumentException
Thrown when the parameter count doesn't match.
UpdateParameters(T)
Updates all trainable parameters using the specified learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate for parameter updates.
Remarks
This method updates parameters for both the router and all expert networks that support training.
For Beginners: This applies all the learned improvements to the router and experts.
After the backward pass calculated how everything should change:
- The router updates its weights to make better routing decisions
- Each expert updates its weights to make better predictions
- The learning rate controls how big these updates are
Learning rate guidelines:
- Too small: Learning is very slow but stable
- Too large: Learning is fast but might be unstable
- Just right: Balances speed and stability (often 0.001 to 0.01)
After calling this method, the MoE layer should perform slightly better than before.