Table of Contents

Class MixtureOfExpertsLayer<T>

Namespace
AiDotNet.NeuralNetworks.Layers
Assembly
AiDotNet.dll

Implements a Mixture-of-Experts (MoE) layer that routes inputs through multiple expert networks.

public class MixtureOfExpertsLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
MixtureOfExpertsLayer<T>
Implements
Inherited Members

Remarks

A Mixture-of-Experts layer contains multiple expert networks and a gating/routing network. For each input, the router determines how much weight to give each expert's output, allowing the model to specialize different experts for different types of inputs. This architecture enables models with very high capacity while remaining computationally efficient by activating only a subset of parameters per input.

For Beginners: Think of a Mixture-of-Experts as a team of specialists working together.

How it works:

  • You have multiple "experts" (specialized neural networks)
  • A "router" (gating network) decides which experts should handle each input
  • Each expert processes the input independently
  • The final output is a weighted combination of the experts' outputs

Why use MoE:

  • Scalability: Add more experts to increase model capacity without proportionally increasing computation
  • Specialization: Different experts learn to handle different types of inputs
  • Efficiency: Only activate the most relevant experts for each input (sparse MoE)

Real-world analogy: Imagine you're running a hospital with specialists:

  • A cardiologist (expert 1) handles heart problems
  • A neurologist (expert 2) handles brain issues
  • A pediatrician (expert 3) handles children's health
  • A triage nurse (router) directs patients to the right specialist(s)

The router learns to send cardiac patients to the cardiologist, neurological cases to the neurologist, etc. This is more efficient than having one doctor handle everything, and allows each specialist to become highly skilled in their area.

Key Features:

  • Support for any number of experts
  • Learned routing via a dense gating network
  • Softmax routing: All experts contribute with learned weights
  • Top-K routing: Only the top K experts are activated per input
  • Load balancing: Optional auxiliary loss to encourage balanced expert usage

Constructors

MixtureOfExpertsLayer(List<ILayer<T>>, ILayer<T>, int[], int[], int, IActivationFunction<T>?, bool, T?)

Initializes a new instance of the MixtureOfExpertsLayer<T> class.

public MixtureOfExpertsLayer(List<ILayer<T>> experts, ILayer<T> router, int[] inputShape, int[] outputShape, int topK = 0, IActivationFunction<T>? activationFunction = null, bool useLoadBalancing = false, T? loadBalancingWeight = default)

Parameters

experts List<ILayer<T>>

The list of expert networks.

router ILayer<T>

The routing/gating network.

inputShape int[]

The shape of input tensors.

outputShape int[]

The shape of output tensors.

topK int

Number of experts to activate per input (0 = use all experts). Default is 0.

activationFunction IActivationFunction<T>

Optional activation function to apply after combining expert outputs.

useLoadBalancing bool
loadBalancingWeight T

Remarks

Creates a Mixture-of-Experts layer with the specified experts and router. All experts should have compatible input/output shapes. The router should output a tensor with one value per expert.

For Beginners: This creates a new MoE layer with your chosen experts and router.

To create an MoE layer:

  1. Create your expert networks (can be any ILayer<T>, often Expert<T> or DenseLayer<T>)
  2. Create a router (typically a DenseLayer that outputs numExperts values)
  3. Specify input/output shapes
  4. Optionally set topK for sparse routing

Example - MoE with 4 experts and Top-2 routing:

// Create 4 expert networks
var experts = new List<ILayer<float>>();
for (int i = 0; i < 4; i++)
{
    var expertLayers = new List<ILayer<float>>
    {
        new DenseLayer<float>(128, 256, new ReLUActivation<float>()),
        new DenseLayer<float>(256, 128, new ReLUActivation<float>())
    };
    experts.Add(new ExpertLayer<float>(expertLayers, new[] { 128 }, new[] { 128 }));
}

// Create router that outputs 4 scores (one per expert)
var router = new DenseLayer<float>(128, 4);

// Create MoE layer with Top-2 routing
var moe = new MixtureOfExpertsLayer<float>(
    experts, router,
    new[] { 128 }, new[] { 128 },
    topK: 2
);

Exceptions

ArgumentException

Thrown when the experts list is empty or when topK is invalid.

Properties

AuxiliaryLossWeight

Gets or sets the weight for the auxiliary load balancing loss.

public T AuxiliaryLossWeight { get; set; }

Property Value

T

The coefficient that determines how much the load balancing loss influences training.

Remarks

This weight is multiplied by the load balancing loss before adding it to the primary loss. Typical values range from 0.01 to 0.1. Higher values enforce stronger load balancing.

For Beginners: Controls the importance of load balancing.

Recommended starting value: 0.01

Tuning guidelines:

  • If experts are very imbalanced: increase (e.g., 0.05 or 0.1)
  • If primary task accuracy suffers: decrease (e.g., 0.005)
  • Monitor both primary loss and expert usage statistics to find the right balance

NumExperts

Gets the number of experts in this MoE layer.

public int NumExperts { get; }

Property Value

int

The count of expert networks.

Remarks

This is the total number of expert networks, regardless of how many are activated per input.

For Beginners: How many specialist networks are available.

Common configurations:

  • Small models: 4-8 experts
  • Medium models: 8-16 experts
  • Large models: 32-128+ experts

The number of experts affects:

  • Model capacity (more experts = more capacity)
  • Memory usage (more experts = more memory)
  • Specialization potential (more experts = more specialized roles)

ParameterCount

Gets the total number of trainable parameters in the layer.

public override int ParameterCount { get; }

Property Value

int

The sum of the router's parameters and all experts' parameters.

Remarks

This includes all parameters from the router and all experts combined. This gives you the total model capacity and memory requirement for this layer.

For Beginners: The total count of all adjustable numbers in this layer.

This includes:

  • All weights and biases in the router
  • All weights and biases in all experts

For example, with:

  • Router: 1000 parameters
  • 8 experts with 5000 parameters each: 40,000 parameters
  • Total: 41,000 parameters

More parameters = more capacity to learn, but also more memory needed. MoE shines because you can have huge capacity (many experts) but still only activate a fraction of them per input with sparse routing.

SupportsJitCompilation

Gets a value indicating whether this layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool

true if both the router and all experts support JIT compilation; otherwise, false.

Remarks

JIT compilation for MoE uses TopKSoftmax for differentiable expert selection. The routing is performed by the router network, and the selected experts' outputs are weighted by the softmax-normalized routing scores.

SupportsTraining

Gets a value indicating whether this layer supports training.

public override bool SupportsTraining { get; }

Property Value

bool

true if the router or any expert supports training; otherwise, false.

Remarks

The MoE layer supports training if either its router or any of its experts have trainable parameters.

For Beginners: This tells you if the MoE layer can learn from data.

The layer can learn if:

  • The router can learn better routing decisions
  • Any expert can improve its predictions

In almost all practical cases, this will be true since both the router and experts typically have trainable parameters.

UseAuxiliaryLoss

Gets or sets a value indicating whether to use the auxiliary load balancing loss.

public bool UseAuxiliaryLoss { get; set; }

Property Value

bool

true to compute and apply load balancing loss during training; otherwise, false.

Remarks

When enabled, the layer computes a load balancing loss that encourages balanced expert usage. This loss is added to the primary task loss during training to prevent expert imbalance.

For Beginners: Turn load balancing on or off.

Enable this during training to ensure all experts are used roughly equally. Disable during inference/testing since load balancing is only needed during training.

Benefits of load balancing:

  • Prevents expert collapse (all inputs routed to the same expert)
  • Encourages specialization across different experts
  • Improves overall model quality and generalization

Methods

Backward(Tensor<T>)

Performs the backward pass through the MoE layer.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

The gradient of the loss with respect to this layer's output.

Returns

Tensor<T>

The gradient of the loss with respect to this layer's input.

Remarks

The backward pass: 1. Applies the derivative of the activation function 2. Computes gradients for each expert's contribution 3. Backpropagates through active experts 4. Computes gradients for the router 5. Backpropagates through the router 6. Returns the combined input gradient

For Beginners: This is where the MoE layer learns from its mistakes.

The backward pass works in reverse:

  1. Receive Error Signal: Get information about how wrong the output was

    • This comes from layers after this one (or from the loss function)
  2. Activation Gradient: Account for the activation function

    • If we applied ReLU, apply its derivative
    • This adjusts the error signal appropriately
  3. Expert Gradients: Calculate how each expert should improve

    • Weight the error by how much each expert contributed
    • Expert with weight 0.7 gets more of the blame/credit than one with 0.1
    • Send these weighted errors back through each expert
  4. Router Gradients: Calculate how routing should improve

    • If expert 1 was useful, increase its future routing weight for similar inputs
    • If expert 3 was harmful, decrease its future routing weight
    • This helps the router make better decisions next time
  5. Combine Input Gradients: Sum up gradients from router and experts

    • This tells earlier layers how they should adjust

After backward pass completes, all components know how to improve, but haven't changed yet. The actual changes happen in UpdateParameters().

Exceptions

InvalidOperationException

Thrown when backward is called before forward.

Clone()

Creates a deep copy of this MoE layer.

public override LayerBase<T> Clone()

Returns

LayerBase<T>

A new MixtureOfExpertsLayer with the same configuration and parameters.

Remarks

Creates an independent copy of this layer, including the router and all experts. Changes to the clone won't affect the original.

For Beginners: Makes an identical copy of the entire MoE layer.

The clone includes:

  • A copy of the router
  • Copies of all experts
  • Same configuration (TopK, shapes, etc.)
  • Same learned parameters

Useful for:

  • Creating an ensemble of similar models
  • Experimenting with different training approaches
  • Saving checkpoints during training
  • Implementing certain meta-learning algorithms

The clone is completely independent - training one won't affect the other.

ComputeAuxiliaryLoss()

Computes the load balancing auxiliary loss based on expert usage from the last forward pass.

public T ComputeAuxiliaryLoss()

Returns

T

The load balancing loss value.

Remarks

The load balancing loss encourages balanced expert usage by penalizing imbalanced routing. It is computed as the dot product of two fractions for each expert: - Token fraction: Proportion of tokens (inputs) routed to this expert - Probability mass fraction: Average routing probability for this expert

Loss = NumExperts * sum(token_fraction_i * prob_mass_fraction_i) for all experts i

This loss is minimized when all experts receive equal numbers of tokens and equal total probability mass, encouraging balanced utilization.

For Beginners: Calculates a penalty for imbalanced expert usage.

How it works:

  1. Count Token Assignments:

    • For each expert, count how many inputs chose it (with Top-K) or had non-zero weight
    • Example with 8 inputs and 4 experts: [3, 2, 2, 1] tokens per expert
  2. Calculate Probability Mass:

    • For each expert, sum up its routing weights across all inputs
    • Example: [0.4, 0.3, 0.2, 0.1] total probability per expert
  3. Compute Load Balancing Loss:

    • Convert counts to fractions: [3/8, 2/8, 2/8, 1/8] = [0.375, 0.25, 0.25, 0.125]
    • Convert probabilities to fractions: [0.4, 0.3, 0.2, 0.1]
    • Dot product: 0.3750.4 + 0.250.3 + 0.250.2 + 0.1250.1
    • Multiply by numExperts (4): gives load balancing loss

Why this works:

  • If all experts are used equally, both fractions are [0.25, 0.25, 0.25, 0.25]
  • Dot product: 0.25*0.25 * 4 = 0.25 (minimum possible)
  • If imbalanced like [0.5, 0.3, 0.15, 0.05] × [0.6, 0.25, 0.1, 0.05]
  • Dot product: 0.5*0.6 + ... = higher value (penalty for imbalance)

The loss is minimized when usage is perfectly balanced!

Exceptions

InvalidOperationException

Thrown when called before a forward pass or when auxiliary loss is disabled.

ExportComputationGraph(List<ComputationNode<T>>)

Exports the layer's computation graph for JIT compilation.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>

List to populate with input computation nodes.

Returns

ComputationNode<T>

The output computation node representing the layer's operation.

Remarks

This method constructs a computation graph representation of the layer's forward pass that can be JIT compiled for faster inference. All layers MUST implement this method to support JIT compilation.

For Beginners: JIT (Just-In-Time) compilation converts the layer's operations into optimized native code for 5-10x faster inference.

To support JIT compilation, a layer must:

  1. Implement this method to export its computation graph
  2. Set SupportsJitCompilation to true
  3. Use ComputationNode and TensorOperations to build the graph

All layers are required to implement this method, even if they set SupportsJitCompilation = false.

Forward(Tensor<T>)

Performs the forward pass through the MoE layer.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

The input tensor.

Returns

Tensor<T>

The output tensor after routing through experts and combining their outputs.

Remarks

The forward pass: 1. Routes the input through the gating network to get expert scores 2. Applies softmax to convert scores to routing probabilities 3. Optionally selects only top-K experts (sparse routing) 4. Passes input through selected experts 5. Combines expert outputs using routing weights 6. Applies the layer's activation function

For Beginners: This is where the MoE layer processes input data.

Step-by-step process:

  1. Routing: The router looks at the input and scores each expert

    • Input: data to process
    • Output: a score for each expert (raw numbers)
  2. Normalization: Convert scores to probabilities using softmax

    • Scores: might be [2.1, -0.5, 1.3, 0.8]
    • Weights: becomes [0.55, 0.04, 0.26, 0.15] (sum = 1.0)
  3. Selection (if using Top-K): Keep only the best K experts

    • With Top-2, keep experts with weights 0.55 and 0.26
    • Set others to 0 and renormalize: [0.68, 0, 0.32, 0]
  4. Expert Processing: Run input through selected experts

    • Expert 1 produces output A
    • Expert 3 produces output B
    • Others are skipped (if using Top-K)
  5. Combination: Mix expert outputs using weights

    • Output = 0.68 * A + 0.32 * B
    • This is the weighted average of expert predictions
  6. Activation: Apply final transformation

    • Usually identity (no change) or ReLU

The result is a smart combination of expert predictions, where each expert contributes based on its relevance to the specific input.

ForwardGpu(params IGpuTensor<T>[])

Performs the forward pass on GPU tensors by routing through experts. All computations stay GPU-resident for maximum performance.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]

GPU tensor inputs (uses first input).

Returns

IGpuTensor<T>

GPU tensor output after routing through experts and combining outputs.

Remarks

The GPU forward pass (all operations GPU-resident): 1. Routes input through the router network (GPU) 2. Applies softmax to get routing probabilities (GPU) 3. Optionally applies Top-K selection (GPU) 4. Passes input through each expert (GPU) 5. Combines expert outputs using routing weights (GPU) 6. Applies activation function (GPU) Only downloads to CPU in training mode for gradient caching.

GetAuxiliaryLossDiagnostics()

Gets diagnostic information about expert usage and load balancing.

public Dictionary<string, string> GetAuxiliaryLossDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic metrics including per-expert usage statistics, load balancing metrics, and routing weight distributions.

Remarks

This method provides detailed statistics about expert usage that can be used for monitoring training progress, debugging routing issues, and tuning load balancing parameters.

For Beginners: Gets a detailed report about how experts are being used.

The returned dictionary includes:

  • expert_i_tokens: How many inputs were routed to expert i
  • expert_i_prob_mass: Total routing weight for expert i across all inputs
  • expert_i_avg_weight: Average routing weight when expert i is selected
  • load_balance_loss: Current load balancing loss value
  • usage_variance: Variance in expert usage (lower is better balanced)
  • max_min_ratio: Ratio of most-used to least-used expert (1.0 is perfect)

Use this information to:

  • Monitor if experts are being used balanced or if some are overused
  • Decide if you need to adjust the load balancing weight
  • Detect expert collapse (all inputs routed to one expert)
  • Track training health over time

Example output: { "expert_0_tokens": "245", "expert_1_tokens": "198", "expert_2_tokens": "223", "expert_3_tokens": "234", "expert_0_prob_mass": "0.28", "expert_1_prob_mass": "0.22", ... "load_balance_loss": "0.253", "usage_variance": "0.0012", "max_min_ratio": "1.24" }

GetDiagnostics()

Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.

public override Dictionary<string, string> GetDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().

GetParameters()

Gets all trainable parameters as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all parameters from the router and all experts.

Remarks

Parameters are ordered as: [router parameters] [expert1 parameters] [expert2 parameters] ...

For Beginners: Collects all learned values into one list.

The returned vector contains:

  • First, all parameters from the router
  • Then, all parameters from expert 1
  • Then, all parameters from expert 2
  • And so on

This is useful for:

  • Saving the entire MoE model to disk
  • Implementing advanced optimization algorithms
  • Analyzing the model's learned parameters
  • Transferring knowledge to another model

ResetState()

Resets the internal state of the layer, clearing all cached values.

public override void ResetState()

Remarks

This clears cached values from forward/backward passes and resets the state of the router and all experts. Call this between training batches or when switching between training and inference.

For Beginners: Clears the layer's "short-term memory".

This resets:

  • Cached inputs and outputs
  • Routing weights and decisions
  • Expert activations
  • All temporary values used for learning

When to call this:

  • Between different batches of training data
  • When switching from training to testing mode
  • Before processing a new, unrelated input

This ensures that information from one batch doesn't leak into the next batch, which could cause incorrect gradient calculations or predictions.

SetParameters(Vector<T>)

Sets all trainable parameters from a single vector.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

A vector containing parameters for the router and all experts.

Remarks

Parameters should be in the same order as returned by GetParameters(): [router parameters] [expert1 parameters] [expert2 parameters] ...

For Beginners: Loads previously saved parameters back into the model.

This is the opposite of GetParameters():

  • Takes a vector of all parameters
  • Distributes them to the router and experts
  • Must match the exact format returned by GetParameters()

Use this to:

  • Load a saved model from disk
  • Initialize with pre-trained parameters
  • Implement custom optimization algorithms

If the parameter count doesn't match exactly, an error is thrown to prevent accidentally corrupting the model.

Exceptions

ArgumentException

Thrown when the parameter count doesn't match.

UpdateParameters(T)

Updates all trainable parameters using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate for parameter updates.

Remarks

This method updates parameters for both the router and all expert networks that support training.

For Beginners: This applies all the learned improvements to the router and experts.

After the backward pass calculated how everything should change:

  • The router updates its weights to make better routing decisions
  • Each expert updates its weights to make better predictions
  • The learning rate controls how big these updates are

Learning rate guidelines:

  • Too small: Learning is very slow but stable
  • Too large: Learning is fast but might be unstable
  • Just right: Balances speed and stability (often 0.001 to 0.01)

After calling this method, the MoE layer should perform slightly better than before.