Table of Contents

Class MixtureOfExpertsBuilder<T>

Namespace
AiDotNet.NeuralNetworks.Layers
Assembly
AiDotNet.dll

A builder class that helps create and configure Mixture-of-Experts layers with sensible defaults.

public class MixtureOfExpertsBuilder<T>

Type Parameters

T

The numeric type used for calculations (typically float or double).

Inheritance
MixtureOfExpertsBuilder<T>
Inherited Members

Remarks

This builder simplifies the creation of Mixture-of-Experts layers by providing convenient methods with research-backed default values. It follows best practices from MoE literature to ensure good initial configuration for most use cases.

For Beginners: Think of this as a guided recipe for creating an MoE layer.

Instead of manually specifying every detail of your MoE layer (which experts to use, how to route between them, whether to use load balancing, etc.), this builder provides good default choices based on research and best practices.

It's like having a cooking recipe that says "preheat to 350°F" instead of making you figure out the right temperature yourself. You can still customize if needed, but the defaults work well for most cases.

Constructors

MixtureOfExpertsBuilder()

Initializes a new instance of the MixtureOfExpertsBuilder<T> class.

public MixtureOfExpertsBuilder()

Remarks

The builder is initialized with sensible default values based on research: - 4 experts (balance between capacity and computation) - Soft routing (all experts active, good for smaller models) - Load balancing enabled with weight 0.01 (prevents expert collapse) - ReLU activation for experts (standard, well-tested choice) - Identity activation for output (let downstream layers add non-linearity)

For Beginners: Creates a new MoE builder with smart default settings.

The defaults are chosen to work well in most situations:

  • Not too many experts (4): Fast training and inference
  • Not too few experts (4): Enough specialization capacity
  • Load balancing: Ensures all experts get used
  • ReLU activation: The most popular choice, works well in practice

These defaults are based on what researchers have found works best in practice. You can change any of these later if you have specific needs.

Methods

Build()

Builds the Mixture-of-Experts layer with the configured settings.

public MixtureOfExpertsLayer<T> Build()

Returns

MixtureOfExpertsLayer<T>

A configured MixtureOfExpertsLayer instance.

Remarks

This method creates all the expert networks, the routing network, and assembles them into a complete MoE layer. It uses the configuration specified via the builder methods, falling back to sensible defaults for any unspecified settings.

For Beginners: Creates the actual MoE layer with all your settings.

What happens when you call Build():

  1. Creates the routing network (decides which experts to use)
  2. Creates all the expert networks with your specified architecture
  3. Connects everything together into one MoE layer
  4. Initializes all parameters with good starting values

After calling Build(), you get a complete, ready-to-use MoE layer that you can:

  • Add to your neural network architecture
  • Train with your data
  • Use for inference

Example:

var moeLayer = new MixtureOfExpertsBuilder<float>()
    .WithExperts(8)
    .WithDimensions(256, 256)
    .WithTopK(2)
    .WithLoadBalancing(true, 0.01)
    .Build();

This creates an MoE layer with 8 experts, where each input uses only the top 2 experts, and load balancing ensures all experts get used equally during training.

WithDimensions(int, int)

Sets the input and output dimensions for the MoE layer.

public MixtureOfExpertsBuilder<T> WithDimensions(int inputDim, int outputDim)

Parameters

inputDim int

The input dimension.

outputDim int

The output dimension.

Returns

MixtureOfExpertsBuilder<T>

This builder instance for method chaining.

Remarks

For transformer-style architectures, input and output dimensions are typically the same (residual connections). For bottleneck architectures, output might be smaller than input (dimensionality reduction). For expansion architectures, output might be larger than input (feature expansion).

For Beginners: Sets the size of data coming in and going out.

Common patterns:

  • Same size (128→128): Maintains dimensionality, easy to stack multiple MoE layers
  • Bottleneck (512→128): Compresses information, reduces computation in later layers
  • Expansion (128→512): Expands features, increases representational capacity

Most transformer-based models use the same input and output dimensions, which makes it easy to stack many MoE layers together.

Example: If your previous layer outputs 256 features and your next layer expects 256 features, use WithDimensions(256, 256).

WithExpertActivation(IActivationFunction<T>)

Sets the activation function for experts.

public MixtureOfExpertsBuilder<T> WithExpertActivation(IActivationFunction<T> activation)

Parameters

activation IActivationFunction<T>

The activation function to use in expert networks.

Returns

MixtureOfExpertsBuilder<T>

This builder instance for method chaining.

Remarks

Common choices: ReLU (default, fast and stable), GELU (used in transformers, smoother), Swish/SiLU (good performance, slightly more computation). ReLU is the safest default choice for most applications.

For Beginners: Sets what mathematical function experts use for non-linearity.

Popular choices:

  • ReLU (default): Fast, stable, works in most cases

    • Use when: You want safe, reliable performance
    • Used in: Most computer vision, many NLP models
  • GELU: Smoother than ReLU, used in modern transformers

    • Use when: Building transformer-based models
    • Used in: BERT, GPT, most modern language models
  • Swish/SiLU: Smooth and performs well

    • Use when: You want slightly better performance
    • Trade-off: A bit slower than ReLU
  • Tanh: Classic choice, outputs -1 to 1

    • Use when: You need bounded outputs
    • Used in: LSTMs, some older architectures

If unsure, stick with the default (ReLU). It's the most tested and reliable.

WithExpertHiddenDim(int)

Sets the hidden dimension for the expert networks (for 2-layer experts).

public MixtureOfExpertsBuilder<T> WithExpertHiddenDim(int hiddenDim)

Parameters

hiddenDim int

The hidden dimension for expert networks.

Returns

MixtureOfExpertsBuilder<T>

This builder instance for method chaining.

Remarks

Research often uses 4x the model dimension as hidden dimension in MoE layers. For example, if your model uses 128-dimensional embeddings, use hiddenDim=512. This is known as the "feed-forward expansion factor" in transformer literature.

For Beginners: Sets how large the "middle" of each expert is.

Each expert is like a mini-network: Input → Hidden → Output The hidden layer is where the expert does its "thinking."

Common practice: Make hidden dimension 4x the input dimension

  • Input 128 → Hidden 512 → Output 128
  • Input 256 → Hidden 1024 → Output 256

Why 4x?

  • Gives experts enough capacity to learn complex patterns
  • Based on extensive research (used in BERT, GPT, etc.)
  • Good balance between capacity and efficiency

You can go lower (2x) for smaller models or higher (8x) for more capacity.

WithExperts(int)

Sets the number of expert networks in the MoE layer.

public MixtureOfExpertsBuilder<T> WithExperts(int numExperts)

Parameters

numExperts int

The number of experts (must be at least 2).

Returns

MixtureOfExpertsBuilder<T>

This builder instance for method chaining.

Remarks

Common values in research: 4-16 for small/medium models, 32-128 for large models. More experts = more capacity but also more computation and memory.

For Beginners: Sets how many specialist networks to create.

Guidelines:

  • 2-4 experts: Good for small models or limited compute
  • 4-8 experts: Sweet spot for most applications
  • 8-16 experts: For larger, more complex tasks
  • 16+ experts: For very large scale models (use with TopK for efficiency)

More experts allow more specialization, but:

  • Take longer to train
  • Use more memory
  • May need load balancing to prevent some being unused

Start with 4-8 and adjust based on your results.

WithHiddenExpansion(int)

Sets the hidden dimension expansion factor for expert networks.

public MixtureOfExpertsBuilder<T> WithHiddenExpansion(int expansion)

Parameters

expansion int

The expansion factor (hidden dim = input dim * expansion).

Returns

MixtureOfExpertsBuilder<T>

This builder instance for method chaining.

Remarks

The actual expert hidden dimension will be calculated as InputDim * expansion. Common values: 2-4 for moderate capacity, 4-8 for high capacity.

WithIntermediateLayer(bool)

Configures whether experts should use an intermediate (hidden) layer.

public MixtureOfExpertsBuilder<T> WithIntermediateLayer(bool useIntermediateLayer)

Parameters

useIntermediateLayer bool

True to use 2-layer experts, false for single-layer experts.

Returns

MixtureOfExpertsBuilder<T>

This builder instance for method chaining.

Remarks

Two-layer experts (Input → Hidden → Output) provide more capacity and are standard in research. Single-layer experts (Input → Output) are faster and use less memory, suitable for simpler tasks. Default is true (two-layer) as this matches most research implementations.

For Beginners: Controls how complex each expert network is.

Two-layer experts (default: true):

  • Structure: Input → Hidden → Output
  • Pros: More capacity to learn complex patterns
  • Cons: Slower, uses more memory
  • Use when: You have a complex task or enough compute
  • Example: Input(128) → Hidden(512) → Output(128)

Single-layer experts (false):

  • Structure: Input → Output (direct connection)
  • Pros: Faster, less memory, easier to train
  • Cons: Less capacity for complex patterns
  • Use when: Simpler task or limited compute
  • Example: Input(128) → Output(128)

Rule of thumb:

  • Complex tasks (language, vision): Use two-layer (true)
  • Simple tasks (regression, small classification): Can use single-layer (false)
  • When unsure: Stick with default (true)

WithLoadBalancing(bool, double)

Configures load balancing to encourage even expert utilization.

public MixtureOfExpertsBuilder<T> WithLoadBalancing(bool enabled = true, double weight = 0.01)

Parameters

enabled bool

Whether to enable load balancing.

weight double

The weight for the load balancing loss (typically 0.01-0.1).

Returns

MixtureOfExpertsBuilder<T>

This builder instance for method chaining.

Remarks

Load balancing prevents "expert collapse" where all inputs are routed to a small subset of experts. The default weight of 0.01 is based on the Switch Transformer paper and works well in most cases. Increase to 0.05-0.1 if you observe severe imbalance, decrease to 0.001-0.005 if it hurts accuracy.

For Beginners: Ensures all experts get used roughly equally.

The Problem: Without load balancing, the router might send all inputs to just 1-2 experts, leaving others unused. This wastes capacity and prevents specialization.

The Solution: Load balancing adds a small penalty when experts are used unevenly, encouraging the router to spread inputs across all experts.

Weight Guidelines:

  • 0.01 (default): Gentle encouragement, rarely hurts accuracy
  • 0.05: Moderate encouragement, use if you see significant imbalance
  • 0.1: Strong encouragement, may slightly reduce accuracy but ensures balance
  • 0.001: Very gentle, use if load balancing seems to hurt performance

When to use:

  • Always use for training (enabled by default)
  • Disable for inference/testing (the builder does this automatically)

Monitoring: Check GetAuxiliaryLossDiagnostics() during training to see if experts are balanced. Ideally, all experts should be used 10-30% of the time (with 4 experts, each should get ~25%).

WithOutputActivation(IActivationFunction<T>)

Sets the activation function for the MoE layer output.

public MixtureOfExpertsBuilder<T> WithOutputActivation(IActivationFunction<T> activation)

Parameters

activation IActivationFunction<T>

The activation function to apply after combining expert outputs.

Returns

MixtureOfExpertsBuilder<T>

This builder instance for method chaining.

Remarks

The default is Identity (no activation), which is appropriate when the MoE layer is used as a drop-in replacement for a feed-forward layer in architectures with residual connections. Use ReLU or other activations if you want non-linearity at this point.

For Beginners: Sets what happens to the combined output of all experts.

Typical choices:

  • Identity (default): No change to the output

    • Use when: MoE is part of a residual block (most transformer architectures)
    • Reasoning: Downstream layers will add their own activations
  • ReLU: Applies non-linearity to the final output

    • Use when: MoE is a standalone layer without residual connections
    • Common in: Feed-forward networks, some CNN architectures

In most modern architectures (like transformers), you want Identity here because the architecture already has non-linearity elsewhere.

WithTopK(int)

Configures Top-K sparse routing.

public MixtureOfExpertsBuilder<T> WithTopK(int k)

Parameters

k int

The number of top experts to activate per input (0 = use all experts).

Returns

MixtureOfExpertsBuilder<T>

This builder instance for method chaining.

Remarks

Top-K routing dramatically improves efficiency by activating only K experts per input. Common values: K=1 or K=2 for large models, K=0 (all experts) for smaller models. Research shows K=2 often provides the best accuracy/efficiency tradeoff.

For Beginners: Controls how many experts process each input.

Options:

  • TopK = 0 (default): All experts process every input (soft routing)

    • Pros: Maximum quality, all experts contribute
    • Cons: Slower, uses more memory
    • Best for: Small models (4-8 experts), when quality is critical
  • TopK = 1: Only the best expert for each input

    • Pros: Very fast, minimal computation
    • Cons: Less capacity, experts must specialize strongly
    • Best for: Very large models (32+ experts), inference speed critical
  • TopK = 2 (recommended for large models): Top 2 experts per input

    • Pros: Good balance of quality and speed
    • Cons: Still more computation than TopK=1
    • Best for: Medium to large models (8-32 experts)

Example: With 8 experts and TopK=2, you use only 25% of the computation!

Rule of thumb:

  • 4-8 experts: Use TopK=0 (all)
  • 8-16 experts: Use TopK=2
  • 16+ experts: Use TopK=1 or TopK=2