Class MixtureOfExpertsNeuralNetwork<T>
- Namespace
- AiDotNet.NeuralNetworks
- Assembly
- AiDotNet.dll
Represents a Mixture-of-Experts (MoE) neural network that routes inputs through multiple specialist networks.
public class MixtureOfExpertsNeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable
Type Parameters
TThe numeric type used for calculations (typically float or double).
- Inheritance
-
MixtureOfExpertsNeuralNetwork<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
A Mixture-of-Experts neural network employs multiple expert networks and a gating mechanism to route inputs to the most appropriate experts. This architecture enables: - Increased model capacity without proportional compute cost (sparse activation) - Specialization of different experts on different aspects of the problem - Improved scalability for large-scale problems
The architecture consists of: - Multiple expert networks (can be feed-forward, convolutional, etc.) - A gating/routing network that learns to select appropriate experts - Optional load balancing loss to ensure all experts are utilized
For Beginners: Mixture-of-Experts is like having a team of specialists rather than one generalist.
Imagine you're running a hospital:
- Instead of one doctor handling everything, you have specialists (cardiologist, neurologist, etc.)
- A triage system (gating network) decides which specialist(s) should see each patient
- Each specialist only handles cases they're best suited for
In a MoE neural network:
- Multiple "expert" networks specialize in different patterns in your data
- A "gating network" learns to route each input to the best expert(s)
- Only a few experts process each input (sparse activation), making it efficient
- The final prediction combines the outputs from the selected experts
This model automatically implements IFullModel, allowing it to work with AiModelBuilder just like any other neural network in AiDotNet.
Key Features:
- Configurable number of expert networks
- Top-K sparse routing for computational efficiency
- Automatic load balancing to prevent expert collapse
- Integration with AiModelBuilder for easy training
- Full support for serialization and deserialization
Constructors
MixtureOfExpertsNeuralNetwork(MixtureOfExpertsOptions<T>, NeuralNetworkArchitecture<T>, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?, double)
Initializes a new instance of the MixtureOfExpertsNeuralNetwork class.
public MixtureOfExpertsNeuralNetwork(MixtureOfExpertsOptions<T> options, NeuralNetworkArchitecture<T> architecture, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null, double maxGradNorm = 1)
Parameters
optionsMixtureOfExpertsOptions<T>Configuration options for the Mixture-of-Experts model.
architectureNeuralNetworkArchitecture<T>The architecture defining the structure of the neural network.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>The optimization algorithm to use for training. If null, Adam optimizer is used.
lossFunctionILossFunction<T>The loss function to use for training. If null, an appropriate loss function is selected based on the task type.
maxGradNormdoubleThe maximum gradient norm for gradient clipping during training.
Examples
// Create options for 8 experts with Top-2 routing
var options = new MixtureOfExpertsOptions<float>
{
NumExperts = 8,
TopK = 2,
InputDim = 128,
OutputDim = 128,
UseLoadBalancing = true,
LoadBalancingWeight = 0.01
};
// Create architecture for classification
var architecture = new NeuralNetworkArchitecture<float>(
inputType: InputType.OneDimensional,
taskType: NeuralNetworkTaskType.MultiClassClassification,
inputSize: 128,
outputSize: 10
);
// Create the model
var model = new MixtureOfExpertsNeuralNetwork<float>(options, architecture);
// Use with AiModelBuilder (standard pattern)
var builder = new AiModelBuilder<float, Tensor<float>, Tensor<float>>();
var result = builder.ConfigureModel(model).Build(trainingData, trainingLabels);
Remarks
This constructor creates a Mixture-of-Experts neural network based on the provided options and architecture. The options control MoE-specific parameters like number of experts, Top-K routing, and load balancing. The architecture defines the overall network structure including input/output dimensions and task type.
For Beginners: When creating a Mixture-of-Experts model, you provide two types of configuration:
Options (MixtureOfExpertsOptions): MoE-specific settings
- How many expert networks to create
- How many experts to activate per input (Top-K)
- Expert dimensions and architecture
- Load balancing settings
Architecture (NeuralNetworkArchitecture): General network settings
- What type of task (classification, regression, etc.)
- Input and output dimensions
- Any additional layers beyond the MoE layer
If you don't specify an optimizer or loss function, the model will choose sensible defaults based on your task type (e.g., CrossEntropy for classification, MSE for regression).
The model automatically integrates with AiModelBuilder, so you can train it using the standard AiDotNet pattern without any special handling.
Properties
SupportsTraining
Indicates whether this network supports training.
public override bool SupportsTraining { get; }
Property Value
Remarks
Mixture-of-Experts networks fully support training through backpropagation, including gradient flow to both the expert networks and the gating network.
For Beginners: This indicates that the MoE network can learn from data.
The network learns:
- How to route inputs to appropriate experts (gating network)
- How each expert should process its specialized inputs (expert networks)
- How to balance usage across all experts (load balancing)
This property returns true, meaning the network is designed to be trained.
Methods
Backward(Tensor<T>)
Performs a backward pass through the network to calculate gradients.
public Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient of the loss with respect to the network's output.
Returns
- Tensor<T>
The gradient of the loss with respect to the network's input.
Remarks
The backward pass propagates gradients backward through each layer, including the MoE layer which distributes gradients to the experts that were activated during the forward pass.
For Beginners: This is how the network learns from mistakes.
After making a prediction, we calculate how wrong it was (the error). This method works backward through the network, calculating how each part contributed to the error. This information is used to improve the network.
For MoE networks:
- Gradients flow back through the output layers
- Then through the MoE layer to the activated experts
- The gating network also learns which experts to select
CreateNewInstance()
Creates a new instance of the MixtureOfExpertsNeuralNetwork with the same configuration as the current instance.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
A new MixtureOfExpertsNeuralNetwork instance with the same configuration.
Remarks
This method creates a new instance with the same architecture, options, optimizer, and loss function as the current instance. This is useful for model cloning, ensemble methods, or cross-validation scenarios.
For Beginners: This creates a fresh copy of your MoE network's blueprint.
The new network:
- Has the same number and configuration of experts
- Uses the same routing strategy (Top-K)
- Has the same load balancing settings
- BUT has newly initialized weights (no learned data)
Use cases:
- Testing the same model architecture on different data
- Creating ensemble models (multiple models voting on predictions)
- Cross-validation (training and testing on different data splits)
DeserializeNetworkSpecificData(BinaryReader)
Deserializes Mixture-of-Experts network-specific data from a binary reader.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReaderThe BinaryReader to read the data from.
Remarks
This method reads the MoE-specific configuration and state from a binary stream, restoring a previously saved model.
For Beginners: This loads a previously saved MoE model from a file.
It restores:
- All expert network weights
- Gating network weights
- Configuration settings
- Optimizer and loss function types
The loaded model is ready to use for predictions without retraining.
Forward(Tensor<T>)
Performs a forward pass through the network with the given input tensor.
public Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>The input tensor to process.
Returns
- Tensor<T>
The output tensor after processing through all layers.
Remarks
The forward pass sequentially processes the input through each layer of the network, with the MoE layer using sparse expert activation based on the gating network's decisions.
For Beginners: This method processes input through all the network's layers.
For MoE networks:
- Input goes through the MoE layer (which internally routes to experts)
- Then through any additional layers you added
- Final output is returned
Think of it like an assembly line where each station (layer) processes the data and passes it to the next station.
GetModelMetadata()
Retrieves metadata about the Mixture-of-Experts neural network model.
public override ModelMetadata<T> GetModelMetadata()
Returns
- ModelMetadata<T>
A ModelMetadata object containing information about the network.
Remarks
This method collects and returns various pieces of information about the network's structure, including MoE-specific details like number of experts and routing strategy.
For Beginners: This provides a summary of your MoE network's configuration.
The metadata includes:
- How many expert networks you have
- How many experts are activated per input (Top-K)
- Expert dimensions and architecture details
- Load balancing settings
- Overall network structure
This is useful for documentation, debugging, or understanding model differences.
InitializeLayers()
Initializes the layers of the neural network based on the provided architecture and options.
protected override void InitializeLayers()
Remarks
This method creates the Mixture-of-Experts layer based on the configuration options, and adds any additional layers specified in the architecture.
For Beginners: This method sets up the expert networks and gating mechanism.
The initialization process:
- Creates expert networks based on your options (number, dimensions, etc.)
- Creates the gating/routing network that learns to select experts
- Adds any additional layers you specified in the architecture
You don't need to call this manually - it's automatically called during construction.
Predict(Tensor<T>)
Makes a prediction using the Mixture-of-Experts network for the given input tensor.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>The input tensor to make a prediction for.
Returns
- Tensor<T>
The predicted output tensor.
Remarks
This method performs a forward pass through the network to generate a prediction. The gating network determines which experts to activate, and only those experts process the input for efficiency.
For Beginners: This method makes predictions using the trained Mixture-of-Experts model.
What happens during prediction:
- The input goes to the gating network
- The gating network selects the best experts for this input (Top-K)
- Only the selected experts process the input
- Expert outputs are combined using learned weights
- The final prediction is returned
This sparse activation (only using some experts) makes MoE much faster than running all experts for every input, while maintaining high quality predictions.
SerializeNetworkSpecificData(BinaryWriter)
Serializes Mixture-of-Experts network-specific data to a binary writer.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriterThe BinaryWriter to write the data to.
Remarks
This method writes the MoE-specific configuration and state to a binary stream, allowing the model to be saved and loaded later.
For Beginners: This saves your trained MoE model to a file.
It records:
- All expert network weights
- Gating network weights
- Configuration settings
- Optimizer and loss function types
This allows you to:
- Save a trained model for later use
- Share models with others
- Deploy models to production
Train(Tensor<T>, Tensor<T>)
Trains the Mixture-of-Experts network using the provided input and expected output.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>The input tensor for training.
expectedOutputTensor<T>The expected output tensor for the given input.
Remarks
This method performs one training iteration, including: - Forward pass through the network - Primary loss calculation (task-specific loss) - Auxiliary loss calculation (load balancing loss) - Backward pass (backpropagation) - Parameter update via optimizer
For Beginners: This is how the MoE network learns.
Training process:
- Show the network some input data and the correct output
- Network makes a prediction using selected experts
- Calculate two types of errors:
- Task error: How wrong the prediction was
- Balance error: Whether experts are being used evenly
- Adjust the network to reduce both errors
- Repeat many times with different examples
The load balancing ensures all experts contribute and don't collapse to using just one or two experts.
UpdateParameters(Vector<T>)
Updates the parameters of all layers in the network.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>A vector containing all parameters for the network.
Remarks
This method distributes the parameters to each layer, including all expert networks within the MoE layer and the gating network.
For Beginners: After calculating how to improve the network, this method applies those improvements.
It distributes updated settings (parameters) to:
- All expert networks
- The gating network
- Any additional layers
This is called repeatedly during training to gradually improve accuracy.