Class MixtureOfExpertsNeuralNetwork<T>

Namespace: AiDotNet.NeuralNetworks

Assembly: AiDotNet.dll

Represents a Mixture-of-Experts (MoE) neural network that routes inputs through multiple specialist networks.

public class MixtureOfExpertsNeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable

Type Parameters

T: The numeric type used for calculations (typically float or double).

Inheritance: object

NeuralNetworkBase<T>

MixtureOfExpertsNeuralNetwork<T>

Implements: INeuralNetworkModel<T>

INeuralNetwork<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

IInterpretableModel<T>

IInputGradientComputable<T>

IDisposable

Inherited Members: NeuralNetworkBase<T>.Layers

NeuralNetworkBase<T>.LayerCount

NeuralNetworkBase<T>.Architecture

NeuralNetworkBase<T>.NumOps

NeuralNetworkBase<T>.Engine

NeuralNetworkBase<T>._layerInputs

NeuralNetworkBase<T>._layerOutputs

NeuralNetworkBase<T>.Random

NeuralNetworkBase<T>.LossFunction

NeuralNetworkBase<T>.LastLoss

NeuralNetworkBase<T>.IsTrainingMode

NeuralNetworkBase<T>.SupportsGpuTraining

NeuralNetworkBase<T>.CanTrainOnGpu

NeuralNetworkBase<T>.GpuEngine

NeuralNetworkBase<T>.MaxGradNorm

NeuralNetworkBase<T>._mixedPrecisionContext

NeuralNetworkBase<T>._memoryManager

NeuralNetworkBase<T>.IsMemoryManagementEnabled

NeuralNetworkBase<T>.IsGradientCheckpointingEnabled

NeuralNetworkBase<T>.IsMixedPrecisionEnabled

NeuralNetworkBase<T>.ClipGradients(List<Tensor<T>>)

NeuralNetworkBase<T>.ClipGradient(Tensor<T>)

NeuralNetworkBase<T>.ClipGradient(Vector<T>)

NeuralNetworkBase<T>.GetParameters()

NeuralNetworkBase<T>.Backpropagate(Tensor<T>)

NeuralNetworkBase<T>.BackpropagateWithRecompute(Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpuDeferred(IGpuTensor<T>, GpuExecutionOptions)

NeuralNetworkBase<T>.UpdateParametersGpu(T, T, T)

NeuralNetworkBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

NeuralNetworkBase<T>.UpdateParametersGpuDeferred(IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferred(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferredAsync(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions, CancellationToken)

NeuralNetworkBase<T>.UploadWeightsToGpu()

NeuralNetworkBase<T>.DownloadWeightsFromGpu()

NeuralNetworkBase<T>.ZeroGradientsGpu()

NeuralNetworkBase<T>.ExtractSingleExample(Tensor<T>, int)

NeuralNetworkBase<T>.ForwardWithMemory(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithCheckpointing(Tensor<T>)

NeuralNetworkBase<T>.CanUseGpuResidentPath()

NeuralNetworkBase<T>.TryForwardGpuOptimized(Tensor<T>, out Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferred(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferredAsync(Tensor<T>, CancellationToken)

NeuralNetworkBase<T>.BeginGpuExecution(GpuExecutionOptions)

NeuralNetworkBase<T>.ForwardWithGpuContext(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithGpuContext(IGpuTensor<T>)

NeuralNetworkBase<T>.GetGpuMemoryStats()

NeuralNetworkBase<T>.ForwardWithFeatures(Tensor<T>, int[])

NeuralNetworkBase<T>.ParameterCount

NeuralNetworkBase<T>.GetParameterCount()

NeuralNetworkBase<T>.InvalidateParameterCountCache()

NeuralNetworkBase<T>.AddLayerToCollection(ILayer<T>)

NeuralNetworkBase<T>.RemoveLayerFromCollection(ILayer<T>)

NeuralNetworkBase<T>.ClearLayers()

NeuralNetworkBase<T>.ValidateCustomLayers(List<ILayer<T>>)

NeuralNetworkBase<T>.ValidateCustomLayersInternal(List<ILayer<T>>)

NeuralNetworkBase<T>.IsValidInputLayer(ILayer<T>)

NeuralNetworkBase<T>.IsValidOutputLayer(ILayer<T>)

NeuralNetworkBase<T>.AreLayersCompatible(ILayer<T>, ILayer<T>)

NeuralNetworkBase<T>.GetParameterGradients()

NeuralNetworkBase<T>.EnsureArchitectureInitialized()

NeuralNetworkBase<T>.SetTrainingMode(bool)

NeuralNetworkBase<T>.EnableMemoryManagement(TrainingMemoryConfig)

NeuralNetworkBase<T>.DisableMemoryManagement()

NeuralNetworkBase<T>.GetMemoryEstimate(int, int)

NeuralNetworkBase<T>.GetLastLoss()

NeuralNetworkBase<T>.ResetState()

NeuralNetworkBase<T>.BackwardWithInputGradient(Tensor<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Vector<T>, Vector<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.SaveModel(string)

NeuralNetworkBase<T>.LoadModel(string)

NeuralNetworkBase<T>.Serialize()

NeuralNetworkBase<T>.Deserialize(byte[])

NeuralNetworkBase<T>.WithParameters(Vector<T>)

NeuralNetworkBase<T>.GetActiveFeatureIndices()

NeuralNetworkBase<T>.IsFeatureUsed(int)

NeuralNetworkBase<T>.DeepCopy()

NeuralNetworkBase<T>.Clone()

NeuralNetworkBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

NeuralNetworkBase<T>._enabledMethods

NeuralNetworkBase<T>._sensitiveFeatures

NeuralNetworkBase<T>._fairnessMetrics

NeuralNetworkBase<T>._baseModel

NeuralNetworkBase<T>.GetGlobalFeatureImportanceAsync()

NeuralNetworkBase<T>.GetLocalFeatureImportanceAsync(Tensor<T>)

NeuralNetworkBase<T>.GetShapValuesAsync(Tensor<T>)

NeuralNetworkBase<T>.GetLimeExplanationAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetPartialDependenceAsync(Vector<int>, int)

NeuralNetworkBase<T>.GetCounterfactualAsync(Tensor<T>, Tensor<T>, int)

NeuralNetworkBase<T>.GetModelSpecificInterpretabilityAsync()

NeuralNetworkBase<T>.GenerateTextExplanationAsync(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.GetFeatureInteractionAsync(int, int)

NeuralNetworkBase<T>.ValidateFairnessAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetAnchorExplanationAsync(Tensor<T>, T)

NeuralNetworkBase<T>.SetBaseModel<TInput, TOutput>(IFullModel<T, TInput, TOutput>)

NeuralNetworkBase<T>.EnableMethod(params InterpretationMethod[])

NeuralNetworkBase<T>.ConfigureFairness(Vector<int>, params FairnessMetric[])

NeuralNetworkBase<T>.GetNamedLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.GetArchitecture()

NeuralNetworkBase<T>.GetFeatureImportance()

NeuralNetworkBase<T>.SetParameters(Vector<T>)

NeuralNetworkBase<T>.AddLayer(LayerType, int, ActivationFunction)

NeuralNetworkBase<T>.AddConvolutionalLayer(int, int, int, ActivationFunction)

NeuralNetworkBase<T>.AddLSTMLayer(int, bool)

NeuralNetworkBase<T>.AddDropoutLayer(double)

NeuralNetworkBase<T>.AddBatchNormalizationLayer(int, double, double)

NeuralNetworkBase<T>.AddPoolingLayer(int[], PoolingType, int, int?)

NeuralNetworkBase<T>.GetGradients()

NeuralNetworkBase<T>.GetInputShape()

NeuralNetworkBase<T>.GetLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.DefaultLossFunction

NeuralNetworkBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

NeuralNetworkBase<T>.ApplyGradients(Vector<T>, T)

NeuralNetworkBase<T>.SaveState(Stream)

NeuralNetworkBase<T>.LoadState(Stream)

NeuralNetworkBase<T>.Dispose()

NeuralNetworkBase<T>.Dispose(bool)

NeuralNetworkBase<T>.SupportsJitCompilation

NeuralNetworkBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

NeuralNetworkBase<T>.ConvertLayerToGraph(ILayer<T>, ComputationNode<T>)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

A Mixture-of-Experts neural network employs multiple expert networks and a gating mechanism to route inputs to the most appropriate experts. This architecture enables: - Increased model capacity without proportional compute cost (sparse activation) - Specialization of different experts on different aspects of the problem - Improved scalability for large-scale problems

The architecture consists of: - Multiple expert networks (can be feed-forward, convolutional, etc.) - A gating/routing network that learns to select appropriate experts - Optional load balancing loss to ensure all experts are utilized

For Beginners: Mixture-of-Experts is like having a team of specialists rather than one generalist.

Imagine you're running a hospital:

Instead of one doctor handling everything, you have specialists (cardiologist, neurologist, etc.)
A triage system (gating network) decides which specialist(s) should see each patient
Each specialist only handles cases they're best suited for

In a MoE neural network:

Multiple "expert" networks specialize in different patterns in your data
A "gating network" learns to route each input to the best expert(s)
Only a few experts process each input (sparse activation), making it efficient
The final prediction combines the outputs from the selected experts

This model automatically implements IFullModel, allowing it to work with AiModelBuilder just like any other neural network in AiDotNet.

Key Features:

Configurable number of expert networks
Top-K sparse routing for computational efficiency
Automatic load balancing to prevent expert collapse
Integration with AiModelBuilder for easy training
Full support for serialization and deserialization

Constructors

MixtureOfExpertsNeuralNetwork(MixtureOfExpertsOptions<T>, NeuralNetworkArchitecture<T>, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?, double)

Initializes a new instance of the MixtureOfExpertsNeuralNetwork class.

public MixtureOfExpertsNeuralNetwork(MixtureOfExpertsOptions<T> options, NeuralNetworkArchitecture<T> architecture, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null, double maxGradNorm = 1)

Parameters

options MixtureOfExpertsOptions<T>: Configuration options for the Mixture-of-Experts model.
architecture NeuralNetworkArchitecture<T>: The architecture defining the structure of the neural network.
optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>: The optimization algorithm to use for training. If null, Adam optimizer is used.
lossFunction ILossFunction<T>: The loss function to use for training. If null, an appropriate loss function is selected based on the task type.
maxGradNorm double: The maximum gradient norm for gradient clipping during training.

Examples

// Create options for 8 experts with Top-2 routing
var options = new MixtureOfExpertsOptions<float>
{
    NumExperts = 8,
    TopK = 2,
    InputDim = 128,
    OutputDim = 128,
    UseLoadBalancing = true,
    LoadBalancingWeight = 0.01
};

// Create architecture for classification
var architecture = new NeuralNetworkArchitecture<float>(
    inputType: InputType.OneDimensional,
    taskType: NeuralNetworkTaskType.MultiClassClassification,
    inputSize: 128,
    outputSize: 10
);

// Create the model
var model = new MixtureOfExpertsNeuralNetwork<float>(options, architecture);

// Use with AiModelBuilder (standard pattern)
var builder = new AiModelBuilder<float, Tensor<float>, Tensor<float>>();
var result = builder.ConfigureModel(model).Build(trainingData, trainingLabels);

Remarks

This constructor creates a Mixture-of-Experts neural network based on the provided options and architecture. The options control MoE-specific parameters like number of experts, Top-K routing, and load balancing. The architecture defines the overall network structure including input/output dimensions and task type.

For Beginners: When creating a Mixture-of-Experts model, you provide two types of configuration:

Options (MixtureOfExpertsOptions): MoE-specific settings
- How many expert networks to create
- How many experts to activate per input (Top-K)
- Expert dimensions and architecture
- Load balancing settings
Architecture (NeuralNetworkArchitecture): General network settings
- What type of task (classification, regression, etc.)
- Input and output dimensions
- Any additional layers beyond the MoE layer

If you don't specify an optimizer or loss function, the model will choose sensible defaults based on your task type (e.g., CrossEntropy for classification, MSE for regression).

The model automatically integrates with AiModelBuilder, so you can train it using the standard AiDotNet pattern without any special handling.

Properties

SupportsTraining

Indicates whether this network supports training.

public override bool SupportsTraining { get; }

Property Value

bool

Remarks

Mixture-of-Experts networks fully support training through backpropagation, including gradient flow to both the expert networks and the gating network.

For Beginners: This indicates that the MoE network can learn from data.

The network learns:

How to route inputs to appropriate experts (gating network)
How each expert should process its specialized inputs (expert networks)
How to balance usage across all experts (load balancing)

This property returns true, meaning the network is designed to be trained.

Methods

Backward(Tensor<T>)

Performs a backward pass through the network to calculate gradients.

public Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: The gradient of the loss with respect to the network's output.

Returns

Tensor<T>: The gradient of the loss with respect to the network's input.

Remarks

The backward pass propagates gradients backward through each layer, including the MoE layer which distributes gradients to the experts that were activated during the forward pass.

For Beginners: This is how the network learns from mistakes.

After making a prediction, we calculate how wrong it was (the error). This method works backward through the network, calculating how each part contributed to the error. This information is used to improve the network.

For MoE networks:

Gradients flow back through the output layers
Then through the MoE layer to the activated experts
The gating network also learns which experts to select

CreateNewInstance()

Creates a new instance of the MixtureOfExpertsNeuralNetwork with the same configuration as the current instance.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>: A new MixtureOfExpertsNeuralNetwork instance with the same configuration.

Remarks

This method creates a new instance with the same architecture, options, optimizer, and loss function as the current instance. This is useful for model cloning, ensemble methods, or cross-validation scenarios.

For Beginners: This creates a fresh copy of your MoE network's blueprint.

The new network:

Has the same number and configuration of experts
Uses the same routing strategy (Top-K)
Has the same load balancing settings
BUT has newly initialized weights (no learned data)

Use cases:

Testing the same model architecture on different data
Creating ensemble models (multiple models voting on predictions)
Cross-validation (training and testing on different data splits)

DeserializeNetworkSpecificData(BinaryReader)

Deserializes Mixture-of-Experts network-specific data from a binary reader.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader: The BinaryReader to read the data from.

Remarks

This method reads the MoE-specific configuration and state from a binary stream, restoring a previously saved model.

For Beginners: This loads a previously saved MoE model from a file.

It restores:

All expert network weights
Gating network weights
Configuration settings
Optimizer and loss function types

The loaded model is ready to use for predictions without retraining.

Forward(Tensor<T>)

Performs a forward pass through the network with the given input tensor.

public Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: The input tensor to process.

Returns

Tensor<T>: The output tensor after processing through all layers.

Remarks

The forward pass sequentially processes the input through each layer of the network, with the MoE layer using sparse expert activation based on the gating network's decisions.

For Beginners: This method processes input through all the network's layers.

For MoE networks:

Input goes through the MoE layer (which internally routes to experts)
Then through any additional layers you added
Final output is returned

Think of it like an assembly line where each station (layer) processes the data and passes it to the next station.

GetModelMetadata()

Retrieves metadata about the Mixture-of-Experts neural network model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>: A ModelMetadata object containing information about the network.

Remarks

This method collects and returns various pieces of information about the network's structure, including MoE-specific details like number of experts and routing strategy.

For Beginners: This provides a summary of your MoE network's configuration.

The metadata includes:

How many expert networks you have
How many experts are activated per input (Top-K)
Expert dimensions and architecture details
Load balancing settings
Overall network structure

This is useful for documentation, debugging, or understanding model differences.

InitializeLayers()

Initializes the layers of the neural network based on the provided architecture and options.

protected override void InitializeLayers()

Remarks

This method creates the Mixture-of-Experts layer based on the configuration options, and adds any additional layers specified in the architecture.

For Beginners: This method sets up the expert networks and gating mechanism.

The initialization process:

Creates expert networks based on your options (number, dimensions, etc.)
Creates the gating/routing network that learns to select experts
Adds any additional layers you specified in the architecture

You don't need to call this manually - it's automatically called during construction.

Predict(Tensor<T>)

Makes a prediction using the Mixture-of-Experts network for the given input tensor.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>: The input tensor to make a prediction for.

Returns

Tensor<T>: The predicted output tensor.

Remarks

This method performs a forward pass through the network to generate a prediction. The gating network determines which experts to activate, and only those experts process the input for efficiency.

For Beginners: This method makes predictions using the trained Mixture-of-Experts model.

What happens during prediction:

The input goes to the gating network
The gating network selects the best experts for this input (Top-K)
Only the selected experts process the input
Expert outputs are combined using learned weights
The final prediction is returned

This sparse activation (only using some experts) makes MoE much faster than running all experts for every input, while maintaining high quality predictions.

SerializeNetworkSpecificData(BinaryWriter)

Serializes Mixture-of-Experts network-specific data to a binary writer.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter: The BinaryWriter to write the data to.

Remarks

This method writes the MoE-specific configuration and state to a binary stream, allowing the model to be saved and loaded later.

For Beginners: This saves your trained MoE model to a file.

It records:

All expert network weights
Gating network weights
Configuration settings
Optimizer and loss function types

This allows you to:

Save a trained model for later use
Share models with others
Deploy models to production

Train(Tensor<T>, Tensor<T>)

Trains the Mixture-of-Experts network using the provided input and expected output.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>: The input tensor for training.
expectedOutput Tensor<T>: The expected output tensor for the given input.

Remarks

This method performs one training iteration, including: - Forward pass through the network - Primary loss calculation (task-specific loss) - Auxiliary loss calculation (load balancing loss) - Backward pass (backpropagation) - Parameter update via optimizer

For Beginners: This is how the MoE network learns.

Training process:

Show the network some input data and the correct output
Network makes a prediction using selected experts
Calculate two types of errors:
- Task error: How wrong the prediction was
- Balance error: Whether experts are being used evenly
Adjust the network to reduce both errors
Repeat many times with different examples

The load balancing ensures all experts contribute and don't collapse to using just one or two experts.

UpdateParameters(Vector<T>)

Updates the parameters of all layers in the network.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: A vector containing all parameters for the network.

Remarks

This method distributes the parameters to each layer, including all expert networks within the MoE layer and the gating network.

For Beginners: After calculating how to improve the network, this method applies those improvements.

It distributes updated settings (parameters) to:

All expert networks
The gating network
Any additional layers

This is called repeatedly during training to gradually improve accuracy.

Table of Contents

Class MixtureOfExpertsNeuralNetwork<T>

Type Parameters

Remarks

Constructors

MixtureOfExpertsNeuralNetwork(MixtureOfExpertsOptions<T>, NeuralNetworkArchitecture<T>, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?, double)

Parameters

Examples

Remarks

Properties

SupportsTraining

Property Value

Remarks

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

CreateNewInstance()

Returns

Remarks

DeserializeNetworkSpecificData(BinaryReader)

Parameters

Remarks

Forward(Tensor<T>)

Parameters

Returns

Remarks

GetModelMetadata()

Returns

Remarks

InitializeLayers()

Remarks

Predict(Tensor<T>)

Parameters

Returns

Remarks

SerializeNetworkSpecificData(BinaryWriter)

Parameters

Remarks

Train(Tensor<T>, Tensor<T>)

Parameters

Remarks

UpdateParameters(Vector<T>)

Parameters

Remarks