Class ExpertLayer<T>
- Namespace
- AiDotNet.NeuralNetworks.Layers
- Assembly
- AiDotNet.dll
Represents an expert module in a Mixture-of-Experts architecture, containing a sequence of layers.
public class ExpertLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>, IDisposable
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>ExpertLayer<T>
- Implements
-
ILayer<T>
- Inherited Members
Remarks
An Expert is a container for a sequence of neural network layers that are executed sequentially. In a Mixture-of-Experts (MoE) architecture, multiple experts process the same input, and their outputs are combined based on learned routing weights. Each expert can specialize in processing different types of inputs or patterns.
For Beginners: Think of an Expert as a mini neural network that specializes in a particular task.
In a Mixture-of-Experts system:
- You have multiple "experts" (mini-networks), each with their own layers
- Each expert learns to be good at handling certain types of inputs
- A routing mechanism decides which experts should process each input
- The final output combines the predictions from the selected experts
For example, in a language model:
- One expert might specialize in technical vocabulary
- Another might handle conversational language
- Another might focus on formal writing
- The router learns to send each input to the most appropriate expert(s)
This allows the model to scale to very large sizes while keeping computation efficient, since only a subset of experts are activated for each input.
Constructors
ExpertLayer(List<ILayer<T>>, int[], int[], IActivationFunction<T>?)
Initializes a new instance of the ExpertLayer<T> class with the specified layers.
public ExpertLayer(List<ILayer<T>> layers, int[] inputShape, int[] outputShape, IActivationFunction<T>? activationFunction = null)
Parameters
layersList<ILayer<T>>The sequence of layers that make up this expert.
inputShapeint[]The shape of the input tensor.
outputShapeint[]The shape of the output tensor.
activationFunctionIActivationFunction<T>Optional activation function to apply after all layers (defaults to identity).
Remarks
This constructor creates an expert module from a sequence of layers. The layers are executed in the order provided during forward pass, and in reverse order during backpropagation. The input and output shapes should match the first layer's input shape and last layer's output shape.
For Beginners: This creates a new expert by chaining together multiple layers.
When creating an expert:
- Provide a list of layers in the order they should execute
- The first layer should accept your input shape
- The last layer should produce your desired output shape
- Each intermediate layer's output should match the next layer's expected input
For example, to create an expert that reduces dimensions:
var layers = new List<ILayer<float>>
{
new DenseLayer<float>(100, 50, new ReLUActivation<float>()),
new DenseLayer<float>(50, 25, new ReLUActivation<float>())
};
var expert = new ExpertLayer<float>(layers, new[] { 100 }, new[] { 25 });
Exceptions
- ArgumentException
Thrown when the layers list is empty.
Properties
ParameterCount
Gets the total number of trainable parameters across all layers in this expert.
public override int ParameterCount { get; }
Property Value
- int
The sum of parameter counts from all contained layers.
Remarks
This property calculates the total number of trainable parameters by summing the parameter counts of all layers in the expert's sequence.
For Beginners: This counts all the numbers that can be adjusted during training.
The total includes:
- Weights from all dense layers
- Biases from all layers that use them
- Any other learnable parameters in the layers
A higher parameter count means the expert can represent more complex patterns, but also requires more memory and computation.
SupportsGpuExecution
Gets a value indicating whether this expert supports GPU execution. Returns true if all contained layers support GPU execution.
protected override bool SupportsGpuExecution { get; }
Property Value
SupportsJitCompilation
Gets whether this layer supports JIT compilation.
public override bool SupportsJitCompilation { get; }
Property Value
- bool
True if the layer can be JIT compiled, false otherwise.
Remarks
This property indicates whether the layer has implemented ExportComputationGraph() and can benefit from JIT compilation. All layers MUST implement this property.
For Beginners: JIT compilation can make inference 5-10x faster by converting the layer's operations into optimized native code.
Layers should return false if they:
- Have not yet implemented a working ExportComputationGraph()
- Use dynamic operations that change based on input data
- Are too simple to benefit from JIT compilation
When false, the layer will use the standard Forward() method instead.
SupportsTraining
Gets a value indicating whether this expert supports training through backpropagation.
public override bool SupportsTraining { get; }
Property Value
- bool
trueif any of the contained layers support training; otherwise,false.
Remarks
An expert supports training if at least one of its layers has trainable parameters. This determines whether gradients will be computed during the backward pass.
For Beginners: This tells you whether this expert can learn from data.
An expert can learn if:
- At least one of its layers has adjustable parameters
- Those parameters can be updated during training
If all layers in an expert are fixed (like certain activation layers), the expert won't be trainable, but it can still process data.
Methods
Backward(Tensor<T>)
Calculates gradients by backpropagating through all layers in reverse order.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient of the loss with respect to this expert's output.
Returns
- Tensor<T>
The gradient of the loss with respect to this expert's input.
Remarks
This method performs the backward pass by propagating gradients through layers in reverse order. Each layer computes gradients for its parameters and passes the input gradient to the previous layer. The gradients are stored in each layer for the subsequent parameter update step.
For Beginners: This method helps all layers learn from their mistakes by passing error information backward.
The backward pass works in reverse:
- Start with information about how wrong the output was
- Apply the derivative of the expert's activation function
- Pass this error information to the last layer
- That layer calculates how to improve and passes error info to the previous layer
- Continue in reverse until reaching the first layer
- Return the gradient for the input (so earlier layers can learn too)
This is the core of how neural networks learn - each layer figures out how to adjust its parameters to reduce the error.
BackwardGpu(IGpuTensor<T>)
Computes the gradient of the loss with respect to the input on the GPU.
public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)
Parameters
outputGradientIGpuTensor<T>The gradient of the loss with respect to the layer's output.
Returns
- IGpuTensor<T>
The gradient of the loss with respect to the layer's input.
Clone()
Creates a deep copy of this expert, including all contained layers.
public override LayerBase<T> Clone()
Returns
- LayerBase<T>
A new Expert instance with the same configuration and parameters.
Remarks
This method creates a complete copy of the expert, including all layers and their parameters. The clone is independent of the original - changes to one won't affect the other.
For Beginners: This method creates an identical copy of the expert.
Cloning is useful when you want to:
- Experiment with different training approaches on the same starting point
- Create an ensemble of similar but independent experts
- Save a checkpoint while continuing to train
- Implement certain training algorithms that need multiple copies
The clone has:
- The same layer structure
- The same parameter values
- But is completely independent (changes to one don't affect the other)
It's like photocopying a document - you get an identical copy that you can modify without changing the original.
ExportComputationGraph(List<ComputationNode<T>>)
Exports the layer's computation graph for JIT compilation.
public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)
Parameters
inputNodesList<ComputationNode<T>>List to populate with input computation nodes.
Returns
- ComputationNode<T>
The output computation node representing the layer's operation.
Remarks
This method constructs a computation graph representation of the layer's forward pass that can be JIT compiled for faster inference. All layers MUST implement this method to support JIT compilation.
For Beginners: JIT (Just-In-Time) compilation converts the layer's operations into optimized native code for 5-10x faster inference.
To support JIT compilation, a layer must:
- Implement this method to export its computation graph
- Set SupportsJitCompilation to true
- Use ComputationNode and TensorOperations to build the graph
All layers are required to implement this method, even if they set SupportsJitCompilation = false.
Forward(Tensor<T>)
Processes the input data through all layers in sequence.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>The input tensor to process.
Returns
- Tensor<T>
The output tensor after processing through all layers.
Remarks
This method performs the forward pass by sequentially passing the input through each layer. The output of each layer becomes the input to the next layer. After all layers have processed the data, the expert's activation function (if any) is applied to the final output.
For Beginners: This method runs the data through all the expert's layers in order.
The forward pass works like an assembly line:
- Start with the input data
- Pass it through the first layer
- Take that output and pass it to the second layer
- Continue until all layers have processed the data
- Apply the expert's activation function (if specified)
- Return the final result
Each layer transforms the data in some way, building up more complex representations as the data flows through the expert.
ForwardGpu(params IGpuTensor<T>[])
Performs the forward pass on GPU tensors by chaining through all layers.
public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)
Parameters
inputsIGpuTensor<T>[]GPU tensor inputs.
Returns
- IGpuTensor<T>
GPU tensor output after processing through all layers.
Remarks
This method executes the GPU forward pass by sequentially passing the input through each layer's ForwardGpu method. If any layer doesn't support GPU execution, falls back to CPU.
GetParameters()
Gets all trainable parameters from all layers as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all parameters from all layers, concatenated in layer order.
Remarks
This method extracts all trainable parameters from all layers and concatenates them into a single vector. The parameters are ordered by layer (first layer's parameters, then second layer's, etc.). This is useful for optimization algorithms that operate on all parameters at once.
For Beginners: This method collects all the learned values from every layer into one list.
The returned vector contains:
- All parameters from the first layer
- Then all parameters from the second layer
- And so on for all layers
This is useful for:
- Saving the expert's knowledge to disk
- Transferring learned parameters to another expert
- Advanced optimization techniques
- Analyzing what the expert has learned
You can think of it as packaging up everything the expert knows into one container.
ResetState()
Resets the internal state of all layers, clearing any cached values.
public override void ResetState()
Remarks
This method calls ResetState() on all contained layers, clearing any cached values from forward/backward passes. This should be called between different training batches or when switching between training and inference modes.
For Beginners: This method clears the expert's "short-term memory".
During processing, layers remember:
- Recent inputs they processed
- Intermediate calculations
- Gradients from backpropagation
ResetState() clears all of this temporary information:
- Frees up memory
- Prevents information from one batch affecting another
- Prepares the expert for processing new data
Think of it like cleaning a whiteboard before starting a new problem - you want a fresh start without old information interfering.
When to call this:
- Between different training batches
- When switching from training to testing
- Before processing a completely new input
SetParameters(Vector<T>)
Sets all trainable parameters in all layers from a single vector.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>A vector containing all parameters for all layers, concatenated in layer order.
Remarks
This method distributes parameters from a single vector to all layers. The parameters should be in the same order as returned by GetParameters() - first layer's parameters, then second layer's, etc. This is useful for loading pre-trained models or implementing advanced optimization algorithms.
For Beginners: This method loads previously saved knowledge back into all the layers.
When setting parameters:
- The vector must contain exactly the right number of parameters
- Parameters are distributed to layers in order (first layer first, etc.)
- Each layer receives its parameters and updates its weights and biases
This is the opposite of GetParameters() - instead of collecting knowledge, it distributes it.
Use cases:
- Loading a saved model from disk
- Transferring knowledge from one expert to another
- Initializing an expert with pre-trained parameters
- Implementing custom optimization algorithms
If the parameter count doesn't match, an error will be thrown to prevent corruption.
Exceptions
- ArgumentException
Thrown when the parameter vector has incorrect length.
UpdateParameters(T)
Updates all trainable parameters in all layers using the specified learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate to use for parameter updates.
Remarks
This method updates the parameters of all layers that support training. The learning rate controls the step size of the updates - larger values make bigger changes but may cause instability, while smaller values make more gradual, stable updates.
For Beginners: This method applies all the learned improvements to every layer.
After the backward pass has calculated how each layer should change:
- This method actually makes those changes
- It goes through each layer in order
- Each layer updates its weights and biases
- The learning rate controls how big the changes are
Think of it like this:
- Small learning rate = careful, small adjustments (slower but safer)
- Large learning rate = bold, big adjustments (faster but riskier)
After calling this method, the expert should perform slightly better than before.