Class ExpertLayer<T>

Namespace: AiDotNet.NeuralNetworks.Layers

Assembly: AiDotNet.dll

Represents an expert module in a Mixture-of-Experts architecture, containing a sequence of layers.

public class ExpertLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>, IDisposable

Type Parameters

T: The numeric type used for calculations, typically float or double.

Inheritance: object

LayerBase<T>

ExpertLayer<T>

Implements: ILayer<T>

IJitCompilable<T>

IDiagnosticsProvider

IWeightLoadable<T>

IDisposable

Inherited Members: LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.Forward(params Tensor<T>[])

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.UpdateParameters(Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.GetDiagnostics()

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

An Expert is a container for a sequence of neural network layers that are executed sequentially. In a Mixture-of-Experts (MoE) architecture, multiple experts process the same input, and their outputs are combined based on learned routing weights. Each expert can specialize in processing different types of inputs or patterns.

For Beginners: Think of an Expert as a mini neural network that specializes in a particular task.

In a Mixture-of-Experts system:

You have multiple "experts" (mini-networks), each with their own layers
Each expert learns to be good at handling certain types of inputs
A routing mechanism decides which experts should process each input
The final output combines the predictions from the selected experts

For example, in a language model:

One expert might specialize in technical vocabulary
Another might handle conversational language
Another might focus on formal writing
The router learns to send each input to the most appropriate expert(s)

This allows the model to scale to very large sizes while keeping computation efficient, since only a subset of experts are activated for each input.

Constructors

ExpertLayer(List<ILayer<T>>, int[], int[], IActivationFunction<T>?)

Initializes a new instance of the ExpertLayer<T> class with the specified layers.

public ExpertLayer(List<ILayer<T>> layers, int[] inputShape, int[] outputShape, IActivationFunction<T>? activationFunction = null)

Parameters

layers List<ILayer<T>>: The sequence of layers that make up this expert.
inputShape int[]: The shape of the input tensor.
outputShape int[]: The shape of the output tensor.
activationFunction IActivationFunction<T>: Optional activation function to apply after all layers (defaults to identity).

Remarks

This constructor creates an expert module from a sequence of layers. The layers are executed in the order provided during forward pass, and in reverse order during backpropagation. The input and output shapes should match the first layer's input shape and last layer's output shape.

For Beginners: This creates a new expert by chaining together multiple layers.

When creating an expert:

Provide a list of layers in the order they should execute
The first layer should accept your input shape
The last layer should produce your desired output shape
Each intermediate layer's output should match the next layer's expected input

For example, to create an expert that reduces dimensions:

var layers = new List<ILayer<float>>
{
    new DenseLayer<float>(100, 50, new ReLUActivation<float>()),
    new DenseLayer<float>(50, 25, new ReLUActivation<float>())
};
var expert = new ExpertLayer<float>(layers, new[] { 100 }, new[] { 25 });

Exceptions

ArgumentException: Thrown when the layers list is empty.

Properties

ParameterCount

Gets the total number of trainable parameters across all layers in this expert.

public override int ParameterCount { get; }

Property Value

int: The sum of parameter counts from all contained layers.

Remarks

This property calculates the total number of trainable parameters by summing the parameter counts of all layers in the expert's sequence.

For Beginners: This counts all the numbers that can be adjusted during training.

The total includes:

Weights from all dense layers
Biases from all layers that use them
Any other learnable parameters in the layers

A higher parameter count means the expert can represent more complex patterns, but also requires more memory and computation.

SupportsGpuExecution

Gets a value indicating whether this expert supports GPU execution. Returns true if all contained layers support GPU execution.

protected override bool SupportsGpuExecution { get; }

Property Value

bool

SupportsJitCompilation

Gets whether this layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool: True if the layer can be JIT compiled, false otherwise.

Remarks

This property indicates whether the layer has implemented ExportComputationGraph() and can benefit from JIT compilation. All layers MUST implement this property.

For Beginners: JIT compilation can make inference 5-10x faster by converting the layer's operations into optimized native code.

Layers should return false if they:

Have not yet implemented a working ExportComputationGraph()
Use dynamic operations that change based on input data
Are too simple to benefit from JIT compilation

When false, the layer will use the standard Forward() method instead.

SupportsTraining

Gets a value indicating whether this expert supports training through backpropagation.

public override bool SupportsTraining { get; }

Property Value

bool: true if any of the contained layers support training; otherwise, false.

Remarks

An expert supports training if at least one of its layers has trainable parameters. This determines whether gradients will be computed during the backward pass.

For Beginners: This tells you whether this expert can learn from data.

An expert can learn if:

At least one of its layers has adjustable parameters
Those parameters can be updated during training

If all layers in an expert are fixed (like certain activation layers), the expert won't be trainable, but it can still process data.

Methods

Backward(Tensor<T>)

Calculates gradients by backpropagating through all layers in reverse order.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: The gradient of the loss with respect to this expert's output.

Returns

Tensor<T>: The gradient of the loss with respect to this expert's input.

Remarks

This method performs the backward pass by propagating gradients through layers in reverse order. Each layer computes gradients for its parameters and passes the input gradient to the previous layer. The gradients are stored in each layer for the subsequent parameter update step.

For Beginners: This method helps all layers learn from their mistakes by passing error information backward.

The backward pass works in reverse:

Start with information about how wrong the output was
Apply the derivative of the expert's activation function
Pass this error information to the last layer
That layer calculates how to improve and passes error info to the previous layer
Continue in reverse until reaching the first layer
Return the gradient for the input (so earlier layers can learn too)

This is the core of how neural networks learn - each layer figures out how to adjust its parameters to reduce the error.

BackwardGpu(IGpuTensor<T>)

Computes the gradient of the loss with respect to the input on the GPU.

public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)

Parameters

outputGradient IGpuTensor<T>: The gradient of the loss with respect to the layer's output.

Returns

IGpuTensor<T>: The gradient of the loss with respect to the layer's input.

Clone()

Creates a deep copy of this expert, including all contained layers.

public override LayerBase<T> Clone()

Returns

LayerBase<T>: A new Expert instance with the same configuration and parameters.

Remarks

This method creates a complete copy of the expert, including all layers and their parameters. The clone is independent of the original - changes to one won't affect the other.

For Beginners: This method creates an identical copy of the expert.

Cloning is useful when you want to:

Experiment with different training approaches on the same starting point
Create an ensemble of similar but independent experts
Save a checkpoint while continuing to train
Implement certain training algorithms that need multiple copies

The clone has:

The same layer structure
The same parameter values
But is completely independent (changes to one don't affect the other)

It's like photocopying a document - you get an identical copy that you can modify without changing the original.

ExportComputationGraph(List<ComputationNode<T>>)

Exports the layer's computation graph for JIT compilation.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>: List to populate with input computation nodes.

Returns

ComputationNode<T>: The output computation node representing the layer's operation.

Remarks

This method constructs a computation graph representation of the layer's forward pass that can be JIT compiled for faster inference. All layers MUST implement this method to support JIT compilation.

For Beginners: JIT (Just-In-Time) compilation converts the layer's operations into optimized native code for 5-10x faster inference.

To support JIT compilation, a layer must:

Implement this method to export its computation graph
Set SupportsJitCompilation to true
Use ComputationNode and TensorOperations to build the graph

All layers are required to implement this method, even if they set SupportsJitCompilation = false.

Forward(Tensor<T>)

Processes the input data through all layers in sequence.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: The input tensor to process.

Returns

Tensor<T>: The output tensor after processing through all layers.

Remarks

This method performs the forward pass by sequentially passing the input through each layer. The output of each layer becomes the input to the next layer. After all layers have processed the data, the expert's activation function (if any) is applied to the final output.

For Beginners: This method runs the data through all the expert's layers in order.

The forward pass works like an assembly line:

Start with the input data
Pass it through the first layer
Take that output and pass it to the second layer
Continue until all layers have processed the data
Apply the expert's activation function (if specified)
Return the final result

Each layer transforms the data in some way, building up more complex representations as the data flows through the expert.

ForwardGpu(params IGpuTensor<T>[])

Performs the forward pass on GPU tensors by chaining through all layers.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]: GPU tensor inputs.

Returns

IGpuTensor<T>: GPU tensor output after processing through all layers.

Remarks

This method executes the GPU forward pass by sequentially passing the input through each layer's ForwardGpu method. If any layer doesn't support GPU execution, falls back to CPU.

GetParameters()

Gets all trainable parameters from all layers as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>: A vector containing all parameters from all layers, concatenated in layer order.

Remarks

This method extracts all trainable parameters from all layers and concatenates them into a single vector. The parameters are ordered by layer (first layer's parameters, then second layer's, etc.). This is useful for optimization algorithms that operate on all parameters at once.

For Beginners: This method collects all the learned values from every layer into one list.

The returned vector contains:

All parameters from the first layer
Then all parameters from the second layer
And so on for all layers

This is useful for:

Saving the expert's knowledge to disk
Transferring learned parameters to another expert
Advanced optimization techniques
Analyzing what the expert has learned

You can think of it as packaging up everything the expert knows into one container.

ResetState()

Resets the internal state of all layers, clearing any cached values.

public override void ResetState()

Remarks

This method calls ResetState() on all contained layers, clearing any cached values from forward/backward passes. This should be called between different training batches or when switching between training and inference modes.

For Beginners: This method clears the expert's "short-term memory".

During processing, layers remember:

Recent inputs they processed
Intermediate calculations
Gradients from backpropagation

ResetState() clears all of this temporary information:

Frees up memory
Prevents information from one batch affecting another
Prepares the expert for processing new data

Think of it like cleaning a whiteboard before starting a new problem - you want a fresh start without old information interfering.

When to call this:

Between different training batches
When switching from training to testing
Before processing a completely new input

SetParameters(Vector<T>)

Sets all trainable parameters in all layers from a single vector.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: A vector containing all parameters for all layers, concatenated in layer order.

Remarks

This method distributes parameters from a single vector to all layers. The parameters should be in the same order as returned by GetParameters() - first layer's parameters, then second layer's, etc. This is useful for loading pre-trained models or implementing advanced optimization algorithms.

For Beginners: This method loads previously saved knowledge back into all the layers.

When setting parameters:

The vector must contain exactly the right number of parameters
Parameters are distributed to layers in order (first layer first, etc.)
Each layer receives its parameters and updates its weights and biases

This is the opposite of GetParameters() - instead of collecting knowledge, it distributes it.

Use cases:

Loading a saved model from disk
Transferring knowledge from one expert to another
Initializing an expert with pre-trained parameters
Implementing custom optimization algorithms

If the parameter count doesn't match, an error will be thrown to prevent corruption.

Exceptions

ArgumentException: Thrown when the parameter vector has incorrect length.

UpdateParameters(T)

Updates all trainable parameters in all layers using the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T: The learning rate to use for parameter updates.

Remarks

This method updates the parameters of all layers that support training. The learning rate controls the step size of the updates - larger values make bigger changes but may cause instability, while smaller values make more gradual, stable updates.

For Beginners: This method applies all the learned improvements to every layer.

After the backward pass has calculated how each layer should change:

This method actually makes those changes
It goes through each layer in order
Each layer updates its weights and biases
The learning rate controls how big the changes are

Think of it like this:

Small learning rate = careful, small adjustments (slower but safer)
Large learning rate = bold, big adjustments (faster but riskier)

After calling this method, the expert should perform slightly better than before.

Table of Contents

Class ExpertLayer<T>

Type Parameters

Remarks

Constructors

ExpertLayer(List<ILayer<T>>, int[], int[], IActivationFunction<T>?)

Parameters

Remarks

Exceptions

Properties

ParameterCount

Property Value

Remarks

SupportsGpuExecution

Property Value

SupportsJitCompilation

Property Value

Remarks

SupportsTraining

Property Value

Remarks

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

BackwardGpu(IGpuTensor<T>)

Parameters

Returns

Clone()

Returns

Remarks

ExportComputationGraph(List<ComputationNode<T>>)

Parameters

Returns

Remarks

Forward(Tensor<T>)

Parameters

Returns

Remarks

ForwardGpu(params IGpuTensor<T>[])

Parameters

Returns

Remarks

GetParameters()

Returns

Remarks

ResetState()

Remarks

SetParameters(Vector<T>)

Parameters

Remarks

Exceptions

UpdateParameters(T)

Parameters

Remarks