Class LayerNormalizationLayer<T>

Namespace: AiDotNet.NeuralNetworks.Layers

Assembly: AiDotNet.dll

Represents a Layer Normalization layer that normalizes inputs across the feature dimension.

public class LayerNormalizationLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>, IDisposable

Type Parameters

T: The numeric type used for calculations, typically float or double.

Inheritance: object

LayerBase<T>

LayerNormalizationLayer<T>

Implements: ILayer<T>

IJitCompilable<T>

IDiagnosticsProvider

IWeightLoadable<T>

IDisposable

Inherited Members: LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.Forward(params Tensor<T>[])

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.Clone()

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.UpdateParameters(Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.GetDiagnostics()

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

Layer Normalization is a technique used to normalize the inputs to a layer, which can help improve training stability and speed. Unlike Batch Normalization which normalizes across the batch dimension, Layer Normalization normalizes across the feature dimension independently for each sample. This makes it particularly useful for recurrent networks and when batch sizes are small. The layer learns scale (gamma) and shift (beta) parameters to allow the network to recover the original representation if needed.

For Beginners: This layer helps stabilize and speed up training by standardizing the data.

Think of Layer Normalization like standardizing test scores:

It makes each sample's features have a mean of 0 and standard deviation of 1
It does this independently for each sample (unlike Batch Normalization)
It applies this normalization along the feature dimension
After normalizing, it scales and shifts the values using learnable parameters

For example, in a sentiment analysis task, some input sentences might use very positive words while others use more neutral language. Layer Normalization helps the network focus on the relative importance of features within each sample rather than their absolute values.

This is particularly useful for:

Recurrent neural networks
Cases where batch sizes are small
Making training more stable and faster

Constructors

LayerNormalizationLayer(int, double)

Initializes a new instance of the LayerNormalizationLayer<T> class with the specified feature size and epsilon value.

public LayerNormalizationLayer(int featureSize, double epsilon = 1E-05)

Parameters

featureSize int: The number of features in the input data.
epsilon double: A small value added to the variance for numerical stability. Defaults to 1e-5.

Remarks

This constructor creates a new Layer Normalization layer with the specified feature size and epsilon value. The gamma parameters are initialized to 1.0 and the beta parameters are initialized to 0.0.

For Beginners: This creates a new Layer Normalization layer with specific settings.

When creating this layer, you specify:

featureSize: How many features each sample has (like dimensions in your data)
epsilon: A tiny safety value to prevent division by zero (usually you can use the default)

The layer automatically initializes with:

Gamma values of 1.0 for each feature (neutral scaling)
Beta values of 0.0 for each feature (no initial shifting)

For example, if your data has 128 features, you would use featureSize=128.

Properties

ParameterCount

Gets all trainable parameters of the layer as a single vector.

public override int ParameterCount { get; }

Property Value

int: A vector containing all trainable parameters.

Remarks

This method retrieves all trainable parameters (gamma and beta) and combines them into a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.

For Beginners: This method collects all the learnable values from the layer.

The parameters:

Are the numbers that the neural network learns during training
Include gamma (scaling) and beta (shifting) values
Are combined into a single long list (vector)

This is useful for:

Saving the model to disk
Loading parameters from a previously trained model
Advanced optimization techniques that need access to all parameters

SupportsGpuExecution

Indicates whether this layer supports GPU-resident execution.

protected override bool SupportsGpuExecution { get; }

Property Value

bool

SupportsJitCompilation

Gets whether this layer normalization layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool: True if the layer parameters are initialized.

Remarks

This property indicates whether the layer can be JIT compiled. The layer supports JIT if: - Gamma (scale) and beta (shift) parameters are initialized

For Beginners: This tells you if this layer can use JIT compilation for faster inference.

The layer can be JIT compiled if:

The layer has been initialized with learnable parameters (gamma and beta)

Unlike batch normalization, layer normalization doesn't require running statistics, so it can be JIT compiled immediately after initialization. It works the same way during training and inference, computing mean and variance on the fly for each sample.

Once initialized, JIT compilation can provide significant speedup (5-10x) by optimizing the per-sample normalization, scaling, and shifting operations.

This is especially important for Transformers where layer norm is used extensively in every encoder and decoder block.

SupportsTraining

Gets a value indicating whether this layer supports training.

public override bool SupportsTraining { get; }

Property Value

bool: true because this layer has trainable parameters (gamma and beta).

Remarks

This property indicates whether the layer can be trained through backpropagation. The LayerNormalizationLayer always returns true because it contains trainable scale and shift parameters.

For Beginners: This property tells you if the layer can learn from data.

A value of true means:

The layer has parameters that can be adjusted during training
It will improve its performance as it sees more data
It participates in the learning process

The Layer Normalization layer always supports training because it has gamma (scale) and beta (shift) parameters that are learned during training.

Methods

Backward(Tensor<T>)

Performs the backward pass of the layer normalization layer.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: The gradient of the loss with respect to the layer's output.

Returns

Tensor<T>: The gradient of the loss with respect to the layer's input.

Remarks

This method implements the backward pass of the layer normalization, which is used during training to propagate error gradients back through the network. It calculates the gradients for the gamma and beta parameters, and returns the gradient with respect to the input for further backpropagation.

For Beginners: This method is used during training to calculate how the layer's input and parameters should change to reduce errors.

During the backward pass:

The layer receives information about how its output contributed to errors
It calculates how the gamma and beta parameters should change to reduce errors
It calculates how the input should change, which will be used by earlier layers

This backward computation is complex because changing the mean and standard deviation of a sample affects all features, creating interdependencies in the gradients.

The method will throw an error if you try to run it before performing a forward pass.

Exceptions

InvalidOperationException: Thrown when Forward has not been called before Backward.

BackwardGpu(IGpuTensor<T>)

Computes the gradient of the loss with respect to the input on the GPU.

public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)

Parameters

outputGradient IGpuTensor<T>: The gradient of the loss with respect to the layer's output.

Returns

IGpuTensor<T>: The gradient of the loss with respect to the layer's input.

ExportComputationGraph(List<ComputationNode<T>>)

Exports the layer normalization layer as a computation graph for JIT compilation.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>: List to which the input node will be added.

Returns

ComputationNode<T>: The output computation node representing the layer normalization operation.

Remarks

This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node with shape [batch=1, features] 2. Creates constant nodes for gamma (scale) and beta (shift) parameters 3. Applies the layer normalization operation: gamma * ((x - mean) / sqrt(variance + epsilon)) + beta 4. Unlike batch normalization, layer norm computes statistics per sample (no running statistics needed)

For Beginners: This method builds a symbolic representation of layer normalization for JIT.

JIT compilation converts the layer normalization operation into optimized native code. Layer normalization:

Computes mean and variance for each sample independently across features
Normalizes: (x - mean) / sqrt(variance + epsilon)
Scales and shifts: result * gamma + beta
Works identically during training and inference (no batch dependency)

The symbolic graph allows the JIT compiler to:

Optimize the per-sample normalization formula
Fuse the scale and shift operations
Generate SIMD-optimized code for better performance

This is particularly important for Transformers and RNNs where layer norm is critical. Typically provides 5-10x speedup compared to interpreted execution.

Exceptions

ArgumentNullException: Thrown when inputNodes is null.
InvalidOperationException: Thrown when layer shape or parameters are not initialized.

Forward(Tensor<T>)

Performs the forward pass of the layer normalization layer.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: The input tensor to normalize. Shape should be [batchSize, featureSize].

Returns

Tensor<T>: The normalized tensor with the same shape as the input.

Remarks

This method implements the forward pass of the layer normalization. It uses the Engine's accelerated LayerNorm operation to normalize each sample independently across the feature dimension.

For Beginners: This method normalizes your data as it passes through the layer.

During the forward pass:

The layer calculates mean and variance for each sample using GPU acceleration
It normalizes, scales, and shifts the data in a single optimized operation
It stores the statistics for the backward pass

This is much faster than doing it manually for each sample.

ForwardGpu(params IGpuTensor<T>[])

GPU-resident forward pass for layer normalization. Normalizes input across the feature dimension entirely on GPU.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]

Returns

IGpuTensor<T>: GPU-resident normalized output tensor.

GetBeta()

Gets the beta (shift) parameters of the layer normalization layer.

public Vector<T> GetBeta()

Returns

Vector<T>: The beta vector used for shifting scaled values.

GetBetaTensor()

Gets the beta tensor for JIT compilation and internal use.

public Tensor<T> GetBetaTensor()

Returns

Tensor<T>

GetEpsilon()

Gets the epsilon value used for numerical stability.

public T GetEpsilon()

Returns

T: The epsilon value.

GetGamma()

Gets the gamma (scale) parameters of the layer normalization layer.

public Vector<T> GetGamma()

Returns

Vector<T>: The gamma vector used for scaling normalized values.

GetGammaTensor()

Gets the gamma tensor for JIT compilation and internal use.

public Tensor<T> GetGammaTensor()

Returns

Tensor<T>

GetNormalizedShape()

Gets the normalized shape (feature size) of the layer.

public int[] GetNormalizedShape()

Returns

int[]: The normalized shape array.

GetParameters()

Gets all trainable parameters of the layer as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>: A vector containing all trainable parameters.

Remarks

This abstract method must be implemented by derived classes to provide access to all trainable parameters of the layer as a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.

For Beginners: This method collects all the learnable values from the layer.

The parameters:

Are the numbers that the neural network learns during training
Include weights, biases, and other learnable values
Are combined into a single long list (vector)

This is useful for:

Saving the model to disk
Loading parameters from a previously trained model
Advanced optimization techniques that need access to all parameters

ResetState()

Resets the internal state of the layer.

public override void ResetState()

Remarks

This method resets the internal state of the layer, clearing cached values from forward and backward passes. This includes the last input, normalized values, mean, standard deviation, and gradients.

For Beginners: This method clears the layer's memory to start fresh.

When resetting the state:

All stored information about previous inputs is removed
All calculated statistics (mean, standard deviation) are cleared
All gradient information is cleared
The layer is ready for new data without being influenced by previous data

This is important for:

Processing a new, unrelated batch of data
Preventing information from one batch affecting another
Starting a new training episode

SetParameters(Vector<T>)

Sets the trainable parameters of the layer.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: A vector containing all parameters to set.

Remarks

This method sets all the trainable parameters of the layer from a single vector of parameters. The parameters vector must have the correct length to match the total number of parameters in the layer. By default, it simply assigns the parameters vector to the Parameters field, but derived classes may override this to handle the parameters differently.

For Beginners: This method updates all the learnable values in the layer.

When setting parameters:

The input must be a vector with the correct length
The layer parses this vector to set all its internal parameters
Throws an error if the input doesn't match the expected number of parameters

This is useful for:

Loading a previously saved model
Transferring parameters from another model
Setting specific parameter values for testing

Exceptions

ArgumentException: Thrown when the parameters vector has incorrect length.

UpdateParameters(T)

Updates the parameters of the layer using the calculated gradients.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T: The learning rate to use for the parameter updates.

Remarks

This method updates the gamma and beta parameters of the layer based on the gradients calculated during the backward pass. The learning rate controls the size of the parameter updates.

For Beginners: This method updates the layer's internal values during training.

When updating parameters:

The gamma (scaling) and beta (shifting) values are adjusted to reduce prediction errors
The learning rate controls how big each update step is
Smaller learning rates mean slower but more stable learning
Larger learning rates mean faster but potentially unstable learning

This is how the layer "learns" from data over time, gradually improving its ability to normalize inputs in the most helpful way for the network.

The method will throw an error if you try to run it before performing a backward pass.

Exceptions

InvalidOperationException: Thrown when Backward has not been called before UpdateParameters.

Table of Contents

Class LayerNormalizationLayer<T>

Type Parameters

Remarks

Constructors

LayerNormalizationLayer(int, double)

Parameters

Remarks

Properties

ParameterCount

Property Value

Remarks

SupportsGpuExecution

Property Value

SupportsJitCompilation

Property Value

Remarks

SupportsTraining

Property Value

Remarks

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

Exceptions

BackwardGpu(IGpuTensor<T>)

Parameters

Returns

ExportComputationGraph(List<ComputationNode<T>>)

Parameters

Returns

Remarks

Exceptions

Forward(Tensor<T>)

Parameters

Returns

Remarks

ForwardGpu(params IGpuTensor<T>[])

Parameters

Returns

GetBeta()

Returns

GetBetaTensor()

Returns

GetEpsilon()

Returns

GetGamma()

Returns

GetGammaTensor()

Returns

GetNormalizedShape()

Returns

GetParameters()

Returns

Remarks

ResetState()

Remarks

SetParameters(Vector<T>)

Parameters

Remarks

Exceptions

UpdateParameters(T)

Parameters

Remarks

Exceptions