Class LayerNormalizationLayer<T>
- Namespace
- AiDotNet.NeuralNetworks.Layers
- Assembly
- AiDotNet.dll
Represents a Layer Normalization layer that normalizes inputs across the feature dimension.
public class LayerNormalizationLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>, IDisposable
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>LayerNormalizationLayer<T>
- Implements
-
ILayer<T>
- Inherited Members
Remarks
Layer Normalization is a technique used to normalize the inputs to a layer, which can help improve training stability and speed. Unlike Batch Normalization which normalizes across the batch dimension, Layer Normalization normalizes across the feature dimension independently for each sample. This makes it particularly useful for recurrent networks and when batch sizes are small. The layer learns scale (gamma) and shift (beta) parameters to allow the network to recover the original representation if needed.
For Beginners: This layer helps stabilize and speed up training by standardizing the data.
Think of Layer Normalization like standardizing test scores:
- It makes each sample's features have a mean of 0 and standard deviation of 1
- It does this independently for each sample (unlike Batch Normalization)
- It applies this normalization along the feature dimension
- After normalizing, it scales and shifts the values using learnable parameters
For example, in a sentiment analysis task, some input sentences might use very positive words while others use more neutral language. Layer Normalization helps the network focus on the relative importance of features within each sample rather than their absolute values.
This is particularly useful for:
- Recurrent neural networks
- Cases where batch sizes are small
- Making training more stable and faster
Constructors
LayerNormalizationLayer(int, double)
Initializes a new instance of the LayerNormalizationLayer<T> class with the specified feature size and epsilon value.
public LayerNormalizationLayer(int featureSize, double epsilon = 1E-05)
Parameters
featureSizeintThe number of features in the input data.
epsilondoubleA small value added to the variance for numerical stability. Defaults to 1e-5.
Remarks
This constructor creates a new Layer Normalization layer with the specified feature size and epsilon value. The gamma parameters are initialized to 1.0 and the beta parameters are initialized to 0.0.
For Beginners: This creates a new Layer Normalization layer with specific settings.
When creating this layer, you specify:
- featureSize: How many features each sample has (like dimensions in your data)
- epsilon: A tiny safety value to prevent division by zero (usually you can use the default)
The layer automatically initializes with:
- Gamma values of 1.0 for each feature (neutral scaling)
- Beta values of 0.0 for each feature (no initial shifting)
For example, if your data has 128 features, you would use featureSize=128.
Properties
ParameterCount
Gets all trainable parameters of the layer as a single vector.
public override int ParameterCount { get; }
Property Value
- int
A vector containing all trainable parameters.
Remarks
This method retrieves all trainable parameters (gamma and beta) and combines them into a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.
For Beginners: This method collects all the learnable values from the layer.
The parameters:
- Are the numbers that the neural network learns during training
- Include gamma (scaling) and beta (shifting) values
- Are combined into a single long list (vector)
This is useful for:
- Saving the model to disk
- Loading parameters from a previously trained model
- Advanced optimization techniques that need access to all parameters
SupportsGpuExecution
Indicates whether this layer supports GPU-resident execution.
protected override bool SupportsGpuExecution { get; }
Property Value
SupportsJitCompilation
Gets whether this layer normalization layer supports JIT compilation.
public override bool SupportsJitCompilation { get; }
Property Value
- bool
True if the layer parameters are initialized.
Remarks
This property indicates whether the layer can be JIT compiled. The layer supports JIT if: - Gamma (scale) and beta (shift) parameters are initialized
For Beginners: This tells you if this layer can use JIT compilation for faster inference.
The layer can be JIT compiled if:
- The layer has been initialized with learnable parameters (gamma and beta)
Unlike batch normalization, layer normalization doesn't require running statistics, so it can be JIT compiled immediately after initialization. It works the same way during training and inference, computing mean and variance on the fly for each sample.
Once initialized, JIT compilation can provide significant speedup (5-10x) by optimizing the per-sample normalization, scaling, and shifting operations.
This is especially important for Transformers where layer norm is used extensively in every encoder and decoder block.
SupportsTraining
Gets a value indicating whether this layer supports training.
public override bool SupportsTraining { get; }
Property Value
- bool
truebecause this layer has trainable parameters (gamma and beta).
Remarks
This property indicates whether the layer can be trained through backpropagation. The LayerNormalizationLayer always returns true because it contains trainable scale and shift parameters.
For Beginners: This property tells you if the layer can learn from data.
A value of true means:
- The layer has parameters that can be adjusted during training
- It will improve its performance as it sees more data
- It participates in the learning process
The Layer Normalization layer always supports training because it has gamma (scale) and beta (shift) parameters that are learned during training.
Methods
Backward(Tensor<T>)
Performs the backward pass of the layer normalization layer.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient of the loss with respect to the layer's output.
Returns
- Tensor<T>
The gradient of the loss with respect to the layer's input.
Remarks
This method implements the backward pass of the layer normalization, which is used during training to propagate error gradients back through the network. It calculates the gradients for the gamma and beta parameters, and returns the gradient with respect to the input for further backpropagation.
For Beginners: This method is used during training to calculate how the layer's input and parameters should change to reduce errors.
During the backward pass:
- The layer receives information about how its output contributed to errors
- It calculates how the gamma and beta parameters should change to reduce errors
- It calculates how the input should change, which will be used by earlier layers
This backward computation is complex because changing the mean and standard deviation of a sample affects all features, creating interdependencies in the gradients.
The method will throw an error if you try to run it before performing a forward pass.
Exceptions
- InvalidOperationException
Thrown when Forward has not been called before Backward.
BackwardGpu(IGpuTensor<T>)
Computes the gradient of the loss with respect to the input on the GPU.
public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)
Parameters
outputGradientIGpuTensor<T>The gradient of the loss with respect to the layer's output.
Returns
- IGpuTensor<T>
The gradient of the loss with respect to the layer's input.
ExportComputationGraph(List<ComputationNode<T>>)
Exports the layer normalization layer as a computation graph for JIT compilation.
public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)
Parameters
inputNodesList<ComputationNode<T>>List to which the input node will be added.
Returns
- ComputationNode<T>
The output computation node representing the layer normalization operation.
Remarks
This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node with shape [batch=1, features] 2. Creates constant nodes for gamma (scale) and beta (shift) parameters 3. Applies the layer normalization operation: gamma * ((x - mean) / sqrt(variance + epsilon)) + beta 4. Unlike batch normalization, layer norm computes statistics per sample (no running statistics needed)
For Beginners: This method builds a symbolic representation of layer normalization for JIT.
JIT compilation converts the layer normalization operation into optimized native code. Layer normalization:
- Computes mean and variance for each sample independently across features
- Normalizes: (x - mean) / sqrt(variance + epsilon)
- Scales and shifts: result * gamma + beta
- Works identically during training and inference (no batch dependency)
The symbolic graph allows the JIT compiler to:
- Optimize the per-sample normalization formula
- Fuse the scale and shift operations
- Generate SIMD-optimized code for better performance
This is particularly important for Transformers and RNNs where layer norm is critical. Typically provides 5-10x speedup compared to interpreted execution.
Exceptions
- ArgumentNullException
Thrown when inputNodes is null.
- InvalidOperationException
Thrown when layer shape or parameters are not initialized.
Forward(Tensor<T>)
Performs the forward pass of the layer normalization layer.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>The input tensor to normalize. Shape should be [batchSize, featureSize].
Returns
- Tensor<T>
The normalized tensor with the same shape as the input.
Remarks
This method implements the forward pass of the layer normalization. It uses the Engine's accelerated LayerNorm operation to normalize each sample independently across the feature dimension.
For Beginners: This method normalizes your data as it passes through the layer.
During the forward pass:
- The layer calculates mean and variance for each sample using GPU acceleration
- It normalizes, scales, and shifts the data in a single optimized operation
- It stores the statistics for the backward pass
This is much faster than doing it manually for each sample.
ForwardGpu(params IGpuTensor<T>[])
GPU-resident forward pass for layer normalization. Normalizes input across the feature dimension entirely on GPU.
public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)
Parameters
inputsIGpuTensor<T>[]
Returns
- IGpuTensor<T>
GPU-resident normalized output tensor.
GetBeta()
Gets the beta (shift) parameters of the layer normalization layer.
public Vector<T> GetBeta()
Returns
- Vector<T>
The beta vector used for shifting scaled values.
GetBetaTensor()
Gets the beta tensor for JIT compilation and internal use.
public Tensor<T> GetBetaTensor()
Returns
- Tensor<T>
GetEpsilon()
Gets the epsilon value used for numerical stability.
public T GetEpsilon()
Returns
- T
The epsilon value.
GetGamma()
Gets the gamma (scale) parameters of the layer normalization layer.
public Vector<T> GetGamma()
Returns
- Vector<T>
The gamma vector used for scaling normalized values.
GetGammaTensor()
Gets the gamma tensor for JIT compilation and internal use.
public Tensor<T> GetGammaTensor()
Returns
- Tensor<T>
GetNormalizedShape()
Gets the normalized shape (feature size) of the layer.
public int[] GetNormalizedShape()
Returns
- int[]
The normalized shape array.
GetParameters()
Gets all trainable parameters of the layer as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all trainable parameters.
Remarks
This abstract method must be implemented by derived classes to provide access to all trainable parameters of the layer as a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.
For Beginners: This method collects all the learnable values from the layer.
The parameters:
- Are the numbers that the neural network learns during training
- Include weights, biases, and other learnable values
- Are combined into a single long list (vector)
This is useful for:
- Saving the model to disk
- Loading parameters from a previously trained model
- Advanced optimization techniques that need access to all parameters
ResetState()
Resets the internal state of the layer.
public override void ResetState()
Remarks
This method resets the internal state of the layer, clearing cached values from forward and backward passes. This includes the last input, normalized values, mean, standard deviation, and gradients.
For Beginners: This method clears the layer's memory to start fresh.
When resetting the state:
- All stored information about previous inputs is removed
- All calculated statistics (mean, standard deviation) are cleared
- All gradient information is cleared
- The layer is ready for new data without being influenced by previous data
This is important for:
- Processing a new, unrelated batch of data
- Preventing information from one batch affecting another
- Starting a new training episode
SetParameters(Vector<T>)
Sets the trainable parameters of the layer.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>A vector containing all parameters to set.
Remarks
This method sets all the trainable parameters of the layer from a single vector of parameters. The parameters vector must have the correct length to match the total number of parameters in the layer. By default, it simply assigns the parameters vector to the Parameters field, but derived classes may override this to handle the parameters differently.
For Beginners: This method updates all the learnable values in the layer.
When setting parameters:
- The input must be a vector with the correct length
- The layer parses this vector to set all its internal parameters
- Throws an error if the input doesn't match the expected number of parameters
This is useful for:
- Loading a previously saved model
- Transferring parameters from another model
- Setting specific parameter values for testing
Exceptions
- ArgumentException
Thrown when the parameters vector has incorrect length.
UpdateParameters(T)
Updates the parameters of the layer using the calculated gradients.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate to use for the parameter updates.
Remarks
This method updates the gamma and beta parameters of the layer based on the gradients calculated during the backward pass. The learning rate controls the size of the parameter updates.
For Beginners: This method updates the layer's internal values during training.
When updating parameters:
- The gamma (scaling) and beta (shifting) values are adjusted to reduce prediction errors
- The learning rate controls how big each update step is
- Smaller learning rates mean slower but more stable learning
- Larger learning rates mean faster but potentially unstable learning
This is how the layer "learns" from data over time, gradually improving its ability to normalize inputs in the most helpful way for the network.
The method will throw an error if you try to run it before performing a backward pass.
Exceptions
- InvalidOperationException
Thrown when Backward has not been called before UpdateParameters.