Class BatchNormalizationLayer<T>
- Namespace
- AiDotNet.NeuralNetworks.Layers
- Assembly
- AiDotNet.dll
Implements batch normalization for neural networks, which normalizes the inputs across a mini-batch.
public class BatchNormalizationLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>, IDisposable
Type Parameters
TThe numeric type used for computations (e.g., float, double).
- Inheritance
-
LayerBase<T>BatchNormalizationLayer<T>
- Implements
-
ILayer<T>
- Inherited Members
Remarks
Batch normalization helps stabilize and accelerate training by normalizing layer inputs. It works by normalizing each feature to have zero mean and unit variance across the batch, then applying learnable scale (gamma) and shift (beta) parameters.
Benefits include: - Faster training convergence - Reduced sensitivity to weight initialization - Ability to use higher learning rates - Acts as a form of regularization
For Beginners: Batch normalization is like standardizing test scores in a classroom.
Imagine a class where each student (input) has a raw test score. Batch normalization:
- Calculates the average score and how spread out the scores are
- Converts each score to show how many standard deviations it is from the average
- Applies adjustable scaling and shifting to the standardized scores
This helps neural networks learn more efficiently by:
- Keeping input values in a consistent range
- Reducing the "internal covariate shift" problem
- Making the network less sensitive to poor weight initialization
- Allowing higher learning rates without divergence
In practice, this means your network will typically train faster and perform better.
Constructors
BatchNormalizationLayer(int, double, double)
public BatchNormalizationLayer(int numFeatures, double epsilon = 1E-05, double momentum = 0.9)
Parameters
Properties
ParameterCount
Gets all trainable parameters of the batch normalization layer.
public override int ParameterCount { get; }
Property Value
- int
A vector containing all trainable parameters (gamma and beta) concatenated together.
Remarks
This method returns a single vector containing all trainable parameters of the layer: - First half: gamma (scale) parameters - Second half: beta (shift) parameters
This is useful for optimization algorithms that need access to all parameters at once, or for saving/loading model weights.
For Beginners: This method returns all the learnable parameters as a single vector.
Batch normalization has two sets of learnable parameters:
- Gamma (scale): Controls how much to stretch or compress the normalized data
- Beta (shift): Controls how much to move the normalized data up or down
This method combines both sets into a single vector, with gamma values first, followed by beta values. For example, with 3 features:
[gamma1, gamma2, gamma3, beta1, beta2, beta3]
This format is useful for:
- Saving and loading models
- Advanced optimization algorithms that work with all parameters at once
- Regularization techniques that need to access all parameters
The total length of the returned vector is twice the number of features, since there's one gamma and one beta parameter per feature.
SupportsGpuExecution
Gets whether this layer has a GPU implementation.
protected override bool SupportsGpuExecution { get; }
Property Value
SupportsJitCompilation
Gets whether this batch normalization layer supports JIT compilation.
public override bool SupportsJitCompilation { get; }
Property Value
- bool
True if the layer parameters and running statistics are initialized.
Remarks
This property indicates whether the layer can be JIT compiled. The layer supports JIT if: - Gamma (scale) and beta (shift) parameters are initialized - Running mean and variance statistics are initialized (from training)
For Beginners: This tells you if this layer can use JIT compilation for faster inference.
The layer can be JIT compiled if:
- The layer has been initialized with learnable parameters (gamma and beta)
- The model has been trained, so running statistics are available
Batch normalization during inference requires running statistics collected during training, so JIT compilation is only supported after the model has been trained at least once.
Once these conditions are met, JIT compilation can provide significant speedup (5-10x) by optimizing the normalization, scaling, and shifting operations.
SupportsTraining
Gets a value indicating whether this layer supports training.
public override bool SupportsTraining { get; }
Property Value
- bool
trueif the layer has trainable parameters and supports backpropagation; otherwise,false.
Remarks
This property indicates whether the layer can be trained through backpropagation. Layers with trainable parameters such as weights and biases typically return true, while layers that only perform fixed transformations (like pooling or activation layers) typically return false.
For Beginners: This property tells you if the layer can learn from data.
A value of true means:
- The layer has parameters that can be adjusted during training
- It will improve its performance as it sees more data
- It participates in the learning process
A value of false means:
- The layer doesn't have any adjustable parameters
- It performs the same operation regardless of training
- It doesn't need to learn (but may still be useful)
Methods
Backward(Tensor<T>)
Performs the backward pass of batch normalization.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient of the loss with respect to the layer's output.
Returns
- Tensor<T>
The gradient of the loss with respect to the layer's input.
Remarks
The backward pass computes three types of gradients: 1. Gradients for the input (to pass to previous layers) 2. Gradients for gamma (scale parameter) 3. Gradients for beta (shift parameter)
This is a complex calculation that accounts for how each input affects: - The normalized value directly - The batch mean - The batch variance
The implementation follows the chain rule of calculus to properly backpropagate through all operations in the forward pass.
For Beginners: This method calculates how the error gradients flow backward through this layer.
During backpropagation, this method:
- Checks that Forward() was called first
- Creates tensors to hold the gradients for inputs and parameters
- Calculates the inverse standard deviation (1/sqrt(variance + epsilon))
- For each feature:
- Sums the output gradients across the batch
- Sums the product of output gradients and normalized values
- Calculates gradients for gamma and beta parameters
- Calculates gradients for each input value
The calculation is complex because in batch normalization, each input affects:
- Its own normalized value directly
- The mean of the batch (which affects all normalized values)
- The variance of the batch (which affects all normalized values)
The formula accounts for all these dependencies using the chain rule of calculus.
This method stores the gradients for gamma and beta to use during parameter updates, and returns the gradient for the input to pass to previous layers.
Exceptions
- InvalidOperationException
Thrown when backward is called before forward.
BackwardGpu(IGpuTensor<T>)
Performs GPU-resident backward pass for the batch normalization layer. Computes gradients for input, gamma, and beta entirely on GPU.
public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)
Parameters
outputGradientIGpuTensor<T>GPU-resident gradient from the next layer.
Returns
- IGpuTensor<T>
GPU-resident gradient to pass to the previous layer.
Exceptions
- InvalidOperationException
Thrown if ForwardGpu was not called first.
ExportComputationGraph(List<ComputationNode<T>>)
Exports the batch normalization layer as a computation graph for JIT compilation.
public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)
Parameters
inputNodesList<ComputationNode<T>>List to which the input node will be added.
Returns
- ComputationNode<T>
The output computation node representing the batch normalization operation.
Remarks
This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node with shape [batch=1, features] 2. Creates constant nodes for gamma (scale) and beta (shift) parameters 3. Uses running statistics (mean and variance) for inference mode 4. Applies the batch normalization operation: gamma * ((x - mean) / sqrt(variance + epsilon)) + beta
For Beginners: This method builds a symbolic representation of batch normalization for JIT.
JIT compilation converts the batch normalization operation into optimized native code. During inference (prediction), batch normalization uses:
- Running mean and variance collected during training (not batch statistics)
- Learned scale (gamma) and shift (beta) parameters
The symbolic graph allows the JIT compiler to:
- Optimize the normalization formula: (x - mean) / sqrt(variance + epsilon)
- Fuse the scale and shift operations: result * gamma + beta
- Generate SIMD-optimized code for better performance
This typically provides 5-10x speedup compared to interpreted execution.
Exceptions
- ArgumentNullException
Thrown when inputNodes is null.
- InvalidOperationException
Thrown when layer shape or parameters are not initialized.
Forward(Tensor<T>)
Performs the forward pass of batch normalization.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>The input tensor with shape [batchSize, featureSize].
Returns
- Tensor<T>
The normalized, scaled, and shifted output tensor.
Remarks
The forward pass performs these steps: 1. If in training mode: - Compute mean and variance of the current batch - Update running statistics for inference - Normalize using batch statistics 2. If in inference mode: - Normalize using running statistics collected during training 3. Apply scale (gamma) and shift (beta) parameters
The normalization formula is: y = gamma * ((x - mean) / sqrt(variance + epsilon)) + beta
For Beginners: This method normalizes the input data and applies learned scaling and shifting.
During the forward pass, this method:
- Saves the input for later use in backpropagation
- If in training mode:
- Calculates the mean and variance of each feature across the batch
- Updates the running statistics for use during inference
- Normalizes the data using the batch statistics
- If in inference/testing mode:
- Uses the running statistics collected during training
- Applies the learned scale (gamma) and shift (beta) parameters
The normalization makes each feature have approximately zero mean and unit variance, while the scale and shift parameters allow the network to learn the optimal distribution for each feature.
ForwardGpu(params IGpuTensor<T>[])
Performs GPU-resident batch normalization forward pass.
public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)
Parameters
inputsIGpuTensor<T>[]
Returns
- IGpuTensor<T>
GPU-resident output tensor with same shape as input.
Remarks
This method performs batch normalization entirely on GPU, avoiding CPU round-trips. The input and output tensors remain GPU-resident for chained GPU operations.
During training mode, running statistics (mean and variance) are updated on GPU and then downloaded back to CPU for persistence.
Exceptions
- InvalidOperationException
Thrown when GPU engine is not available.
GetBeta()
Gets the beta (shift) parameters of the batch normalization layer.
public Tensor<T> GetBeta()
Returns
- Tensor<T>
The beta tensor used for shifting scaled values.
GetEpsilon()
Gets the epsilon value used for numerical stability.
public T GetEpsilon()
Returns
- T
The epsilon value.
GetGamma()
Gets a value indicating whether this layer supports training mode.
public Tensor<T> GetGamma()
Returns
- Tensor<T>
The gamma tensor used for scaling normalized values.
Remarks
Batch normalization behaves differently during training versus inference: - During training: Uses statistics from the current batch - During inference: Uses running statistics collected during training
This property always returns true because the layer needs to track its training state.
For Beginners: This tells the network that this layer behaves differently during training versus testing.
During training, batch normalization uses statistics (mean and variance) calculated from the current batch of data. During testing or inference, it uses the average statistics collected during training.
This property being true means:
- The layer needs to know whether it's in training or inference mode
- The layer has parameters that can be updated during training
- The layer's behavior will change depending on the mode
This is important because it affects how the network processes data and how the layer's internal statistics are updated.
GetMomentum()
Gets the momentum value for running statistics.
public T GetMomentum()
Returns
- T
The momentum value.
GetParameters()
Gets all trainable parameters of the layer as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all trainable parameters.
Remarks
This abstract method must be implemented by derived classes to provide access to all trainable parameters of the layer as a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.
For Beginners: This method collects all the learnable values from the layer.
The parameters:
- Are the numbers that the neural network learns during training
- Include weights, biases, and other learnable values
- Are combined into a single long list (vector)
This is useful for:
- Saving the model to disk
- Loading parameters from a previously trained model
- Advanced optimization techniques that need access to all parameters
GetRunningMean()
Gets the running mean of the batch normalization layer.
public Tensor<T> GetRunningMean()
Returns
- Tensor<T>
The running mean tensor used during inference.
GetRunningVariance()
Gets the running variance of the batch normalization layer.
public Tensor<T> GetRunningVariance()
Returns
- Tensor<T>
The running variance tensor used during inference.
ResetState()
Resets the internal state of the batch normalization layer.
public override void ResetState()
Remarks
This method clears all cached values from the forward and backward passes, including: - Last input tensor - Last normalized values - Last batch mean and variance - Gradients for gamma and beta parameters
It does NOT reset the learned parameters (gamma and beta) or the running statistics (running mean and variance) used for inference.
This is typically called when starting a new training epoch or when switching between training and inference modes.
For Beginners: This method clears the layer's memory of previous calculations.
During training, the batch normalization layer keeps track of:
- The last input it processed
- The normalized values it calculated
- The mean and variance of the last batch
- The gradients for its parameters
This method clears all of these temporary values, which is useful when:
- Starting a new training epoch
- Switching between training and testing modes
- Ensuring the layer behaves deterministically
Important: This does NOT reset the learned parameters (gamma and beta) or the running statistics (running mean and variance) that are used during inference. It only clears temporary calculation values.
Think of it as clearing the layer's short-term memory while preserving its long-term learning.
SetParameters(Vector<T>)
Sets all trainable parameters of the batch normalization layer.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>A vector containing all parameters (gamma and beta) concatenated together.
Remarks
This method expects a single vector containing all trainable parameters: - First half: gamma (scale) parameters - Second half: beta (shift) parameters
The length of the parameters vector must be exactly twice the feature size. This method is useful for loading pre-trained weights or setting parameters after optimization.
For Beginners: This method loads parameters into the layer from a single vector.
This is the counterpart to GetParameters() - it takes a vector containing all parameters and sets them in the layer. The vector must have the format:
[gamma1, gamma2, ..., gammaN, beta1, beta2, ..., betaN]
Where N is the number of features. The total length must be exactly 2*N.
This method is commonly used for:
- Loading pre-trained models
- Setting parameters after external optimization
- Implementing transfer learning
- Testing different parameter configurations
If the vector doesn't have the expected length, the method will throw an exception to prevent incorrect parameter assignments.
Exceptions
- ArgumentException
Thrown when the parameters vector has incorrect length.
UpdateParameters(T)
Updates the layer's parameters using the computed gradients.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate for parameter updates.
Remarks
This method updates the gamma (scale) and beta (shift) parameters using gradient descent: - gamma = gamma - learningRate * gammaGradient - beta = beta - learningRate * betaGradient
The gradients are computed during the backward pass and represent how much each parameter should change to reduce the loss function.
For Beginners: This method updates the layer's learnable parameters during training.
After the backward pass calculates how each parameter affects the error, this method adjusts those parameters to reduce the error:
- It checks that the backward pass has been called first
- It updates the gamma (scale) parameters: gamma = gamma - learningRate * gammaGradient
- It updates the beta (shift) parameters: beta = beta - learningRate * betaGradient
The learning rate controls how big the updates are:
- A larger learning rate means bigger changes (faster learning but potentially unstable)
- A smaller learning rate means smaller changes (slower but more stable learning)
For example, if a particular gamma value is causing high error, its gradient will be large, and this method will adjust that parameter more significantly to reduce the error in the next forward pass.
This is the step where actual "learning" happens in the neural network.
Exceptions
- InvalidOperationException
Thrown when update is called before backward.
ZeroInitGamma()
Initializes gamma (scale) parameters to zero.
public void ZeroInitGamma()
Remarks
This is used for zero-init residual in ResNet, where the last BatchNorm in each residual block has gamma initialized to zero. This makes the residual blocks start as identity mappings, which can improve training.