Class GatedLinearUnitLayer<T>
- Namespace
- AiDotNet.NeuralNetworks.Layers
- Assembly
- AiDotNet.dll
Represents a Gated Linear Unit (GLU) layer in a neural network that combines linear transformation with multiplicative gating.
public class GatedLinearUnitLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>, IDisposable
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>GatedLinearUnitLayer<T>
- Implements
-
ILayer<T>
- Derived
- Inherited Members
Remarks
A Gated Linear Unit (GLU) is a neural network layer that combines linear transformations with a gating mechanism. It applies two parallel linear transformations to the input: one produces a linear output, and the other produces a gate that controls how much of the linear output passes through. The final output is the element-wise product of the linear output and the activated gate. GLUs were introduced to help with vanishing gradient problems in deep networks and have been particularly effective in natural language processing and sequence modeling tasks.
For Beginners: A Gated Linear Unit is like a smart filter that controls how much information flows through.
Imagine water flowing through a pipe with an adjustable valve:
- The water is the input data
- One part of the layer (linear part) processes the water
- Another part (gate) controls how much processed water flows through
- Together they decide "what information is important to keep"
For example, in language processing:
- The linear transformation might extract features from words
- The gate might decide which features are relevant to the current context
- Their combination helps the network focus on important information
GLUs are particularly good at:
- Controlling information flow through the network
- Helping gradients flow during training (preventing vanishing gradients)
- Allowing the network to selectively use information
This selectivity is valuable in many tasks, especially those involving sequences like text or time-series data.
Constructors
GatedLinearUnitLayer(int, int, IActivationFunction<T>?)
public GatedLinearUnitLayer(int inputDimension, int outputDimension, IActivationFunction<T>? gateActivation = null)
Parameters
inputDimensionintoutputDimensionintgateActivationIActivationFunction<T>
GatedLinearUnitLayer(int, int, IVectorActivationFunction<T>?)
Initializes a new instance of the GatedLinearUnitLayer<T> class with a vector activation function.
public GatedLinearUnitLayer(int inputDimension, int outputDimension, IVectorActivationFunction<T>? gateActivation = null)
Parameters
inputDimensionintThe number of input features.
outputDimensionintThe number of output features.
gateActivationIVectorActivationFunction<T>The vector activation function to apply to the gating mechanism. Defaults to Sigmoid if not specified.
Remarks
This constructor creates a new GLU layer with the specified input and output dimensions and vector gate activation function. The weights for both paths are initialized with small random values, and the biases are initialized to zero. Unlike the other constructor, this one accepts a vector activation function that operates on entire vectors rather than individual scalar values.
For Beginners: This is an alternative setup that uses a different kind of activation function for the gate.
This constructor is almost identical to the first one, but with one key difference:
- Regular activation: processes each gate value separately
- Vector activation: processes the entire gate vector together
Vector activations might be useful for specialized gating where gate values should influence each other. For most common use cases, the standard constructor with sigmoid activation works well.
The default is still sigmoid activation, which is usually the best choice for GLU layers because its 0-1 range makes it ideal for gating.
Properties
ParameterCount
Gets the total number of trainable parameters in this layer.
public override int ParameterCount { get; }
Property Value
- int
The sum of elements in all weight and bias tensors (linear weights, gate weights, linear bias, gate bias).
Remarks
This property returns the total count of learnable parameters across all four parameter tensors: linear weights, gate weights, linear biases, and gate biases.
For Beginners: This tells you how many numbers the layer can adjust during training. For a GLU layer with 100 inputs and 50 outputs, you would have: - 5,000 linear weights (100 x 50) - 5,000 gate weights (100 x 50) - 50 linear biases - 50 gate biases - Total: 10,100 parameters
SupportsGpuExecution
Gets a value indicating whether this layer supports GPU execution.
protected override bool SupportsGpuExecution { get; }
Property Value
SupportsJitCompilation
Gets whether this layer supports JIT compilation.
public override bool SupportsJitCompilation { get; }
Property Value
- bool
True if the layer can be JIT compiled, false otherwise.
Remarks
This property indicates whether the layer has implemented ExportComputationGraph() and can benefit from JIT compilation. All layers MUST implement this property.
For Beginners: JIT compilation can make inference 5-10x faster by converting the layer's operations into optimized native code.
Layers should return false if they:
- Have not yet implemented a working ExportComputationGraph()
- Use dynamic operations that change based on input data
- Are too simple to benefit from JIT compilation
When false, the layer will use the standard Forward() method instead.
SupportsTraining
Gets a value indicating whether this layer supports training.
public override bool SupportsTraining { get; }
Property Value
- bool
Always
truebecause GLU layers have trainable parameters (weights and biases for both paths).
Remarks
This property indicates that the GLU layer supports training through backpropagation. The layer has trainable parameters (weights and biases for both linear and gating paths) that are updated during the training process.
For Beginners: This property tells you that this layer can learn from data.
A value of true means:
- The layer adjusts its weights and biases during training
- It improves its performance as it sees more data
- It has parameters for both the linear and gating paths that adapt
GLU layers are powerful learning components because they can learn both what features to extract and which ones are important in context.
Methods
Backward(Tensor<T>)
Performs the backward pass of the GLU layer to compute gradients.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient tensor from the next layer. Shape: [batchSize, outputDimension].
Returns
- Tensor<T>
The gradient tensor to be passed to the previous layer. Shape: [batchSize, inputDimension].
Remarks
This method implements the backward pass (backpropagation) of the GLU layer. It computes the gradients of the loss with respect to the layer's weights, biases, and inputs. The computation accounts for the two paths (linear and gating) and their interaction through element-wise multiplication.
For Beginners: This is where the layer learns from its mistakes during training.
The backward pass is more complex in GLU layers because of the two paths:
First, compute gradients for both paths:
- Linear path gradient: outputGradient × gate values
- Gate path gradient: outputGradient × linear output
For the gate path, apply the activation derivative
- This accounts for how the activation affected the gates
Compute gradients for all parameters:
- Linear weights: Based on input and linear gradient
- Gate weights: Based on input and gate gradient
- Linear biases: Sum of linear gradients
- Gate biases: Sum of gate gradients
Compute gradient for the input (to pass to previous layer):
- Combine contributions from both paths
This process ensures that both paths learn appropriately based on their contribution to the final output.
Exceptions
- InvalidOperationException
Thrown when backward is called before forward.
BackwardGpu(IGpuTensor<T>)
Performs the backward pass using GPU-resident tensors.
public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)
Parameters
outputGradientIGpuTensor<T>GPU-resident gradient of the loss w.r.t. output.
Returns
- IGpuTensor<T>
GPU-resident gradient of the loss w.r.t. input.
ExportComputationGraph(List<ComputationNode<T>>)
Exports the layer's computation graph for JIT compilation.
public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)
Parameters
inputNodesList<ComputationNode<T>>List to populate with input computation nodes.
Returns
- ComputationNode<T>
The output computation node representing the layer's operation.
Remarks
This method constructs a computation graph representation of the layer's forward pass that can be JIT compiled for faster inference. All layers MUST implement this method to support JIT compilation.
For Beginners: JIT (Just-In-Time) compilation converts the layer's operations into optimized native code for 5-10x faster inference.
To support JIT compilation, a layer must:
- Implement this method to export its computation graph
- Set SupportsJitCompilation to true
- Use ComputationNode and TensorOperations to build the graph
All layers are required to implement this method, even if they set SupportsJitCompilation = false.
Forward(Tensor<T>)
Performs the forward pass of the GLU layer.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>The input tensor to process. Shape: [batchSize, inputDimension].
Returns
- Tensor<T>
The output tensor after gated linear transformation. Shape: [batchSize, outputDimension].
Remarks
This method implements the forward pass of the GLU layer. It performs two parallel linear transformations on the input: one for the linear path and one for the gating path. The gating path output is passed through an activation function (typically sigmoid), and then the two outputs are multiplied element-wise. This gating mechanism allows the layer to selectively pass information through.
For Beginners: This is where the layer processes input data through both paths.
The forward pass works in these steps:
- Linear Path: Transform the input using linear weights and biases
- This creates features that might be useful
- Gate Path: Transform the input using gate weights and biases
- This determines how important each feature is
- Apply activation to the gate values (typically sigmoid)
- Converts gate values to be between 0 and 1
- Multiply the linear output by the activated gate values
- This lets important features pass through and blocks others
The result is that the layer can learn both:
- What features to extract (linear path)
- Which features are important in each context (gate path)
This selective focus helps the network learn more effectively.
ForwardGpu(params IGpuTensor<T>[])
Performs the forward pass on GPU using FusedLinearGpu for efficient computation.
public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)
Parameters
inputsIGpuTensor<T>[]The GPU input tensors.
Returns
- IGpuTensor<T>
The GPU output tensor.
GetParameters()
Gets all trainable parameters of the layer as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all trainable parameters.
Remarks
This method retrieves all trainable parameters of the GLU layer as a single vector. The parameters include weights and biases for both the linear and gating paths. The order is: linear weights, gate weights, linear biases, gate biases.
For Beginners: This method collects all the layer's learnable values into a single list.
The parameters include four sets of values:
- Linear weights: Main transformation parameters
- Gate weights: Selection mechanism parameters
- Linear biases: Baseline adjustments for features
- Gate biases: Default settings for gates
All these values are collected in a specific order into a single vector. This combined list is useful for:
- Saving a trained model to disk
- Loading parameters from a previously trained model
- Advanced optimization techniques
For a layer with 100 inputs and 50 outputs, this would return:
- 5,000 linear weight parameters (100 × 50)
- 5,000 gate weight parameters (100 × 50)
- 50 linear bias parameters
- 50 gate bias parameters
- Totaling 10,100 parameters
ResetState()
Resets the internal state of the layer.
public override void ResetState()
Remarks
This method resets the internal state of the GLU layer by clearing all cached values from forward and backward passes. This includes inputs, intermediate outputs, and gradients.
For Beginners: This method clears the layer's memory to start fresh.
When resetting the state:
- The saved input is cleared
- The saved linear and gate outputs are cleared
- All calculated gradients are cleared
- The layer forgets previous calculations it performed
This is typically called:
- Between training batches to free up memory
- When switching from training to evaluation mode
- When starting to process completely new data
It's like wiping a whiteboard clean before starting a new calculation. Note that this doesn't affect the learned weights and biases, just the temporary working data.
SetParameters(Vector<T>)
Sets the trainable parameters of the layer from a single vector.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>A vector containing all parameters to set.
Remarks
This method sets all trainable parameters of the GLU layer from a single vector. The parameters should be in the same order as produced by GetParameters: linear weights, gate weights, linear biases, gate biases.
For Beginners: This method updates all the layer's learnable values from a provided list.
When setting parameters:
- The input must be a vector with the exact right length
- The values are distributed to the correct parameters in order
- They must follow the same order used in GetParameters
This method is useful for:
- Restoring a saved model
- Loading pre-trained parameters
- Testing specific parameter configurations
The method verifies that the vector contains exactly the right number of parameters before applying them.
Exceptions
- ArgumentException
Thrown when the parameters vector has incorrect length.
UpdateParameters(T)
Updates the weights and biases for both paths using the calculated gradients and the specified learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate to use for the parameter updates.
Remarks
This method updates all trainable parameters of the GLU layer based on the gradients calculated during the backward pass. The parameters include weights and biases for both the linear and gating paths. The learning rate determines the size of the parameter updates.
For Beginners: This method changes the weights and biases to improve future predictions.
After calculating how each parameter should change:
- All parameters are adjusted in the direction that reduces errors
- The learning rate controls how big these adjustments are
The updates apply to all four sets of parameters:
- Linear weights: For better feature extraction
- Gate weights: For better selection of important features
- Linear biases: For better baseline feature values
- Gate biases: For better default gate openness
Each parameter moves a small step in the direction that improves performance. The minus sign means we move in the opposite direction of the gradient to minimize error.
Exceptions
- InvalidOperationException
Thrown when update is called before backward.