Table of Contents

Class GatedLinearUnitLayer<T>

Namespace
AiDotNet.NeuralNetworks.Layers
Assembly
AiDotNet.dll

Represents a Gated Linear Unit (GLU) layer in a neural network that combines linear transformation with multiplicative gating.

public class GatedLinearUnitLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>, IDisposable

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
GatedLinearUnitLayer<T>
Implements
Derived
Inherited Members

Remarks

A Gated Linear Unit (GLU) is a neural network layer that combines linear transformations with a gating mechanism. It applies two parallel linear transformations to the input: one produces a linear output, and the other produces a gate that controls how much of the linear output passes through. The final output is the element-wise product of the linear output and the activated gate. GLUs were introduced to help with vanishing gradient problems in deep networks and have been particularly effective in natural language processing and sequence modeling tasks.

For Beginners: A Gated Linear Unit is like a smart filter that controls how much information flows through.

Imagine water flowing through a pipe with an adjustable valve:

  • The water is the input data
  • One part of the layer (linear part) processes the water
  • Another part (gate) controls how much processed water flows through
  • Together they decide "what information is important to keep"

For example, in language processing:

  • The linear transformation might extract features from words
  • The gate might decide which features are relevant to the current context
  • Their combination helps the network focus on important information

GLUs are particularly good at:

  • Controlling information flow through the network
  • Helping gradients flow during training (preventing vanishing gradients)
  • Allowing the network to selectively use information

This selectivity is valuable in many tasks, especially those involving sequences like text or time-series data.

Constructors

GatedLinearUnitLayer(int, int, IActivationFunction<T>?)

public GatedLinearUnitLayer(int inputDimension, int outputDimension, IActivationFunction<T>? gateActivation = null)

Parameters

inputDimension int
outputDimension int
gateActivation IActivationFunction<T>

GatedLinearUnitLayer(int, int, IVectorActivationFunction<T>?)

Initializes a new instance of the GatedLinearUnitLayer<T> class with a vector activation function.

public GatedLinearUnitLayer(int inputDimension, int outputDimension, IVectorActivationFunction<T>? gateActivation = null)

Parameters

inputDimension int

The number of input features.

outputDimension int

The number of output features.

gateActivation IVectorActivationFunction<T>

The vector activation function to apply to the gating mechanism. Defaults to Sigmoid if not specified.

Remarks

This constructor creates a new GLU layer with the specified input and output dimensions and vector gate activation function. The weights for both paths are initialized with small random values, and the biases are initialized to zero. Unlike the other constructor, this one accepts a vector activation function that operates on entire vectors rather than individual scalar values.

For Beginners: This is an alternative setup that uses a different kind of activation function for the gate.

This constructor is almost identical to the first one, but with one key difference:

  • Regular activation: processes each gate value separately
  • Vector activation: processes the entire gate vector together

Vector activations might be useful for specialized gating where gate values should influence each other. For most common use cases, the standard constructor with sigmoid activation works well.

The default is still sigmoid activation, which is usually the best choice for GLU layers because its 0-1 range makes it ideal for gating.

Properties

ParameterCount

Gets the total number of trainable parameters in this layer.

public override int ParameterCount { get; }

Property Value

int

The sum of elements in all weight and bias tensors (linear weights, gate weights, linear bias, gate bias).

Remarks

This property returns the total count of learnable parameters across all four parameter tensors: linear weights, gate weights, linear biases, and gate biases.

For Beginners: This tells you how many numbers the layer can adjust during training. For a GLU layer with 100 inputs and 50 outputs, you would have: - 5,000 linear weights (100 x 50) - 5,000 gate weights (100 x 50) - 50 linear biases - 50 gate biases - Total: 10,100 parameters

SupportsGpuExecution

Gets a value indicating whether this layer supports GPU execution.

protected override bool SupportsGpuExecution { get; }

Property Value

bool

SupportsJitCompilation

Gets whether this layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool

True if the layer can be JIT compiled, false otherwise.

Remarks

This property indicates whether the layer has implemented ExportComputationGraph() and can benefit from JIT compilation. All layers MUST implement this property.

For Beginners: JIT compilation can make inference 5-10x faster by converting the layer's operations into optimized native code.

Layers should return false if they:

  • Have not yet implemented a working ExportComputationGraph()
  • Use dynamic operations that change based on input data
  • Are too simple to benefit from JIT compilation

When false, the layer will use the standard Forward() method instead.

SupportsTraining

Gets a value indicating whether this layer supports training.

public override bool SupportsTraining { get; }

Property Value

bool

Always true because GLU layers have trainable parameters (weights and biases for both paths).

Remarks

This property indicates that the GLU layer supports training through backpropagation. The layer has trainable parameters (weights and biases for both linear and gating paths) that are updated during the training process.

For Beginners: This property tells you that this layer can learn from data.

A value of true means:

  • The layer adjusts its weights and biases during training
  • It improves its performance as it sees more data
  • It has parameters for both the linear and gating paths that adapt

GLU layers are powerful learning components because they can learn both what features to extract and which ones are important in context.

Methods

Backward(Tensor<T>)

Performs the backward pass of the GLU layer to compute gradients.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

The gradient tensor from the next layer. Shape: [batchSize, outputDimension].

Returns

Tensor<T>

The gradient tensor to be passed to the previous layer. Shape: [batchSize, inputDimension].

Remarks

This method implements the backward pass (backpropagation) of the GLU layer. It computes the gradients of the loss with respect to the layer's weights, biases, and inputs. The computation accounts for the two paths (linear and gating) and their interaction through element-wise multiplication.

For Beginners: This is where the layer learns from its mistakes during training.

The backward pass is more complex in GLU layers because of the two paths:

  1. First, compute gradients for both paths:

    • Linear path gradient: outputGradient × gate values
    • Gate path gradient: outputGradient × linear output
  2. For the gate path, apply the activation derivative

    • This accounts for how the activation affected the gates
  3. Compute gradients for all parameters:

    • Linear weights: Based on input and linear gradient
    • Gate weights: Based on input and gate gradient
    • Linear biases: Sum of linear gradients
    • Gate biases: Sum of gate gradients
  4. Compute gradient for the input (to pass to previous layer):

    • Combine contributions from both paths

This process ensures that both paths learn appropriately based on their contribution to the final output.

Exceptions

InvalidOperationException

Thrown when backward is called before forward.

BackwardGpu(IGpuTensor<T>)

Performs the backward pass using GPU-resident tensors.

public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)

Parameters

outputGradient IGpuTensor<T>

GPU-resident gradient of the loss w.r.t. output.

Returns

IGpuTensor<T>

GPU-resident gradient of the loss w.r.t. input.

ExportComputationGraph(List<ComputationNode<T>>)

Exports the layer's computation graph for JIT compilation.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>

List to populate with input computation nodes.

Returns

ComputationNode<T>

The output computation node representing the layer's operation.

Remarks

This method constructs a computation graph representation of the layer's forward pass that can be JIT compiled for faster inference. All layers MUST implement this method to support JIT compilation.

For Beginners: JIT (Just-In-Time) compilation converts the layer's operations into optimized native code for 5-10x faster inference.

To support JIT compilation, a layer must:

  1. Implement this method to export its computation graph
  2. Set SupportsJitCompilation to true
  3. Use ComputationNode and TensorOperations to build the graph

All layers are required to implement this method, even if they set SupportsJitCompilation = false.

Forward(Tensor<T>)

Performs the forward pass of the GLU layer.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

The input tensor to process. Shape: [batchSize, inputDimension].

Returns

Tensor<T>

The output tensor after gated linear transformation. Shape: [batchSize, outputDimension].

Remarks

This method implements the forward pass of the GLU layer. It performs two parallel linear transformations on the input: one for the linear path and one for the gating path. The gating path output is passed through an activation function (typically sigmoid), and then the two outputs are multiplied element-wise. This gating mechanism allows the layer to selectively pass information through.

For Beginners: This is where the layer processes input data through both paths.

The forward pass works in these steps:

  1. Linear Path: Transform the input using linear weights and biases
    • This creates features that might be useful
  2. Gate Path: Transform the input using gate weights and biases
    • This determines how important each feature is
  3. Apply activation to the gate values (typically sigmoid)
    • Converts gate values to be between 0 and 1
  4. Multiply the linear output by the activated gate values
    • This lets important features pass through and blocks others

The result is that the layer can learn both:

  • What features to extract (linear path)
  • Which features are important in each context (gate path)

This selective focus helps the network learn more effectively.

ForwardGpu(params IGpuTensor<T>[])

Performs the forward pass on GPU using FusedLinearGpu for efficient computation.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]

The GPU input tensors.

Returns

IGpuTensor<T>

The GPU output tensor.

GetParameters()

Gets all trainable parameters of the layer as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all trainable parameters.

Remarks

This method retrieves all trainable parameters of the GLU layer as a single vector. The parameters include weights and biases for both the linear and gating paths. The order is: linear weights, gate weights, linear biases, gate biases.

For Beginners: This method collects all the layer's learnable values into a single list.

The parameters include four sets of values:

  1. Linear weights: Main transformation parameters
  2. Gate weights: Selection mechanism parameters
  3. Linear biases: Baseline adjustments for features
  4. Gate biases: Default settings for gates

All these values are collected in a specific order into a single vector. This combined list is useful for:

  • Saving a trained model to disk
  • Loading parameters from a previously trained model
  • Advanced optimization techniques

For a layer with 100 inputs and 50 outputs, this would return:

  • 5,000 linear weight parameters (100 × 50)
  • 5,000 gate weight parameters (100 × 50)
  • 50 linear bias parameters
  • 50 gate bias parameters
  • Totaling 10,100 parameters

ResetState()

Resets the internal state of the layer.

public override void ResetState()

Remarks

This method resets the internal state of the GLU layer by clearing all cached values from forward and backward passes. This includes inputs, intermediate outputs, and gradients.

For Beginners: This method clears the layer's memory to start fresh.

When resetting the state:

  • The saved input is cleared
  • The saved linear and gate outputs are cleared
  • All calculated gradients are cleared
  • The layer forgets previous calculations it performed

This is typically called:

  • Between training batches to free up memory
  • When switching from training to evaluation mode
  • When starting to process completely new data

It's like wiping a whiteboard clean before starting a new calculation. Note that this doesn't affect the learned weights and biases, just the temporary working data.

SetParameters(Vector<T>)

Sets the trainable parameters of the layer from a single vector.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

A vector containing all parameters to set.

Remarks

This method sets all trainable parameters of the GLU layer from a single vector. The parameters should be in the same order as produced by GetParameters: linear weights, gate weights, linear biases, gate biases.

For Beginners: This method updates all the layer's learnable values from a provided list.

When setting parameters:

  • The input must be a vector with the exact right length
  • The values are distributed to the correct parameters in order
  • They must follow the same order used in GetParameters

This method is useful for:

  • Restoring a saved model
  • Loading pre-trained parameters
  • Testing specific parameter configurations

The method verifies that the vector contains exactly the right number of parameters before applying them.

Exceptions

ArgumentException

Thrown when the parameters vector has incorrect length.

UpdateParameters(T)

Updates the weights and biases for both paths using the calculated gradients and the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate to use for the parameter updates.

Remarks

This method updates all trainable parameters of the GLU layer based on the gradients calculated during the backward pass. The parameters include weights and biases for both the linear and gating paths. The learning rate determines the size of the parameter updates.

For Beginners: This method changes the weights and biases to improve future predictions.

After calculating how each parameter should change:

  • All parameters are adjusted in the direction that reduces errors
  • The learning rate controls how big these adjustments are

The updates apply to all four sets of parameters:

  1. Linear weights: For better feature extraction
  2. Gate weights: For better selection of important features
  3. Linear biases: For better baseline feature values
  4. Gate biases: For better default gate openness

Each parameter moves a small step in the direction that improves performance. The minus sign means we move in the opposite direction of the gradient to minimize error.

Exceptions

InvalidOperationException

Thrown when update is called before backward.