Class AttentionLayer<T>

Namespace: AiDotNet.NeuralNetworks.Layers

Assembly: AiDotNet.dll

Represents an Attention Layer for focusing on relevant parts of input sequences.

public class AttentionLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider

Type Parameters

T: The numeric type used for calculations (e.g., float, double).

Inheritance: object

LayerBase<T>

AttentionLayer<T>

Implements: ILayer<T>

IJitCompilable<T>

IWeightLoadable<T>

IDisposable

IAuxiliaryLossLayer<T>

IDiagnosticsProvider

Inherited Members: LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.Clone()

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.SetParameters(Vector<T>)

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

The Attention Layer is a mechanism that allows a neural network to focus on different parts of the input sequence when producing each element of the output sequence. It computes a weighted sum of the input sequence, where the weights (attention weights) are determined based on the relevance of each input element to the current output.

For Beginners: An Attention Layer helps the network focus on important parts of the input.

Think of it like reading a long document to answer a question:

Instead of remembering every word, you focus on key sentences or phrases
The attention mechanism does something similar for the neural network
It helps the network decide which parts of the input are most relevant for the current task

Common applications include:

Machine translation (focusing on relevant words when translating)
Image captioning (focusing on relevant parts of an image when describing it)
Speech recognition (focusing on important audio segments)

The key advantage is that it allows the network to handle long sequences more effectively by focusing on the most relevant parts rather than trying to remember everything.

Constructors

AttentionLayer(int, int, IActivationFunction<T>?)

Initializes a new instance of the AttentionLayer class with scalar activation.

public AttentionLayer(int inputSize, int attentionSize, IActivationFunction<T>? activation = null)

Parameters

inputSize int: The size of the input features.
attentionSize int: The size of the attention mechanism.
activation IActivationFunction<T>: The activation function to use. If null, SoftmaxActivation is used.

Remarks

This constructor creates an Attention Layer with scalar activation, allowing for element-wise application of the activation function.

For Beginners: This sets up the Attention Layer with its initial values, using a scalar activation function.

The scalar activation means the same function is applied to each element independently. This is useful when you want to treat each attention score separately.

AttentionLayer(int, int, IVectorActivationFunction<T>?)

Initializes a new instance of the AttentionLayer class with vector activation.

public AttentionLayer(int inputSize, int attentionSize, IVectorActivationFunction<T>? activation = null)

Parameters

inputSize int: The size of the input features.
attentionSize int: The size of the attention mechanism.
activation IVectorActivationFunction<T>: The vector activation function to use. If null, SoftmaxActivation is used.

Remarks

This constructor creates an Attention Layer with vector activation, allowing for operations on entire vectors or tensors.

For Beginners: This sets up the Attention Layer with its initial values, using a vector activation function.

The vector activation means the function is applied to the entire set of attention scores at once. This can be more efficient and allows for more complex interactions between attention scores.

Properties

AuxiliaryLossWeight

Gets or sets the weight for attention entropy regularization. Default is 0.01. Higher values encourage more uniform attention distributions.

public T AuxiliaryLossWeight { get; set; }

Property Value

T

ParameterCount

Gets the total number of trainable parameters in the layer.

public override int ParameterCount { get; }

Property Value

int

Remarks

This property calculates the total number of trainable parameters in the Attention Layer, which includes all the weights for query, key, and value transformations.

For Beginners: This tells you how many numbers the layer needs to learn.

It counts all the weights in the four transformation matrices (Wq, Wk, Wv, Wo). A higher number means the layer can potentially learn more complex patterns, but also requires more data and time to train effectively.

SupportsGpuExecution

Gets a value indicating whether this layer supports GPU execution.

protected override bool SupportsGpuExecution { get; }

Property Value

bool

SupportsJitCompilation

Gets whether this attention layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool: True if the layer parameters are initialized.

Remarks

This property indicates whether the layer can be JIT compiled. The layer supports JIT if: - Query, Key, Value projection weights are initialized

For Beginners: This tells you if this layer can use JIT compilation for faster inference.

The layer can be JIT compiled if:

The layer has been initialized with projection weight matrices (Wq, Wk, Wv)

Attention layers require these projection matrices to transform the input into query, key, and value representations. Once initialized, JIT compilation can provide significant speedup (5-10x) by optimizing:

Matrix multiplications for projections
Attention score computation (Q @ K^T)
Softmax activation
Weighted sum of values (attention @ V)

This is especially important for Transformers where attention is computed many times in each forward pass (multiple layers, multiple heads).

SupportsTraining

The computation engine (CPU or GPU) for vectorized operations.

public override bool SupportsTraining { get; }

Property Value

bool

Remarks

This property indicates that the Attention Layer can be trained using backpropagation.

For Beginners: This tells you that the layer can learn and improve its performance over time.

When this is true, it means the layer can adjust its internal weights based on the errors it makes, allowing it to get better at its task as it sees more data.

UseAuxiliaryLoss

Gets or sets whether to use auxiliary loss (attention entropy regularization) during training. Default is false. Enable to prevent attention collapse.

public bool UseAuxiliaryLoss { get; set; }

Property Value

bool

Methods

Backward(Tensor<T>)

Performs the backward pass of the attention mechanism.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: The gradient of the loss with respect to the layer's output.

Returns

Tensor<T>: The gradient of the loss with respect to the layer's input.

Remarks

This method implements the backpropagation algorithm for the attention mechanism. It computes the gradients of the loss with respect to the layer's parameters and input.

For Beginners: This is how the layer learns from its mistakes.

The method takes the gradient of the error with respect to the layer's output and works backwards to figure out:

How much each weight contributed to the error (stored in _dWq, _dWk, _dWv)
How the input itself contributed to the error (the returned value)

This information is then used to update the weights and improve the layer's performance.

BackwardGpu(IGpuTensor<T>)

Performs the backward pass on GPU for the attention layer.

public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)

Parameters

outputGradient IGpuTensor<T>: The GPU tensor containing the gradient of the loss with respect to the output.

Returns

IGpuTensor<T>: The GPU tensor containing the gradient of the loss with respect to the input.

ComputeAuxiliaryLoss()

Computes the auxiliary loss for the AttentionLayer, which is attention entropy regularization.

public T ComputeAuxiliaryLoss()

Returns

T: The attention entropy loss value.

Remarks

Attention entropy regularization prevents attention collapse by encouraging diverse attention patterns. It computes the entropy of the attention distribution: H = -Σ(p * log(p)) Lower entropy means more focused (peaky) attention, higher entropy means more distributed attention. We negate the entropy to create a loss that penalizes low entropy (collapsed attention).

For Beginners: This calculates a penalty when attention becomes too focused on just one or two positions.

Attention entropy regularization:

Measures how "spread out" the attention weights are
Penalizes attention that collapses to a single position
Encourages the model to consider multiple relevant parts of the input
Prevents the model from ignoring potentially important information

Why this is important:

Prevents attention heads from becoming redundant or degenerate
Improves model robustness and generalization
Encourages learning diverse attention patterns
Helps prevent overfitting to specific positions

Think of it like ensuring a student reads the entire textbook rather than just memorizing one page.

ExportComputationGraph(List<ComputationNode<T>>)

Exports the attention layer as a computation graph for JIT compilation.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>: List to which the input node will be added.

Returns

ComputationNode<T>: The output computation node representing the attention operation.

Remarks

This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node with shape [batch=1, inputSize] 2. Creates constant nodes for Query, Key, Value projection weights 3. Projects input to Q, K, V using matrix multiplication 4. Applies scaled dot-product attention: softmax((Q @ K^T) / sqrt(d_k)) @ V 5. Returns the attention output

For Beginners: This method builds a symbolic representation of attention for JIT.

JIT compilation converts the attention mechanism into optimized native code. Attention allows the model to focus on relevant parts of the input by:

Creating Query (what we're looking for), Key (what we have), Value (what we return) projections
Computing similarity scores between Query and all Keys
Using softmax to convert scores to weights (focusing mechanism)
Applying these weights to Values to get focused output

The symbolic graph allows the JIT compiler to:

Optimize matrix multiplications using BLAS libraries
Fuse softmax computation with scaling
Generate efficient memory layouts for cache utilization

Attention is the core mechanism in Transformers and modern NLP models. JIT compilation provides 5-10x speedup by optimizing these operations.

Exceptions

ArgumentNullException: Thrown when inputNodes is null.
InvalidOperationException: Thrown when layer parameters are not initialized.

Forward(Tensor<T>)

Performs the forward pass of the attention mechanism.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: The input tensor to the layer.

Returns

Tensor<T>: The output tensor after applying the attention mechanism.

Remarks

This method implements the core functionality of the attention mechanism. It transforms the input into query, key, and value representations, computes attention scores, applies scaling and activation, and produces the final output.

For Beginners: This is where the attention magic happens!

The input is transformed into three different representations: Query (Q), Key (K), and Value (V).
Attention scores are computed by comparing Q and K.
These scores are scaled and activated (usually with softmax) to get attention weights.
The final output is produced by applying these weights to V.

This process allows the layer to focus on different parts of the input as needed.

Forward(params Tensor<T>[])

Performs the forward pass of the attention mechanism with multiple inputs.

public override Tensor<T> Forward(params Tensor<T>[] inputs)

Parameters

inputs Tensor<T>[]: An array of input tensors. Based on the number of inputs: - One input: Standard forward pass with just the input tensor - Two inputs: The first tensor is the query input, the second is either the key/value input or an attention mask - Three inputs: The first tensor is the query input, the second is the key/value input, and the third is the attention mask

Returns

Tensor<T>: The output tensor after applying the attention mechanism.

Remarks

This method extends the attention mechanism to support multiple input tensors, which is useful for implementing cross-attention (as used in transformer decoder layers) and masked attention.

For Beginners: This method allows the attention layer to handle more complex scenarios:

With one input: It works just like the standard attention (self-attention)
With two inputs: It can either:
- Perform cross-attention (where query comes from one source, and key/value from another)
- Apply a mask to self-attention to control which parts of the input to focus on
With three inputs: It performs masked cross-attention, which combines both features above

These capabilities are essential for transformer architectures, especially decoder layers that need to attend to both their own outputs and the encoder's outputs.

Exceptions

ArgumentException: Thrown when the input array is empty.

ForwardGpu(params IGpuTensor<T>[])

Performs GPU-accelerated forward pass for the attention mechanism. All computations stay on GPU - no CPU roundtrips.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]: The input GPU tensors. Expects one tensor with shape [batch, seqLen, inputSize].

Returns

IGpuTensor<T>: The output GPU tensor after applying the attention mechanism.

Exceptions

ArgumentException: Thrown when no inputs provided.
InvalidOperationException: Thrown when engine is not a DirectGpuTensorEngine.

GetAuxiliaryLossDiagnostics()

Gets diagnostic information about the attention regularization.

public Dictionary<string, string> GetAuxiliaryLossDiagnostics()

Returns

Dictionary<string, string>: A dictionary containing diagnostic information about attention patterns.

Remarks

This method provides insights into attention behavior, including: - Attention entropy (measure of distribution spread) - Whether regularization is enabled - Regularization weight

For Beginners: This gives you information to monitor attention pattern health.

The diagnostics include:

Attention Entropy: How spread out the attention is (higher = more distributed)
Entropy Weight: How much the regularization influences training
Use Auxiliary Loss: Whether regularization is enabled

These values help you:

Detect attention collapse (very low entropy)
Monitor attention diversity during training
Tune the entropy regularization weight
Ensure attention heads are learning different patterns

GetDiagnostics()

Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.

public override Dictionary<string, string> GetDiagnostics()

Returns

Dictionary<string, string>: A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().

GetParameters()

Retrieves the current parameters of the layer.

public override Vector<T> GetParameters()

Returns

Vector<T>: A vector containing all the parameters of the layer.

Remarks

This method collects all the weights of the attention layer (Wq, Wk, Wv) into a single vector. It's useful for operations that need to work with all the layer's parameters at once, such as certain optimization algorithms or when saving the model's state.

For Beginners: This method gives you all the layer's learned values in one list.

It's like taking a snapshot of everything the layer has learned. This can be useful for saving the layer's current state or for advanced training techniques.

ResetState()

Resets the state of the attention layer.

public override void ResetState()

Remarks

This method resets the internal state of the attention layer. It clears the last input and attention weights, effectively preparing the layer for a new sequence or episode.

For Beginners: This is like clearing the layer's short-term memory.

In attention mechanisms, sometimes we want to start fresh, forgetting any previous inputs. This is especially useful when starting a new sequence or when you don't want the layer to consider past information anymore.

UpdateParameters(Vector<T>)

Updates the layer's parameters with the provided values.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: A vector containing new parameter values.

Remarks

This method replaces the current values of the layer's weights with new values provided in the parameters vector. It's useful for setting the layer's state to a specific configuration, such as when loading a pre-trained model.

For Beginners: This allows you to directly set the layer's internal weights.

Instead of the layer learning these weights through training, you're providing them directly. This is often used when you want to use a pre-trained attention layer or set up the layer with specific initial values.

UpdateParameters(T)

Updates the layer's parameters based on the computed gradients and a learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T: The learning rate to use for the update.

Remarks

This method applies the computed gradients to the layer's weights, scaled by the learning rate. This is typically called after the backward pass to adjust the layer's parameters.

For Beginners: This is how the layer actually improves its performance.

After figuring out how each weight contributed to the error (in the Backward method), this method adjusts those weights to reduce the error:

Weights that contributed to large errors are changed more.
The learning rate determines how big these changes are.

Table of Contents

Class AttentionLayer<T>

Type Parameters

Remarks

Constructors

AttentionLayer(int, int, IActivationFunction<T>?)

Parameters

Remarks

AttentionLayer(int, int, IVectorActivationFunction<T>?)

Parameters

Remarks

Properties

AuxiliaryLossWeight

Property Value

ParameterCount

Property Value

Remarks

SupportsGpuExecution

Property Value

SupportsJitCompilation

Property Value

Remarks

SupportsTraining

Property Value

Remarks

UseAuxiliaryLoss

Property Value

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

BackwardGpu(IGpuTensor<T>)

Parameters

Returns

ComputeAuxiliaryLoss()

Returns

Remarks

ExportComputationGraph(List<ComputationNode<T>>)

Parameters

Returns

Remarks

Exceptions

Forward(Tensor<T>)

Parameters

Returns

Remarks

Forward(params Tensor<T>[])

Parameters

Returns

Remarks

Exceptions

ForwardGpu(params IGpuTensor<T>[])

Parameters

Returns

Exceptions

GetAuxiliaryLossDiagnostics()

Returns

Remarks

GetDiagnostics()

Returns

GetParameters()

Returns

Remarks

ResetState()

Remarks

UpdateParameters(Vector<T>)

Parameters

Remarks

UpdateParameters(T)

Parameters

Remarks