Class AttentionLayer<T>
- Namespace
- AiDotNet.NeuralNetworks.Layers
- Assembly
- AiDotNet.dll
Represents an Attention Layer for focusing on relevant parts of input sequences.
public class AttentionLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider
Type Parameters
TThe numeric type used for calculations (e.g., float, double).
- Inheritance
-
LayerBase<T>AttentionLayer<T>
- Implements
-
ILayer<T>
- Inherited Members
Remarks
The Attention Layer is a mechanism that allows a neural network to focus on different parts of the input sequence when producing each element of the output sequence. It computes a weighted sum of the input sequence, where the weights (attention weights) are determined based on the relevance of each input element to the current output.
For Beginners: An Attention Layer helps the network focus on important parts of the input.
Think of it like reading a long document to answer a question:
- Instead of remembering every word, you focus on key sentences or phrases
- The attention mechanism does something similar for the neural network
- It helps the network decide which parts of the input are most relevant for the current task
Common applications include:
- Machine translation (focusing on relevant words when translating)
- Image captioning (focusing on relevant parts of an image when describing it)
- Speech recognition (focusing on important audio segments)
The key advantage is that it allows the network to handle long sequences more effectively by focusing on the most relevant parts rather than trying to remember everything.
Constructors
AttentionLayer(int, int, IActivationFunction<T>?)
Initializes a new instance of the AttentionLayer class with scalar activation.
public AttentionLayer(int inputSize, int attentionSize, IActivationFunction<T>? activation = null)
Parameters
inputSizeintThe size of the input features.
attentionSizeintThe size of the attention mechanism.
activationIActivationFunction<T>The activation function to use. If null, SoftmaxActivation is used.
Remarks
This constructor creates an Attention Layer with scalar activation, allowing for element-wise application of the activation function.
For Beginners: This sets up the Attention Layer with its initial values, using a scalar activation function.
The scalar activation means the same function is applied to each element independently. This is useful when you want to treat each attention score separately.
AttentionLayer(int, int, IVectorActivationFunction<T>?)
Initializes a new instance of the AttentionLayer class with vector activation.
public AttentionLayer(int inputSize, int attentionSize, IVectorActivationFunction<T>? activation = null)
Parameters
inputSizeintThe size of the input features.
attentionSizeintThe size of the attention mechanism.
activationIVectorActivationFunction<T>The vector activation function to use. If null, SoftmaxActivation is used.
Remarks
This constructor creates an Attention Layer with vector activation, allowing for operations on entire vectors or tensors.
For Beginners: This sets up the Attention Layer with its initial values, using a vector activation function.
The vector activation means the function is applied to the entire set of attention scores at once. This can be more efficient and allows for more complex interactions between attention scores.
Properties
AuxiliaryLossWeight
Gets or sets the weight for attention entropy regularization. Default is 0.01. Higher values encourage more uniform attention distributions.
public T AuxiliaryLossWeight { get; set; }
Property Value
- T
ParameterCount
Gets the total number of trainable parameters in the layer.
public override int ParameterCount { get; }
Property Value
Remarks
This property calculates the total number of trainable parameters in the Attention Layer, which includes all the weights for query, key, and value transformations.
For Beginners: This tells you how many numbers the layer needs to learn.
It counts all the weights in the four transformation matrices (Wq, Wk, Wv, Wo). A higher number means the layer can potentially learn more complex patterns, but also requires more data and time to train effectively.
SupportsGpuExecution
Gets a value indicating whether this layer supports GPU execution.
protected override bool SupportsGpuExecution { get; }
Property Value
SupportsJitCompilation
Gets whether this attention layer supports JIT compilation.
public override bool SupportsJitCompilation { get; }
Property Value
- bool
True if the layer parameters are initialized.
Remarks
This property indicates whether the layer can be JIT compiled. The layer supports JIT if: - Query, Key, Value projection weights are initialized
For Beginners: This tells you if this layer can use JIT compilation for faster inference.
The layer can be JIT compiled if:
- The layer has been initialized with projection weight matrices (Wq, Wk, Wv)
Attention layers require these projection matrices to transform the input into query, key, and value representations. Once initialized, JIT compilation can provide significant speedup (5-10x) by optimizing:
- Matrix multiplications for projections
- Attention score computation (Q @ K^T)
- Softmax activation
- Weighted sum of values (attention @ V)
This is especially important for Transformers where attention is computed many times in each forward pass (multiple layers, multiple heads).
SupportsTraining
The computation engine (CPU or GPU) for vectorized operations.
public override bool SupportsTraining { get; }
Property Value
Remarks
This property indicates that the Attention Layer can be trained using backpropagation.
For Beginners: This tells you that the layer can learn and improve its performance over time.
When this is true, it means the layer can adjust its internal weights based on the errors it makes, allowing it to get better at its task as it sees more data.
UseAuxiliaryLoss
Gets or sets whether to use auxiliary loss (attention entropy regularization) during training. Default is false. Enable to prevent attention collapse.
public bool UseAuxiliaryLoss { get; set; }
Property Value
Methods
Backward(Tensor<T>)
Performs the backward pass of the attention mechanism.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient of the loss with respect to the layer's output.
Returns
- Tensor<T>
The gradient of the loss with respect to the layer's input.
Remarks
This method implements the backpropagation algorithm for the attention mechanism. It computes the gradients of the loss with respect to the layer's parameters and input.
For Beginners: This is how the layer learns from its mistakes.
The method takes the gradient of the error with respect to the layer's output and works backwards to figure out:
- How much each weight contributed to the error (stored in _dWq, _dWk, _dWv)
- How the input itself contributed to the error (the returned value)
This information is then used to update the weights and improve the layer's performance.
BackwardGpu(IGpuTensor<T>)
Performs the backward pass on GPU for the attention layer.
public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)
Parameters
outputGradientIGpuTensor<T>The GPU tensor containing the gradient of the loss with respect to the output.
Returns
- IGpuTensor<T>
The GPU tensor containing the gradient of the loss with respect to the input.
ComputeAuxiliaryLoss()
Computes the auxiliary loss for the AttentionLayer, which is attention entropy regularization.
public T ComputeAuxiliaryLoss()
Returns
- T
The attention entropy loss value.
Remarks
Attention entropy regularization prevents attention collapse by encouraging diverse attention patterns. It computes the entropy of the attention distribution: H = -Σ(p * log(p)) Lower entropy means more focused (peaky) attention, higher entropy means more distributed attention. We negate the entropy to create a loss that penalizes low entropy (collapsed attention).
For Beginners: This calculates a penalty when attention becomes too focused on just one or two positions.
Attention entropy regularization:
- Measures how "spread out" the attention weights are
- Penalizes attention that collapses to a single position
- Encourages the model to consider multiple relevant parts of the input
- Prevents the model from ignoring potentially important information
Why this is important:
- Prevents attention heads from becoming redundant or degenerate
- Improves model robustness and generalization
- Encourages learning diverse attention patterns
- Helps prevent overfitting to specific positions
Think of it like ensuring a student reads the entire textbook rather than just memorizing one page.
ExportComputationGraph(List<ComputationNode<T>>)
Exports the attention layer as a computation graph for JIT compilation.
public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)
Parameters
inputNodesList<ComputationNode<T>>List to which the input node will be added.
Returns
- ComputationNode<T>
The output computation node representing the attention operation.
Remarks
This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node with shape [batch=1, inputSize] 2. Creates constant nodes for Query, Key, Value projection weights 3. Projects input to Q, K, V using matrix multiplication 4. Applies scaled dot-product attention: softmax((Q @ K^T) / sqrt(d_k)) @ V 5. Returns the attention output
For Beginners: This method builds a symbolic representation of attention for JIT.
JIT compilation converts the attention mechanism into optimized native code. Attention allows the model to focus on relevant parts of the input by:
- Creating Query (what we're looking for), Key (what we have), Value (what we return) projections
- Computing similarity scores between Query and all Keys
- Using softmax to convert scores to weights (focusing mechanism)
- Applying these weights to Values to get focused output
The symbolic graph allows the JIT compiler to:
- Optimize matrix multiplications using BLAS libraries
- Fuse softmax computation with scaling
- Generate efficient memory layouts for cache utilization
Attention is the core mechanism in Transformers and modern NLP models. JIT compilation provides 5-10x speedup by optimizing these operations.
Exceptions
- ArgumentNullException
Thrown when inputNodes is null.
- InvalidOperationException
Thrown when layer parameters are not initialized.
Forward(Tensor<T>)
Performs the forward pass of the attention mechanism.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>The input tensor to the layer.
Returns
- Tensor<T>
The output tensor after applying the attention mechanism.
Remarks
This method implements the core functionality of the attention mechanism. It transforms the input into query, key, and value representations, computes attention scores, applies scaling and activation, and produces the final output.
For Beginners: This is where the attention magic happens!
- The input is transformed into three different representations: Query (Q), Key (K), and Value (V).
- Attention scores are computed by comparing Q and K.
- These scores are scaled and activated (usually with softmax) to get attention weights.
- The final output is produced by applying these weights to V.
This process allows the layer to focus on different parts of the input as needed.
Forward(params Tensor<T>[])
Performs the forward pass of the attention mechanism with multiple inputs.
public override Tensor<T> Forward(params Tensor<T>[] inputs)
Parameters
inputsTensor<T>[]An array of input tensors. Based on the number of inputs: - One input: Standard forward pass with just the input tensor - Two inputs: The first tensor is the query input, the second is either the key/value input or an attention mask - Three inputs: The first tensor is the query input, the second is the key/value input, and the third is the attention mask
Returns
- Tensor<T>
The output tensor after applying the attention mechanism.
Remarks
This method extends the attention mechanism to support multiple input tensors, which is useful for implementing cross-attention (as used in transformer decoder layers) and masked attention.
For Beginners: This method allows the attention layer to handle more complex scenarios:
- With one input: It works just like the standard attention (self-attention)
- With two inputs: It can either:
- Perform cross-attention (where query comes from one source, and key/value from another)
- Apply a mask to self-attention to control which parts of the input to focus on
- With three inputs: It performs masked cross-attention, which combines both features above
These capabilities are essential for transformer architectures, especially decoder layers that need to attend to both their own outputs and the encoder's outputs.
Exceptions
- ArgumentException
Thrown when the input array is empty.
ForwardGpu(params IGpuTensor<T>[])
Performs GPU-accelerated forward pass for the attention mechanism. All computations stay on GPU - no CPU roundtrips.
public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)
Parameters
inputsIGpuTensor<T>[]The input GPU tensors. Expects one tensor with shape [batch, seqLen, inputSize].
Returns
- IGpuTensor<T>
The output GPU tensor after applying the attention mechanism.
Exceptions
- ArgumentException
Thrown when no inputs provided.
- InvalidOperationException
Thrown when engine is not a DirectGpuTensorEngine.
GetAuxiliaryLossDiagnostics()
Gets diagnostic information about the attention regularization.
public Dictionary<string, string> GetAuxiliaryLossDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic information about attention patterns.
Remarks
This method provides insights into attention behavior, including: - Attention entropy (measure of distribution spread) - Whether regularization is enabled - Regularization weight
For Beginners: This gives you information to monitor attention pattern health.
The diagnostics include:
- Attention Entropy: How spread out the attention is (higher = more distributed)
- Entropy Weight: How much the regularization influences training
- Use Auxiliary Loss: Whether regularization is enabled
These values help you:
- Detect attention collapse (very low entropy)
- Monitor attention diversity during training
- Tune the entropy regularization weight
- Ensure attention heads are learning different patterns
GetDiagnostics()
Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.
public override Dictionary<string, string> GetDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().
GetParameters()
Retrieves the current parameters of the layer.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all the parameters of the layer.
Remarks
This method collects all the weights of the attention layer (Wq, Wk, Wv) into a single vector. It's useful for operations that need to work with all the layer's parameters at once, such as certain optimization algorithms or when saving the model's state.
For Beginners: This method gives you all the layer's learned values in one list.
It's like taking a snapshot of everything the layer has learned. This can be useful for saving the layer's current state or for advanced training techniques.
ResetState()
Resets the state of the attention layer.
public override void ResetState()
Remarks
This method resets the internal state of the attention layer. It clears the last input and attention weights, effectively preparing the layer for a new sequence or episode.
For Beginners: This is like clearing the layer's short-term memory.
In attention mechanisms, sometimes we want to start fresh, forgetting any previous inputs. This is especially useful when starting a new sequence or when you don't want the layer to consider past information anymore.
UpdateParameters(Vector<T>)
Updates the layer's parameters with the provided values.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>A vector containing new parameter values.
Remarks
This method replaces the current values of the layer's weights with new values provided in the parameters vector. It's useful for setting the layer's state to a specific configuration, such as when loading a pre-trained model.
For Beginners: This allows you to directly set the layer's internal weights.
Instead of the layer learning these weights through training, you're providing them directly. This is often used when you want to use a pre-trained attention layer or set up the layer with specific initial values.
UpdateParameters(T)
Updates the layer's parameters based on the computed gradients and a learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate to use for the update.
Remarks
This method applies the computed gradients to the layer's weights, scaled by the learning rate. This is typically called after the backward pass to adjust the layer's parameters.
For Beginners: This is how the layer actually improves its performance.
After figuring out how each weight contributed to the error (in the Backward method), this method adjusts those weights to reduce the error:
- Weights that contributed to large errors are changed more.
- The learning rate determines how big these changes are.