Class SelfAttentionLayer<T>
- Namespace
- AiDotNet.NeuralNetworks.Layers
- Assembly
- AiDotNet.dll
Represents a self-attention layer that allows a sequence to attend to itself, capturing relationships between elements.
public class SelfAttentionLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>SelfAttentionLayer<T>
- Implements
-
ILayer<T>
- Inherited Members
Remarks
The SelfAttentionLayer implements the self-attention mechanism, a key component of transformer architectures. It allows each position in a sequence to attend to all positions within the same sequence, enabling the model to capture long-range dependencies and relationships. The layer uses the scaled dot-product attention mechanism with multiple attention heads, which allows it to focus on different aspects of the input simultaneously.
For Beginners: This layer helps a neural network understand relationships between different parts of a sequence.
Think of the SelfAttentionLayer like a group of spotlights at a theater performance:
- Each spotlight (attention head) can focus on different actors on stage
- For each actor, the spotlights decide which other actors are most relevant to them
- The spotlights assign importance scores to these relationships
- This helps the network understand who is interacting with whom, and how
For example, in a sentence like "The cat sat on the mat because it was tired":
- Traditional networks might struggle to figure out what "it" refers to
- Self-attention can learn that "it" has a strong relationship with "cat"
- This helps the network understand that the cat was tired, not the mat
Multi-head attention (using multiple "spotlights") allows the layer to focus on different types of relationships simultaneously, such as grammatical structure, semantic meaning, and contextual clues.
Self-attention is a cornerstone of modern natural language processing and has revolutionized how neural networks handle sequential data like text, time series, and even images.
Constructors
SelfAttentionLayer(int, int, int, IActivationFunction<T>?)
Initializes a new instance of the SelfAttentionLayer<T> class with a scalar activation function.
public SelfAttentionLayer(int sequenceLength, int embeddingDimension, int headCount = 8, IActivationFunction<T>? activationFunction = null)
Parameters
sequenceLengthintThe length of the input sequence.
embeddingDimensionintThe dimension of the input and output embeddings.
headCountintThe number of attention heads. Defaults to 8.
activationFunctionIActivationFunction<T>The activation function to apply to the output. Defaults to Identity if not specified.
Remarks
This constructor creates a new SelfAttentionLayer with the specified dimensions and a scalar activation function. It validates that the embedding dimension is divisible by the number of heads and initializes the weight matrices and bias vector with appropriate values. A scalar activation function is applied element-wise to each output embedding independently.
For Beginners: This creates a new self-attention layer for your neural network using a simple activation function.
When you create this layer, you specify:
- sequenceLength: How many items (like words) are in your sequence
- embeddingDimension: How many features each item has
- headCount: How many different "spotlights" the attention mechanism uses (default: 8)
- activationFunction: How to transform the output (defaults to Identity, which makes no changes)
For example, in a language model:
- sequenceLength might be 512 (the maximum number of words/tokens in a text)
- embeddingDimension might be 768 (the number of features per word/token)
- Using 8 attention heads lets the model focus on 8 different types of relationships
The embedding dimension must be divisible by the number of heads (e.g., 768 ÷ 8 = 96), so each head has the same dimension.
Exceptions
- ArgumentException
Thrown when the embedding dimension is not divisible by the number of heads.
SelfAttentionLayer(int, int, int, IVectorActivationFunction<T>?)
Initializes a new instance of the SelfAttentionLayer<T> class with a vector activation function.
public SelfAttentionLayer(int sequenceLength, int embeddingDimension, int headCount = 8, IVectorActivationFunction<T>? vectorActivationFunction = null)
Parameters
sequenceLengthintThe length of the input sequence.
embeddingDimensionintThe dimension of the input and output embeddings.
headCountintThe number of attention heads. Defaults to 8.
vectorActivationFunctionIVectorActivationFunction<T>The vector activation function to apply to the output. Defaults to Identity if not specified.
Remarks
This constructor creates a new SelfAttentionLayer with the specified dimensions and a vector activation function. It validates that the embedding dimension is divisible by the number of heads and initializes the weight tensors and bias tensor with appropriate values. A vector activation function is applied to the entire output vector at once, which allows for interactions between different output elements.
For Beginners: This creates a new self-attention layer for your neural network using an advanced activation function.
When you create this layer, you specify the same parameters as in the scalar version, but with a vector activation:
- sequenceLength: How many items are in your sequence
- embeddingDimension: How many features each item has
- headCount: How many different "spotlights" the attention mechanism uses
- vectorActivationFunction: How to transform the entire output as a group
A vector activation can consider relationships between different positions in the output, which might be useful for certain advanced applications.
This constructor works the same as the scalar version, but allows for more sophisticated activation patterns across the output sequence.
Exceptions
- ArgumentException
Thrown when the embedding dimension is not divisible by the number of heads.
Properties
AuxiliaryLossWeight
Gets or sets the weight for the attention sparsity auxiliary loss.
public T AuxiliaryLossWeight { get; set; }
Property Value
- T
Remarks
This weight controls how much attention sparsity regularization contributes to the total loss. Typical values range from 0.001 to 0.01.
For Beginners: This controls how much we encourage focused attention.
Common values:
- 0.005 (default): Balanced sparsity regularization
- 0.001-0.003: Light sparsity enforcement
- 0.008-0.01: Strong sparsity enforcement
Higher values encourage sharper, more focused attention patterns.
ParameterCount
Gets the total number of trainable parameters in this layer.
public override int ParameterCount { get; }
Property Value
- int
The total number of parameters: 3 weight matrices (Q, K, V) each of size [embeddingDimension × embeddingDimension], plus an output bias of size [embeddingDimension]. Total = 3 × E² + E = E × (3E + 1) where E is the embedding dimension.
SupportsGpuExecution
Gets a value indicating whether this layer supports GPU execution.
protected override bool SupportsGpuExecution { get; }
Property Value
SupportsJitCompilation
Gets whether this self-attention layer supports JIT compilation.
public override bool SupportsJitCompilation { get; }
Property Value
- bool
True if the layer parameters are initialized.
Remarks
This property indicates whether the layer can be JIT compiled. The layer supports JIT if: - Query, Key, Value projection weights are initialized - The layer has been properly configured with sequence length and embedding dimensions
For Beginners: This tells you if this layer can use JIT compilation for faster inference.
The layer can be JIT compiled if:
- The layer has been initialized with projection weight matrices (query, key, value weights)
- The multi-head structure has been configured
Self-attention layers are computationally expensive because each position attends to all other positions in the sequence (O(n²) complexity). JIT compilation can provide significant speedup (5-10x) by optimizing:
- Parallel matrix multiplications for projections
- Multi-head attention score computation across heads
- Softmax operations for attention weights
- Weighted sums of values across all heads
This is especially critical for Transformers where self-attention is the bottleneck:
- BERT has 12-24 self-attention layers
- GPT-3 has 96 self-attention layers
- Vision Transformers process image patches as sequences
JIT compilation makes these models practical for production use.
SupportsTraining
Gets a value indicating whether this layer supports training.
public override bool SupportsTraining { get; }
Property Value
- bool
Always
truefor SelfAttentionLayer, indicating that the layer can be trained through backpropagation.
Remarks
This property indicates that the SelfAttentionLayer has trainable parameters (query, key, and value weights, as well as output biases) that can be optimized during the training process using backpropagation. The gradients of these parameters are calculated during the backward pass and used to update the parameters.
For Beginners: This property tells you if the layer can learn from data.
A value of true means:
- The layer has values (weights and biases) that can be adjusted during training
- It will improve its performance as it sees more data
- It participates in the learning process of the neural network
When you train a neural network containing this layer, it will automatically learn which relationships between sequence positions are important for your specific task.
UseAuxiliaryLoss
Gets or sets whether auxiliary loss (attention sparsity regularization) should be used during training.
public bool UseAuxiliaryLoss { get; set; }
Property Value
Remarks
Attention sparsity regularization encourages the attention mechanism to focus on relevant positions while ignoring irrelevant ones. This prevents attention from being too diffuse and improves interpretability.
For Beginners: This helps self-attention focus on what matters.
Self-attention works best when it's selective:
- Without regularization: Attention might spread too thin across all positions
- With regularization: Attention focuses on truly relevant relationships
This includes:
- Entropy regularization: Prevents overly uniform attention
- Sparsity penalties: Encourages sharp, focused attention patterns
This helps the model:
- Learn clearer, more interpretable attention patterns
- Focus computational resources on relevant relationships
- Improve robustness and generalization
Methods
Backward(Tensor<T>)
Performs the backward pass of the self-attention layer.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient of the loss with respect to the layer's output.
Returns
- Tensor<T>
The gradient of the loss with respect to the layer's input.
Remarks
This method implements the backward pass of the self-attention layer, which is used during training to propagate error gradients back through the network. It calculates the gradients of the loss with respect to the layer's parameters (query, key, and value weights, as well as output biases) and with respect to the layer's input. The calculation involves complex tensor operations that essentially reverse the computations done in the forward pass.
For Beginners: This method calculates how the layer's parameters should change to reduce errors.
During the backward pass:
- The layer receives error gradients indicating how the output should change
- It calculates how each of its internal components contributed to the error:
- How the query weights should change
- How the key weights should change
- How the value weights should change
- How the output biases should change
- It also calculates how the error should propagate back to the previous layer
This involves complex matrix mathematics, but the basic idea is:
- Finding which attention patterns led to errors
- Adjusting the weights to improve these patterns
- Sending appropriate feedback to the previous layer
The backward pass is what allows the self-attention mechanism to learn which relationships in the sequence are important for the specific task.
Exceptions
- InvalidOperationException
Thrown when backward is called before forward.
BackwardGpu(IGpuTensor<T>)
Performs the backward pass using GPU-resident tensors.
public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)
Parameters
outputGradientIGpuTensor<T>GPU-resident gradient of the loss w.r.t. output.
Returns
- IGpuTensor<T>
GPU-resident gradient of the loss w.r.t. input.
ComputeAuxiliaryLoss()
Initializes the layer's internal parameters based on the sequence length, embedding dimension, and head count.
public T ComputeAuxiliaryLoss()
Returns
- T
The computed attention sparsity auxiliary loss.
Remarks
This private method initializes the internal parameters of the self-attention layer based on the specified dimensions. It validates that the embedding dimension is divisible by the number of heads, calculates the dimension of each head, and then calls InitializeParameters to set up the weight matrices and bias vector. This method is called by both constructors.
For Beginners: This method sets up the internal structure of the self-attention layer.
During initialization:
- The method saves the basic dimensions (sequence length, embedding size, head count)
- It calculates how large each attention head should be
- It verifies that the embedding dimension can be evenly divided by the head count
- It triggers the creation of all the weight matrices with proper initial values
The head dimension calculation is important - if you have an embedding size of 512 and 8 attention heads, each head will have a dimension of 64 (512 ÷ 8). This allows each head to specialize in different aspects of the input sequence.
This method throws an error if the embedding dimension isn't divisible by the head count because the attention mechanism requires equal-sized heads.
Exceptions
- ArgumentException
Thrown when the embedding dimension is not divisible by the number of heads.
ExportComputationGraph(List<ComputationNode<T>>)
Exports the self-attention layer as a computation graph for JIT compilation.
public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)
Parameters
inputNodesList<ComputationNode<T>>List to which the input node will be added.
Returns
- ComputationNode<T>
The output computation node representing the self-attention operation.
Remarks
This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node with shape [batch=1, sequenceLength, embeddingDimension] 2. Creates constant nodes for Query, Key, Value projection weights 3. Projects input to Q, K, V using matrix multiplication (self-attention: all from same input) 4. Applies multi-head scaled dot-product attention mechanism 5. Returns the attention output with residual connection and bias
For Beginners: This method builds a symbolic representation of self-attention for JIT.
JIT compilation converts multi-head self-attention into optimized native code. Self-attention allows each position in a sequence to attend to all positions, enabling the model to capture long-range dependencies and relationships within the sequence.
Multi-head attention uses multiple parallel attention mechanisms ("heads") that:
- Focus on different aspects of the input simultaneously
- Allow the model to capture diverse relationships (syntax, semantics, context)
- Improve the model's ability to understand complex patterns
The symbolic graph allows the JIT compiler to:
- Optimize parallel matrix multiplications across heads
- Fuse attention score computation and softmax
- Generate efficient memory layouts for multi-head processing
- Optimize the split and concatenation operations for heads
Self-attention is the core of Transformer architectures (BERT, GPT, Vision Transformers). JIT compilation provides 5-10x speedup by optimizing these complex operations.
Exceptions
- ArgumentNullException
Thrown when inputNodes is null.
- InvalidOperationException
Thrown when layer parameters are not initialized.
Forward(Tensor<T>)
Performs the forward pass of the self-attention layer.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>The input tensor to process, with shape [batchSize, sequenceLength, embeddingDimension].
Returns
- Tensor<T>
The output tensor after self-attention, with the same shape as the input.
Remarks
This method implements the forward pass of the self-attention layer. It transforms the input into queries, keys, and values, then computes attention scores between each position and all other positions. These scores are normalized using the softmax function and used to compute a weighted sum of the values. The result is transformed back to the original embedding dimension and passed through an activation function.
For Beginners: This method processes your sequence data through the self-attention mechanism.
During the forward pass:
- The input sequence is transformed into three different representations:
- Queries: What each position is looking for
- Keys: What each position has to offer
- Values: The actual content at each position
- For each position, attention scores are computed by comparing its query with all keys
- These scores are scaled and normalized to create attention weights
- Each position's output is a weighted sum of all values, based on the attention weights
- The result is transformed and passed through an activation function
Imagine a classroom where each student (position) asks a question (query) to the entire class. Other students offer answers (keys) and knowledge (values). Each student pays more attention to the most relevant answers and combines that knowledge to form their own understanding.
The multi-head mechanism allows this process to happen in parallel with different "perspectives" or types of questions.
ForwardGpu(params IGpuTensor<T>[])
Performs the forward pass using GPU-resident tensors.
public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)
Parameters
inputsIGpuTensor<T>[]
Returns
- IGpuTensor<T>
A GPU-resident output tensor.
Remarks
This method performs the entire self-attention forward pass on the GPU without downloading intermediate results to CPU. All projections, attention computation, and bias addition remain GPU-resident for maximum performance.
GetAuxiliaryLossDiagnostics()
Gets diagnostic information about the attention sparsity auxiliary loss.
public Dictionary<string, string> GetAuxiliaryLossDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic information about attention regularization.
Remarks
This method returns detailed diagnostics about attention sparsity regularization, including entropy loss, sparsity penalty, and configuration parameters. This information is useful for monitoring training progress and debugging attention patterns.
For Beginners: This provides information about how attention regularization is working.
The diagnostics include:
- Total entropy loss (how focused attention patterns are)
- Total sparsity loss (L1 penalty on attention weights)
- Weight applied to the regularization
- Whether regularization is enabled
- Number of attention heads
This helps you:
- Monitor if attention is becoming too diffuse or too sharp
- Debug issues with attention patterns
- Understand the impact of regularization on learning
You can use this information to adjust regularization weights for better results.
GetDiagnostics()
Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.
public override Dictionary<string, string> GetDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().
GetParameters()
Gets all trainable parameters of the self-attention layer as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all trainable parameters (query weights, key weights, value weights, and output biases).
Remarks
This method retrieves all trainable parameters of the self-attention layer as a single vector. The query weights are stored first, followed by the key weights, value weights, and finally the output biases. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.
For Beginners: This method collects all the learnable values from the self-attention layer.
The parameters:
- Are the weights and biases that the self-attention layer learns during training
- Control how the layer processes sequence information
- Are returned as a single list (vector)
This is useful for:
- Saving the model to disk
- Loading parameters from a previously trained model
- Advanced optimization techniques that need access to all parameters
The query weights are stored first in the vector, followed by the key weights, value weights, and finally the output biases.
ResetState()
Resets the internal state of the self-attention layer.
public override void ResetState()
Remarks
This method resets the internal state of the self-attention layer, including the cached inputs, outputs, attention scores from the forward pass, and the gradients from the backward pass. This is useful when starting to process a new batch of data.
For Beginners: This method clears the layer's memory to start fresh.
When resetting the state:
- Stored inputs, outputs, and attention scores from previous calculations are cleared
- Calculated gradients for all weights and biases are cleared
- The layer forgets any information from previous batches
This is important for:
- Processing a new, unrelated batch of data
- Preventing information from one batch affecting another
- Managing memory usage efficiently
Since the self-attention layer caches quite a bit of information during the forward and backward passes, resetting the state helps prevent memory leaks and ensures each new sequence is processed independently.
SetParameters(Vector<T>)
Sets the trainable parameters of the self-attention layer.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>A vector containing all parameters (query weights, key weights, value weights, and output biases) to set.
Remarks
This method sets the trainable parameters of the self-attention layer from a single vector. The vector should contain the query weight values first, followed by the key weight values, value weight values, and finally the output bias values. This is useful for loading saved model weights or for implementing optimization algorithms that operate on all parameters at once.
For Beginners: This method updates all the weights and biases in the self-attention layer.
When setting parameters:
- The input must be a vector with the correct total length
- The first part of the vector is used for the query weights
- The second part of the vector is used for the key weights
- The third part of the vector is used for the value weights
- The last part of the vector is used for the output biases
This is useful for:
- Loading a previously saved model
- Transferring parameters from another model
- Testing different parameter values
An error is thrown if the input vector doesn't have the expected number of parameters.
Exceptions
- ArgumentException
Thrown when the parameters vector has incorrect length.
UpdateParameters(T)
Updates the parameters of the self-attention layer using the calculated gradients.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate to use for the parameter updates.
Remarks
This method updates the query weights, key weights, value weights, and output biases of the self-attention layer based on the gradients calculated during the backward pass. The learning rate controls the size of the parameter updates. This method should be called after the backward pass to apply the calculated updates.
For Beginners: This method updates the layer's internal values during training.
When updating parameters:
- The query weight values are adjusted based on their gradients
- The key weight values are adjusted based on their gradients
- The value weight values are adjusted based on their gradients
- The output bias values are adjusted based on their gradients
- The learning rate controls how big each update step is
These updates help the self-attention mechanism:
- Focus on more relevant relationships between positions
- Ignore irrelevant relationships
- Better understand the structure of your sequences
Smaller learning rates mean slower but more stable learning, while larger learning rates mean faster but potentially unstable learning.
Exceptions
- InvalidOperationException
Thrown when UpdateParameters is called before Backward.