Class TransformerEncoderLayer<T>
- Namespace
- AiDotNet.NeuralNetworks.Layers
- Assembly
- AiDotNet.dll
Represents a transformer encoder layer that processes sequences using self-attention and feed-forward networks.
public class TransformerEncoderLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>TransformerEncoderLayer<T>
- Implements
-
ILayer<T>
- Inherited Members
Remarks
A transformer encoder layer is a fundamental building block of transformer-based models for sequence processing tasks. It consists of two main components: a self-attention mechanism that allows each position in a sequence to attend to all positions, and a feed-forward network that processes each position independently. Each component is followed by layer normalization and residual connections to facilitate training of deep networks.
For Beginners: This layer helps a neural network understand relationships between different elements in a sequence.
Think of it like a careful reader analyzing a paragraph:
- First, the reader looks at how each word relates to every other word (self-attention)
- Then, the reader processes this information to understand the meaning (feed-forward network)
For example, in the sentence "The animal didn't cross the street because it was too wide":
- The self-attention helps the network understand that "it" refers to "the street" (not "the animal")
- The feed-forward network processes this contextual information for each word
This architecture is powerful for tasks like understanding text, analyzing time series, or processing any data where the relationships between elements matter.
Constructors
TransformerEncoderLayer(int, int, int)
Initializes a new instance of the TransformerEncoderLayer<T> class.
public TransformerEncoderLayer(int embeddingSize, int numHeads, int feedForwardDim)
Parameters
embeddingSizeintThe size of the embeddings.
numHeadsintThe number of attention heads.
feedForwardDimintThe dimension of the feed-forward network.
Remarks
This constructor creates a transformer encoder layer with the specified dimensions. It initializes the self-attention, layer normalization, and feed-forward sublayers with appropriate dimensions and activation functions.
For Beginners: This constructor creates a new transformer encoder layer with the specified settings.
The parameters you provide determine:
- embeddingSize: How rich the representation of each token is (more = more expressive)
- numHeads: How many different "perspectives" the attention mechanism can have
- feedForwardDim: How much processing capacity the feed-forward network has
These settings control the capacity, expressiveness, and computational requirements of the encoder. Typical values might be 512 for embedding size, 8 attention heads, and 2048 for the feed-forward dimension, similar to those used in the original transformer paper.
Properties
AuxiliaryLossWeight
Gets or sets the weight for the auxiliary loss contribution.
public T AuxiliaryLossWeight { get; set; }
Property Value
- T
Remarks
This value determines how much the aggregated auxiliary losses contribute to the total loss. The default value of 0.005 provides a good balance between the main task and regularization.
For Beginners: This controls how much importance to give to the attention regularization.
The weight affects training:
- Higher values (e.g., 0.01) make the network prioritize better attention patterns more strongly
- Lower values (e.g., 0.001) make the regularization less important
- The default (0.005) works well for most transformer tasks
If your attention is collapsing (all heads learning the same thing), you might increase this value. If the main task is more important, you might decrease it.
ParameterCount
Gets the total number of trainable parameters in this layer.
public override int ParameterCount { get; }
Property Value
Remarks
This returns the sum of all parameters from sublayers: self-attention, layer norms, and feed-forward layers.
SupportsGpuExecution
Gets a value indicating whether this layer supports GPU execution.
protected override bool SupportsGpuExecution { get; }
Property Value
SupportsJitCompilation
Gets whether this transformer encoder layer supports JIT compilation.
public override bool SupportsJitCompilation { get; }
Property Value
- bool
True if all sublayers support JIT compilation.
Remarks
This property indicates whether the layer can be JIT compiled. As a composite layer, it supports JIT if all its sublayers support JIT: - Multi-head self-attention layer - Layer normalization layers - Feed-forward layer
For Beginners: This tells you if this composite layer can use JIT compilation.
The transformer encoder layer can be JIT compiled if:
- All sublayers are properly initialized
- Each sublayer supports JIT compilation
Composite layer JIT optimization:
- Each sublayer can be independently JIT compiled
- Future optimization: fuse operations across sublayers
- Residual connections and layer norms are fast operations
The bottleneck in transformers is typically the attention mechanism (O(n²)), which benefits most from JIT compilation. The feed-forward networks are also computationally expensive (matrix multiplications).
BERT and other transformers stack 12-24 of these encoder layers, so optimizing each layer compounds to significant speedup for the full model.
SupportsTraining
The computation engine (CPU or GPU) for vectorized operations.
public override bool SupportsTraining { get; }
Property Value
- bool
truefor this layer, as it contains trainable parameters.
Remarks
This property indicates whether the transformer encoder layer can be trained through backpropagation. Since this layer has trainable parameters in its sublayers, it supports training.
For Beginners: This property tells you if the layer can learn from data.
A value of true means:
- The layer has internal values that can be adjusted during training
- It will improve its performance as it sees more data
- It participates in the learning process
For this layer, the value is always true because it contains multiple sublayers with trainable parameters that need to be optimized during training.
UseAuxiliaryLoss
Gets or sets a value indicating whether auxiliary loss is enabled for this layer.
public bool UseAuxiliaryLoss { get; set; }
Property Value
Remarks
When enabled, the layer aggregates auxiliary losses from its sublayers, particularly the self-attention mechanism. This helps regularize attention patterns and prevent issues like attention collapse.
For Beginners: This setting controls whether the layer uses additional learning signals.
When enabled (true):
- The layer collects extra penalties from the self-attention mechanism
- This helps the attention heads learn diverse and focused patterns
- Training may be more stable and produce better results
When disabled (false):
- Only the main task loss is used for training
- This is the default setting
Methods
Backward(Tensor<T>)
Performs the backward pass of the transformer encoder layer.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient of the loss with respect to the layer's output.
Returns
- Tensor<T>
The gradient of the loss with respect to the layer's input.
Remarks
This method implements the backward pass of the transformer encoder layer, which is used during training to propagate error gradients back through the network. It computes gradients for each sublayer in reverse order of the forward pass, ensuring that residual connections are properly handled.
For Beginners: This method calculates how the layer's inputs should change to reduce errors.
During the backward pass, we go through the same steps as the forward pass, but in reverse order:
Final Layer Normalization:
- Compute how the normalization's input should change based on output errors
Feed-Forward Network:
- Determine how the feed-forward network's input should change
- Account for the residual connection by adding gradients
First Layer Normalization:
- Compute how the first normalization's input should change
Self-Attention:
- Determine how the self-attention's input should change
- Account for the residual connection
This reverse flow of gradients allows each component to learn how it contributed to any errors.
BackwardGpu(IGpuTensor<T>)
Computes the gradient of the loss with respect to the input on the GPU.
public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)
Parameters
outputGradientIGpuTensor<T>The gradient of the loss with respect to the layer's output.
Returns
- IGpuTensor<T>
The gradient of the loss with respect to the layer's input.
ComputeAuxiliaryLoss()
Computes the auxiliary loss for this layer by aggregating losses from sublayers.
public T ComputeAuxiliaryLoss()
Returns
- T
The computed auxiliary loss value.
Remarks
This method computes the auxiliary loss by aggregating losses from sublayers that implement IAuxiliaryLossLayer. Currently, this includes the self-attention mechanism which provides attention entropy and head diversity regularization.
For Beginners: This method collects additional learning signals from the layer's components.
Auxiliary loss aggregation:
- Checks each sublayer to see if it has auxiliary losses
- Collects those losses and combines them
- Returns the total for use in training
Why this is useful:
- The self-attention mechanism can benefit from regularization to prevent all heads from learning the same patterns
- Aggregating losses at the encoder level provides a unified view of attention quality
- This helps the entire encoder learn better representations
Example: If the self-attention has an entropy loss (to keep attention focused) and a diversity loss (to prevent heads from being redundant), this method adds them together and returns the total.
The aggregated loss helps ensure:
- Attention heads learn diverse patterns
- Attention is focused rather than diffuse
- The encoder uses its capacity efficiently
ExportComputationGraph(List<ComputationNode<T>>)
Exports the transformer encoder layer as a computation graph for JIT compilation.
public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)
Parameters
inputNodesList<ComputationNode<T>>List to which the input node will be added.
Returns
- ComputationNode<T>
The output computation node representing the transformer encoder operation.
Remarks
This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node 2. Applies multi-head self-attention with residual connection and norm 3. Applies feed-forward network with residual connection and norm 4. Returns the final output
For Beginners: This method builds a symbolic representation of a transformer encoder layer for JIT.
The transformer encoder layer is a composite layer combining:
- Multi-head self-attention (captures relationships between positions)
- Layer normalization (stabilizes training)
- Feed-forward network (processes each position independently)
- Residual connections (helps gradient flow in deep networks)
The forward pass:
- x' = LayerNorm(x + MultiHeadAttention(x))
- output = LayerNorm(x' + FeedForward(x'))
JIT optimization for composite layers:
- For now, composite layers note their structure but may delegate to sublayers
- Future optimization could fuse operations across sublayers
- Each sublayer (attention, feed-forward, norm) can be independently JIT compiled
This is the core building block of BERT (12-24 encoder layers), GPT uses decoder layers.
Exceptions
- ArgumentNullException
Thrown when inputNodes is null.
- InvalidOperationException
Thrown when sublayers are not initialized.
Forward(Tensor<T>)
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
ForwardGpu(params IGpuTensor<T>[])
Performs the forward pass using GPU-resident tensors.
public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)
Parameters
inputsIGpuTensor<T>[]
Returns
- IGpuTensor<T>
A GPU-resident output tensor.
Remarks
This method performs the entire transformer encoder forward pass on the GPU without downloading intermediate results to CPU. All sublayer operations (self-attention, layer normalization, feed-forward networks, residual connections) remain GPU-resident for maximum performance.
GetAuxiliaryLossDiagnostics()
Gets diagnostic information about the auxiliary loss computation.
public Dictionary<string, string> GetAuxiliaryLossDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic information about the auxiliary loss.
Remarks
This method returns diagnostic information that can be used to monitor the auxiliary loss during training. The diagnostics include the total auxiliary loss, the weight applied to it, whether auxiliary loss is enabled, and detailed diagnostics from sublayers.
For Beginners: This method provides information to help you understand how the auxiliary loss is working.
The diagnostics show:
- TotalAuxiliaryLoss: The combined penalty from all sublayers
- AuxiliaryWeight: How much this penalty affects the overall training
- UseAuxiliaryLoss: Whether this penalty is currently enabled
- SelfAttentionDiagnostics: Detailed information from the self-attention mechanism
You can use this information to:
- Monitor if attention patterns are healthy (diverse and focused)
- Debug training issues related to attention
- Understand how the encoder is learning
Example: If you see that attention entropy is very low, it might mean attention is too diffuse. If head diversity is very low, it might mean all heads are learning the same thing and capacity is wasted.
GetDiagnostics()
Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.
public override Dictionary<string, string> GetDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().
GetParameters()
Gets all trainable parameters of the transformer encoder layer as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all trainable parameters from all sublayers.
Remarks
This method retrieves all trainable parameters from all sublayers of the transformer encoder layer and combines them into a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.
For Beginners: This method collects all the learnable values from all parts of the encoder.
The parameters:
- Are the numbers that the neural network learns during training
- Include weights from attention mechanisms, normalization layers, and the feed-forward network
- Are combined into a single long list (vector)
This is useful for:
- Saving the model to disk
- Loading parameters from a previously trained model
- Advanced optimization techniques that need access to all parameters
A transformer encoder layer typically has millions of parameters, all of which contribute to its ability to understand complex sequences.
ResetState()
Resets the internal state of the transformer encoder layer and all its sublayers.
public override void ResetState()
Remarks
This method resets the internal state of the transformer encoder layer and all its sublayers. It delegates the reset operation to each sublayer, ensuring that any cached state is cleared.
For Beginners: This method clears the layer's memory to start fresh.
When resetting the state:
- All sublayers are reset to their initial condition
- Any cached information from previous processing is cleared
- The layer is ready to process new, unrelated sequences
This is important for:
- Processing a new, unrelated sequence
- Starting a new training episode
- Testing the layer with fresh inputs
Think of it like clearing your mind before starting a completely new task, ensuring no information from previous tasks affects your current thinking.
UpdateParameters(T)
Updates the parameters of all sublayers using the calculated gradients.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate to use for parameter updates.
Remarks
This method updates the parameters of all sublayers in the transformer encoder layer based on the gradients calculated during the backward pass. It delegates the update process to each sublayer, passing the learning rate.
For Beginners: This method adjusts all the internal values of the layer to improve its performance.
During parameter updates:
- The learning rate controls how big each adjustment is
- Every sublayer gets updated based on what was learned in the backward pass
- This helps the entire encoder layer gradually improve its performance
Think of it like fine-tuning all the components of the encoder based on feedback:
- The self-attention mechanism learns to focus on more relevant relationships
- The feed-forward network learns to better transform the information
- The normalization layers learn to keep values in the optimal range
UpdateParametersGpu(IGpuOptimizerConfig)
Updates layer parameters using GPU-resident optimizer.
public override void UpdateParametersGpu(IGpuOptimizerConfig config)
Parameters
configIGpuOptimizerConfigThe GPU optimizer configuration.
Remarks
This method delegates to each sublayer's UpdateParametersGpu method. All sublayers (self-attention, layer norms, feed-forward) are updated.