Table of Contents

Class TransformerEncoderLayer<T>

Namespace
AiDotNet.NeuralNetworks.Layers
Assembly
AiDotNet.dll

Represents a transformer encoder layer that processes sequences using self-attention and feed-forward networks.

public class TransformerEncoderLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
TransformerEncoderLayer<T>
Implements
Inherited Members

Remarks

A transformer encoder layer is a fundamental building block of transformer-based models for sequence processing tasks. It consists of two main components: a self-attention mechanism that allows each position in a sequence to attend to all positions, and a feed-forward network that processes each position independently. Each component is followed by layer normalization and residual connections to facilitate training of deep networks.

For Beginners: This layer helps a neural network understand relationships between different elements in a sequence.

Think of it like a careful reader analyzing a paragraph:

  • First, the reader looks at how each word relates to every other word (self-attention)
  • Then, the reader processes this information to understand the meaning (feed-forward network)

For example, in the sentence "The animal didn't cross the street because it was too wide":

  • The self-attention helps the network understand that "it" refers to "the street" (not "the animal")
  • The feed-forward network processes this contextual information for each word

This architecture is powerful for tasks like understanding text, analyzing time series, or processing any data where the relationships between elements matter.

Constructors

TransformerEncoderLayer(int, int, int)

Initializes a new instance of the TransformerEncoderLayer<T> class.

public TransformerEncoderLayer(int embeddingSize, int numHeads, int feedForwardDim)

Parameters

embeddingSize int

The size of the embeddings.

numHeads int

The number of attention heads.

feedForwardDim int

The dimension of the feed-forward network.

Remarks

This constructor creates a transformer encoder layer with the specified dimensions. It initializes the self-attention, layer normalization, and feed-forward sublayers with appropriate dimensions and activation functions.

For Beginners: This constructor creates a new transformer encoder layer with the specified settings.

The parameters you provide determine:

  • embeddingSize: How rich the representation of each token is (more = more expressive)
  • numHeads: How many different "perspectives" the attention mechanism can have
  • feedForwardDim: How much processing capacity the feed-forward network has

These settings control the capacity, expressiveness, and computational requirements of the encoder. Typical values might be 512 for embedding size, 8 attention heads, and 2048 for the feed-forward dimension, similar to those used in the original transformer paper.

Properties

AuxiliaryLossWeight

Gets or sets the weight for the auxiliary loss contribution.

public T AuxiliaryLossWeight { get; set; }

Property Value

T

Remarks

This value determines how much the aggregated auxiliary losses contribute to the total loss. The default value of 0.005 provides a good balance between the main task and regularization.

For Beginners: This controls how much importance to give to the attention regularization.

The weight affects training:

  • Higher values (e.g., 0.01) make the network prioritize better attention patterns more strongly
  • Lower values (e.g., 0.001) make the regularization less important
  • The default (0.005) works well for most transformer tasks

If your attention is collapsing (all heads learning the same thing), you might increase this value. If the main task is more important, you might decrease it.

ParameterCount

Gets the total number of trainable parameters in this layer.

public override int ParameterCount { get; }

Property Value

int

Remarks

This returns the sum of all parameters from sublayers: self-attention, layer norms, and feed-forward layers.

SupportsGpuExecution

Gets a value indicating whether this layer supports GPU execution.

protected override bool SupportsGpuExecution { get; }

Property Value

bool

SupportsJitCompilation

Gets whether this transformer encoder layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool

True if all sublayers support JIT compilation.

Remarks

This property indicates whether the layer can be JIT compiled. As a composite layer, it supports JIT if all its sublayers support JIT: - Multi-head self-attention layer - Layer normalization layers - Feed-forward layer

For Beginners: This tells you if this composite layer can use JIT compilation.

The transformer encoder layer can be JIT compiled if:

  • All sublayers are properly initialized
  • Each sublayer supports JIT compilation

Composite layer JIT optimization:

  • Each sublayer can be independently JIT compiled
  • Future optimization: fuse operations across sublayers
  • Residual connections and layer norms are fast operations

The bottleneck in transformers is typically the attention mechanism (O(n²)), which benefits most from JIT compilation. The feed-forward networks are also computationally expensive (matrix multiplications).

BERT and other transformers stack 12-24 of these encoder layers, so optimizing each layer compounds to significant speedup for the full model.

SupportsTraining

The computation engine (CPU or GPU) for vectorized operations.

public override bool SupportsTraining { get; }

Property Value

bool

true for this layer, as it contains trainable parameters.

Remarks

This property indicates whether the transformer encoder layer can be trained through backpropagation. Since this layer has trainable parameters in its sublayers, it supports training.

For Beginners: This property tells you if the layer can learn from data.

A value of true means:

  • The layer has internal values that can be adjusted during training
  • It will improve its performance as it sees more data
  • It participates in the learning process

For this layer, the value is always true because it contains multiple sublayers with trainable parameters that need to be optimized during training.

UseAuxiliaryLoss

Gets or sets a value indicating whether auxiliary loss is enabled for this layer.

public bool UseAuxiliaryLoss { get; set; }

Property Value

bool

Remarks

When enabled, the layer aggregates auxiliary losses from its sublayers, particularly the self-attention mechanism. This helps regularize attention patterns and prevent issues like attention collapse.

For Beginners: This setting controls whether the layer uses additional learning signals.

When enabled (true):

  • The layer collects extra penalties from the self-attention mechanism
  • This helps the attention heads learn diverse and focused patterns
  • Training may be more stable and produce better results

When disabled (false):

  • Only the main task loss is used for training
  • This is the default setting

Methods

Backward(Tensor<T>)

Performs the backward pass of the transformer encoder layer.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

The gradient of the loss with respect to the layer's output.

Returns

Tensor<T>

The gradient of the loss with respect to the layer's input.

Remarks

This method implements the backward pass of the transformer encoder layer, which is used during training to propagate error gradients back through the network. It computes gradients for each sublayer in reverse order of the forward pass, ensuring that residual connections are properly handled.

For Beginners: This method calculates how the layer's inputs should change to reduce errors.

During the backward pass, we go through the same steps as the forward pass, but in reverse order:

  1. Final Layer Normalization:

    • Compute how the normalization's input should change based on output errors
  2. Feed-Forward Network:

    • Determine how the feed-forward network's input should change
    • Account for the residual connection by adding gradients
  3. First Layer Normalization:

    • Compute how the first normalization's input should change
  4. Self-Attention:

    • Determine how the self-attention's input should change
    • Account for the residual connection

This reverse flow of gradients allows each component to learn how it contributed to any errors.

BackwardGpu(IGpuTensor<T>)

Computes the gradient of the loss with respect to the input on the GPU.

public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)

Parameters

outputGradient IGpuTensor<T>

The gradient of the loss with respect to the layer's output.

Returns

IGpuTensor<T>

The gradient of the loss with respect to the layer's input.

ComputeAuxiliaryLoss()

Computes the auxiliary loss for this layer by aggregating losses from sublayers.

public T ComputeAuxiliaryLoss()

Returns

T

The computed auxiliary loss value.

Remarks

This method computes the auxiliary loss by aggregating losses from sublayers that implement IAuxiliaryLossLayer. Currently, this includes the self-attention mechanism which provides attention entropy and head diversity regularization.

For Beginners: This method collects additional learning signals from the layer's components.

Auxiliary loss aggregation:

  • Checks each sublayer to see if it has auxiliary losses
  • Collects those losses and combines them
  • Returns the total for use in training

Why this is useful:

  • The self-attention mechanism can benefit from regularization to prevent all heads from learning the same patterns
  • Aggregating losses at the encoder level provides a unified view of attention quality
  • This helps the entire encoder learn better representations

Example: If the self-attention has an entropy loss (to keep attention focused) and a diversity loss (to prevent heads from being redundant), this method adds them together and returns the total.

The aggregated loss helps ensure:

  • Attention heads learn diverse patterns
  • Attention is focused rather than diffuse
  • The encoder uses its capacity efficiently

ExportComputationGraph(List<ComputationNode<T>>)

Exports the transformer encoder layer as a computation graph for JIT compilation.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>

List to which the input node will be added.

Returns

ComputationNode<T>

The output computation node representing the transformer encoder operation.

Remarks

This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node 2. Applies multi-head self-attention with residual connection and norm 3. Applies feed-forward network with residual connection and norm 4. Returns the final output

For Beginners: This method builds a symbolic representation of a transformer encoder layer for JIT.

The transformer encoder layer is a composite layer combining:

  • Multi-head self-attention (captures relationships between positions)
  • Layer normalization (stabilizes training)
  • Feed-forward network (processes each position independently)
  • Residual connections (helps gradient flow in deep networks)

The forward pass:

  1. x' = LayerNorm(x + MultiHeadAttention(x))
  2. output = LayerNorm(x' + FeedForward(x'))

JIT optimization for composite layers:

  • For now, composite layers note their structure but may delegate to sublayers
  • Future optimization could fuse operations across sublayers
  • Each sublayer (attention, feed-forward, norm) can be independently JIT compiled

This is the core building block of BERT (12-24 encoder layers), GPT uses decoder layers.

Exceptions

ArgumentNullException

Thrown when inputNodes is null.

InvalidOperationException

Thrown when sublayers are not initialized.

Forward(Tensor<T>)

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

ForwardGpu(params IGpuTensor<T>[])

Performs the forward pass using GPU-resident tensors.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]

Returns

IGpuTensor<T>

A GPU-resident output tensor.

Remarks

This method performs the entire transformer encoder forward pass on the GPU without downloading intermediate results to CPU. All sublayer operations (self-attention, layer normalization, feed-forward networks, residual connections) remain GPU-resident for maximum performance.

GetAuxiliaryLossDiagnostics()

Gets diagnostic information about the auxiliary loss computation.

public Dictionary<string, string> GetAuxiliaryLossDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic information about the auxiliary loss.

Remarks

This method returns diagnostic information that can be used to monitor the auxiliary loss during training. The diagnostics include the total auxiliary loss, the weight applied to it, whether auxiliary loss is enabled, and detailed diagnostics from sublayers.

For Beginners: This method provides information to help you understand how the auxiliary loss is working.

The diagnostics show:

  • TotalAuxiliaryLoss: The combined penalty from all sublayers
  • AuxiliaryWeight: How much this penalty affects the overall training
  • UseAuxiliaryLoss: Whether this penalty is currently enabled
  • SelfAttentionDiagnostics: Detailed information from the self-attention mechanism

You can use this information to:

  • Monitor if attention patterns are healthy (diverse and focused)
  • Debug training issues related to attention
  • Understand how the encoder is learning

Example: If you see that attention entropy is very low, it might mean attention is too diffuse. If head diversity is very low, it might mean all heads are learning the same thing and capacity is wasted.

GetDiagnostics()

Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.

public override Dictionary<string, string> GetDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().

GetParameters()

Gets all trainable parameters of the transformer encoder layer as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all trainable parameters from all sublayers.

Remarks

This method retrieves all trainable parameters from all sublayers of the transformer encoder layer and combines them into a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.

For Beginners: This method collects all the learnable values from all parts of the encoder.

The parameters:

  • Are the numbers that the neural network learns during training
  • Include weights from attention mechanisms, normalization layers, and the feed-forward network
  • Are combined into a single long list (vector)

This is useful for:

  • Saving the model to disk
  • Loading parameters from a previously trained model
  • Advanced optimization techniques that need access to all parameters

A transformer encoder layer typically has millions of parameters, all of which contribute to its ability to understand complex sequences.

ResetState()

Resets the internal state of the transformer encoder layer and all its sublayers.

public override void ResetState()

Remarks

This method resets the internal state of the transformer encoder layer and all its sublayers. It delegates the reset operation to each sublayer, ensuring that any cached state is cleared.

For Beginners: This method clears the layer's memory to start fresh.

When resetting the state:

  • All sublayers are reset to their initial condition
  • Any cached information from previous processing is cleared
  • The layer is ready to process new, unrelated sequences

This is important for:

  • Processing a new, unrelated sequence
  • Starting a new training episode
  • Testing the layer with fresh inputs

Think of it like clearing your mind before starting a completely new task, ensuring no information from previous tasks affects your current thinking.

UpdateParameters(T)

Updates the parameters of all sublayers using the calculated gradients.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate to use for parameter updates.

Remarks

This method updates the parameters of all sublayers in the transformer encoder layer based on the gradients calculated during the backward pass. It delegates the update process to each sublayer, passing the learning rate.

For Beginners: This method adjusts all the internal values of the layer to improve its performance.

During parameter updates:

  • The learning rate controls how big each adjustment is
  • Every sublayer gets updated based on what was learned in the backward pass
  • This helps the entire encoder layer gradually improve its performance

Think of it like fine-tuning all the components of the encoder based on feedback:

  • The self-attention mechanism learns to focus on more relevant relationships
  • The feed-forward network learns to better transform the information
  • The normalization layers learn to keep values in the optimal range

UpdateParametersGpu(IGpuOptimizerConfig)

Updates layer parameters using GPU-resident optimizer.

public override void UpdateParametersGpu(IGpuOptimizerConfig config)

Parameters

config IGpuOptimizerConfig

The GPU optimizer configuration.

Remarks

This method delegates to each sublayer's UpdateParametersGpu method. All sublayers (self-attention, layer norms, feed-forward) are updated.