Table of Contents

Class TransformerDecoderLayer<T>

Namespace
AiDotNet.NeuralNetworks.Layers
Assembly
AiDotNet.dll

Represents a transformer decoder layer that processes sequences using self-attention, cross-attention, and feed-forward networks.

public class TransformerDecoderLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
TransformerDecoderLayer<T>
Implements
Inherited Members

Remarks

A transformer decoder layer is a fundamental building block of transformer-based models for sequence-to-sequence tasks. It consists of three main components: a masked self-attention mechanism that processes the target sequence, a cross-attention mechanism that attends to the encoder's output, and a feed-forward network for additional transformation. Each component is followed by layer normalization and residual connections to facilitate training of deep networks.

For Beginners: This layer helps the network generate sequences while considering both what it has generated so far and input from another source.

Think of it like a writer who is translating a book:

  • First, the writer looks at what they've translated so far to maintain consistency (self-attention)
  • Then they look at the original text to understand what to translate next (cross-attention)
  • Finally, they process all this information to produce the next part of the translation (feed-forward network)

For example, in machine translation, the decoder generates each word of the target language by:

  • Looking at the words it has already generated (to maintain grammatical coherence)
  • Looking at the encoded source sentence (to understand what content to translate)
  • Combining this information to produce the most appropriate next word

This architecture is powerful for tasks like translation, summarization, and text generation.

Constructors

TransformerDecoderLayer(int, int, int, int, IActivationFunction<T>?, IEngine?)

Initializes a new instance of the TransformerDecoderLayer<T> class with scalar activation function.

public TransformerDecoderLayer(int embeddingSize = 512, int numHeads = 8, int feedForwardDim = 2048, int sequenceLength = 512, IActivationFunction<T>? ffnActivation = null, IEngine? engine = null)

Parameters

embeddingSize int

The size of the embeddings. Default is 512.

numHeads int

The number of attention heads. Default is 8.

feedForwardDim int

The dimension of the feed-forward network. Default is 2048.

sequenceLength int

The maximum sequence length. Default is 512.

ffnActivation IActivationFunction<T>

The activation function for the feed-forward network. Default is GELU.

engine IEngine

Remarks

This constructor creates a transformer decoder layer with the specified dimensions and a scalar activation function for the feed-forward network. It initializes all the sublayers needed for the transformer decoder architecture.

For Beginners: This constructor creates a new transformer decoder layer with standard settings.

The parameters you provide determine:

  • embeddingSize: How rich the representation of each token is (more = more expressive)
  • numHeads: How many different "perspectives" the attention mechanism can have
  • feedForwardDim: How much processing capacity the feed-forward network has
  • sequenceLength: The maximum number of tokens the model can process
  • ffnActivation: The mathematical function used in the feed-forward network

These settings control the capacity, expressiveness, and computational requirements of the decoder. The default values (512 embedding size, 8 heads, etc.) are similar to those used in the original transformer paper and work well for many language tasks.

TransformerDecoderLayer(int, int, int, int, IVectorActivationFunction<T>?, IEngine?)

Initializes a new instance of the TransformerDecoderLayer<T> class with vector activation function.

public TransformerDecoderLayer(int embeddingSize = 512, int numHeads = 8, int feedForwardDim = 2048, int sequenceLength = 512, IVectorActivationFunction<T>? ffnVectorActivation = null, IEngine? engine = null)

Parameters

embeddingSize int

The size of the embeddings. Default is 512.

numHeads int

The number of attention heads. Default is 8.

feedForwardDim int

The dimension of the feed-forward network. Default is 2048.

sequenceLength int

The maximum sequence length. Default is 512.

ffnVectorActivation IVectorActivationFunction<T>

The vector activation function for the feed-forward network. Default is GELU.

engine IEngine

Remarks

This constructor creates a transformer decoder layer with the specified dimensions and a vector activation function for the feed-forward network. It initializes all the sublayers needed for the transformer decoder architecture.

For Beginners: This constructor is similar to the previous one, but uses vector activations.

Vector activations:

  • Process entire groups of numbers at once, rather than one at a time
  • Can capture relationships between different elements
  • Allow for more complex transformations

This version is useful when you need more sophisticated processing that considers how different features relate to each other, rather than treating each feature independently.

Properties

AuxiliaryLossWeight

Gets or sets the weight for the auxiliary loss contribution.

public T AuxiliaryLossWeight { get; set; }

Property Value

T

Remarks

This value determines how much the aggregated auxiliary losses contribute to the total loss. The default value of 0.005 provides a good balance between the main task and regularization.

For Beginners: This controls how much importance to give to the attention regularization.

The weight affects training:

  • Higher values (e.g., 0.01) make the network prioritize better attention patterns more strongly
  • Lower values (e.g., 0.001) make the regularization less important
  • The default (0.005) works well for most transformer tasks

If your attention is collapsing (all heads learning the same thing), you might increase this value. If the main task is more important, you might decrease it.

ParameterCount

Gets the total number of trainable parameters in this layer.

public override int ParameterCount { get; }

Property Value

int

Remarks

This returns the sum of all parameters from sublayers: self-attention, cross-attention, layer norms, feed-forward layer, and feed-forward projection layer.

SupportsGpuExecution

Gets a value indicating whether this layer can execute on GPU.

protected override bool SupportsGpuExecution { get; }

Property Value

bool

SupportsJitCompilation

Gets whether this transformer decoder layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool

True if all sublayers support JIT compilation.

Remarks

This property indicates whether the layer can be JIT compiled. As a composite layer, it supports JIT if all its sublayers support JIT: - Masked self-attention layer - Cross-attention layer (attends to encoder output) - Layer normalization layers (3 total) - Feed-forward layer

For Beginners: This tells you if this composite layer can use JIT compilation.

The transformer decoder layer can be JIT compiled if:

  • All sublayers are properly initialized
  • Each sublayer supports JIT compilation

Composite layer JIT optimization:

  • Each sublayer can be independently JIT compiled
  • Future optimization: fuse operations across sublayers
  • Residual connections and layer norms are fast operations

The bottleneck in decoder layers:

  • Self-attention: O(n²) for target sequence
  • Cross-attention: O(n*m) where n=target length, m=source length
  • Feed-forward: matrix multiplications

All benefit significantly from JIT compilation (5-10x speedup).

GPT models use decoder-only architecture (no cross-attention, only self-attention). T5 and other seq2seq models use both encoder and decoder layers. GPT-3 has 96 decoder layers, making JIT optimization critical for performance.

SupportsTraining

Gets a value indicating whether this layer supports training.

public override bool SupportsTraining { get; }

Property Value

bool

true for this layer, as it contains trainable parameters.

Remarks

This property indicates whether the transformer decoder layer can be trained through backpropagation. Since this layer has trainable parameters in its sublayers, it supports training.

For Beginners: This property tells you if the layer can learn from data.

A value of true means:

  • The layer has internal values that can be adjusted during training
  • It will improve its performance as it sees more data
  • It participates in the learning process

For this layer, the value is always true because it contains multiple sublayers with trainable parameters that need to be optimized during training.

UseAuxiliaryLoss

Gets or sets a value indicating whether auxiliary loss is enabled for this layer.

public bool UseAuxiliaryLoss { get; set; }

Property Value

bool

Remarks

When enabled, the layer aggregates auxiliary losses from its attention sublayers (both self-attention and cross-attention). This helps regularize attention patterns and prevents issues like attention collapse.

For Beginners: This setting controls whether the layer uses additional learning signals.

When enabled (true):

  • The layer collects extra penalties from both self-attention and cross-attention mechanisms
  • This helps the attention heads learn diverse and focused patterns
  • Training may be more stable and produce better results

When disabled (false):

  • Only the main task loss is used for training
  • This is the default setting

Methods

Backward(Tensor<T>)

Performs the backward pass of the transformer decoder layer.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

The gradient of the loss with respect to the layer's output.

Returns

Tensor<T>

The gradient of the loss with respect to the layer's input.

Remarks

This method implements the backward pass of the transformer decoder layer, which is used during training to propagate error gradients back through the network. It computes gradients for each sublayer in reverse order of the forward pass, ensuring that residual connections are properly handled.

For Beginners: This method calculates how the layer's inputs should change to reduce errors.

During the backward pass, we go through the same steps as the forward pass, but in reverse order:

  1. Final Layer Normalization:

    • Compute how the normalization's input should change based on output errors
  2. Feed-Forward Network:

    • Determine how the feed-forward network's input should change
    • Account for the residual connection by adding gradients
  3. Second Layer Normalization:

    • Compute how the second normalization's input should change
  4. Cross-Attention:

    • Determine how the cross-attention's inputs should change
    • Account for the residual connection
  5. First Layer Normalization:

    • Compute how the first normalization's input should change
  6. Self-Attention:

    • Determine how the self-attention's input should change
    • Account for the final residual connection

This reverse flow of gradients allows each component to learn how it contributed to any errors.

BackwardGpu(IGpuTensor<T>)

Computes the gradient of the loss with respect to the decoder input on the GPU.

public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)

Parameters

outputGradient IGpuTensor<T>

The gradient of the loss with respect to the layer's output.

Returns

IGpuTensor<T>

The gradient of the loss with respect to the decoder input.

ComputeAuxiliaryLoss()

Computes the auxiliary loss for this layer by aggregating losses from sublayers.

public T ComputeAuxiliaryLoss()

Returns

T

The computed auxiliary loss value.

Remarks

This method computes the auxiliary loss by aggregating losses from sublayers that implement IAuxiliaryLossLayer. For the decoder layer, this includes both self-attention and cross-attention mechanisms, which provide attention entropy and head diversity regularization.

For Beginners: This method collects additional learning signals from the layer's components.

Auxiliary loss aggregation:

  • Checks each attention sublayer to see if it has auxiliary losses
  • Collects those losses from both self-attention and cross-attention
  • Combines them and returns the total for use in training

Why this is useful:

  • Both attention mechanisms benefit from regularization to prevent all heads from learning the same patterns
  • Self-attention regularization helps the decoder maintain coherent generation patterns
  • Cross-attention regularization helps the decoder focus on relevant parts of the source
  • Aggregating losses at the decoder level provides a unified view of attention quality

Example: If the self-attention has an entropy loss (to keep attention focused) and a diversity loss (to prevent heads from being redundant), and the cross-attention has similar losses, this method adds all of them together and returns the total.

The aggregated loss helps ensure:

  • Both attention mechanisms learn diverse patterns
  • Attention is focused rather than diffuse
  • The decoder uses its capacity efficiently for both understanding context and attending to source

ExportComputationGraph(List<ComputationNode<T>>)

Exports the transformer decoder layer as a computation graph for JIT compilation.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>

List to which the input node will be added.

Returns

ComputationNode<T>

The output computation node representing the transformer decoder operation.

Remarks

This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node (decoder input) 2. Applies masked self-attention with residual connection and norm 3. Applies cross-attention to encoder output with residual and norm 4. Applies feed-forward network with residual connection and norm 5. Returns the final output

For Beginners: This method builds a symbolic representation of a transformer decoder layer for JIT.

The transformer decoder layer is a composite layer combining:

  • Masked self-attention (prevents looking ahead in target sequence)
  • Cross-attention (attends to encoder output, connects source and target)
  • Layer normalization (stabilizes training)
  • Feed-forward network (processes each position independently)
  • Residual connections (helps gradient flow in deep networks)

The forward pass:

  1. x' = LayerNorm(x + MaskedSelfAttention(x))
  2. x'' = LayerNorm(x' + CrossAttention(x', encoder_output))
  3. output = LayerNorm(x'' + FeedForward(x''))

JIT optimization for composite layers:

  • For now, composite layers note their structure but may delegate to sublayers
  • Future optimization could fuse operations across sublayers
  • Each sublayer (self-attention, cross-attention, feed-forward, norm) can be independently JIT compiled

This is the core building block of GPT (decoder-only) and encoder-decoder models like T5.

Exceptions

ArgumentNullException

Thrown when inputNodes is null.

InvalidOperationException

Thrown when sublayers are not initialized.

Forward(Tensor<T>)

Not supported for this layer. Use Forward(Tensor<T> input, Tensor<T> encoderOutput) instead.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

The input tensor.

Returns

Tensor<T>

Never returns as this method throws an exception.

Remarks

This method is not supported for the transformer decoder layer, as it requires both a decoder input and an encoder output. Use the overloaded Forward method that accepts both inputs instead.

For Beginners: This method is a placeholder that shows an error if used incorrectly.

The transformer decoder needs two inputs:

  • The decoder's own input (what it has generated so far)
  • The encoder's output (information from the source sequence)

This method exists only to satisfy the base class requirements, but will show an error if someone tries to use it. The correct method to use is the one that accepts both inputs.

Exceptions

InvalidOperationException

Always thrown, as this method is not supported for this layer.

Forward(Tensor<T>, Tensor<T>)

Performs the forward pass of the transformer decoder layer.

public Tensor<T> Forward(Tensor<T> input, Tensor<T> encoderOutput)

Parameters

input Tensor<T>

The decoder input tensor.

encoderOutput Tensor<T>

The encoder output tensor.

Returns

Tensor<T>

The output tensor after processing through the transformer decoder layer.

Remarks

This method implements the forward pass of the transformer decoder layer. It processes the decoder input through the self-attention mechanism, applies layer normalization and a residual connection, then passes the result through the cross-attention mechanism (attending to the encoder output), applies another layer normalization and residual connection, and finally processes the result through the feed-forward network followed by a final layer normalization and residual connection.

For Beginners: This method processes the inputs through all components of the decoder layer.

The forward pass follows these steps:

  1. Self-Attention:

    • The decoder looks at its own input to understand the context of what it has generated so far
    • The result is added to the original input (residual connection)
    • Layer normalization is applied to stabilize the values
  2. Cross-Attention:

    • The decoder looks at the encoder output to gather information from the source sequence
    • The result is added to the output from step 1 (residual connection)
    • Layer normalization is applied again
  3. Feed-Forward Network:

    • The output from step 2 is processed through a feed-forward network
    • The result is added to the output from step 2 (residual connection)
    • A final layer normalization is applied

These steps allow the decoder to generate output that is coherent with both what it has generated so far and the information from the source sequence.

ForwardGpu(params IGpuTensor<T>[])

GPU-resident forward pass for the transformer decoder layer. Performs self-attention, cross-attention, and feed-forward operations entirely on GPU.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]

Array containing [decoderInput, encoderOutput] GPU tensors.

Returns

IGpuTensor<T>

GPU-resident output tensor.

Remarks

This method performs the entire transformer decoder forward pass on the GPU without downloading intermediate results to CPU. All sublayer operations (self-attention, cross-attention, layer normalization, feed-forward networks, residual connections) remain GPU-resident for maximum performance.

Exceptions

ArgumentException

Thrown when inputs array doesn't contain exactly 2 tensors.

GetAuxiliaryLossDiagnostics()

Gets diagnostic information about the auxiliary loss computation.

public Dictionary<string, string> GetAuxiliaryLossDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic information about the auxiliary loss.

Remarks

This method returns diagnostic information that can be used to monitor the auxiliary loss during training. The diagnostics include the total auxiliary loss, the weight applied to it, whether auxiliary loss is enabled, and detailed diagnostics from both self-attention and cross-attention sublayers.

For Beginners: This method provides information to help you understand how the auxiliary loss is working.

The diagnostics show:

  • TotalAuxiliaryLoss: The combined penalty from all attention sublayers
  • AuxiliaryWeight: How much this penalty affects the overall training
  • UseAuxiliaryLoss: Whether this penalty is currently enabled
  • SelfAttentionDiagnostics: Detailed information from the self-attention mechanism
  • CrossAttentionDiagnostics: Detailed information from the cross-attention mechanism

You can use this information to:

  • Monitor if attention patterns are healthy (diverse and focused) in both mechanisms
  • Debug training issues related to attention
  • Understand how the decoder is learning both context and source information

Example: If you see that self-attention entropy is very low, it might mean the decoder isn't maintaining good coherence with its own generated output. If cross-attention diversity is low, it might mean all heads are looking at the same part of the source, wasting capacity.

GetDiagnostics()

Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.

public override Dictionary<string, string> GetDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().

GetParameters()

Gets all trainable parameters of the transformer decoder layer as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all trainable parameters from all sublayers.

Remarks

This method retrieves all trainable parameters from all sublayers of the transformer decoder layer and combines them into a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.

For Beginners: This method collects all the learnable values from all parts of the decoder.

The parameters:

  • Are the numbers that the neural network learns during training
  • Include weights from attention mechanisms, normalization layers, and the feed-forward network
  • Are combined into a single long list (vector)

This is useful for:

  • Saving the model to disk
  • Loading parameters from a previously trained model
  • Advanced optimization techniques that need access to all parameters

A transformer decoder layer typically has millions of parameters, all of which contribute to its ability to generate high-quality sequences.

ResetState()

Resets the internal state of the transformer decoder layer and all its sublayers.

public override void ResetState()

Remarks

This method resets the internal state of the transformer decoder layer and all its sublayers. It clears the cached tensors from the forward pass and delegates the reset operation to each sublayer.

For Beginners: This method clears the layer's memory to start fresh.

When resetting the state:

  • All sublayers are reset to their initial condition
  • Stored inputs and outputs are cleared
  • The layer forgets all intermediate results from previous processing

This is important for:

  • Processing a new, unrelated sequence
  • Starting a new training episode
  • Testing the layer with fresh inputs

Think of it like clearing the entire team's mind before starting a completely new task, ensuring no residual information affects the processing of new inputs.

UpdateParameters(T)

Updates the parameters of all sublayers using the calculated gradients.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate to use for parameter updates.

Remarks

This method updates the parameters of all sublayers in the transformer decoder layer based on the gradients calculated during the backward pass. It delegates the update process to each sublayer, passing the learning rate.

For Beginners: This method adjusts all the internal values of the layer to improve its performance.

During parameter updates:

  • The learning rate controls how big each adjustment is
  • Every sublayer gets updated based on what was learned in the backward pass
  • This helps the entire decoder layer gradually improve its performance

Think of it like fine-tuning all the components of the decoder based on feedback:

  • The self-attention mechanism learns to focus on more relevant parts of what's been generated
  • The cross-attention mechanism learns to extract more useful information from the source
  • The feed-forward network learns to better transform this information into the next output

UpdateParametersGpu(IGpuOptimizerConfig)

Updates layer parameters using GPU-resident optimizer.

public override void UpdateParametersGpu(IGpuOptimizerConfig config)

Parameters

config IGpuOptimizerConfig

The GPU optimizer configuration.

Remarks

This method delegates to each sublayer's UpdateParametersGpu method. All sublayers (self-attention, cross-attention, layer norms, feed-forward) are updated.