Table of Contents

Class EmbeddingLayer<T>

Namespace
AiDotNet.NeuralNetworks.Layers
Assembly
AiDotNet.dll

Represents an embedding layer that converts discrete token indices into dense vector representations.

public class EmbeddingLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider, ITokenEmbedding<T>

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
EmbeddingLayer<T>
Implements
Inherited Members

Remarks

An embedding layer maps discrete tokens (represented as indices) to continuous vector representations. This is particularly useful for natural language processing tasks where words or tokens need to be represented as dense vectors that capture semantic relationships. Each token is assigned a unique vector in a high-dimensional space, allowing the model to learn meaningful representations.

For Beginners: An embedding layer turns words or other symbols into lists of numbers that capture their meaning.

Imagine you have a dictionary where:

  • Each word has an ID number (like "cat" = 5, "dog" = 10)
  • The embedding layer gives each ID a unique "coordinate" in a multi-dimensional space
  • Words with similar meanings end up with similar coordinates

For example:

  • "Cat" might become [0.2, -0.5, 0.1, 0.8]
  • "Kitten" might become [0.25, -0.4, 0.15, 0.7]
  • "Computer" might become [-0.8, 0.2, 0.5, -0.3]

The embedding layer learns these representations during training, so that:

  • Similar words end up close to each other
  • Related concepts form clusters
  • The vectors capture meaningful semantic relationships

This allows neural networks to work with text and other discrete tokens in a way that captures their meaning and relationships.

Thread Safety: This layer is not thread-safe. Each layer instance maintains internal state during forward and backward passes. If you need concurrent execution, use separate layer instances per thread or synchronize access to shared instances.

Constructors

EmbeddingLayer(int, int)

public EmbeddingLayer(int vocabularySize, int embeddingDimension)

Parameters

vocabularySize int
embeddingDimension int

Properties

AuxiliaryLossWeight

Gets or sets the weight for embedding regularization. Default is 0.0001. Controls L2 regularization strength on embedding weights.

public T AuxiliaryLossWeight { get; set; }

Property Value

T

InputMode

public EmbeddingInputMode InputMode { get; set; }

Property Value

EmbeddingInputMode

ParameterCount

Gets the total number of trainable parameters in this layer.

public override int ParameterCount { get; }

Property Value

int

The number of elements in the embedding matrix (vocabulary size × embedding dimension).

Remarks

For Beginners: This counts the total number of adjustable values in the layer. For an embedding layer with 10,000 vocabulary size and 300 dimensions, the parameter count would be 10,000 × 300 = 3,000,000 parameters.

SupportsGpuExecution

Gets a value indicating whether this layer can execute on GPU.

protected override bool SupportsGpuExecution { get; }

Property Value

bool

true because embedding lookup has efficient GPU support.

SupportsJitCompilation

Gets a value indicating whether this layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool

Always true because embedding lookup can be JIT compiled.

SupportsTraining

Gets a value indicating whether this layer supports training.

public override bool SupportsTraining { get; }

Property Value

bool

Always true because this layer has trainable parameters (the embedding matrix).

Remarks

This property indicates that the embedding layer supports training through backpropagation. The layer has trainable embeddings that are updated during the training process.

For Beginners: This property tells you that this layer can learn from data.

A value of true means:

  • The layer can adjust its embeddings during training
  • It will improve its representations as it sees more data
  • It has parameters (the embedding matrix) that are updated to make better predictions

Unlike static word embeddings (like pre-trained word vectors), these embeddings adapt and improve specifically for your task during training.

UseAuxiliaryLoss

Gets or sets whether to use auxiliary loss (embedding regularization) during training. Default is false. Enable to prevent embeddings from becoming too large or collapsing.

public bool UseAuxiliaryLoss { get; set; }

Property Value

bool

Methods

Backward(Tensor<T>)

Performs the backward pass of the embedding layer, computing gradients for the embedding matrix.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

The gradient tensor from the next layer. Shape: [sequenceLength, batchSize, embeddingDimension].

Returns

Tensor<T>

A zero-filled tensor with the same shape as the input, as gradients don't flow back to indices.

Remarks

This method implements the backward pass (backpropagation) of the embedding layer. It computes the gradients for the embedding matrix by accumulating the gradients from the output for each token index that was used in the forward pass. Since the input to the embedding layer is indices rather than computed values, no meaningful gradients can be computed for the input. Therefore, this method returns a zero-filled tensor with the same shape as the input.

For Beginners: This is where the embedding layer learns from its mistakes during training.

During the backward pass:

  1. For each token in the input sequence:
    • Look up which embedding was used (based on the token ID)
    • Add the corresponding gradient to that specific embedding
  2. Return a dummy gradient for the input (since we can't backpropagate through token IDs)

For example, if token ID 5 appears three times in different positions:

  • All three gradient contributions will be added together for embedding #5
  • This accumulates learning from all occurrences of that token

This is different from most layers because:

  • We only update the embeddings that were actually used in this batch
  • We don't pass meaningful gradients back to the input (the token IDs themselves don't change)

Exceptions

InvalidOperationException

Thrown when backward is called before forward.

BackwardGpu(IGpuTensor<T>)

Performs GPU-resident backward pass for the embedding layer. Computes gradients for embeddings or projection weights entirely on GPU.

public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)

Parameters

outputGradient IGpuTensor<T>

GPU-resident gradient from the next layer.

Returns

IGpuTensor<T>

GPU-resident gradient to pass to the previous layer (zero for discrete embeddings).

Exceptions

InvalidOperationException

Thrown if ForwardGpu was not called first.

ComputeAuxiliaryLoss()

Computes the auxiliary loss for the EmbeddingLayer, which is embedding regularization.

public T ComputeAuxiliaryLoss()

Returns

T

The embedding regularization loss value.

Remarks

Embedding regularization prevents embedding vectors from becoming too large or too similar, which can lead to overfitting. It applies L2 regularization on the embedding weights: Loss = (1/2) * Σ||embedding||²

This regularization:

  • Prevents embeddings from growing unboundedly
  • Encourages smaller, more generalizable embedding values
  • Helps prevent overfitting to the training data
  • Promotes diverse embedding representations

For Beginners: This calculates a penalty for embeddings that become too large.

Embedding regularization:

  • Measures how large the embedding vectors are
  • Penalizes very large embedding values
  • Encourages the model to use smaller, more manageable numbers
  • Prevents the model from memorizing training data too closely

Why this is important:

  • Large embedding values can indicate overfitting
  • Regularization promotes better generalization to new data
  • Keeps embedding vectors at reasonable scales
  • Prevents embeddings from collapsing or diverging

Think of it like a referee that prevents embeddings from becoming too extreme, keeping them in a reasonable range for better model performance.

ExportComputationGraph(List<ComputationNode<T>>)

Exports the embedding layer's forward pass as a JIT-compilable computation graph.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>

List to populate with input computation nodes.

Returns

ComputationNode<T>

The output computation node representing the embedded vectors.

Remarks

This method builds a computation graph for the embedding lookup operation. The graph uses the embedding matrix as a constant and performs an EmbeddingLookup operation based on the input indices.

For Beginners: This creates an optimized version of the embedding lookup.

The computation graph:

  • Takes input indices (token IDs)
  • Looks up corresponding rows in the embedding matrix
  • Returns the embedding vectors for each token

This is JIT compiled for faster inference.

Forward(Tensor<T>)

Performs the forward pass of the embedding layer, converting token indices to vector representations.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

The input tensor containing token indices. Supports any-rank tensors: - 1D: [seqLen] - single sequence - 2D: [batch, seqLen] - batch of sequences (industry standard) - 3D: [batch, seqLen, 1] - compatible with legacy format

Returns

Tensor<T>

The output tensor containing embedding vectors with the same leading dimensions plus embeddingDim.

Remarks

Industry Standard: Like PyTorch's nn.Embedding, this layer supports any-rank input tensors. The indices in the last dimension(s) are looked up in the embedding table, and the result has the same shape with the last dimension replaced by the embedding dimension.

For Beginners: This method looks up the vector for each token ID in your input.

The forward pass works like this:

  1. Take a sequence of token IDs as input (like [5, 10, 3])
  2. For each ID, look up its corresponding row in the embedding matrix
  3. Copy that row (the embedding vector) to the output

For example, with an input sequence [5, 10, 3]:

  • Look up row 5 in the embedding matrix -> output row 1
  • Look up row 10 in the embedding matrix -> output row 2
  • Look up row 3 in the embedding matrix -> output row 3

The result is a sequence of embedding vectors, one for each input token. This transforms your discrete tokens into continuous vectors that the neural network can process more effectively.

ForwardGpu(params IGpuTensor<T>[])

Performs the forward pass of the embedding layer on GPU.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]

The GPU-resident input tensor(s) containing token indices.

Returns

IGpuTensor<T>

A GPU-resident tensor containing the embedding vectors.

Remarks

This method performs embedding lookup entirely on GPU, keeping the output on GPU for subsequent GPU-accelerated operations. This eliminates CPU-GPU data transfers for intermediate results in deep networks.

For Beginners: This is the GPU-optimized version of embedding lookup. Instead of moving data between CPU and GPU, all computation stays on the GPU, making it much faster for large vocabularies and batch sizes.

Exceptions

ArgumentException

Thrown when no inputs are provided.

InvalidOperationException

Thrown when GPU engine is not available.

GetAuxiliaryLossDiagnostics()

Gets diagnostic information about the embedding regularization.

public Dictionary<string, string> GetAuxiliaryLossDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic information about embedding health.

Remarks

This method provides insights into embedding behavior, including: - Embedding regularization loss - Average embedding magnitude - Regularization weight

For Beginners: This gives you information to monitor embedding quality.

The diagnostics include:

  • Embedding Regularization Loss: Measure of embedding magnitude
  • Regularization Weight: How much the penalty influences training
  • Average Embedding Magnitude: Typical size of embedding vectors
  • Use Auxiliary Loss: Whether regularization is enabled

These values help you:

  • Monitor if embeddings are growing too large
  • Detect potential overfitting in embedding layer
  • Tune the regularization weight
  • Ensure embeddings remain at reasonable scales

GetDiagnostics()

Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.

public override Dictionary<string, string> GetDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().

GetParameters()

Gets all trainable parameters of the layer as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all trainable parameters.

Remarks

This method retrieves all trainable parameters (the entire embedding matrix) as a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.

For Beginners: This method collects all the embedding values into a single list.

The parameters include:

  • All values from the embedding matrix, arranged in a single long list
  • Each embedding vector is placed one after another

This is useful for:

  • Saving the embeddings to disk
  • Loading pre-trained embeddings
  • Applying specific optimization techniques

For example, a vocabulary of 1,000 tokens with 100-dimensional embeddings would produce a vector of 100,000 values.

GetTokenEmbeddings(IReadOnlyList<int>)

Retrieves embeddings for the provided token IDs.

public Matrix<T> GetTokenEmbeddings(IReadOnlyList<int> tokenIds)

Parameters

tokenIds IReadOnlyList<int>

Token IDs to lookup.

Returns

Matrix<T>

A matrix where each row corresponds to a token embedding.

ResetState()

Resets the internal state of the layer.

public override void ResetState()

Remarks

This method resets the internal state of the layer by clearing the cached input and embedding gradients from previous forward and backward passes. This is useful when starting to process a new batch of data or when implementing stateful recurrent networks.

For Beginners: This method clears the layer's memory to start fresh.

When resetting the state:

  • The saved input token IDs are cleared
  • The calculated gradients are cleared
  • The layer forgets previous calculations it performed

This is typically called:

  • Between training batches to free up memory
  • When switching from training to evaluation mode
  • When starting to process completely new data

It doesn't affect the learned embeddings themselves, just the temporary working data used during computation.

SetParameters(Vector<T>)

Sets the trainable parameters of the layer from a single vector.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

A vector containing all parameters to set.

Remarks

This method sets all trainable parameters (the entire embedding matrix) from a single vector. This is useful for loading saved model weights or pre-trained embeddings.

For Beginners: This method updates all embedding values from a provided list.

When setting parameters:

  • The input must be a vector with the exact right length
  • The values are distributed back to the embedding matrix
  • This allows loading previously trained or pre-trained embeddings

Use cases include:

  • Loading embeddings trained on another task
  • Initializing with pre-trained word vectors (like Word2Vec or GloVe)
  • Restoring a saved model

For example, you might initialize your embeddings with GloVe vectors that were pre-trained on a large corpus, giving your model a head start.

Exceptions

ArgumentException

Thrown when the parameters vector has incorrect length.

UpdateParameters(T)

Updates the embedding matrix using the calculated gradients and the specified learning rate.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate to use for the parameter updates.

Remarks

This method updates the embedding matrix based on the gradients calculated during the backward pass. Only the embeddings for tokens that appeared in the input during the forward pass will be updated. The learning rate determines the size of the parameter updates.

For Beginners: This method actually changes the embeddings to improve future predictions.

After figuring out how each embedding should change:

  • The embedding matrix is updated by subtracting the gradients
  • Each value is adjusted proportionally to its gradient
  • The learning rate controls how big these adjustments are

For example:

  • If embedding for token #5 has a gradient of [0.1, -0.2, 0.3]
  • With learning rate of 0.01
  • The embedding will change by [-0.001, 0.002, -0.003]

Only embeddings for tokens that appeared in the recent input batch will be updated. Frequently used tokens will get more updates over time.

Exceptions

InvalidOperationException

Thrown when update is called before backward.