Class EmbeddingLayer<T>
- Namespace
- AiDotNet.NeuralNetworks.Layers
- Assembly
- AiDotNet.dll
Represents an embedding layer that converts discrete token indices into dense vector representations.
public class EmbeddingLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider, ITokenEmbedding<T>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>EmbeddingLayer<T>
- Implements
-
ILayer<T>
- Inherited Members
Remarks
An embedding layer maps discrete tokens (represented as indices) to continuous vector representations. This is particularly useful for natural language processing tasks where words or tokens need to be represented as dense vectors that capture semantic relationships. Each token is assigned a unique vector in a high-dimensional space, allowing the model to learn meaningful representations.
For Beginners: An embedding layer turns words or other symbols into lists of numbers that capture their meaning.
Imagine you have a dictionary where:
- Each word has an ID number (like "cat" = 5, "dog" = 10)
- The embedding layer gives each ID a unique "coordinate" in a multi-dimensional space
- Words with similar meanings end up with similar coordinates
For example:
- "Cat" might become [0.2, -0.5, 0.1, 0.8]
- "Kitten" might become [0.25, -0.4, 0.15, 0.7]
- "Computer" might become [-0.8, 0.2, 0.5, -0.3]
The embedding layer learns these representations during training, so that:
- Similar words end up close to each other
- Related concepts form clusters
- The vectors capture meaningful semantic relationships
This allows neural networks to work with text and other discrete tokens in a way that captures their meaning and relationships.
Thread Safety: This layer is not thread-safe. Each layer instance maintains internal state during forward and backward passes. If you need concurrent execution, use separate layer instances per thread or synchronize access to shared instances.
Constructors
EmbeddingLayer(int, int)
public EmbeddingLayer(int vocabularySize, int embeddingDimension)
Parameters
Properties
AuxiliaryLossWeight
Gets or sets the weight for embedding regularization. Default is 0.0001. Controls L2 regularization strength on embedding weights.
public T AuxiliaryLossWeight { get; set; }
Property Value
- T
InputMode
public EmbeddingInputMode InputMode { get; set; }
Property Value
ParameterCount
Gets the total number of trainable parameters in this layer.
public override int ParameterCount { get; }
Property Value
- int
The number of elements in the embedding matrix (vocabulary size × embedding dimension).
Remarks
For Beginners: This counts the total number of adjustable values in the layer. For an embedding layer with 10,000 vocabulary size and 300 dimensions, the parameter count would be 10,000 × 300 = 3,000,000 parameters.
SupportsGpuExecution
Gets a value indicating whether this layer can execute on GPU.
protected override bool SupportsGpuExecution { get; }
Property Value
- bool
truebecause embedding lookup has efficient GPU support.
SupportsJitCompilation
Gets a value indicating whether this layer supports JIT compilation.
public override bool SupportsJitCompilation { get; }
Property Value
- bool
Always
truebecause embedding lookup can be JIT compiled.
SupportsTraining
Gets a value indicating whether this layer supports training.
public override bool SupportsTraining { get; }
Property Value
- bool
Always
truebecause this layer has trainable parameters (the embedding matrix).
Remarks
This property indicates that the embedding layer supports training through backpropagation. The layer has trainable embeddings that are updated during the training process.
For Beginners: This property tells you that this layer can learn from data.
A value of true means:
- The layer can adjust its embeddings during training
- It will improve its representations as it sees more data
- It has parameters (the embedding matrix) that are updated to make better predictions
Unlike static word embeddings (like pre-trained word vectors), these embeddings adapt and improve specifically for your task during training.
UseAuxiliaryLoss
Gets or sets whether to use auxiliary loss (embedding regularization) during training. Default is false. Enable to prevent embeddings from becoming too large or collapsing.
public bool UseAuxiliaryLoss { get; set; }
Property Value
Methods
Backward(Tensor<T>)
Performs the backward pass of the embedding layer, computing gradients for the embedding matrix.
public override Tensor<T> Backward(Tensor<T> outputGradient)
Parameters
outputGradientTensor<T>The gradient tensor from the next layer. Shape: [sequenceLength, batchSize, embeddingDimension].
Returns
- Tensor<T>
A zero-filled tensor with the same shape as the input, as gradients don't flow back to indices.
Remarks
This method implements the backward pass (backpropagation) of the embedding layer. It computes the gradients for the embedding matrix by accumulating the gradients from the output for each token index that was used in the forward pass. Since the input to the embedding layer is indices rather than computed values, no meaningful gradients can be computed for the input. Therefore, this method returns a zero-filled tensor with the same shape as the input.
For Beginners: This is where the embedding layer learns from its mistakes during training.
During the backward pass:
- For each token in the input sequence:
- Look up which embedding was used (based on the token ID)
- Add the corresponding gradient to that specific embedding
- Return a dummy gradient for the input (since we can't backpropagate through token IDs)
For example, if token ID 5 appears three times in different positions:
- All three gradient contributions will be added together for embedding #5
- This accumulates learning from all occurrences of that token
This is different from most layers because:
- We only update the embeddings that were actually used in this batch
- We don't pass meaningful gradients back to the input (the token IDs themselves don't change)
Exceptions
- InvalidOperationException
Thrown when backward is called before forward.
BackwardGpu(IGpuTensor<T>)
Performs GPU-resident backward pass for the embedding layer. Computes gradients for embeddings or projection weights entirely on GPU.
public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)
Parameters
outputGradientIGpuTensor<T>GPU-resident gradient from the next layer.
Returns
- IGpuTensor<T>
GPU-resident gradient to pass to the previous layer (zero for discrete embeddings).
Exceptions
- InvalidOperationException
Thrown if ForwardGpu was not called first.
ComputeAuxiliaryLoss()
Computes the auxiliary loss for the EmbeddingLayer, which is embedding regularization.
public T ComputeAuxiliaryLoss()
Returns
- T
The embedding regularization loss value.
Remarks
Embedding regularization prevents embedding vectors from becoming too large or too similar, which can lead to overfitting. It applies L2 regularization on the embedding weights: Loss = (1/2) * Σ||embedding||²
This regularization:
- Prevents embeddings from growing unboundedly
- Encourages smaller, more generalizable embedding values
- Helps prevent overfitting to the training data
- Promotes diverse embedding representations
For Beginners: This calculates a penalty for embeddings that become too large.
Embedding regularization:
- Measures how large the embedding vectors are
- Penalizes very large embedding values
- Encourages the model to use smaller, more manageable numbers
- Prevents the model from memorizing training data too closely
Why this is important:
- Large embedding values can indicate overfitting
- Regularization promotes better generalization to new data
- Keeps embedding vectors at reasonable scales
- Prevents embeddings from collapsing or diverging
Think of it like a referee that prevents embeddings from becoming too extreme, keeping them in a reasonable range for better model performance.
ExportComputationGraph(List<ComputationNode<T>>)
Exports the embedding layer's forward pass as a JIT-compilable computation graph.
public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)
Parameters
inputNodesList<ComputationNode<T>>List to populate with input computation nodes.
Returns
- ComputationNode<T>
The output computation node representing the embedded vectors.
Remarks
This method builds a computation graph for the embedding lookup operation. The graph uses the embedding matrix as a constant and performs an EmbeddingLookup operation based on the input indices.
For Beginners: This creates an optimized version of the embedding lookup.
The computation graph:
- Takes input indices (token IDs)
- Looks up corresponding rows in the embedding matrix
- Returns the embedding vectors for each token
This is JIT compiled for faster inference.
Forward(Tensor<T>)
Performs the forward pass of the embedding layer, converting token indices to vector representations.
public override Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>The input tensor containing token indices. Supports any-rank tensors: - 1D: [seqLen] - single sequence - 2D: [batch, seqLen] - batch of sequences (industry standard) - 3D: [batch, seqLen, 1] - compatible with legacy format
Returns
- Tensor<T>
The output tensor containing embedding vectors with the same leading dimensions plus embeddingDim.
Remarks
Industry Standard: Like PyTorch's nn.Embedding, this layer supports any-rank input tensors. The indices in the last dimension(s) are looked up in the embedding table, and the result has the same shape with the last dimension replaced by the embedding dimension.
For Beginners: This method looks up the vector for each token ID in your input.
The forward pass works like this:
- Take a sequence of token IDs as input (like [5, 10, 3])
- For each ID, look up its corresponding row in the embedding matrix
- Copy that row (the embedding vector) to the output
For example, with an input sequence [5, 10, 3]:
- Look up row 5 in the embedding matrix -> output row 1
- Look up row 10 in the embedding matrix -> output row 2
- Look up row 3 in the embedding matrix -> output row 3
The result is a sequence of embedding vectors, one for each input token. This transforms your discrete tokens into continuous vectors that the neural network can process more effectively.
ForwardGpu(params IGpuTensor<T>[])
Performs the forward pass of the embedding layer on GPU.
public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)
Parameters
inputsIGpuTensor<T>[]The GPU-resident input tensor(s) containing token indices.
Returns
- IGpuTensor<T>
A GPU-resident tensor containing the embedding vectors.
Remarks
This method performs embedding lookup entirely on GPU, keeping the output on GPU for subsequent GPU-accelerated operations. This eliminates CPU-GPU data transfers for intermediate results in deep networks.
For Beginners: This is the GPU-optimized version of embedding lookup. Instead of moving data between CPU and GPU, all computation stays on the GPU, making it much faster for large vocabularies and batch sizes.
Exceptions
- ArgumentException
Thrown when no inputs are provided.
- InvalidOperationException
Thrown when GPU engine is not available.
GetAuxiliaryLossDiagnostics()
Gets diagnostic information about the embedding regularization.
public Dictionary<string, string> GetAuxiliaryLossDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic information about embedding health.
Remarks
This method provides insights into embedding behavior, including: - Embedding regularization loss - Average embedding magnitude - Regularization weight
For Beginners: This gives you information to monitor embedding quality.
The diagnostics include:
- Embedding Regularization Loss: Measure of embedding magnitude
- Regularization Weight: How much the penalty influences training
- Average Embedding Magnitude: Typical size of embedding vectors
- Use Auxiliary Loss: Whether regularization is enabled
These values help you:
- Monitor if embeddings are growing too large
- Detect potential overfitting in embedding layer
- Tune the regularization weight
- Ensure embeddings remain at reasonable scales
GetDiagnostics()
Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.
public override Dictionary<string, string> GetDiagnostics()
Returns
- Dictionary<string, string>
A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().
GetParameters()
Gets all trainable parameters of the layer as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all trainable parameters.
Remarks
This method retrieves all trainable parameters (the entire embedding matrix) as a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.
For Beginners: This method collects all the embedding values into a single list.
The parameters include:
- All values from the embedding matrix, arranged in a single long list
- Each embedding vector is placed one after another
This is useful for:
- Saving the embeddings to disk
- Loading pre-trained embeddings
- Applying specific optimization techniques
For example, a vocabulary of 1,000 tokens with 100-dimensional embeddings would produce a vector of 100,000 values.
GetTokenEmbeddings(IReadOnlyList<int>)
Retrieves embeddings for the provided token IDs.
public Matrix<T> GetTokenEmbeddings(IReadOnlyList<int> tokenIds)
Parameters
tokenIdsIReadOnlyList<int>Token IDs to lookup.
Returns
- Matrix<T>
A matrix where each row corresponds to a token embedding.
ResetState()
Resets the internal state of the layer.
public override void ResetState()
Remarks
This method resets the internal state of the layer by clearing the cached input and embedding gradients from previous forward and backward passes. This is useful when starting to process a new batch of data or when implementing stateful recurrent networks.
For Beginners: This method clears the layer's memory to start fresh.
When resetting the state:
- The saved input token IDs are cleared
- The calculated gradients are cleared
- The layer forgets previous calculations it performed
This is typically called:
- Between training batches to free up memory
- When switching from training to evaluation mode
- When starting to process completely new data
It doesn't affect the learned embeddings themselves, just the temporary working data used during computation.
SetParameters(Vector<T>)
Sets the trainable parameters of the layer from a single vector.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>A vector containing all parameters to set.
Remarks
This method sets all trainable parameters (the entire embedding matrix) from a single vector. This is useful for loading saved model weights or pre-trained embeddings.
For Beginners: This method updates all embedding values from a provided list.
When setting parameters:
- The input must be a vector with the exact right length
- The values are distributed back to the embedding matrix
- This allows loading previously trained or pre-trained embeddings
Use cases include:
- Loading embeddings trained on another task
- Initializing with pre-trained word vectors (like Word2Vec or GloVe)
- Restoring a saved model
For example, you might initialize your embeddings with GloVe vectors that were pre-trained on a large corpus, giving your model a head start.
Exceptions
- ArgumentException
Thrown when the parameters vector has incorrect length.
UpdateParameters(T)
Updates the embedding matrix using the calculated gradients and the specified learning rate.
public override void UpdateParameters(T learningRate)
Parameters
learningRateTThe learning rate to use for the parameter updates.
Remarks
This method updates the embedding matrix based on the gradients calculated during the backward pass. Only the embeddings for tokens that appeared in the input during the forward pass will be updated. The learning rate determines the size of the parameter updates.
For Beginners: This method actually changes the embeddings to improve future predictions.
After figuring out how each embedding should change:
- The embedding matrix is updated by subtracting the gradients
- Each value is adjusted proportionally to its gradient
- The learning rate controls how big these adjustments are
For example:
- If embedding for token #5 has a gradient of [0.1, -0.2, 0.3]
- With learning rate of 0.01
- The embedding will change by [-0.001, 0.002, -0.003]
Only embeddings for tokens that appeared in the recent input batch will be updated. Frequently used tokens will get more updates over time.
Exceptions
- InvalidOperationException
Thrown when update is called before backward.