Table of Contents

Class MemoryReadLayer<T>

Namespace
AiDotNet.NeuralNetworks.Layers
Assembly
AiDotNet.dll

Represents a layer that reads from a memory tensor using an attention mechanism.

public class MemoryReadLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
MemoryReadLayer<T>
Implements
Inherited Members

Remarks

The MemoryReadLayer implements a form of attention-based memory access. It computes attention scores between the input and memory tensors, using these scores to create a weighted sum of memory values. This approach allows the layer to selectively retrieve information from memory based on the current input. The layer consists of key weights (for attention computation), value weights (for transforming memory values), and output weights (for final processing).

For Beginners: This layer helps a neural network retrieve information from memory.

Think of it like searching for relevant information in a book:

  • You have a query (your current input)
  • You have a memory (like pages of a book)
  • The layer finds which parts of the memory are most relevant to your query
  • It then combines those relevant parts to produce an output

For example, if your input represents a question like "What's the capital of France?", the layer would look through memory to find information about France, give more attention to content about its capital, and then combine this information to produce the answer "Paris".

This is similar to how modern language models can retrieve and use stored information when answering questions.

Constructors

MemoryReadLayer(int, int, int, IActivationFunction<T>?)

Initializes a new instance of the MemoryReadLayer<T> class with the specified dimensions and a scalar activation function.

public MemoryReadLayer(int inputDimension, int memoryDimension, int outputDimension, IActivationFunction<T>? activationFunction = null)

Parameters

inputDimension int

The size of the input vector.

memoryDimension int

The size of each memory entry.

outputDimension int

The size of the output vector.

activationFunction IActivationFunction<T>

The activation function to apply after processing. Defaults to Identity if not specified.

Remarks

This constructor creates a MemoryReadLayer with the specified dimensions and activation function. The layer is initialized with random weights scaled according to the layer dimensions to facilitate stable training. The bias is initialized to zero.

For Beginners: This constructor sets up the layer with the necessary dimensions and activation function.

When creating a MemoryReadLayer, you need to specify:

  • inputDimension: The size of your query vector (e.g., 128 for a 128-feature query)
  • memoryDimension: The size of each memory entry (e.g., 256 for memory entries with 256 features)
  • outputDimension: The size of the output you want (e.g., 64 for a 64-feature result)
  • activationFunction: The function that processes the final output (optional)

The constructor creates weight matrices of the appropriate sizes and initializes them with small random values to start the learning process. The initialization scale is carefully chosen to prevent vanishing or exploding gradients during training.

MemoryReadLayer(int, int, int, IVectorActivationFunction<T>?)

Initializes a new instance of the MemoryReadLayer<T> class with the specified dimensions and a vector activation function.

public MemoryReadLayer(int inputDimension, int memoryDimension, int outputDimension, IVectorActivationFunction<T>? activationFunction = null)

Parameters

inputDimension int

The size of the input vector.

memoryDimension int

The size of each memory entry.

outputDimension int

The size of the output vector.

activationFunction IVectorActivationFunction<T>

The vector activation function to apply after processing. Defaults to Identity if not specified.

Remarks

This constructor creates a MemoryReadLayer with the specified dimensions and vector activation function. A vector activation function operates on entire vectors rather than individual elements. The layer is initialized with random weights scaled according to the layer dimensions to facilitate stable training. The bias is initialized to zero.

For Beginners: This constructor sets up the layer with the necessary dimensions and a vector-based activation function.

A vector activation function:

  • Operates on entire groups of numbers at once, rather than one at a time
  • Can capture relationships between different elements in the output
  • Defaults to the Identity function, which doesn't change the values

This constructor is useful when you need more complex activation patterns that consider the relationships between different outputs in your memory reading operation.

Properties

AuxiliaryLossWeight

Gets or sets the weight for the auxiliary loss contribution.

public T AuxiliaryLossWeight { get; set; }

Property Value

T

Remarks

This value determines how much the attention sparsity loss contributes to the total loss. The default value of 0.005 provides a good balance between the main task and sparsity regularization.

For Beginners: This controls how much importance to give to the attention sparsity penalty.

The weight affects training:

  • Higher values (e.g., 0.01) make the network prioritize focused attention more strongly
  • Lower values (e.g., 0.001) make the sparsity penalty less important
  • The default (0.005) works well for most memory-augmented tasks

If your memory attention is too diffuse (spreading across too many locations), increase this value. If the main task is more important, you might decrease it.

SupportsGpuExecution

Gets whether this layer has a GPU execution implementation for inference.

protected override bool SupportsGpuExecution { get; }

Property Value

bool

Remarks

Override this to return true when the layer implements ForwardGpu(params IGpuTensor<T>[]). The actual CanExecuteOnGpu property combines this with engine availability.

For Beginners: This flag indicates if the layer has GPU code for the forward pass. Set this to true in derived classes that implement ForwardGpu.

SupportsJitCompilation

Gets whether this layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool

True if the layer can be JIT compiled, false otherwise.

Remarks

This property indicates whether the layer has implemented ExportComputationGraph() and can benefit from JIT compilation. All layers MUST implement this property.

For Beginners: JIT compilation can make inference 5-10x faster by converting the layer's operations into optimized native code.

Layers should return false if they:

  • Have not yet implemented a working ExportComputationGraph()
  • Use dynamic operations that change based on input data
  • Are too simple to benefit from JIT compilation

When false, the layer will use the standard Forward() method instead.

SupportsTraining

Gets a value indicating whether this layer supports training.

public override bool SupportsTraining { get; }

Property Value

bool

Always true because the MemoryReadLayer has trainable parameters.

Remarks

This property indicates that MemoryReadLayer can be trained through backpropagation. The layer has trainable parameters (weights and biases) that are updated during training to optimize the memory reading process.

For Beginners: This property tells you that this layer can learn from data.

A value of true means:

  • The layer has internal values (weights and biases) that change during training
  • It will improve its performance as it sees more data
  • It learns to better focus attention on relevant parts of memory

During training, the layer learns:

  • Which features in the input are important for querying memory
  • How to transform retrieved memory information
  • How to combine everything into a useful output

UseAuxiliaryLoss

Gets or sets a value indicating whether auxiliary loss is enabled for this layer.

public bool UseAuxiliaryLoss { get; set; }

Property Value

bool

Remarks

When enabled, the layer computes an attention sparsity auxiliary loss that encourages focused memory access. This helps prevent the layer from attending to too many memory locations at once, promoting more selective retrieval.

For Beginners: This setting controls whether the layer uses an additional learning signal.

When enabled (true):

  • The layer encourages focused attention on specific memory locations
  • This helps the network learn to be more selective about what information it retrieves
  • Training may be more stable and produce better memory access patterns

When disabled (false):

  • Only the main task loss is used for training
  • This is the default setting

Methods

Backward(Tensor<T>)

Performs the backward pass of the memory read layer.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

The gradient of the loss with respect to the layer's output.

Returns

Tensor<T>

The gradient of the loss with respect to the layer's inputs (both input and memory).

Remarks

This method implements the backward pass of the memory read layer, which is used during training to propagate error gradients back through the network. It computes the gradients of all weights and biases, as well as the gradients with respect to both the input and memory tensors. The computed weight and bias gradients are stored for later use in the parameter update step.

For Beginners: This method calculates how all parameters should change to reduce errors.

During the backward pass:

  • The layer receives gradients indicating how the output should change
  • It calculates how each weight, bias, and input value should change
  • These gradients are used later to update the parameters during training

The backward pass is complex because it needs to:

  • Calculate gradients for all weights (key, value, and output)
  • Calculate gradients for the bias
  • Calculate gradients for both the input and memory tensors
  • Handle the chain rule through the softmax attention mechanism

This is an implementation of backpropagation through an attention mechanism, which is a key component of many modern neural network architectures.

Exceptions

InvalidOperationException

Thrown when backward is called before forward.

ComputeAuxiliaryLoss()

Computes the auxiliary loss for this layer based on attention sparsity regularization.

public T ComputeAuxiliaryLoss()

Returns

T

The computed auxiliary loss value.

Remarks

This method computes an attention sparsity loss that encourages focused memory access patterns. The loss is computed as the negative entropy of the attention weights: L = -Σ(p * log(p)) where p represents the attention probabilities. Lower entropy (more focused attention) results in lower loss. This encourages the layer to attend to specific memory locations rather than spreading attention uniformly.

For Beginners: This method calculates a penalty for unfocused attention patterns.

Attention sparsity loss:

  • Measures how focused the attention is on specific memory locations
  • Lower values mean more focused attention (good)
  • Higher values mean attention is spread across many locations (less focused)

Why this is useful:

  • In most tasks, you want to retrieve specific relevant information from memory
  • Spreading attention too thin means you get a "blurry" mix of information
  • Focused attention means you get clear, specific information

Example: If you're answering "What is the capital of France?" from memory, you want focused attention on the entry about Paris, not a mix of all French cities.

Technical note: The loss is computed using entropy. Entropy measures how "spread out" a distribution is.

  • Low entropy = focused distribution (e.g., [0.9, 0.05, 0.05] - mostly on first item)
  • High entropy = spread out distribution (e.g., [0.33, 0.33, 0.34] - spread evenly) We use negative entropy as the loss, so the network is penalized for high entropy (unfocused attention).

ExportComputationGraph(List<ComputationNode<T>>)

Exports the layer's computation graph for JIT compilation.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>

List to populate with input computation nodes.

Returns

ComputationNode<T>

The output computation node representing the layer's operation.

Remarks

This method constructs a computation graph representation of the layer's forward pass that can be JIT compiled for faster inference. All layers MUST implement this method to support JIT compilation.

For Beginners: JIT (Just-In-Time) compilation converts the layer's operations into optimized native code for 5-10x faster inference.

To support JIT compilation, a layer must:

  1. Implement this method to export its computation graph
  2. Set SupportsJitCompilation to true
  3. Use ComputationNode and TensorOperations to build the graph

All layers are required to implement this method, even if they set SupportsJitCompilation = false.

Forward(Tensor<T>)

Performs a forward pass using a default identity-like memory tensor.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

The input tensor.

Returns

Tensor<T>

The output tensor produced using the default memory tensor.

Remarks

This overload provides a default identity-like memory tensor so the layer can be used in generic pipelines that only pass a single input tensor. For custom memory contents, use Forward(input, memory) instead.

For Beginners: This lets you use the layer without manually supplying a memory tensor.

The layer creates a simple "identity" memory that passes values through, which is useful for quick tests or when a pipeline only supports a single input.

Forward(Tensor<T>, Tensor<T>)

Performs the forward pass of the memory read layer with input and memory tensors.

public Tensor<T> Forward(Tensor<T> input, Tensor<T> memory)

Parameters

input Tensor<T>

The input tensor to process.

memory Tensor<T>

The memory tensor to read from.

Returns

Tensor<T>

The output tensor after memory reading and processing.

Remarks

This method implements the forward pass of the memory read layer. It computes attention scores between the input and memory, applies softmax to get attention weights, retrieves a weighted sum of memory values, applies transformations through the value and output weights, and finally adds the bias and applies the activation function.

For Beginners: This method performs the actual memory reading operation based on the input.

The forward pass works in these steps:

  1. Use the input to create query keys by applying the key weights
  2. Compare these keys with each memory entry to get attention scores
  3. Convert the scores to weights using softmax (making them sum to 1.0)
  4. Use these weights to create a weighted sum of memory values
  5. Transform this retrieved information through value and output weights
  6. Add bias and apply activation function for the final output

This is similar to how attention works in many modern AI systems: the input "attends" to relevant parts of memory, focusing more on what's important for the current task and less on irrelevant information.

ForwardGpu(params IGpuTensor<T>[])

Performs the forward pass of the layer on GPU.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]

The GPU-resident input tensor(s).

Returns

IGpuTensor<T>

The GPU-resident output tensor.

Remarks

This method performs the layer's forward computation entirely on GPU. The input and output tensors remain in GPU memory, avoiding expensive CPU-GPU transfers.

For Beginners: This is like Forward() but runs on the graphics card.

The key difference:

  • Forward() uses CPU tensors that may be copied to/from GPU
  • ForwardGpu() keeps everything on GPU the whole time

Override this in derived classes that support GPU acceleration.

Exceptions

NotSupportedException

Thrown when the layer does not support GPU execution.

GetAuxiliaryLossDiagnostics()

Gets diagnostic information about the auxiliary loss computation.

public Dictionary<string, string> GetAuxiliaryLossDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic information about the auxiliary loss.

Remarks

This method returns diagnostic information that can be used to monitor the auxiliary loss during training. The diagnostics include the total attention sparsity loss, the weight applied to it, and whether auxiliary loss is enabled.

For Beginners: This method provides information to help you understand how the auxiliary loss is working.

The diagnostics show:

  • TotalAttentionSparsityLoss: The computed penalty for unfocused attention
  • AttentionSparsityWeight: How much this penalty affects the overall training
  • UseAttentionSparsity: Whether this penalty is currently enabled

You can use this information to:

  • Monitor if attention is becoming more focused over time
  • Debug training issues related to memory access
  • Understand how the layer is learning to retrieve information

Example: If TotalAttentionSparsityLoss is decreasing during training, it means the layer is learning to be more focused in its memory access, which is typically a good sign. If it's staying high or increasing, it might mean the layer is having trouble learning which parts of memory are relevant.

GetDiagnostics()

Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.

public override Dictionary<string, string> GetDiagnostics()

Returns

Dictionary<string, string>

A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().

GetParameters()

Gets all trainable parameters from the memory read layer as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all trainable parameters.

Remarks

This method retrieves all trainable parameters from the layer as a single vector. It concatenates the key weights, value weights, output weights, and output bias into a single vector. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.

For Beginners: This method collects all the learnable values in the layer.

The parameters:

  • Are the numbers that the neural network learns during training
  • Include all the weights and biases from this layer
  • Are combined into a single long list (vector)

This is useful for:

  • Saving the model to disk
  • Loading parameters from a previously trained model
  • Advanced optimization techniques that need access to all parameters

The method carefully arranges all parameters in a specific order so they can be correctly restored later.

ResetState()

Resets the internal state of the memory read layer.

public override void ResetState()

Remarks

This method resets the internal state of the memory read layer, including the cached inputs, memory, outputs, attention scores, and all gradients. This is useful when starting to process a new sequence or batch of data, or when implementing stateful networks.

For Beginners: This method clears the layer's memory to start fresh.

When resetting the state:

  • Stored inputs, memory, outputs, and attention scores from previous processing are cleared
  • All calculated gradients are cleared
  • The layer forgets any information from previous data batches

This is important for:

  • Processing a new, unrelated batch of data
  • Ensuring clean state before a new training epoch
  • Preventing information from one batch affecting another

Resetting state helps ensure that each forward and backward pass is independent, which is important for correct behavior in many neural network architectures.

SetParameters(Vector<T>)

Sets the trainable parameters for the memory read layer.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

A vector containing all parameters to set.

Remarks

This method sets all trainable parameters of the layer from a single vector. It extracts the appropriate portions of the input vector for each parameter (key weights, value weights, output weights, and output bias). This is useful for loading saved model weights or for implementing optimization algorithms that operate on all parameters at once.

For Beginners: This method updates all the learnable values in the layer.

When setting parameters:

  • The input must be a vector with the correct length
  • The method extracts portions for each weight matrix and bias vector
  • It places each value in its correct position

This is useful for:

  • Loading a previously saved model
  • Transferring parameters from another model
  • Testing different parameter values

An error is thrown if the input vector doesn't have the expected number of parameters, ensuring that all matrices and vectors maintain their correct dimensions.

Exceptions

ArgumentException

Thrown when the parameters vector has incorrect length.

UpdateParameters(T)

Updates the parameters of the memory read layer using the calculated gradients.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T

The learning rate to use for the parameter updates.

Remarks

This method updates all trainable parameters of the layer (key weights, value weights, output weights, and output bias) based on the gradients calculated during the backward pass. The learning rate controls the size of the parameter updates. Each parameter is updated by subtracting the corresponding gradient multiplied by the learning rate.

For Beginners: This method updates all the layer's weights and biases during training.

After the backward pass calculates how parameters should change, this method:

  • Takes each weight matrix and bias vector
  • Subtracts the corresponding gradient scaled by the learning rate
  • This moves the parameters in the direction that reduces errors

The learning rate controls how big each update step is:

  • Smaller learning rates mean slower but more stable learning
  • Larger learning rates mean faster but potentially unstable learning

This is how the layer gradually improves its performance over many training iterations.

Exceptions

InvalidOperationException

Thrown when UpdateParameters is called before Backward.