Class SelfAttentionLayer<T>

Namespace: AiDotNet.NeuralNetworks.Layers

Assembly: AiDotNet.dll

Represents a self-attention layer that allows a sequence to attend to itself, capturing relationships between elements.

public class SelfAttentionLayer<T> : LayerBase<T>, ILayer<T>, IJitCompilable<T>, IWeightLoadable<T>, IDisposable, IAuxiliaryLossLayer<T>, IDiagnosticsProvider

Type Parameters

T: The numeric type used for calculations, typically float or double.

Inheritance: object

LayerBase<T>

SelfAttentionLayer<T>

Implements: ILayer<T>

IJitCompilable<T>

IWeightLoadable<T>

IDisposable

IAuxiliaryLossLayer<T>

IDiagnosticsProvider

Inherited Members: LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.Forward(params Tensor<T>[])

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.Clone()

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.UpdateParameters(Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

The SelfAttentionLayer implements the self-attention mechanism, a key component of transformer architectures. It allows each position in a sequence to attend to all positions within the same sequence, enabling the model to capture long-range dependencies and relationships. The layer uses the scaled dot-product attention mechanism with multiple attention heads, which allows it to focus on different aspects of the input simultaneously.

For Beginners: This layer helps a neural network understand relationships between different parts of a sequence.

Think of the SelfAttentionLayer like a group of spotlights at a theater performance:

Each spotlight (attention head) can focus on different actors on stage
For each actor, the spotlights decide which other actors are most relevant to them
The spotlights assign importance scores to these relationships
This helps the network understand who is interacting with whom, and how

For example, in a sentence like "The cat sat on the mat because it was tired":

Traditional networks might struggle to figure out what "it" refers to
Self-attention can learn that "it" has a strong relationship with "cat"
This helps the network understand that the cat was tired, not the mat

Multi-head attention (using multiple "spotlights") allows the layer to focus on different types of relationships simultaneously, such as grammatical structure, semantic meaning, and contextual clues.

Self-attention is a cornerstone of modern natural language processing and has revolutionized how neural networks handle sequential data like text, time series, and even images.

Constructors

SelfAttentionLayer(int, int, int, IActivationFunction<T>?)

Initializes a new instance of the SelfAttentionLayer<T> class with a scalar activation function.

public SelfAttentionLayer(int sequenceLength, int embeddingDimension, int headCount = 8, IActivationFunction<T>? activationFunction = null)

Parameters

sequenceLength int: The length of the input sequence.
embeddingDimension int: The dimension of the input and output embeddings.
headCount int: The number of attention heads. Defaults to 8.
activationFunction IActivationFunction<T>: The activation function to apply to the output. Defaults to Identity if not specified.

Remarks

This constructor creates a new SelfAttentionLayer with the specified dimensions and a scalar activation function. It validates that the embedding dimension is divisible by the number of heads and initializes the weight matrices and bias vector with appropriate values. A scalar activation function is applied element-wise to each output embedding independently.

For Beginners: This creates a new self-attention layer for your neural network using a simple activation function.

When you create this layer, you specify:

sequenceLength: How many items (like words) are in your sequence
embeddingDimension: How many features each item has
headCount: How many different "spotlights" the attention mechanism uses (default: 8)
activationFunction: How to transform the output (defaults to Identity, which makes no changes)

For example, in a language model:

sequenceLength might be 512 (the maximum number of words/tokens in a text)
embeddingDimension might be 768 (the number of features per word/token)
Using 8 attention heads lets the model focus on 8 different types of relationships

The embedding dimension must be divisible by the number of heads (e.g., 768 ÷ 8 = 96), so each head has the same dimension.

Exceptions

ArgumentException: Thrown when the embedding dimension is not divisible by the number of heads.

SelfAttentionLayer(int, int, int, IVectorActivationFunction<T>?)

Initializes a new instance of the SelfAttentionLayer<T> class with a vector activation function.

public SelfAttentionLayer(int sequenceLength, int embeddingDimension, int headCount = 8, IVectorActivationFunction<T>? vectorActivationFunction = null)

Parameters

sequenceLength int: The length of the input sequence.
embeddingDimension int: The dimension of the input and output embeddings.
headCount int: The number of attention heads. Defaults to 8.
vectorActivationFunction IVectorActivationFunction<T>: The vector activation function to apply to the output. Defaults to Identity if not specified.

Remarks

This constructor creates a new SelfAttentionLayer with the specified dimensions and a vector activation function. It validates that the embedding dimension is divisible by the number of heads and initializes the weight tensors and bias tensor with appropriate values. A vector activation function is applied to the entire output vector at once, which allows for interactions between different output elements.

For Beginners: This creates a new self-attention layer for your neural network using an advanced activation function.

When you create this layer, you specify the same parameters as in the scalar version, but with a vector activation:

sequenceLength: How many items are in your sequence
embeddingDimension: How many features each item has
headCount: How many different "spotlights" the attention mechanism uses
vectorActivationFunction: How to transform the entire output as a group

A vector activation can consider relationships between different positions in the output, which might be useful for certain advanced applications.

This constructor works the same as the scalar version, but allows for more sophisticated activation patterns across the output sequence.

Exceptions

ArgumentException: Thrown when the embedding dimension is not divisible by the number of heads.

Properties

AuxiliaryLossWeight

Gets or sets the weight for the attention sparsity auxiliary loss.

public T AuxiliaryLossWeight { get; set; }

Property Value

T

Remarks

This weight controls how much attention sparsity regularization contributes to the total loss. Typical values range from 0.001 to 0.01.

For Beginners: This controls how much we encourage focused attention.

Common values:

0.005 (default): Balanced sparsity regularization
0.001-0.003: Light sparsity enforcement
0.008-0.01: Strong sparsity enforcement

Higher values encourage sharper, more focused attention patterns.

ParameterCount

Gets the total number of trainable parameters in this layer.

public override int ParameterCount { get; }

Property Value

int: The total number of parameters: 3 weight matrices (Q, K, V) each of size [embeddingDimension × embeddingDimension], plus an output bias of size [embeddingDimension]. Total = 3 × E² + E = E × (3E + 1) where E is the embedding dimension.

SupportsGpuExecution

Gets a value indicating whether this layer supports GPU execution.

protected override bool SupportsGpuExecution { get; }

Property Value

bool

SupportsJitCompilation

Gets whether this self-attention layer supports JIT compilation.

public override bool SupportsJitCompilation { get; }

Property Value

bool: True if the layer parameters are initialized.

Remarks

This property indicates whether the layer can be JIT compiled. The layer supports JIT if: - Query, Key, Value projection weights are initialized - The layer has been properly configured with sequence length and embedding dimensions

For Beginners: This tells you if this layer can use JIT compilation for faster inference.

The layer can be JIT compiled if:

The layer has been initialized with projection weight matrices (query, key, value weights)
The multi-head structure has been configured

Self-attention layers are computationally expensive because each position attends to all other positions in the sequence (O(n²) complexity). JIT compilation can provide significant speedup (5-10x) by optimizing:

Parallel matrix multiplications for projections
Multi-head attention score computation across heads
Softmax operations for attention weights
Weighted sums of values across all heads

This is especially critical for Transformers where self-attention is the bottleneck:

BERT has 12-24 self-attention layers
GPT-3 has 96 self-attention layers
Vision Transformers process image patches as sequences

JIT compilation makes these models practical for production use.

SupportsTraining

Gets a value indicating whether this layer supports training.

public override bool SupportsTraining { get; }

Property Value

bool: Always true for SelfAttentionLayer, indicating that the layer can be trained through backpropagation.

Remarks

This property indicates that the SelfAttentionLayer has trainable parameters (query, key, and value weights, as well as output biases) that can be optimized during the training process using backpropagation. The gradients of these parameters are calculated during the backward pass and used to update the parameters.

For Beginners: This property tells you if the layer can learn from data.

A value of true means:

The layer has values (weights and biases) that can be adjusted during training
It will improve its performance as it sees more data
It participates in the learning process of the neural network

When you train a neural network containing this layer, it will automatically learn which relationships between sequence positions are important for your specific task.

UseAuxiliaryLoss

Gets or sets whether auxiliary loss (attention sparsity regularization) should be used during training.

public bool UseAuxiliaryLoss { get; set; }

Property Value

bool

Remarks

Attention sparsity regularization encourages the attention mechanism to focus on relevant positions while ignoring irrelevant ones. This prevents attention from being too diffuse and improves interpretability.

For Beginners: This helps self-attention focus on what matters.

Self-attention works best when it's selective:

Without regularization: Attention might spread too thin across all positions
With regularization: Attention focuses on truly relevant relationships

This includes:

Entropy regularization: Prevents overly uniform attention
Sparsity penalties: Encourages sharp, focused attention patterns

This helps the model:

Learn clearer, more interpretable attention patterns
Focus computational resources on relevant relationships
Improve robustness and generalization

Methods

Backward(Tensor<T>)

Performs the backward pass of the self-attention layer.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>: The gradient of the loss with respect to the layer's output.

Returns

Tensor<T>: The gradient of the loss with respect to the layer's input.

Remarks

This method implements the backward pass of the self-attention layer, which is used during training to propagate error gradients back through the network. It calculates the gradients of the loss with respect to the layer's parameters (query, key, and value weights, as well as output biases) and with respect to the layer's input. The calculation involves complex tensor operations that essentially reverse the computations done in the forward pass.

For Beginners: This method calculates how the layer's parameters should change to reduce errors.

During the backward pass:

The layer receives error gradients indicating how the output should change
It calculates how each of its internal components contributed to the error:
- How the query weights should change
- How the key weights should change
- How the value weights should change
- How the output biases should change
It also calculates how the error should propagate back to the previous layer

This involves complex matrix mathematics, but the basic idea is:

Finding which attention patterns led to errors
Adjusting the weights to improve these patterns
Sending appropriate feedback to the previous layer

The backward pass is what allows the self-attention mechanism to learn which relationships in the sequence are important for the specific task.

Exceptions

InvalidOperationException: Thrown when backward is called before forward.

BackwardGpu(IGpuTensor<T>)

Performs the backward pass using GPU-resident tensors.

public override IGpuTensor<T> BackwardGpu(IGpuTensor<T> outputGradient)

Parameters

outputGradient IGpuTensor<T>: GPU-resident gradient of the loss w.r.t. output.

Returns

IGpuTensor<T>: GPU-resident gradient of the loss w.r.t. input.

ComputeAuxiliaryLoss()

Initializes the layer's internal parameters based on the sequence length, embedding dimension, and head count.

public T ComputeAuxiliaryLoss()

Returns

T: The computed attention sparsity auxiliary loss.

Remarks

This private method initializes the internal parameters of the self-attention layer based on the specified dimensions. It validates that the embedding dimension is divisible by the number of heads, calculates the dimension of each head, and then calls InitializeParameters to set up the weight matrices and bias vector. This method is called by both constructors.

For Beginners: This method sets up the internal structure of the self-attention layer.

During initialization:

The method saves the basic dimensions (sequence length, embedding size, head count)
It calculates how large each attention head should be
It verifies that the embedding dimension can be evenly divided by the head count
It triggers the creation of all the weight matrices with proper initial values

The head dimension calculation is important - if you have an embedding size of 512 and 8 attention heads, each head will have a dimension of 64 (512 ÷ 8). This allows each head to specialize in different aspects of the input sequence.

This method throws an error if the embedding dimension isn't divisible by the head count because the attention mechanism requires equal-sized heads.

Exceptions

ArgumentException: Thrown when the embedding dimension is not divisible by the number of heads.

ExportComputationGraph(List<ComputationNode<T>>)

Exports the self-attention layer as a computation graph for JIT compilation.

public override ComputationNode<T> ExportComputationGraph(List<ComputationNode<T>> inputNodes)

Parameters

inputNodes List<ComputationNode<T>>: List to which the input node will be added.

Returns

ComputationNode<T>: The output computation node representing the self-attention operation.

Remarks

This method creates a symbolic computation graph for JIT compilation: 1. Creates a symbolic input node with shape [batch=1, sequenceLength, embeddingDimension] 2. Creates constant nodes for Query, Key, Value projection weights 3. Projects input to Q, K, V using matrix multiplication (self-attention: all from same input) 4. Applies multi-head scaled dot-product attention mechanism 5. Returns the attention output with residual connection and bias

For Beginners: This method builds a symbolic representation of self-attention for JIT.

JIT compilation converts multi-head self-attention into optimized native code. Self-attention allows each position in a sequence to attend to all positions, enabling the model to capture long-range dependencies and relationships within the sequence.

Multi-head attention uses multiple parallel attention mechanisms ("heads") that:

Focus on different aspects of the input simultaneously
Allow the model to capture diverse relationships (syntax, semantics, context)
Improve the model's ability to understand complex patterns

The symbolic graph allows the JIT compiler to:

Optimize parallel matrix multiplications across heads
Fuse attention score computation and softmax
Generate efficient memory layouts for multi-head processing
Optimize the split and concatenation operations for heads

Self-attention is the core of Transformer architectures (BERT, GPT, Vision Transformers). JIT compilation provides 5-10x speedup by optimizing these complex operations.

Exceptions

ArgumentNullException: Thrown when inputNodes is null.
InvalidOperationException: Thrown when layer parameters are not initialized.

Forward(Tensor<T>)

Performs the forward pass of the self-attention layer.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>: The input tensor to process, with shape [batchSize, sequenceLength, embeddingDimension].

Returns

Tensor<T>: The output tensor after self-attention, with the same shape as the input.

Remarks

This method implements the forward pass of the self-attention layer. It transforms the input into queries, keys, and values, then computes attention scores between each position and all other positions. These scores are normalized using the softmax function and used to compute a weighted sum of the values. The result is transformed back to the original embedding dimension and passed through an activation function.

For Beginners: This method processes your sequence data through the self-attention mechanism.

During the forward pass:

The input sequence is transformed into three different representations:
- Queries: What each position is looking for
- Keys: What each position has to offer
- Values: The actual content at each position
For each position, attention scores are computed by comparing its query with all keys
These scores are scaled and normalized to create attention weights
Each position's output is a weighted sum of all values, based on the attention weights
The result is transformed and passed through an activation function

Imagine a classroom where each student (position) asks a question (query) to the entire class. Other students offer answers (keys) and knowledge (values). Each student pays more attention to the most relevant answers and combines that knowledge to form their own understanding.

The multi-head mechanism allows this process to happen in parallel with different "perspectives" or types of questions.

ForwardGpu(params IGpuTensor<T>[])

Performs the forward pass using GPU-resident tensors.

public override IGpuTensor<T> ForwardGpu(params IGpuTensor<T>[] inputs)

Parameters

inputs IGpuTensor<T>[]

Returns

IGpuTensor<T>: A GPU-resident output tensor.

Remarks

This method performs the entire self-attention forward pass on the GPU without downloading intermediate results to CPU. All projections, attention computation, and bias addition remain GPU-resident for maximum performance.

GetAuxiliaryLossDiagnostics()

Gets diagnostic information about the attention sparsity auxiliary loss.

public Dictionary<string, string> GetAuxiliaryLossDiagnostics()

Returns

Dictionary<string, string>: A dictionary containing diagnostic information about attention regularization.

Remarks

This method returns detailed diagnostics about attention sparsity regularization, including entropy loss, sparsity penalty, and configuration parameters. This information is useful for monitoring training progress and debugging attention patterns.

For Beginners: This provides information about how attention regularization is working.

The diagnostics include:

Total entropy loss (how focused attention patterns are)
Total sparsity loss (L1 penalty on attention weights)
Weight applied to the regularization
Whether regularization is enabled
Number of attention heads

This helps you:

Monitor if attention is becoming too diffuse or too sharp
Debug issues with attention patterns
Understand the impact of regularization on learning

You can use this information to adjust regularization weights for better results.

GetDiagnostics()

Gets diagnostic information about this component's state and behavior. Overrides GetDiagnostics() to include auxiliary loss diagnostics.

public override Dictionary<string, string> GetDiagnostics()

Returns

Dictionary<string, string>: A dictionary containing diagnostic metrics including both base layer diagnostics and auxiliary loss diagnostics from GetAuxiliaryLossDiagnostics().

GetParameters()

Gets all trainable parameters of the self-attention layer as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>: A vector containing all trainable parameters (query weights, key weights, value weights, and output biases).

Remarks

This method retrieves all trainable parameters of the self-attention layer as a single vector. The query weights are stored first, followed by the key weights, value weights, and finally the output biases. This is useful for optimization algorithms that operate on all parameters at once, or for saving and loading model weights.

For Beginners: This method collects all the learnable values from the self-attention layer.

The parameters:

Are the weights and biases that the self-attention layer learns during training
Control how the layer processes sequence information
Are returned as a single list (vector)

This is useful for:

Saving the model to disk
Loading parameters from a previously trained model
Advanced optimization techniques that need access to all parameters

The query weights are stored first in the vector, followed by the key weights, value weights, and finally the output biases.

ResetState()

Resets the internal state of the self-attention layer.

public override void ResetState()

Remarks

This method resets the internal state of the self-attention layer, including the cached inputs, outputs, attention scores from the forward pass, and the gradients from the backward pass. This is useful when starting to process a new batch of data.

For Beginners: This method clears the layer's memory to start fresh.

When resetting the state:

Stored inputs, outputs, and attention scores from previous calculations are cleared
Calculated gradients for all weights and biases are cleared
The layer forgets any information from previous batches

This is important for:

Processing a new, unrelated batch of data
Preventing information from one batch affecting another
Managing memory usage efficiently

Since the self-attention layer caches quite a bit of information during the forward and backward passes, resetting the state helps prevent memory leaks and ensures each new sequence is processed independently.

SetParameters(Vector<T>)

Sets the trainable parameters of the self-attention layer.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: A vector containing all parameters (query weights, key weights, value weights, and output biases) to set.

Remarks

This method sets the trainable parameters of the self-attention layer from a single vector. The vector should contain the query weight values first, followed by the key weight values, value weight values, and finally the output bias values. This is useful for loading saved model weights or for implementing optimization algorithms that operate on all parameters at once.

For Beginners: This method updates all the weights and biases in the self-attention layer.

When setting parameters:

The input must be a vector with the correct total length
The first part of the vector is used for the query weights
The second part of the vector is used for the key weights
The third part of the vector is used for the value weights
The last part of the vector is used for the output biases

This is useful for:

Loading a previously saved model
Transferring parameters from another model
Testing different parameter values

An error is thrown if the input vector doesn't have the expected number of parameters.

Exceptions

ArgumentException: Thrown when the parameters vector has incorrect length.

UpdateParameters(T)

Updates the parameters of the self-attention layer using the calculated gradients.

public override void UpdateParameters(T learningRate)

Parameters

learningRate T: The learning rate to use for the parameter updates.

Remarks

This method updates the query weights, key weights, value weights, and output biases of the self-attention layer based on the gradients calculated during the backward pass. The learning rate controls the size of the parameter updates. This method should be called after the backward pass to apply the calculated updates.

For Beginners: This method updates the layer's internal values during training.

When updating parameters:

The query weight values are adjusted based on their gradients
The key weight values are adjusted based on their gradients
The value weight values are adjusted based on their gradients
The output bias values are adjusted based on their gradients
The learning rate controls how big each update step is

These updates help the self-attention mechanism:

Focus on more relevant relationships between positions
Ignore irrelevant relationships
Better understand the structure of your sequences

Smaller learning rates mean slower but more stable learning, while larger learning rates mean faster but potentially unstable learning.

Exceptions

InvalidOperationException: Thrown when UpdateParameters is called before Backward.

Table of Contents

Class SelfAttentionLayer<T>

Type Parameters

Remarks

Constructors

SelfAttentionLayer(int, int, int, IActivationFunction<T>?)

Parameters

Remarks

Exceptions

SelfAttentionLayer(int, int, int, IVectorActivationFunction<T>?)

Parameters

Remarks

Exceptions

Properties

AuxiliaryLossWeight

Property Value

Remarks

ParameterCount

Property Value

SupportsGpuExecution

Property Value

SupportsJitCompilation

Property Value

Remarks

SupportsTraining

Property Value

Remarks

UseAuxiliaryLoss

Property Value

Remarks

Methods

Backward(Tensor<T>)

Parameters

Returns

Remarks

Exceptions

BackwardGpu(IGpuTensor<T>)

Parameters

Returns

ComputeAuxiliaryLoss()

Returns

Remarks

Exceptions

ExportComputationGraph(List<ComputationNode<T>>)

Parameters

Returns

Remarks

Exceptions

Forward(Tensor<T>)

Parameters

Returns

Remarks

ForwardGpu(params IGpuTensor<T>[])

Parameters

Returns

Remarks

GetAuxiliaryLossDiagnostics()

Returns

Remarks

GetDiagnostics()

Returns

GetParameters()

Returns

Remarks

ResetState()

Remarks

SetParameters(Vector<T>)

Parameters

Remarks

Exceptions

UpdateParameters(T)

Parameters

Remarks

Exceptions