Table of Contents

Class TensorOperations<T>

Namespace
AiDotNet.Autodiff
Assembly
AiDotNet.dll

Provides automatic differentiation support for tensor operations.

public static class TensorOperations<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
TensorOperations<T>
Inherited Members

Remarks

TensorOperations is a helper class that integrates automatic differentiation with tensor operations. It records operations performed on tensors to an active GradientTape (if present) and creates the computation graph needed for backpropagation.

This class follows the opt-in pattern: tensor operations only record to the gradient tape when explicitly used within a GradientTape context. Outside of a GradientTape context, operations work normally without any overhead.

For Beginners: This class bridges regular tensor operations with automatic differentiation.

Think of it like adding a "recording mode" to your calculations:

  • When you're inside a GradientTape context, operations are recorded
  • The recording remembers how each value was computed
  • Later, you can "play it backwards" to compute gradients
  • When not recording, operations work exactly as before

This enables features like:

  • Automatic gradient computation for neural network training
  • Computing derivatives without writing manual backward passes
  • Building complex computational graphs automatically

Example usage:

using (var tape = new GradientTape<double>())
{
    var x = TensorOperations<double>.Variable(inputTensor, "x");
    var y = TensorOperations<double>.Variable(parameterTensor, "y");
    tape.Watch(x);
    tape.Watch(y);

    var z = TensorOperations<double>.Add(x, y); // Recorded to tape
    var gradients = tape.Gradient(z, new[] { x, y });
}

Methods

Abs(ComputationNode<T>)

Computes the absolute value of each element in a computation node.

public static ComputationNode<T> Abs(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input node.

Returns

ComputationNode<T>

A new computation node containing the absolute values.

Remarks

This method computes |x| for each element and records the operation. The backward function uses the sign of the original values for gradient computation.

For Beginners: This makes all values positive (removes the sign).

For absolute value (c = |a|):

  • The forward pass removes the sign of each element
  • The backward pass uses sign(a) to route gradients correctly
  • For positive values, gradient passes through unchanged
  • For negative values, gradient is negated

Note: At x = 0, the gradient is technically undefined, but we use 0 as a convention.

Add(ComputationNode<T>, ComputationNode<T>)

Performs element-wise addition of two computation nodes.

public static ComputationNode<T> Add(ComputationNode<T> a, ComputationNode<T> b)

Parameters

a ComputationNode<T>

The first node.

b ComputationNode<T>

The second node.

Returns

ComputationNode<T>

A new computation node containing the sum.

Remarks

This method performs element-wise addition and records the operation to any active GradientTape. The backward function distributes gradients equally to both inputs (since ∂(a+b)/∂a = 1 and ∂(a+b)/∂b = 1).

For Beginners: This adds two tensors together and remembers how to compute gradients.

For addition (c = a + b):

  • The forward pass computes the sum element-wise
  • The backward pass sends gradients to both inputs unchanged
  • This is because changing 'a' by 1 changes the sum by 1, same for 'b'

Example: If the gradient flowing back to c is [1, 2, 3], then both 'a' and 'b' receive [1, 2, 3]

AffineGrid(ComputationNode<T>, int, int)

Generates a sampling grid for spatial transformer networks using affine transformation matrices.

public static ComputationNode<T> AffineGrid(ComputationNode<T> theta, int outputHeight, int outputWidth)

Parameters

theta ComputationNode<T>

Affine transformation matrices of shape [batch, 2, 3]

outputHeight int

Height of the output grid

outputWidth int

Width of the output grid

Returns

ComputationNode<T>

Sampling grid of shape [batch, outputHeight, outputWidth, 2] with (x, y) coordinates

Remarks

This operation generates a grid of sampling coordinates for spatial transformations. The output grid starts as a regular grid in normalized coordinates [-1, 1], then each point is transformed using the affine matrix.

Forward pass: 1. Generate base grid in [-1, 1] normalized space 2. For each point (x_out, y_out) in output space: x_in = theta[0,0]*x_out + theta[0,1]*y_out + theta[0,2] y_in = theta[1,0]*x_out + theta[1,1]*y_out + theta[1,2]

Backward pass: - ∂L/∂theta[i,j] = sum over all grid points of (∂L/∂grid * ∂grid/∂theta)

For Beginners: This creates a map showing where each output pixel should sample from. The affine matrix controls rotation, scaling, translation, and shearing of the grid.

AnomalyScore(ComputationNode<T>, ComputationNode<T>)

Anomaly score computation using reconstruction error or density estimation.

public static ComputationNode<T> AnomalyScore(ComputationNode<T> input, ComputationNode<T> reconstruction)

Parameters

input ComputationNode<T>

Input tensor.

reconstruction ComputationNode<T>

Reconstructed input (e.g., from autoencoder).

Returns

ComputationNode<T>

Anomaly scores (higher = more anomalous).

ApplyActivation(ComputationNode<T>, IActivationFunction<T>)

Applies a generic activation function (scalar or element-wise) with automatic differentiation.

public static ComputationNode<T> ApplyActivation(ComputationNode<T> input, IActivationFunction<T> activation)

Parameters

input ComputationNode<T>

The input computation node.

activation IActivationFunction<T>

The activation function to apply.

Returns

ComputationNode<T>

A new computation node with the activation applied.

Remarks

This method provides generic autodiff support for ANY activation function that implements IActivationFunction{T}. It works by applying the activation function element-wise during the forward pass, then using the activation's ComputeDerivative method during backpropagation.

This means ALL 39 built-in activation functions automatically work with autodiff, and only truly custom user-defined activations (that don't inherit from ActivationFunctionBase) would fail.

AvgPool2D(ComputationNode<T>, int[], int[]?)

Performs 2D average pooling on a 4D tensor (batch, channels, height, width).

public static ComputationNode<T> AvgPool2D(ComputationNode<T> a, int[] poolSize, int[]? strides = null)

Parameters

a ComputationNode<T>

The input node with shape [batch, channels, height, width].

poolSize int[]

The size of the pooling window [poolH, poolW].

strides int[]

The stride for the pooling operation [strideH, strideW]. If null, uses poolSize.

Returns

ComputationNode<T>

A new computation node containing the average pooled result.

Remarks

This method performs average pooling over 2D spatial dimensions. The backward function distributes gradients equally across the pooling window.

For Beginners: AvgPool downsamples by taking the average value in each window.

For average pooling:

  • The forward pass slides a window and computes the average
  • This smoothly reduces spatial dimensions
  • The backward pass distributes gradients equally to all elements in the window
  • Each element gets gradient / pool_area

Used in:

  • CNNs for smoother downsampling than max pooling
  • Global average pooling (replacing fully connected layers)
  • Reducing overfitting

BatchMatrixMultiply(ComputationNode<T>, ComputationNode<T>)

Performs batched matrix multiplication of two 3D computation nodes.

public static ComputationNode<T> BatchMatrixMultiply(ComputationNode<T> a, ComputationNode<T> b)

Parameters

a ComputationNode<T>

The first 3D tensor with shape [Batch, M, K].

b ComputationNode<T>

The second 3D tensor with shape [Batch, K, N].

Returns

ComputationNode<T>

A computation node representing the batched matrix multiplication with shape [Batch, M, N].

Remarks

For 3D tensors, performs element-wise matrix multiplication across the batch dimension: result[i] = a[i] @ b[i] for each batch index i.

Gradient computation: - ∂(A·B)/∂A = gradOut·B^T (batch-wise) - ∂(A·B)/∂B = A^T·gradOut (batch-wise)

BatchNorm(ComputationNode<T>, ComputationNode<T>?, ComputationNode<T>?, Tensor<T>?, Tensor<T>?, bool, double)

Applies batch normalization to a computation node.

public static ComputationNode<T> BatchNorm(ComputationNode<T> a, ComputationNode<T>? gamma = null, ComputationNode<T>? beta = null, Tensor<T>? runningMean = null, Tensor<T>? runningVar = null, bool training = true, double epsilon = 1E-05)

Parameters

a ComputationNode<T>

The input node with shape [batch, features].

gamma ComputationNode<T>

Optional scale parameter (learnable). If null, uses ones.

beta ComputationNode<T>

Optional shift parameter (learnable). If null, uses zeros.

runningMean Tensor<T>

Running mean for inference (not updated during this operation).

runningVar Tensor<T>

Running variance for inference (not updated during this operation).

training bool

Whether in training mode (uses batch statistics) or inference mode (uses running statistics).

epsilon double

Small constant for numerical stability. Default is 1e-5.

Returns

ComputationNode<T>

A new computation node containing the batch normalized result.

Remarks

Batch normalization normalizes inputs across the batch dimension. During training: Uses batch statistics (mean and variance computed from current batch). During inference: Uses running statistics (accumulated during training).

For Beginners: BatchNorm standardizes features across the batch.

For batch normalization:

  • Training mode: Uses current batch's mean and variance
  • Inference mode: Uses running mean/variance from training
  • Normalizes: (x - mean) / sqrt(variance)
  • Scales and shifts: result * gamma + beta

Benefits:

  • Stabilizes training (reduces internal covariate shift)
  • Allows higher learning rates
  • Acts as regularization

Used in:

  • CNNs (after convolutional layers)
  • Deep feedforward networks
  • GANs and many other architectures

BentIdentity(ComputationNode<T>)

Applies the Bent Identity activation function element-wise.

public static ComputationNode<T> BentIdentity(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with BentIdentity applied.

Remarks

BentIdentity is defined as: f(x) = (sqrt(x² + 1) - 1) / 2 + x The gradient is: x / (2 * sqrt(x² + 1)) + 1

For Beginners: BentIdentity is a smooth alternative to ReLU with non-zero gradient everywhere, preventing dead neurons during training.

Broadcast(ComputationNode<T>, int[])

Broadcasts a 1D tensor to a 2D tensor by tiling along the batch dimension.

public static ComputationNode<T> Broadcast(ComputationNode<T> a, int[] targetShape)

Parameters

a ComputationNode<T>

The input 1D tensor node with shape [N].

targetShape int[]

The target 2D shape [batchSize, N].

Returns

ComputationNode<T>

A new computation node with the broadcasted tensor.

Remarks

This operation broadcasts a 1D tensor (e.g., biases with shape [outputSize]) to a 2D tensor (e.g., [batchSize, outputSize]) by replicating values along the batch dimension. The backward pass correctly sums gradients along the broadcasted dimension.

For Beginners: Broadcasting is like copying a row multiple times to create a matrix.

For example, if you have biases [b1, b2, b3] and need to add them to a batch of outputs:

  • Input: [b1, b2, b3] (shape [3])
  • Target shape: [batchSize=2, 3]
  • Output: [[b1, b2, b3], [b1, b2, b3]] (each row is a copy)

During backpropagation, gradients from all rows are summed back to the original biases, because each bias contributed to all batch elements.

CELU(ComputationNode<T>, double)

Applies the CELU (Continuously Differentiable ELU) activation function element-wise.

public static ComputationNode<T> CELU(ComputationNode<T> a, double alpha = 1)

Parameters

a ComputationNode<T>

The input computation node.

alpha double

The alpha parameter controlling negative saturation. Default is 1.0.

Returns

ComputationNode<T>

A new computation node with CELU applied.

Remarks

CELU is defined as: max(0, x) + min(0, α * (exp(x/α) - 1)) The gradient is: 1 if x >= 0, otherwise exp(x/α)

For Beginners: CELU is an improved version of ELU that is continuously differentiable everywhere, which can help with optimization and training stability.

CRFForward(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, ComputationNode<T>?)

CRF forward algorithm for sequence labeling.

public static ComputationNode<T> CRFForward(ComputationNode<T> emissions, ComputationNode<T> transitions, ComputationNode<T>? startScores = null, ComputationNode<T>? endScores = null)

Parameters

emissions ComputationNode<T>

Emission scores [seq_len, num_tags].

transitions ComputationNode<T>

Transition matrix [num_tags, num_tags].

startScores ComputationNode<T>

Optional start scores [num_tags].

endScores ComputationNode<T>

Optional end scores [num_tags].

Returns

ComputationNode<T>

Log partition function (normalizer).

Remarks

Computes the log partition function using the forward-backward algorithm. This is differentiable and returns proper gradients for emissions, transitions, start scores, and end scores.

ComplexMatMul(ComputationNode<T>, ComputationNode<T>, string)

Performs complex matrix multiplication on tensors representing complex numbers as [real, imag] pairs.

public static ComputationNode<T> ComplexMatMul(ComputationNode<T> a, ComputationNode<T> b, string format = "split")

Parameters

a ComputationNode<T>

First complex matrix [batch, m, 2*k] where dimensions are [real, imag] interleaved or concatenated.

b ComputationNode<T>

Second complex matrix [batch, 2*k, n].

format string

Whether complex numbers are "interleaved" ([r,i,r,i,...]) or "split" ([r,r,...,i,i,...]).

Returns

ComputationNode<T>

Complex matrix product [batch, m, 2*n].

Remarks

Complex multiplication: (a + bi)(c + di) = (ac - bd) + (ad + bc)i

For Beginners: This multiplies matrices of complex numbers.

Complex numbers are represented as pairs of real numbers [real_part, imaginary_part]. This operation implements the full complex matrix multiplication formula.

Used in quantum computing layers where quantum gates are unitary matrices.

ComplexMultiply(ComputationNode<T>, ComputationNode<T>, string)

Performs element-wise complex multiplication.

public static ComputationNode<T> ComplexMultiply(ComputationNode<T> a, ComputationNode<T> b, string format = "split")

Parameters

a ComputationNode<T>

First complex tensor with last dimension of size 2*n.

b ComputationNode<T>

Second complex tensor with last dimension of size 2*n.

format string

Whether complex numbers are "split" ([r,r,...,i,i,...]).

Returns

ComputationNode<T>

Element-wise complex product.

Remarks

Complex multiplication: (a + bi)(c + di) = (ac - bd) + (ad + bc)i

Concat(List<ComputationNode<T>>, int)

Concatenates multiple computation nodes along a specified axis.

public static ComputationNode<T> Concat(List<ComputationNode<T>> nodes, int axis = 0)

Parameters

nodes List<ComputationNode<T>>

The list of nodes to concatenate.

axis int

The axis along which to concatenate. Default is 0.

Returns

ComputationNode<T>

A new computation node containing the concatenated result.

Remarks

This method concatenates tensors along the specified axis. All tensors must have the same shape except along the concatenation axis. The backward function splits the gradient and sends each portion to the corresponding input.

For Beginners: Concat stacks tensors together along a dimension.

For concatenation:

  • The forward pass combines multiple tensors into one larger tensor
  • The backward pass splits the gradient back to each input
  • Think of it like gluing arrays together end-to-end

Used in:

  • Skip connections (concatenating features from different layers)
  • Multi-input architectures
  • Feature fusion in neural networks

Constant(Tensor<T>, string?)

Creates a constant computation node from a tensor value.

public static ComputationNode<T> Constant(Tensor<T> value, string? name = null)

Parameters

value Tensor<T>

The tensor value.

name string

Optional name for the node.

Returns

ComputationNode<T>

A computation node that doesn't require gradients.

Remarks

This method creates a constant node - a value that won't have gradients computed. Use this for constants, hyperparameters, or intermediate values you don't need gradients for.

For Beginners: This creates a value that won't be adjusted during training.

Use this for:

  • Constants (like pi, e, or fixed multipliers)
  • Hyperparameters that don't change during training
  • Any value you don't need gradients for (saves memory)

Conv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?)

Performs 2D convolution on a 4D tensor (batch, channels, height, width).

public static ComputationNode<T> Conv2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null)

Parameters

input ComputationNode<T>

The input node with shape [batch, inChannels, height, width].

kernel ComputationNode<T>

The kernel/filter with shape [outChannels, inChannels, kernelH, kernelW].

bias ComputationNode<T>

Optional bias with shape [outChannels]. If null, no bias is added.

stride int[]

The stride [strideH, strideW]. Default is [1, 1].

padding int[]

The padding [padH, padW]. Default is [0, 0].

Returns

ComputationNode<T>

A new computation node containing the convolution result.

Remarks

This method performs 2D convolution, the fundamental operation in CNNs. Forward: Slides the kernel over the input computing dot products. Backward: Computes gradients for both input and kernel using transposed convolutions.

For Beginners: Conv2D is the core operation of convolutional neural networks.

For 2D convolution:

  • The kernel "slides" over the input, computing weighted sums
  • Each output position is a dot product of the kernel with input patch
  • Stride controls how far the kernel moves each step
  • Padding adds borders to control output size

Gradient computation:

  • Gradient w.r.t. input: "full" convolution with flipped kernel
  • Gradient w.r.t. kernel: cross-correlation between input and output gradient

Used in:

  • All CNNs (image classification, object detection, segmentation)
  • Feature extraction in vision models
  • Learning spatial hierarchies

Conv3D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?)

Performs 3D convolution on a 5D tensor (batch, channels, depth, height, width).

public static ComputationNode<T> Conv3D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null)

Parameters

input ComputationNode<T>

The input node with shape [batch, inChannels, depth, height, width].

kernel ComputationNode<T>

The kernel/filter with shape [outChannels, inChannels, kernelD, kernelH, kernelW].

bias ComputationNode<T>

Optional bias with shape [outChannels]. If null, no bias is added.

stride int[]

The stride [strideD, strideH, strideW]. Default is [1, 1, 1].

padding int[]

The padding [padD, padH, padW]. Default is [0, 0, 0].

Returns

ComputationNode<T>

A new computation node containing the 3D convolution result.

Remarks

This method performs 3D convolution, the fundamental operation for volumetric data processing. Forward: Slides the kernel over the input computing dot products across all three spatial dimensions. Backward: Computes gradients for both input and kernel using transposed 3D convolutions.

For Beginners: Conv3D is the 3D extension of Conv2D for volumetric data.

For 3D convolution:

  • The kernel "slides" over depth, height, and width dimensions
  • Each output position is a dot product of the kernel with an input volume
  • Stride controls how far the kernel moves each step in each dimension
  • Padding adds borders to control output size

Gradient computation:

  • Gradient w.r.t. input: "full" 3D convolution with flipped kernel
  • Gradient w.r.t. kernel: 3D cross-correlation between input and output gradient

Used in:

  • 3D object recognition from voxel grids (VoxNet, VoxelCNN)
  • Medical image analysis (CT/MRI volumetric scans)
  • Video understanding (treating time as depth dimension)
  • Point cloud processing after voxelization

Exceptions

ArgumentException

Thrown when input or kernel have invalid dimensions.

ConvTranspose2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?, int[]?)

Performs 2D transposed convolution (deconvolution) on a 4D tensor.

public static ComputationNode<T> ConvTranspose2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null, int[]? outputPadding = null)

Parameters

input ComputationNode<T>

The input node with shape [batch, inChannels, height, width].

kernel ComputationNode<T>

The kernel with shape [inChannels, outChannels, kernelH, kernelW] (note: reversed from Conv2D).

bias ComputationNode<T>

Optional bias with shape [outChannels]. If null, no bias is added.

stride int[]

The stride [strideH, strideW]. Default is [1, 1].

padding int[]

The padding [padH, padW]. Default is [0, 0].

outputPadding int[]

Output padding [outPadH, outPadW] for size adjustment. Default is [0, 0].

Returns

ComputationNode<T>

A new computation node containing the transposed convolution result.

Remarks

Transposed convolution (often called deconvolution) upsamples the input. It's the gradient of Conv2D with respect to its input, used as a forward operation.

For Beginners: ConvTranspose2D upsamples spatial dimensions.

For transposed convolution:

  • Inserts zeros between input elements according to stride
  • Applies regular convolution to the expanded input
  • Results in larger spatial dimensions (upsampling)

Used in:

  • Image generation (GANs, VAEs)
  • Semantic segmentation (U-Net decoder)
  • Super-resolution
  • Any task requiring upsampling

Crop(ComputationNode<T>, int[])

Crops a tensor by removing elements from the edges.

public static ComputationNode<T> Crop(ComputationNode<T> a, int[] cropping)

Parameters

a ComputationNode<T>

The input computation node.

cropping int[]

Array of [top, bottom, left, right] cropping amounts for 4D tensors.

Returns

ComputationNode<T>

A computation node representing the cropped tensor.

DeformableConv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, ComputationNode<T>?, int[]?, int[]?, int[]?)

Performs 2D deformable convolution with learnable offsets and optional modulation mask.

public static ComputationNode<T> DeformableConv2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T> offset, ComputationNode<T>? mask = null, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null, int[]? dilation = null)

Parameters

input ComputationNode<T>

Input tensor [batch, inChannels, height, width].

kernel ComputationNode<T>

Convolution kernel [outChannels, inChannels, kernelH, kernelW].

offset ComputationNode<T>

Spatial offsets [batch, 2kernelHkernelW, outH, outW].

mask ComputationNode<T>

Optional modulation mask [batch, kernelH*kernelW, outH, outW]. If null, uses uniform weights.

bias ComputationNode<T>

Optional bias [outChannels]. If null, no bias is added.

stride int[]

Stride [strideH, strideW]. Default is [1, 1].

padding int[]

Padding [padH, padW]. Default is [0, 0].

dilation int[]

Dilation [dilationH, dilationW]. Default is [1, 1].

Returns

ComputationNode<T>

Output tensor [batch, outChannels, outH, outW].

Remarks

Deformable convolution augments standard convolution with learnable 2D offsets for each sampling position in the kernel. This allows the network to adaptively adjust its receptive field based on the input, enabling better modeling of geometric transformations.

For Beginners: Standard convolution samples at fixed grid positions. Deformable convolution learns where to sample, allowing it to handle objects of various shapes and scales more effectively.

DepthwiseConv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?)

Performs depthwise 2D convolution where each input channel is convolved with its own set of filters.

public static ComputationNode<T> DepthwiseConv2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null)

Parameters

input ComputationNode<T>

Input tensor of shape [batch, in_channels, height, width]

kernel ComputationNode<T>

Kernel tensor of shape [in_channels, multiplier, kernel_height, kernel_width]

bias ComputationNode<T>

Optional bias tensor of shape [in_channels * multiplier]

stride int[]

Stride for the convolution, defaults to [1, 1]

padding int[]

Padding for the convolution, defaults to [0, 0]

Returns

ComputationNode<T>

Output tensor of shape [batch, in_channels * multiplier, out_height, out_width]

Remarks

Depthwise convolution applies a separate filter to each input channel independently, with no mixing across channels. This is in contrast to standard convolution which mixes all input channels. Each input channel gets 'multiplier' filters applied to it, producing 'multiplier' output channels. The total output channels is in_channels * multiplier.

This operation is commonly used in MobileNets and other efficient architectures, often followed by a pointwise (1x1) convolution to mix channels. The combination dramatically reduces computational cost compared to standard convolution.

Forward pass computes the depthwise convolution by applying each filter only to its corresponding input channel. Backward pass computes gradients with respect to input, kernel, and bias.

DilatedConv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?, int[]?)

Performs dilated (atrous) 2D convolution operation.

public static ComputationNode<T> DilatedConv2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null, int[]? dilation = null)

Parameters

input ComputationNode<T>

The input tensor with shape [batch, channels, height, width].

kernel ComputationNode<T>

The convolution kernel with shape [out_channels, in_channels, kernel_height, kernel_width].

bias ComputationNode<T>

Optional bias tensor with shape [out_channels].

stride int[]

The stride for the convolution. Defaults to [1, 1].

padding int[]

The padding for the convolution. Defaults to [0, 0].

dilation int[]

The dilation rate for the convolution. Defaults to [1, 1].

Returns

ComputationNode<T>

A computation node representing the dilated convolution result.

Divide(ComputationNode<T>, ComputationNode<T>)

Performs element-wise division of two computation nodes.

public static ComputationNode<T> Divide(ComputationNode<T> a, ComputationNode<T> b)

Parameters

a ComputationNode<T>

The numerator node.

b ComputationNode<T>

The denominator node.

Returns

ComputationNode<T>

A new computation node containing the element-wise quotient.

Remarks

This method performs element-wise division and records the operation to any active GradientTape. The backward function uses the quotient rule: ∂(a/b)/∂a = 1/b and ∂(a/b)/∂b = -a/b².

For Beginners: This divides one tensor by another element-wise and tracks gradients.

For element-wise division (c = a / b):

  • The forward pass divides corresponding elements
  • The backward pass uses the quotient rule from calculus
  • Gradient to 'a' is: incoming gradient * (1/b)
  • Gradient to 'b' is: incoming gradient * (-a/b²)

Example: If a=[6,8], b=[2,4], c=[3,2] If gradient to c is [1,1]:

  • 'a' receives [1/2, 1/4] = [0.5, 0.25]
  • 'b' receives [-6/4, -8/16] = [-1.5, -0.5]

ELU(ComputationNode<T>, double)

Applies the Exponential Linear Unit (ELU) activation function to a computation node.

public static ComputationNode<T> ELU(ComputationNode<T> a, double alpha = 1)

Parameters

a ComputationNode<T>

The input computation node.

alpha double

The alpha parameter controlling the negative saturation value. Default is 1.0.

Returns

ComputationNode<T>

A new computation node with ELU applied.

Remarks

ELU(x) = x if x > 0, alpha * (exp(x) - 1) otherwise. ELU helps prevent "dying neurons" and pushes mean activations closer to zero.

Gradient: d(ELU)/dx = 1 if x > 0, alpha * exp(x) = ELU(x) + alpha otherwise.

ElementwiseMultiply(ComputationNode<T>, ComputationNode<T>)

Performs element-wise multiplication of two computation nodes.

public static ComputationNode<T> ElementwiseMultiply(ComputationNode<T> a, ComputationNode<T> b)

Parameters

a ComputationNode<T>

The first node.

b ComputationNode<T>

The second node.

Returns

ComputationNode<T>

A new computation node containing the element-wise product.

Remarks

This method performs element-wise (Hadamard) multiplication and records the operation. The backward function uses the product rule: ∂(a*b)/∂a = b and ∂(a*b)/∂b = a.

For Beginners: This multiplies two tensors element-wise and tracks gradients.

For element-wise multiplication (c = a * b):

  • The forward pass multiplies corresponding elements
  • The backward pass uses the product rule from calculus
  • Gradient to 'a' is: incoming gradient * b's value
  • Gradient to 'b' is: incoming gradient * a's value

Example: If a=[2,3], b=[4,5], c=[8,15] If gradient to c is [1,1]:

  • 'a' receives [14, 15] = [4, 5]
  • 'b' receives [12, 13] = [2, 3]

EmbeddingLookup(ComputationNode<T>, ComputationNode<T>)

Performs embedding lookup operation.

public static ComputationNode<T> EmbeddingLookup(ComputationNode<T> embeddings, ComputationNode<T> indices)

Parameters

embeddings ComputationNode<T>

The embedding matrix [vocab_size, embedding_dim].

indices ComputationNode<T>

The indices to lookup [batch_size, sequence_length].

Returns

ComputationNode<T>

The looked up embeddings [batch_size, sequence_length, embedding_dim].

Exp(ComputationNode<T>)

Computes the exponential function (e^x) for a computation node.

public static ComputationNode<T> Exp(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input node.

Returns

ComputationNode<T>

A new computation node containing the exponential result.

Remarks

This method computes e raised to each element and records the operation. The backward function uses: ∂(e^a)/∂a = e^a.

For Beginners: This computes e^x for each element and tracks gradients.

For exponential (c = e^a):

  • The forward pass computes e^x for each element
  • The backward pass has a special property: the derivative equals the output!
  • Gradient to 'a' is: incoming gradient * e^a (which is just the output)

This is used in softmax, sigmoid, and many activation functions.

FakeQuantize(ComputationNode<T>, int, T?, T?, bool)

Performs fake quantization with Straight-Through Estimator (STE) for differentiable quantization.

public static ComputationNode<T> FakeQuantize(ComputationNode<T> input, int numBits = 8, T? scale = default, T? zeroPoint = default, bool symmetric = true)

Parameters

input ComputationNode<T>

The input tensor to quantize.

numBits int

Number of quantization bits (default: 8).

scale T

Scale factor (if null, computed from input range).

zeroPoint T

Zero point for asymmetric quantization (default: 0).

symmetric bool

Whether to use symmetric quantization (default: true).

Returns

ComputationNode<T>

Fake-quantized tensor (quantized forward, STE backward).

Remarks

Forward: output = round(input / scale) * scale (clipped to valid range) Backward: gradient passes through unchanged (Straight-Through Estimator)

For Beginners: This simulates quantization during training while allowing gradients to flow back for optimization. The forward pass applies real quantization, but the backward pass pretends it didn't happen - this trick (STE) lets us train models that will be quantized for deployment.

GELU(ComputationNode<T>)

Applies the Gaussian Error Linear Unit (GELU) activation function.

public static ComputationNode<T> GELU(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with GELU applied.

Remarks

GELU(x) = x * Φ(x) where Φ is the standard Gaussian cumulative distribution function. Approximation: 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))

GELU is widely used in transformers (BERT, GPT) and modern architectures.

Gradient: d(GELU)/dx = Φ(x) + x * φ(x) where φ is the Gaussian PDF.

GRUCell(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>)

GRU cell forward pass.

public static ComputationNode<T> GRUCell(ComputationNode<T> input, ComputationNode<T> hiddenState, ComputationNode<T> weightIH, ComputationNode<T> weightHH, ComputationNode<T> bias)

Parameters

input ComputationNode<T>

Input tensor [batch, input_dim].

hiddenState ComputationNode<T>

Previous hidden state [batch, hidden_dim].

weightIH ComputationNode<T>

Input-to-hidden weights [input_dim, 3*hidden_dim].

weightHH ComputationNode<T>

Hidden-to-hidden weights [hidden_dim, 3*hidden_dim].

bias ComputationNode<T>

Bias terms [3*hidden_dim].

Returns

ComputationNode<T>

New hidden state.

Gaussian(ComputationNode<T>)

Applies the Gaussian activation function element-wise: f(x) = exp(-x²).

public static ComputationNode<T> Gaussian(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with Gaussian applied.

Remarks

Gaussian is defined as: f(x) = exp(-x²) The gradient is: -2x * exp(-x²)

For Beginners: Gaussian creates a bell-shaped response curve that is maximum at zero and approaches zero for large inputs in either direction. Useful for RBF networks and pattern recognition.

GraphConv(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?)

Performs graph convolution operation for graph neural networks.

public static ComputationNode<T> GraphConv(ComputationNode<T> input, ComputationNode<T> adjacency, ComputationNode<T> weights, ComputationNode<T>? bias = null)

Parameters

input ComputationNode<T>

Input node features of shape [batch, numNodes, inputFeatures]

adjacency ComputationNode<T>

Adjacency matrix of shape [batch, numNodes, numNodes]

weights ComputationNode<T>

Weight matrix of shape [inputFeatures, outputFeatures]

bias ComputationNode<T>

Optional bias vector of shape [outputFeatures]

Returns

ComputationNode<T>

Output node features of shape [batch, numNodes, outputFeatures]

Remarks

This operation implements graph convolution: output = adjacency @ (input @ weights) + bias. It aggregates features from neighboring nodes according to the graph structure defined by the adjacency matrix.

Forward pass: 1. Transform node features: X' = X @ W 2. Aggregate via graph structure: output = A @ X' 3. Add bias: output = output + b

Backward pass gradients: - ∂L/∂X = A^T @ (∂L/∂out) @ W^T - ∂L/∂W = X^T @ A^T @ (∂L/∂out) - ∂L/∂b = sum(∂L/∂out) across batch and nodes - ∂L/∂A = (∂L/∂out) @ (X @ W)^T

For Beginners: This operation helps neural networks learn from graph-structured data.

Think of it like spreading information through a social network:

  • Each person (node) has certain features
  • The adjacency matrix shows who is connected to whom
  • This operation lets each person's features be influenced by their connections
  • The weights control how features are transformed during this process

GridSample(ComputationNode<T>, ComputationNode<T>)

Samples input using bilinear interpolation at grid locations for spatial transformer networks.

public static ComputationNode<T> GridSample(ComputationNode<T> input, ComputationNode<T> grid)

Parameters

input ComputationNode<T>

Input tensor of shape [batch, height, width, channels]

grid ComputationNode<T>

Sampling grid of shape [batch, out_height, out_width, 2] with normalized coordinates in [-1, 1]

Returns

ComputationNode<T>

Sampled output of shape [batch, out_height, out_width, channels]

Remarks

This operation performs differentiable bilinear sampling from the input tensor using coordinates specified in the grid. Grid coordinates are in normalized [-1, 1] space where (-1, -1) is top-left and (1, 1) is bottom-right.

Forward pass: 1. Convert normalized grid coordinates to input pixel coordinates 2. For each sampling point, find the 4 nearest pixels 3. Compute bilinear interpolation weights 4. Interpolate: out = w00*v00 + w01*v01 + w10*v10 + w11*v11

Backward pass: - ∂L/∂input: Distribute gradients back to the 4 nearest pixels using same weights - ∂L/∂grid: Compute how grid coordinates affect the sampling result

For Beginners: This samples from an image using smooth interpolation. Instead of reading exact pixels, it can sample from positions between pixels by blending nearby pixel values. This enables smooth transformations like rotation.

GroupNorm(ComputationNode<T>, int, ComputationNode<T>?, ComputationNode<T>?, double)

Applies group normalization to a computation node.

public static ComputationNode<T> GroupNorm(ComputationNode<T> a, int numGroups, ComputationNode<T>? gamma = null, ComputationNode<T>? beta = null, double epsilon = 1E-05)

Parameters

a ComputationNode<T>

The input node with shape [batch, channels, ...] where ... can be spatial dimensions.

numGroups int

The number of groups to divide channels into.

gamma ComputationNode<T>

Optional scale parameter per channel. If null, uses ones.

beta ComputationNode<T>

Optional shift parameter per channel. If null, uses zeros.

epsilon double

Small constant for numerical stability. Default is 1e-5.

Returns

ComputationNode<T>

A new computation node containing the group normalized result.

Remarks

Group normalization divides channels into groups and normalizes within each group. Unlike batch normalization, it doesn't depend on batch size, making it suitable for small batch sizes or generative models.

For Beginners: GroupNorm is an alternative to BatchNorm that works better when batch sizes are small.

For group normalization:

  • Divides channels into groups (e.g., 32 groups for 256 channels = 8 channels per group)
  • Normalizes each group independently: (x - mean) / sqrt(variance + epsilon)
  • Scales and shifts per channel: result * gamma + beta
  • Works the same during training and inference (no batch dependency)

Key advantages:

  • Works with batch size of 1 (unlike BatchNorm)
  • More stable for generative models (VAEs, GANs, diffusion models)
  • Used in modern architectures like Stable Diffusion VAE

Typical usage:

  • numGroups=32 for 256+ channels
  • numGroups=16 for 128 channels
  • numGroups=8 for 64 channels

GumbelSoftmax(ComputationNode<T>, double, bool)

Applies Gumbel-Softmax for differentiable discrete sampling approximation.

public static ComputationNode<T> GumbelSoftmax(ComputationNode<T> logits, double temperature = 1, bool hard = false)

Parameters

logits ComputationNode<T>

The input logits.

temperature double

Temperature parameter controlling softness (default 1.0).

hard bool

Whether to use straight-through estimator for hard samples.

Returns

ComputationNode<T>

A computation node containing the soft/hard samples.

Remarks

Gumbel-Softmax provides a differentiable approximation to categorical sampling. As temperature approaches 0, outputs approach one-hot categorical samples. When hard=true, uses straight-through estimator for discrete outputs with gradient pass-through.

HardSigmoid(ComputationNode<T>)

Applies the Hard Sigmoid activation function element-wise: f(x) = clip((x + 3) / 6, 0, 1).

public static ComputationNode<T> HardSigmoid(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with HardSigmoid applied.

Remarks

HardSigmoid is a piecewise linear approximation of sigmoid that is computationally efficient. The gradient is 1/6 when -3 < x < 3, and 0 otherwise.

For Beginners: HardSigmoid uses straight lines instead of curves, making it faster to compute while still mapping inputs to the [0, 1] range. It's commonly used in mobile and embedded neural networks.

HardTanh(ComputationNode<T>)

Applies the Hard Tanh activation function element-wise: f(x) = clip(x, -1, 1).

public static ComputationNode<T> HardTanh(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with HardTanh applied.

Remarks

HardTanh is a piecewise linear approximation of tanh that is computationally efficient. The gradient is 1 when -1 < x < 1, and 0 otherwise.

For Beginners: HardTanh clips values to the range [-1, 1], passing through values in the middle range unchanged. It's faster than regular tanh and useful when you need bounded outputs.

HierarchicalSoftmax(ComputationNode<T>, ComputationNode<T>, int)

Applies the Hierarchical Softmax activation function for efficient large-vocabulary classification.

public static ComputationNode<T> HierarchicalSoftmax(ComputationNode<T> input, ComputationNode<T> nodeWeights, int numClasses)

Parameters

input ComputationNode<T>

The input computation node (2D: batch × inputDim).

nodeWeights ComputationNode<T>

The tree node weights (2D: treeDepth × inputDim).

numClasses int

Number of output classes.

Returns

ComputationNode<T>

A new computation node with HierarchicalSoftmax applied.

Remarks

Hierarchical Softmax organizes classes in a binary tree structure. Each node makes a binary decision using sigmoid, and the final probability is the product of probabilities along the path to each class.

Computational complexity is O(log N) instead of O(N) for standard softmax.

Gradient: Flows through sigmoid derivatives at each tree node.

HyperbolicLinear(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, double)

Hyperbolic linear transformation in the Poincare ball model.

public static ComputationNode<T> HyperbolicLinear(ComputationNode<T> input, ComputationNode<T> weights, ComputationNode<T>? biases = null, double curvature = -1)

Parameters

input ComputationNode<T>

Input tensor [batchSize, inputFeatures].

weights ComputationNode<T>

Weight matrix in tangent space [outputFeatures, inputFeatures].

biases ComputationNode<T>

Bias points on Poincare ball [outputFeatures, inputFeatures].

curvature double

Negative curvature of hyperbolic space (default -1).

Returns

ComputationNode<T>

Output tensor [batchSize, outputFeatures] with Poincare distances.

Remarks

Performs hyperbolic linear transformation:

  1. Project input to Poincare ball
  2. For each output: exp_origin(weight) → Mobius add with input → Mobius add with bias → distance from origin

ISRU(ComputationNode<T>, double)

Applies the Inverse Square Root Unit (ISRU) activation function.

public static ComputationNode<T> ISRU(ComputationNode<T> a, double alpha = 1)

Parameters

a ComputationNode<T>

The input computation node.

alpha double

The scaling parameter (default 1.0).

Returns

ComputationNode<T>

A new computation node with ISRU applied.

Remarks

ISRU(x) = x / sqrt(1 + alpha * x²) A smooth, bounded activation function that ranges from -1/sqrt(alpha) to 1/sqrt(alpha).

Gradient: d(ISRU)/dx = (1 + alpha * x²)^(-3/2)

LSTMCell(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>)

LSTM cell forward pass.

public static (ComputationNode<T>, ComputationNode<T>) LSTMCell(ComputationNode<T> input, ComputationNode<T> hiddenState, ComputationNode<T> cellState, ComputationNode<T> weightIH, ComputationNode<T> weightHH, ComputationNode<T> bias)

Parameters

input ComputationNode<T>

Input tensor [batch, input_dim].

hiddenState ComputationNode<T>

Previous hidden state [batch, hidden_dim].

cellState ComputationNode<T>

Previous cell state [batch, hidden_dim].

weightIH ComputationNode<T>

Input-to-hidden weights [input_dim, 4*hidden_dim].

weightHH ComputationNode<T>

Hidden-to-hidden weights [hidden_dim, 4*hidden_dim].

bias ComputationNode<T>

Bias terms [4*hidden_dim].

Returns

(ComputationNode<T>, ComputationNode<T>)

Tuple of (new hidden state, new cell state).

LayerNorm(ComputationNode<T>, int[], ComputationNode<T>?, ComputationNode<T>?, double)

Applies layer normalization to a computation node.

public static ComputationNode<T> LayerNorm(ComputationNode<T> a, int[] normalizedShape, ComputationNode<T>? gamma = null, ComputationNode<T>? beta = null, double epsilon = 1E-05)

Parameters

a ComputationNode<T>

The input node.

normalizedShape int[]

The shape over which to normalize (typically the feature dimensions).

gamma ComputationNode<T>

Optional scale parameter (learnable). If null, uses ones.

beta ComputationNode<T>

Optional shift parameter (learnable). If null, uses zeros.

epsilon double

Small constant for numerical stability. Default is 1e-5.

Returns

ComputationNode<T>

A new computation node containing the layer normalized result.

Remarks

Layer normalization normalizes inputs across the feature dimension for each sample independently. Formula: y = gamma * (x - mean) / sqrt(variance + epsilon) + beta Unlike batch normalization, this doesn't depend on batch statistics.

For Beginners: LayerNorm standardizes features for each sample independently.

For layer normalization:

  • Computes mean and variance for each sample's features
  • Normalizes: (x - mean) / sqrt(variance)
  • Scales and shifts: result * gamma + beta
  • Works the same during training and inference (no batch dependency)

Used in:

  • Transformers (critical component)
  • RNNs (stabilizes training)
  • Any architecture needing sample-independent normalization

LeakyReLU(ComputationNode<T>, double)

Applies the Leaky Rectified Linear Unit (LeakyReLU) activation function.

public static ComputationNode<T> LeakyReLU(ComputationNode<T> a, double alpha = 0.01)

Parameters

a ComputationNode<T>

The input computation node.

alpha double

The slope for negative values. Default is 0.01.

Returns

ComputationNode<T>

A new computation node with LeakyReLU applied.

Remarks

LeakyReLU(x) = x if x > 0, alpha * x otherwise. Unlike ReLU, LeakyReLU allows a small gradient for negative inputs, preventing dying neurons.

Gradient: d(LeakyReLU)/dx = 1 if x > 0, alpha otherwise.

LeakyStateUpdate(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, double)

Leaky state update for reservoir/echo state networks.

public static ComputationNode<T> LeakyStateUpdate(ComputationNode<T> prevState, ComputationNode<T> input, ComputationNode<T> weights, double leakingRate = 1)

Parameters

prevState ComputationNode<T>

Previous hidden state.

input ComputationNode<T>

Current input.

weights ComputationNode<T>

Reservoir weight matrix (can be frozen).

leakingRate double

Leaking rate (default 1.0 for full update).

Returns

ComputationNode<T>

New hidden state.

Remarks

Computes: new_state = (1 - leakingRate) * prevState + leakingRate * tanh(weights @ prevState + input)

LiSHT(ComputationNode<T>)

Applies the LiSHT (Linearly Scaled Hyperbolic Tangent) activation function element-wise.

public static ComputationNode<T> LiSHT(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with LiSHT applied.

Remarks

LiSHT is defined as: f(x) = x * tanh(x) The gradient is: tanh(x) + x * (1 - tanh²(x))

For Beginners: LiSHT combines the input with its tanh, creating a smooth activation that preserves sign and helps prevent vanishing gradients.

LocallyConnectedConv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?)

Performs locally connected 2D convolution where weights are NOT shared across spatial locations.

public static ComputationNode<T> LocallyConnectedConv2D(ComputationNode<T> input, ComputationNode<T> weights, ComputationNode<T>? bias = null, int[]? stride = null)

Parameters

input ComputationNode<T>

Input tensor of shape [batch, in_channels, height, width]

weights ComputationNode<T>

Weight tensor of shape [out_h, out_w, out_channels, in_channels, kernel_h, kernel_w]

bias ComputationNode<T>

Optional bias tensor of shape [out_channels]

stride int[]

Stride for the convolution, defaults to [1, 1]

Returns

ComputationNode<T>

Output tensor of shape [batch, out_channels, out_h, out_w]

Remarks

Locally connected convolution is like regular convolution but uses different weights for each spatial output location. This increases parameters but allows position-specific feature detection.

Unlike Conv2D where weights are shared across all positions, LocallyConnectedConv2D uses unique weights for each (h,w) output position. This is useful when different regions have fundamentally different characteristics (e.g., face recognition where eyes/nose/mouth are at specific locations).

Forward pass applies position-specific filters at each output location. Backward pass computes gradients with respect to input, position-specific weights, and bias.

Log(ComputationNode<T>)

Computes the natural logarithm for a computation node.

public static ComputationNode<T> Log(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input node.

Returns

ComputationNode<T>

A new computation node containing the logarithm result.

Remarks

This method computes the natural logarithm of each element and records the operation. The backward function uses: ∂(log(a))/∂a = 1/a.

For Beginners: This computes the natural log and tracks gradients.

For logarithm (c = log(a)):

  • The forward pass computes log for each element
  • The backward pass uses: gradient to 'a' is incoming gradient * (1/a)

Logarithms are used in loss functions like cross-entropy.

LogSoftmax(ComputationNode<T>, int)

Applies the Log-Softmax function for numerically stable cross-entropy loss computation.

public static ComputationNode<T> LogSoftmax(ComputationNode<T> a, int axis = -1)

Parameters

a ComputationNode<T>

The input computation node.

axis int

The axis along which to compute log-softmax (default -1, last axis).

Returns

ComputationNode<T>

A new computation node with Log-Softmax applied.

Remarks

LogSoftmax(x) = log(softmax(x)) = x - log(sum(exp(x))) More numerically stable than computing log(softmax(x)) separately.

Gradient: dL/dx_i = dL/dy_i - softmax(x)_i * sum_j(dL/dy_j) where y = LogSoftmax(x) and dL/dy is the incoming gradient.

LogSoftmin(ComputationNode<T>, int)

Applies the Log-Softmin function for numerically stable computation.

public static ComputationNode<T> LogSoftmin(ComputationNode<T> a, int axis = -1)

Parameters

a ComputationNode<T>

The input computation node.

axis int

The axis along which to compute log-softmin (default -1, last axis).

Returns

ComputationNode<T>

A new computation node with Log-Softmin applied.

Remarks

LogSoftmin(x) = log(softmin(x)) = -x - log(sum(exp(-x))) Combines log and softmin for numerical stability.

MatrixMultiply(ComputationNode<T>, ComputationNode<T>)

Performs matrix multiplication on two computation nodes.

public static ComputationNode<T> MatrixMultiply(ComputationNode<T> a, ComputationNode<T> b)

Parameters

a ComputationNode<T>

The left matrix (must be 2D).

b ComputationNode<T>

The right matrix (must be 2D).

Returns

ComputationNode<T>

A computation node representing the matrix product.

Remarks

Computes C = A·B where A has shape [m, n] and B has shape [n, p], resulting in C with shape [m, p].

Gradient computation: - ∂(A·B)/∂A = gradOut·B^T - ∂(A·B)/∂B = A^T·gradOut

MatrixVectorMultiply(ComputationNode<T>, ComputationNode<T>)

Performs a matrix-vector multiplication (2D x 1D) by reshaping the vector into a column matrix.

public static ComputationNode<T> MatrixVectorMultiply(ComputationNode<T> matrix, ComputationNode<T> vector)

Parameters

matrix ComputationNode<T>

The left matrix (must be 2D).

vector ComputationNode<T>

The right vector (must be 1D).

Returns

ComputationNode<T>

A computation node representing the vector result.

MaxPool2D(ComputationNode<T>, int[], int[]?)

Performs 2D max pooling on a 4D tensor (batch, channels, height, width).

public static ComputationNode<T> MaxPool2D(ComputationNode<T> a, int[] poolSize, int[]? strides = null)

Parameters

a ComputationNode<T>

The input node with shape [batch, channels, height, width].

poolSize int[]

The size of the pooling window [poolH, poolW].

strides int[]

The stride for the pooling operation [strideH, strideW]. If null, uses poolSize.

Returns

ComputationNode<T>

A new computation node containing the max pooled result.

Remarks

This method performs max pooling over 2D spatial dimensions. During forward pass, it tracks which element was the max for routing gradients during backward pass.

For Beginners: MaxPool downsamples by taking the maximum value in each window.

For max pooling:

  • The forward pass slides a window and takes the max value in each position
  • This reduces spatial dimensions (downsampling)
  • The backward pass routes gradients only to the positions that were max
  • Other positions get zero gradient (they didn't contribute to the output)

Used in:

  • CNNs for translation invariance
  • Reducing spatial resolution
  • Building hierarchical features

MaxPool3D(ComputationNode<T>, int[], int[]?)

Performs 3D max pooling on a 5D tensor (batch, channels, depth, height, width).

public static ComputationNode<T> MaxPool3D(ComputationNode<T> input, int[] poolSize, int[]? strides = null)

Parameters

input ComputationNode<T>

The input node with shape [batch, channels, depth, height, width].

poolSize int[]

The size of the pooling window [poolD, poolH, poolW].

strides int[]

The stride for the pooling operation [strideD, strideH, strideW]. If null, uses poolSize.

Returns

ComputationNode<T>

A new computation node containing the max pooled result.

Remarks

This method performs max pooling over 3D spatial dimensions (depth, height, width). The backward function routes gradients only to the maximum values in each pooling window.

For Beginners: MaxPool3D downsamples volumetric data by taking the maximum value in each window.

For max pooling:

  • The forward pass slides a 3D window and takes the maximum
  • This reduces the spatial dimensions while preserving the strongest activations
  • The backward pass routes gradients only to where the max came from
  • Non-max elements get zero gradient

Used in:

  • Voxel-based 3D CNNs for shape classification
  • Medical image analysis (CT/MRI)
  • Video processing

Maxout(ComputationNode<T>, int)

Applies the Maxout activation function which takes maximum over groups of inputs.

public static ComputationNode<T> Maxout(ComputationNode<T> a, int numPieces = 2)

Parameters

a ComputationNode<T>

The input computation node (2D: batch × features).

numPieces int

Number of inputs per group (default 2).

Returns

ComputationNode<T>

A new computation node with Maxout applied.

Remarks

Maxout groups consecutive features and outputs the maximum from each group. Input features must be divisible by numPieces. Output shape: [batch, features / numPieces].

Gradient: Flows only to the maximum element in each group (sparse gradient).

Mean(ComputationNode<T>)

Computes the mean of elements in a computation node.

public static ComputationNode<T> Mean(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The computation node to compute mean of.

Returns

ComputationNode<T>

A computation node representing the mean (scalar).

Remarks

Computes the average of all elements in the tensor.

Gradient computation: - ∂(mean(A))/∂A = gradOut / count - Each element gets an equal share of the gradient, divided by the total count.

Mish(ComputationNode<T>)

Applies the Mish activation function.

public static ComputationNode<T> Mish(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with Mish applied.

Remarks

Mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + exp(x))) Mish is a smooth, self-regularizing activation function.

Gradient: d(Mish)/dx = sech²(softplus(x)) * sigmoid(x) + tanh(softplus(x))

MobiusAdd(ComputationNode<T>, ComputationNode<T>, double)

Mobius addition in the Poincare ball model.

public static ComputationNode<T> MobiusAdd(ComputationNode<T> x, ComputationNode<T> y, double curvature = -1)

Parameters

x ComputationNode<T>

First point tensor [batchSize, dim] or [dim].

y ComputationNode<T>

Second point tensor with same shape as x.

curvature double

Negative curvature of hyperbolic space (default -1).

Returns

ComputationNode<T>

Result of Mobius addition x ⊕ y.

Remarks

Mobius addition is the hyperbolic analog of vector addition: x ⊕ y = ((1 + 2c⟨x,y⟩ + c||y||²)x + (1 - c||x||²)y) / (1 + 2c⟨x,y⟩ + c²||x||²||y||²)

MultiHeadAttention(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, int, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>)

Applies multi-head attention mechanism.

public static ComputationNode<T> MultiHeadAttention(ComputationNode<T> query, ComputationNode<T> key, ComputationNode<T> value, int numHeads, ComputationNode<T> wQ, ComputationNode<T> wK, ComputationNode<T> wV, ComputationNode<T> wO)

Parameters

query ComputationNode<T>

Query tensor.

key ComputationNode<T>

Key tensor.

value ComputationNode<T>

Value tensor.

numHeads int

Number of attention heads.

wQ ComputationNode<T>

Query projection weights.

wK ComputationNode<T>

Key projection weights.

wV ComputationNode<T>

Value projection weights.

wO ComputationNode<T>

Output projection weights.

Returns

ComputationNode<T>

Multi-head attention output.

Negate(ComputationNode<T>)

Negates a computation node (computes -a).

public static ComputationNode<T> Negate(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input node.

Returns

ComputationNode<T>

A new computation node containing the negated result.

Remarks

This method negates each element and records the operation. The backward function simply negates the incoming gradient.

For Beginners: This flips the sign of each element.

For negation (c = -a):

  • The forward pass flips signs (positive becomes negative, vice versa)
  • The backward pass also flips the gradient sign

Norm(ComputationNode<T>, int, bool, double)

Computes the L2 norm along a specified axis.

public static ComputationNode<T> Norm(ComputationNode<T> a, int axis = -1, bool keepDims = false, double epsilon = 1E-12)

Parameters

a ComputationNode<T>

The input node.

axis int

The axis along which to compute the norm. Default is -1 (last axis).

keepDims bool

Whether to keep the reduced dimensions. Default is false.

epsilon double

Small value for numerical stability. Default is 1e-12.

Returns

ComputationNode<T>

A new computation node containing the norm along the specified axis.

Remarks

This method computes the L2 (Euclidean) norm: sqrt(sum(x²)) along the specified axis. The gradient is computed as: ∂||x||/∂x = x / ||x||.

For Beginners: The norm measures the "length" of vectors.

For example, with axis=-1:

  • Input shape: [batch, features]
  • Output shape: [batch] (or [batch, 1] with keepDims=True)
  • Each output value is sqrt(sum of squares along that row)

This is commonly used in capsule networks to compute capsule lengths, and in normalization operations.

OctonionMatMul(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?)

Performs octonion matrix multiplication for OctonionLinearLayer.

public static ComputationNode<T> OctonionMatMul(ComputationNode<T> input, ComputationNode<T> weights, ComputationNode<T>? biases = null)

Parameters

input ComputationNode<T>

Input tensor with shape [batch, inputFeatures * 8] where each group of 8 represents an octonion.

weights ComputationNode<T>

Weight tensor with shape [outputFeatures, inputFeatures, 8] where last dimension is octonion components.

biases ComputationNode<T>

Optional bias tensor with shape [outputFeatures, 8].

Returns

ComputationNode<T>

Output tensor with shape [batch, outputFeatures * 8].

Remarks

Octonions are 8-dimensional numbers that generalize quaternions. They are non-associative but can capture more complex relationships in data. This operation performs: output[b, o] = sum_i(input[b, i] * weights[o, i]) + biases[o] where * is octonion multiplication.

For Beginners: This is like matrix multiplication but using 8-dimensional octonion numbers instead of regular numbers. Each octonion has 8 components: (scalar, e1, e2, e3, e4, e5, e6, e7).

PReLU(ComputationNode<T>, double)

Applies the Parametric Rectified Linear Unit (PReLU) activation function.

public static ComputationNode<T> PReLU(ComputationNode<T> a, double alpha = 0.01)

Parameters

a ComputationNode<T>

The input computation node.

alpha double

The slope for negative values (default 0.01).

Returns

ComputationNode<T>

A new computation node with PReLU applied.

Remarks

PReLU(x) = x if x > 0, alpha * x otherwise. Similar to LeakyReLU but alpha is typically learned during training.

Gradient: d(PReLU)/dx = 1 if x > 0, alpha otherwise.

Pad(ComputationNode<T>, int[,], T?)

Pads a tensor with a constant value along specified dimensions.

public static ComputationNode<T> Pad(ComputationNode<T> a, int[,] padWidth, T? value = default)

Parameters

a ComputationNode<T>

The input node.

padWidth int[,]

Padding width for each dimension as (before, after) pairs.

value T

The value to use for padding. Default is zero.

Returns

ComputationNode<T>

A new computation node containing the padded result.

Remarks

This method adds padding around the tensor. The backward function simply crops the gradient back to the original size (gradients for padding are zero).

For Beginners: Pad adds extra elements around a tensor.

For padding:

  • The forward pass adds border elements with a constant value
  • The backward pass removes those border gradients (they don't affect the original tensor)
  • Think of it like adding margins to an image

Used in:

  • Convolutional layers (to maintain spatial dimensions)
  • Handling variable-length sequences
  • Data augmentation

Pad(ComputationNode<T>, int[])

Pads a tensor with zeros along specified dimensions.

public static ComputationNode<T> Pad(ComputationNode<T> a, int[] padding)

Parameters

a ComputationNode<T>

The input computation node to pad.

padding int[]

Array specifying padding amount for each dimension (applied symmetrically on both sides).

Returns

ComputationNode<T>

A new computation node containing the padded tensor.

Remarks

This method pads the input tensor by adding zeros around each dimension. The padding array specifies how many zeros to add on BOTH sides of each dimension. For example, padding[1] = 2 means add 2 zeros on the left AND 2 zeros on the right of dimension 1.

The backward function for padding simply extracts the non-padded region from the output gradient, since ∂(pad(x))/∂x is an extraction operation that removes the padded regions.

For Beginners: Padding adds a border of zeros around your data.

For padding (output = pad(input, [p0, p1, ...])):

  • The forward pass creates a larger tensor and copies input to the center
  • Padding p on dimension d means: add p zeros on left, p zeros on right
  • The backward pass extracts the center region from the gradient (removes the padding)

This is commonly used in convolutional neural networks to preserve spatial dimensions.

Permute(ComputationNode<T>, params int[])

Permutes the dimensions of a computation node (general transpose).

public static ComputationNode<T> Permute(ComputationNode<T> a, params int[] axes)

Parameters

a ComputationNode<T>

The computation node to permute.

axes int[]

The new order of dimensions.

Returns

ComputationNode<T>

A computation node with permuted dimensions.

Remarks

Rearranges dimensions according to the axes array. Equivalent to Transpose but for N dimensions.

Gradient computation: - ∂(Permute(A))/∂A = Permute(gradOut, inverseAxes)

PixelShuffle(ComputationNode<T>, int)

Performs pixel shuffle (depth-to-space) operation for sub-pixel convolution.

public static ComputationNode<T> PixelShuffle(ComputationNode<T> a, int upscaleFactor)

Parameters

a ComputationNode<T>

The input computation node with shape [batch, channels, height, width].

upscaleFactor int

The upscaling factor (r). Channels must be divisible by r².

Returns

ComputationNode<T>

A computation node with shape [batch, channels/(r²), heightr, widthr].

PoincareDistance(ComputationNode<T>, ComputationNode<T>, double)

Computes the Poincare ball distance between two points.

public static ComputationNode<T> PoincareDistance(ComputationNode<T> x, ComputationNode<T> y, double curvature = -1)

Parameters

x ComputationNode<T>

First point tensor [batchSize, dim] or [dim].

y ComputationNode<T>

Second point tensor with same shape as x.

curvature double

Negative curvature of hyperbolic space (default -1).

Returns

ComputationNode<T>

Distance tensor [batchSize] or scalar.

Remarks

The Poincare distance between points x and y is: d(x, y) = (2/sqrt(c)) * arctanh(sqrt(c) || -x ⊕ y ||)

PoincareExpMap(ComputationNode<T>, ComputationNode<T>, double)

Poincare ball exponential map from tangent space at a point.

public static ComputationNode<T> PoincareExpMap(ComputationNode<T> point, ComputationNode<T> tangent, double curvature = -1)

Parameters

point ComputationNode<T>

Base point on the Poincare ball [batchSize, dim] or [dim].

tangent ComputationNode<T>

Tangent vector at the point with same shape.

curvature double

Negative curvature of hyperbolic space (default -1).

Returns

ComputationNode<T>

Point on manifold after following geodesic.

Remarks

The exponential map takes a tangent vector at point p and returns the point reached by following the geodesic in that direction: exp_p(v) = p ⊕ (tanh(sqrt(c)||v||_p / 2) * v / (sqrt(c)||v||)) where ||v||_p = ||v|| * 2 / (1 - c||p||²) is the Poincare norm.

PoincareLogMap(ComputationNode<T>, ComputationNode<T>, double)

Poincare ball logarithmic map to tangent space at a point.

public static ComputationNode<T> PoincareLogMap(ComputationNode<T> point, ComputationNode<T> target, double curvature = -1)

Parameters

point ComputationNode<T>

Base point on the Poincare ball [batchSize, dim] or [dim].

target ComputationNode<T>

Target point on the Poincare ball with same shape.

curvature double

Negative curvature of hyperbolic space (default -1).

Returns

ComputationNode<T>

Tangent vector at point pointing towards target.

Remarks

The logarithmic map is the inverse of the exponential map: log_p(q) = (2 / (sqrt(c) * lambda_p)) * arctanh(sqrt(c) || -p ⊕ q ||) * (-p ⊕ q) / || -p ⊕ q ||

PoincareProject(ComputationNode<T>, double, double)

Projects a point onto the Poincare ball to ensure it stays inside the unit ball.

public static ComputationNode<T> PoincareProject(ComputationNode<T> point, double curvature = -1, double epsilon = 1E-05)

Parameters

point ComputationNode<T>

Input point tensor [batchSize, dim] or [dim].

curvature double

Negative curvature of hyperbolic space (default -1).

epsilon double

Small value for numerical stability.

Returns

ComputationNode<T>

Projected point on the Poincare ball.

Remarks

Projects points that are outside or on the boundary of the Poincare ball back inside by scaling to have norm slightly less than 1/sqrt(|c|).

Power(ComputationNode<T>, double)

Raises a computation node to a power.

public static ComputationNode<T> Power(ComputationNode<T> a, double exponent)

Parameters

a ComputationNode<T>

The base node.

exponent double

The exponent value.

Returns

ComputationNode<T>

A new computation node containing the power operation result.

Remarks

This method raises each element to a power and records the operation. The backward function uses the power rule: ∂(a^n)/∂a = n * a^(n-1).

For Beginners: This raises a tensor to a power and tracks gradients.

For power operation (c = a^n):

  • The forward pass raises each element to the power
  • The backward pass uses the power rule from calculus
  • Gradient to 'a' is: incoming gradient * n * a^(n-1)

Example: If a=[2,3], n=2, c=[4,9] If gradient to c is [1,1]:

  • 'a' receives [122^1, 123^1] = [4, 6]

RBFKernel(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>)

Computes Gaussian Radial Basis Function (RBF) kernel activations.

public static ComputationNode<T> RBFKernel(ComputationNode<T> input, ComputationNode<T> centers, ComputationNode<T> epsilons)

Parameters

input ComputationNode<T>

Input tensor of shape [batch, inputSize]

centers ComputationNode<T>

Center points tensor of shape [numCenters, inputSize]

epsilons ComputationNode<T>

Width parameters tensor of shape [numCenters]

Returns

ComputationNode<T>

Output tensor of shape [batch, numCenters] containing RBF activations

Remarks

This operation implements the Gaussian RBF: f(r) = exp(-epsilon * r²) where r is the Euclidean distance between input and center.

Forward pass: For each input and center pair, computes: 1. distance = sqrt(sum((input - center)²)) 2. output = exp(-epsilon * distance²)

Backward pass gradients: - ∂L/∂input = ∂L/∂output * (-2 * epsilon * distance) * (input - center) / distance - ∂L/∂centers = -∂L/∂input (opposite direction) - ∂L/∂epsilon = ∂L/∂output * (-distance²) * output

For Beginners: This operation creates "similarity scores" between inputs and centers. Each RBF neuron responds strongly (value near 1) when input is close to its center, and weakly (value near 0) when far away. The epsilon parameter controls how quickly the response decreases with distance.

RReLU(ComputationNode<T>, double, double, bool, int?)

Applies the Randomized Leaky ReLU (RReLU) activation function.

public static ComputationNode<T> RReLU(ComputationNode<T> a, double lower = 0.125, double upper = 0.333, bool isTraining = false, int? seed = null)

Parameters

a ComputationNode<T>

The input computation node.

lower double

Lower bound for alpha (default 1/8).

upper double

Upper bound for alpha (default 1/3).

isTraining bool

If true, samples random alpha; if false, uses average (default false for JIT).

seed int?

Optional random seed for reproducibility.

Returns

ComputationNode<T>

A new computation node with RReLU applied.

Remarks

RReLU(x) = x if x >= 0, alpha * x otherwise. During training, alpha is sampled uniformly from [lower, upper]. During inference (JIT default), alpha = (lower + upper) / 2.

Gradient: 1 for x >= 0, alpha for x < 0.

ReLU(ComputationNode<T>)

Computes the ReLU (Rectified Linear Unit) activation for a computation node.

public static ComputationNode<T> ReLU(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input node.

Returns

ComputationNode<T>

A new computation node containing the ReLU result.

Remarks

This method computes ReLU (max(0, x)) and records the operation. The backward function uses: ∂ReLU(a)/∂a = 1 if a > 0, else 0.

For Beginners: ReLU is the most popular activation function in deep learning.

For ReLU (c = max(0, a)):

  • The forward pass keeps positive values, zeros out negative values
  • The backward pass: gradient flows through if input was positive, blocked if negative

ReLU is popular because:

  • Very fast to compute
  • Helps avoid vanishing gradients
  • Works well in practice for deep networks

ReduceLogVariance(ComputationNode<T>, int, double)

Computes the natural logarithm of variance along the specified axis.

public static ComputationNode<T> ReduceLogVariance(ComputationNode<T> input, int axis, double epsilon = 1E-08)

Parameters

input ComputationNode<T>

Input tensor of any shape

axis int

The axis along which to compute variance (must be specified)

epsilon double

Small constant for numerical stability (default: 1e-8)

Returns

ComputationNode<T>

Tensor with reduced shape containing log-variance values

Remarks

This operation computes log(variance + epsilon) along the specified axis. The output shape has the specified axis dimension removed from the input shape.

Forward pass: log(variance + epsilon) where variance = mean((x - mean(x))^2)

Backward pass uses chain rule: ∂L/∂x_i = ∂L/∂log_var * (1/variance) * (2/N) * (x_i - mean) where N is the size of the reduction axis.

For Beginners: This operation measures how spread out values are along an axis, then takes the logarithm. Commonly used in variational autoencoders and uncertainty estimation.

ReduceMax(ComputationNode<T>, int[]?, bool)

Reduces a tensor by computing the maximum value along specified axes.

public static ComputationNode<T> ReduceMax(ComputationNode<T> a, int[]? axes = null, bool keepDims = false)

Parameters

a ComputationNode<T>

The input computation node.

axes int[]

The axes along which to compute the maximum. If null, reduces over all axes.

keepDims bool

Whether to keep the reduced dimensions with size 1.

Returns

ComputationNode<T>

A computation node representing the result of the reduce max operation.

ReduceMean(ComputationNode<T>, int[]?, bool)

Reduces a tensor by computing the mean value along specified axes.

public static ComputationNode<T> ReduceMean(ComputationNode<T> a, int[]? axes = null, bool keepDims = false)

Parameters

a ComputationNode<T>

The input computation node.

axes int[]

The axes along which to compute the mean. If null, reduces over all axes.

keepDims bool

Whether to keep the reduced dimensions with size 1.

Returns

ComputationNode<T>

A computation node representing the result of the reduce mean operation.

Reshape(ComputationNode<T>, params int[])

Reshapes a computation node to a new shape.

public static ComputationNode<T> Reshape(ComputationNode<T> a, params int[] newShape)

Parameters

a ComputationNode<T>

The computation node to reshape.

newShape int[]

The new shape (must have same total number of elements).

Returns

ComputationNode<T>

A computation node with the new shape.

Remarks

Changes the shape of the tensor without changing the underlying data. The total number of elements must remain the same.

Gradient computation: - ∂(Reshape(A))/∂A = Reshape(gradOut, A.Shape) - Simply reshape the gradient back to the original shape.

SELU(ComputationNode<T>)

Applies the SELU (Scaled Exponential Linear Unit) activation function element-wise.

public static ComputationNode<T> SELU(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with SELU applied.

Remarks

SELU is defined as: λ * x if x > 0, otherwise λ * α * (e^x - 1) where λ ≈ 1.0507 and α ≈ 1.6733 are fixed constants for self-normalization. The gradient is: λ if x > 0, otherwise λ * α * e^x

For Beginners: SELU enables self-normalizing neural networks where activations converge to zero mean and unit variance, reducing the need for batch normalization.

SQRBF(ComputationNode<T>, double)

Applies the Squared Radial Basis Function (SQRBF) activation.

public static ComputationNode<T> SQRBF(ComputationNode<T> a, double beta = 1)

Parameters

a ComputationNode<T>

The input computation node.

beta double

The width parameter controlling the Gaussian bell curve (default 1.0).

Returns

ComputationNode<T>

A new computation node with SQRBF applied.

Remarks

SQRBF(x) = exp(-β * x²) A Gaussian bell-shaped activation with maximum at x=0 and values approaching 0 as |x| increases.

Gradient: d(SQRBF)/dx = -2βx * exp(-β * x²)

ScaledDotProductAttention(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?)

Computes scaled dot-product attention: softmax(Q @ K^T / sqrt(d_k)) @ V.

public static ComputationNode<T> ScaledDotProductAttention(ComputationNode<T> query, ComputationNode<T> key, ComputationNode<T> value, ComputationNode<T>? mask = null)

Parameters

query ComputationNode<T>

Query tensor [batch, seq_len_q, d_k].

key ComputationNode<T>

Key tensor [batch, seq_len_k, d_k].

value ComputationNode<T>

Value tensor [batch, seq_len_k, d_v].

mask ComputationNode<T>

Optional attention mask.

Returns

ComputationNode<T>

Attention output [batch, seq_len_q, d_v].

ScaledTanh(ComputationNode<T>, double)

Applies the Scaled Tanh activation function element-wise.

public static ComputationNode<T> ScaledTanh(ComputationNode<T> a, double beta = 1)

Parameters

a ComputationNode<T>

The input computation node.

beta double

The steepness parameter. Default is 1.0.

Returns

ComputationNode<T>

A new computation node with ScaledTanh applied.

Remarks

ScaledTanh is defined as: f(x) = (1 - exp(-βx)) / (1 + exp(-βx)) The gradient is: β * (1 - f(x)²) When β = 2, this equals standard tanh.

For Beginners: ScaledTanh allows you to control the steepness of the tanh curve, which can be useful for tuning network behavior.

Sigmoid(ComputationNode<T>)

Computes the sigmoid function for a computation node.

public static ComputationNode<T> Sigmoid(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input node.

Returns

ComputationNode<T>

A new computation node containing the sigmoid result.

Remarks

This method computes sigmoid (σ(x) = 1/(1+e^(-x))) and records the operation. The backward function uses: ∂σ(a)/∂a = σ(a) * (1 - σ(a)).

For Beginners: Sigmoid squashes values to be between 0 and 1.

For sigmoid (c = σ(a)):

  • The forward pass computes 1/(1+e^(-x)) for each element
  • The backward pass: gradient to 'a' is incoming gradient * output * (1 - output)

Sigmoid is used in binary classification and as a gate in LSTM networks.

Sign(ComputationNode<T>, double)

public static ComputationNode<T> Sign(ComputationNode<T> a, double surrogateBeta = 1)

Parameters

a ComputationNode<T>
surrogateBeta double

Returns

ComputationNode<T>

SinusoidalTimeEmbedding(ComputationNode<T>, int)

Creates sinusoidal time embeddings for diffusion models.

public static ComputationNode<T> SinusoidalTimeEmbedding(ComputationNode<T> timesteps, int embeddingDim)

Parameters

timesteps ComputationNode<T>

The timesteps to embed [batchSize] or [batchSize, 1].

embeddingDim int

The dimension of the output embeddings.

Returns

ComputationNode<T>

A computation node with sinusoidal embeddings [batchSize, embeddingDim].

Slice(ComputationNode<T>, int, int, int, int)

Extracts a slice from a tensor along a specified axis.

public static ComputationNode<T> Slice(ComputationNode<T> a, int start, int length, int step = 1, int axis = 0)

Parameters

a ComputationNode<T>

The input tensor to slice.

start int

The starting index along the specified axis.

length int

The number of elements to extract.

step int

The step size between elements (default 1).

axis int

The axis along which to slice (default 0).

Returns

ComputationNode<T>

A new computation node containing the sliced tensor.

Remarks

This operation extracts a portion of a tensor along a specified axis, starting at a given offset and continuing for a specified length. An optional step parameter allows for strided slicing (e.g., every 2nd element).

For Beginners: Think of this like taking a substring from a string.

For example, if you have a tensor [1, 2, 3, 4, 5, 6] and you slice with start=1, length=3:

  • You get [2, 3, 4]

With step=2 and start=0, length=3:

  • You get [1, 3, 5] (every 2nd element)

This is useful for extracting specific parts of data, like separating real and imaginary parts of complex numbers stored in interleaved format.

SoftKNN(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, T?)

Performs a soft K-Nearest Neighbors operation for differentiable instance-based learning.

public static ComputationNode<T> SoftKNN(ComputationNode<T> input, ComputationNode<T> supportVectors, ComputationNode<T> labels, T? temperature = default)

Parameters

input ComputationNode<T>

The query input tensor.

supportVectors ComputationNode<T>

Matrix of support vectors (training points) [n_samples, n_features].

labels ComputationNode<T>

Labels for each support vector [n_samples] or [n_samples, n_outputs].

temperature T

Temperature for softmax attention (default: 1.0).

Returns

ComputationNode<T>

Attention-weighted sum of labels.

Remarks

Computes: distances[i] = ||input - supportVectors[i]||² weights = softmax(-distances / temperature) output = Σ weights[i] * labels[i]

For Beginners: Instead of finding exactly k nearest neighbors, this computes attention weights for ALL neighbors based on distance. Closer neighbors get higher attention. This makes KNN differentiable and JIT-compilable.

SoftLocallyWeighted(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, T?)

Performs soft locally-weighted regression for differentiable instance-based learning.

public static ComputationNode<T> SoftLocallyWeighted(ComputationNode<T> input, ComputationNode<T> xTrain, ComputationNode<T> yTrain, T? bandwidth = default)

Parameters

input ComputationNode<T>

The query input tensor.

xTrain ComputationNode<T>

Training feature matrix [n_samples, n_features].

yTrain ComputationNode<T>

Training target values [n_samples] or [n_samples, n_outputs].

bandwidth T

Bandwidth parameter controlling locality (default: 1.0).

Returns

ComputationNode<T>

Attention-weighted prediction.

Remarks

Computes: distances[i] = ||input - xTrain[i]||² weights = softmax(-distances / bandwidth) output = Σ weights[i] * yTrain[i]

For Beginners: This is similar to SoftKNN but specifically designed for regression with a bandwidth parameter that controls how local the weighting is. Smaller bandwidth = more local predictions.

SoftPlus(ComputationNode<T>)

Applies the SoftPlus activation function element-wise: f(x) = ln(1 + e^x).

public static ComputationNode<T> SoftPlus(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with SoftPlus applied.

Remarks

SoftPlus is a smooth approximation of ReLU. The gradient is the sigmoid function: d(SoftPlus)/dx = sigmoid(x) = 1 / (1 + e^(-x))

For Beginners: SoftPlus smoothly approaches 0 for negative inputs and approaches the input value for large positive inputs, similar to ReLU but without the sharp corner at x=0.

SoftSign(ComputationNode<T>)

Applies the SoftSign activation function element-wise: f(x) = x / (1 + |x|).

public static ComputationNode<T> SoftSign(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with SoftSign applied.

Remarks

SoftSign is an alternative to tanh with polynomial tails that approach ±1 more slowly. The gradient is: d(SoftSign)/dx = 1 / (1 + |x|)²

For Beginners: SoftSign maps inputs to (-1, 1) like tanh, but with a different shape. The slower saturation can help prevent vanishing gradients in deep networks.

SoftSplit(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, int, T, T?)

Performs a soft split operation for differentiable decision trees.

public static ComputationNode<T> SoftSplit(ComputationNode<T> input, ComputationNode<T> leftValue, ComputationNode<T> rightValue, int featureIndex, T threshold, T? temperature = default)

Parameters

input ComputationNode<T>

The input features tensor.

leftValue ComputationNode<T>

The value to return if going left.

rightValue ComputationNode<T>

The value to return if going right.

featureIndex int

The index of the feature to split on.

threshold T

The threshold value for the split.

temperature T

Temperature parameter controlling split sharpness (default: 1.0).

Returns

ComputationNode<T>

A weighted combination of left and right values based on soft split.

Remarks

Computes: p_left = σ((threshold - x[featureIndex]) / temperature) output = p_left * leftValue + (1 - p_left) * rightValue

For Beginners: This makes decision tree splits differentiable by using a smooth sigmoid function instead of a hard if-then-else. Lower temperature makes the split sharper (more like a hard decision), while higher temperature makes it softer.

Softmax(ComputationNode<T>, int)

Computes the softmax function for a computation node along a specified axis.

public static ComputationNode<T> Softmax(ComputationNode<T> a, int axis = -1)

Parameters

a ComputationNode<T>

The input node.

axis int

The axis along which to compute softmax. Default is -1 (last axis).

Returns

ComputationNode<T>

A new computation node containing the softmax result.

Remarks

This method computes softmax (σ(x_i) = exp(x_i) / Σexp(x_j)) along the specified axis. Uses numerical stability trick: subtract max before exponentiating. The backward function uses: ∂softmax/∂x = softmax(x) * (grad - Σ(grad * softmax(x))).

For Beginners: Softmax converts a vector of numbers into probabilities.

For softmax:

  • The forward pass exponentiates each element, then normalizes so they sum to 1
  • The result is a probability distribution (all values between 0 and 1, summing to 1)
  • The backward pass is complex but efficient: uses the Jacobian of softmax

Softmax is crucial for:

  • Multi-class classification (final layer outputs)
  • Attention mechanisms (computing attention weights)
  • Anywhere you need to convert scores to probabilities

Softmin(ComputationNode<T>, int)

Applies the Softmin function, which assigns higher probability to lower values.

public static ComputationNode<T> Softmin(ComputationNode<T> a, int axis = -1)

Parameters

a ComputationNode<T>

The input computation node.

axis int

The axis along which to compute softmin (default -1, last axis).

Returns

ComputationNode<T>

A new computation node with Softmin applied.

Remarks

Softmin(x) = softmax(-x) = exp(-x) / sum(exp(-x)) Useful when lower values should have higher probability, e.g., in attention over distances.

Gradient: Same Jacobian structure as softmax but with negated input.

Sparsemax(ComputationNode<T>, int)

Applies the Sparsemax activation function which projects onto the probability simplex.

public static ComputationNode<T> Sparsemax(ComputationNode<T> a, int axis = -1)

Parameters

a ComputationNode<T>

The input computation node (2D: batch × features).

axis int

Axis along which to apply (default -1, last axis).

Returns

ComputationNode<T>

A new computation node with Sparsemax applied.

Remarks

Sparsemax produces sparse probability distributions where some outputs are exactly zero. Unlike softmax which always gives positive probabilities to all classes, sparsemax can assign exactly zero to low-scoring classes.

Gradient: For support set S (non-zero outputs): grad = upstream - mean(upstream[S])

SphericalSoftmax(ComputationNode<T>, int)

Applies the Spherical Softmax activation function.

public static ComputationNode<T> SphericalSoftmax(ComputationNode<T> a, int axis = -1)

Parameters

a ComputationNode<T>

The input computation node (2D: batch × features).

axis int

Axis along which to apply (default -1, last axis).

Returns

ComputationNode<T>

A new computation node with SphericalSoftmax applied.

Remarks

SphericalSoftmax = softmax(x / ||x||₂) First L2-normalizes the input, then applies softmax. This improves numerical stability for inputs with varying magnitudes.

Gradient: Chain rule through L2 normalization and softmax.

Split(ComputationNode<T>, int, int)

Splits a tensor along a specified axis into multiple tensors.

public static List<ComputationNode<T>> Split(ComputationNode<T> a, int numSplits, int axis = 0)

Parameters

a ComputationNode<T>

The input computation node.

numSplits int

The number of splits to create.

axis int

The axis along which to split.

Returns

List<ComputationNode<T>>

A list of computation nodes representing the split tensors.

Sqrt(ComputationNode<T>)

Computes the square root for a computation node.

public static ComputationNode<T> Sqrt(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input node.

Returns

ComputationNode<T>

A new computation node containing the square root result.

Remarks

This method computes the square root of each element and records the operation. The backward function uses: ∂(√a)/∂a = 1/(2√a).

For Beginners: This computes square root and tracks gradients.

For square root (c = √a):

  • The forward pass computes √x for each element
  • The backward pass: gradient to 'a' is incoming gradient * 1/(2√a)
  • Which simplifies to: incoming gradient / (2 * output)

Square(ComputationNode<T>)

Computes the element-wise square of the input (x²).

public static ComputationNode<T> Square(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input node.

Returns

ComputationNode<T>

A new computation node containing the squared result.

Remarks

This method computes the square of each element (x²) and records the operation. The backward function uses: ∂(x²)/∂x = 2x.

For Beginners: Square is a common operation in neural networks.

For square (c = a²):

  • The forward pass computes a² for each element
  • The backward pass: gradient to 'a' is incoming gradient * 2a

This is more efficient than using Power(a, 2) and is frequently needed for operations like computing distances, norms, and variance.

Squash(ComputationNode<T>, double)

Computes the squashing function used in capsule networks: s(x) = ||x||² / (1 + ||x||²) * (x / ||x||).

public static ComputationNode<T> Squash(ComputationNode<T> a, double epsilon = 1E-07)

Parameters

a ComputationNode<T>

The input node representing capsule vectors.

epsilon double

Small value for numerical stability (default: 1e-7).

Returns

ComputationNode<T>

A new computation node containing the squashed result.

Remarks

This method computes the squashing nonlinearity used in capsule networks. The squashing function ensures that short vectors shrink to near zero length and long vectors shrink to a length slightly below 1.

For Beginners: Squashing is the activation function for capsule layers.

The squashing function:

  • Keeps the direction of the vector unchanged
  • Scales the length to be between 0 and 1
  • Short vectors get much shorter (near 0)
  • Long vectors approach length 1

This is crucial for capsule networks where the length represents the probability that the entity represented by the capsule exists, and the direction represents its properties.

Formula: s(v) = ||v||² / (1 + ||v||²) * (v / ||v||)

StraightThroughThreshold(ComputationNode<T>, double)

Applies a straight-through threshold for HTM-style sparse activations.

public static ComputationNode<T> StraightThroughThreshold(ComputationNode<T> input, double threshold)

Parameters

input ComputationNode<T>

The input activations.

threshold double

The threshold value.

Returns

ComputationNode<T>

Binary activations with straight-through gradients.

Remarks

Forward: output = (input > threshold) ? 1 : 0 Backward: gradients pass through unchanged (straight-through estimator)

Subtract(ComputationNode<T>, ComputationNode<T>)

Performs element-wise subtraction of two computation nodes.

public static ComputationNode<T> Subtract(ComputationNode<T> a, ComputationNode<T> b)

Parameters

a ComputationNode<T>

The node to subtract from.

b ComputationNode<T>

The node to subtract.

Returns

ComputationNode<T>

A new computation node containing the difference.

Remarks

This method performs element-wise subtraction and records the operation to any active GradientTape. The backward function sends gradient to 'a' unchanged and negated gradient to 'b' (since ∂(a-b)/∂a = 1 and ∂(a-b)/∂b = -1).

For Beginners: This subtracts one tensor from another and tracks gradients.

For subtraction (c = a - b):

  • The forward pass computes a minus b element-wise
  • The backward pass sends the gradient to 'a' unchanged
  • But sends the negative gradient to 'b'
  • This is because increasing 'b' by 1 decreases the result by 1

Example: If the gradient flowing to c is [1, 2, 3]:

  • 'a' receives [1, 2, 3]
  • 'b' receives [-1, -2, -3]

Sum(ComputationNode<T>, int[]?, bool)

Sums elements of a computation node along specified axes.

public static ComputationNode<T> Sum(ComputationNode<T> a, int[]? axes = null, bool keepDims = false)

Parameters

a ComputationNode<T>

The computation node to sum.

axes int[]

The axes along which to sum. If null, sums all elements.

keepDims bool

Whether to keep the reduced dimensions with size 1. Default is false.

Returns

ComputationNode<T>

A computation node representing the sum.

Remarks

Reduces the tensor by summing along specified axes.

Gradient computation: - The gradient is broadcast back to the original shape, as each element contributed equally to the sum.

SurrogateSpike(ComputationNode<T>, double, double)

Applies a surrogate spike function for spiking neural network JIT compilation.

public static ComputationNode<T> SurrogateSpike(ComputationNode<T> membranePotential, double threshold = 1, double surrogateBeta = 1)

Parameters

membranePotential ComputationNode<T>

The membrane potential input.

threshold double

The spike threshold (default 1.0).

surrogateBeta double

Sharpness of the surrogate gradient (default 1.0).

Returns

ComputationNode<T>

A computation node containing spike outputs with surrogate gradients.

Remarks

Uses the sigmoid surrogate for gradient computation while producing hard spikes in forward pass. Forward: spike = (potential > threshold) ? 1 : 0 Backward: uses sigmoid derivative as surrogate gradient

Swish(ComputationNode<T>)

Applies the Swish (SiLU) activation function.

public static ComputationNode<T> Swish(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input computation node.

Returns

ComputationNode<T>

A new computation node with Swish applied.

Remarks

Swish(x) = x * sigmoid(x) = x / (1 + exp(-x)) Also known as SiLU (Sigmoid Linear Unit). Used in EfficientNet and other modern architectures.

Gradient: d(Swish)/dx = sigmoid(x) + x * sigmoid(x) * (1 - sigmoid(x)) = Swish(x) + sigmoid(x) * (1 - Swish(x))

Tanh(ComputationNode<T>)

Computes the hyperbolic tangent (tanh) for a computation node.

public static ComputationNode<T> Tanh(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The input node.

Returns

ComputationNode<T>

A new computation node containing the tanh result.

Remarks

This method computes tanh of each element and records the operation. The backward function uses: ∂(tanh(a))/∂a = 1 - tanh²(a).

For Beginners: Tanh is a common activation function in neural networks.

For tanh (c = tanh(a)):

  • The forward pass computes tanh for each element (outputs between -1 and 1)
  • The backward pass: gradient to 'a' is incoming gradient * (1 - output²)

Tanh is popular because it's centered around 0 (unlike sigmoid which is 0 to 1).

TaylorSoftmax(ComputationNode<T>, int, int)

Applies the Taylor Softmax activation function using Taylor series approximation.

public static ComputationNode<T> TaylorSoftmax(ComputationNode<T> a, int order = 2, int axis = -1)

Parameters

a ComputationNode<T>

The input computation node (2D: batch × features).

order int

Order of Taylor series expansion (default 2).

axis int

Axis along which to apply (default -1, last axis).

Returns

ComputationNode<T>

A new computation node with TaylorSoftmax applied.

Remarks

TaylorSoftmax uses Taylor series approximation of exp(x): exp(x) ≈ 1 + x + x²/2! + x³/3! + ... + xⁿ/n! Then normalizes like standard softmax. More computationally efficient than standard softmax for some hardware.

Gradient: Similar to softmax but using polynomial derivatives.

ThresholdedReLU(ComputationNode<T>, double)

Applies the Thresholded Rectified Linear Unit activation function.

public static ComputationNode<T> ThresholdedReLU(ComputationNode<T> a, double threshold = 1)

Parameters

a ComputationNode<T>

The input computation node.

threshold double

The threshold value (default 1.0).

Returns

ComputationNode<T>

A new computation node with ThresholdedReLU applied.

Remarks

ThresholdedReLU(x) = x if x > threshold, 0 otherwise. Unlike standard ReLU which activates at 0, this activates at a configurable threshold.

Gradient: d(ThresholdedReLU)/dx = 1 if x > threshold, 0 otherwise.

TopKSoftmax(ComputationNode<T>, int)

Differentiable Top-K selection for mixture-of-experts routing.

public static ComputationNode<T> TopKSoftmax(ComputationNode<T> scores, int k)

Parameters

scores ComputationNode<T>

The routing scores for each expert.

k int

Number of experts to select.

Returns

ComputationNode<T>

Sparse routing weights with only top-K non-zero.

Remarks

Selects top-K values and normalizes them via softmax. Gradients flow only to the selected experts.

Transpose(ComputationNode<T>)

Transposes a 2D computation node (matrix).

public static ComputationNode<T> Transpose(ComputationNode<T> a)

Parameters

a ComputationNode<T>

The matrix to transpose (must be 2D).

Returns

ComputationNode<T>

A computation node representing the transposed matrix.

Remarks

For a 2D tensor, swaps rows and columns: if A has shape [m, n], result has shape [n, m].

Gradient computation: - ∂(A^T)/∂A = gradOut^T (transpose the gradient back)

Upsample(ComputationNode<T>, int)

Upsamples a tensor using nearest neighbor interpolation. Supports tensors of any rank (at least 2D), treating the last two dimensions as height and width.

public static ComputationNode<T> Upsample(ComputationNode<T> a, int scale)

Parameters

a ComputationNode<T>

The input computation node with at least 2 dimensions.

scale int

The upsampling scale factor.

Returns

ComputationNode<T>

A computation node representing the upsampled tensor.

Upsample3D(ComputationNode<T>, int, int, int)

Performs 3D upsampling (nearest neighbor) on a 5D tensor.

public static ComputationNode<T> Upsample3D(ComputationNode<T> input, int scaleD, int scaleH, int scaleW)

Parameters

input ComputationNode<T>

The input node with shape [batch, channels, depth, height, width].

scaleD int

Scale factor for depth dimension.

scaleH int

Scale factor for height dimension.

scaleW int

Scale factor for width dimension.

Returns

ComputationNode<T>

A new computation node containing the upsampled result with shape [batch, channels, depthscaleD, heightscaleH, width*scaleW].

Remarks

3D upsampling increases spatial resolution by repeating values. This is the inverse operation of 3D max pooling and is commonly used in 3D U-Net decoder paths.

For Beginners: Upsample3D makes volumetric data larger by repeating voxels. If you have a 4x4x4 volume and upsample by 2 in each dimension, you get an 8x8x8 volume where each original voxel is repeated 2x2x2 times.

Gradient: The gradient is computed by summing gradients that were distributed to repeated elements back to the original position.

Exceptions

ArgumentException

Thrown when input is not 5D.

Variable(Tensor<T>, string?, bool)

Creates a computation node from a tensor value.

public static ComputationNode<T> Variable(Tensor<T> value, string? name = null, bool requiresGradient = true)

Parameters

value Tensor<T>

The tensor value.

name string

Optional name for the node.

requiresGradient bool

Whether this node requires gradient computation.

Returns

ComputationNode<T>

A computation node wrapping the tensor.

Remarks

This method creates a leaf node in the computation graph - a node with no parents. Leaf nodes typically represent inputs or parameters that gradients will be computed with respect to.

For Beginners: This creates a starting point in your calculation graph.

Use this to wrap:

  • Model parameters (weights, biases) that need gradients
  • Input data that you want to compute gradients for
  • Constants (with requiresGradient=false)

The returned ComputationNode tracks the tensor's value and will accumulate gradients during backpropagation.