Class TensorOperations<T>
Provides automatic differentiation support for tensor operations.
public static class TensorOperations<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
TensorOperations<T>
- Inherited Members
Remarks
TensorOperations is a helper class that integrates automatic differentiation with tensor operations. It records operations performed on tensors to an active GradientTape (if present) and creates the computation graph needed for backpropagation.
This class follows the opt-in pattern: tensor operations only record to the gradient tape when explicitly used within a GradientTape context. Outside of a GradientTape context, operations work normally without any overhead.
For Beginners: This class bridges regular tensor operations with automatic differentiation.
Think of it like adding a "recording mode" to your calculations:
- When you're inside a GradientTape context, operations are recorded
- The recording remembers how each value was computed
- Later, you can "play it backwards" to compute gradients
- When not recording, operations work exactly as before
This enables features like:
- Automatic gradient computation for neural network training
- Computing derivatives without writing manual backward passes
- Building complex computational graphs automatically
Example usage:
using (var tape = new GradientTape<double>())
{
var x = TensorOperations<double>.Variable(inputTensor, "x");
var y = TensorOperations<double>.Variable(parameterTensor, "y");
tape.Watch(x);
tape.Watch(y);
var z = TensorOperations<double>.Add(x, y); // Recorded to tape
var gradients = tape.Gradient(z, new[] { x, y });
}
Methods
Abs(ComputationNode<T>)
Computes the absolute value of each element in a computation node.
public static ComputationNode<T> Abs(ComputationNode<T> a)
Parameters
aComputationNode<T>The input node.
Returns
- ComputationNode<T>
A new computation node containing the absolute values.
Remarks
This method computes |x| for each element and records the operation. The backward function uses the sign of the original values for gradient computation.
For Beginners: This makes all values positive (removes the sign).
For absolute value (c = |a|):
- The forward pass removes the sign of each element
- The backward pass uses sign(a) to route gradients correctly
- For positive values, gradient passes through unchanged
- For negative values, gradient is negated
Note: At x = 0, the gradient is technically undefined, but we use 0 as a convention.
Add(ComputationNode<T>, ComputationNode<T>)
Performs element-wise addition of two computation nodes.
public static ComputationNode<T> Add(ComputationNode<T> a, ComputationNode<T> b)
Parameters
aComputationNode<T>The first node.
bComputationNode<T>The second node.
Returns
- ComputationNode<T>
A new computation node containing the sum.
Remarks
This method performs element-wise addition and records the operation to any active GradientTape. The backward function distributes gradients equally to both inputs (since ∂(a+b)/∂a = 1 and ∂(a+b)/∂b = 1).
For Beginners: This adds two tensors together and remembers how to compute gradients.
For addition (c = a + b):
- The forward pass computes the sum element-wise
- The backward pass sends gradients to both inputs unchanged
- This is because changing 'a' by 1 changes the sum by 1, same for 'b'
Example: If the gradient flowing back to c is [1, 2, 3], then both 'a' and 'b' receive [1, 2, 3]
AffineGrid(ComputationNode<T>, int, int)
Generates a sampling grid for spatial transformer networks using affine transformation matrices.
public static ComputationNode<T> AffineGrid(ComputationNode<T> theta, int outputHeight, int outputWidth)
Parameters
thetaComputationNode<T>Affine transformation matrices of shape [batch, 2, 3]
outputHeightintHeight of the output grid
outputWidthintWidth of the output grid
Returns
- ComputationNode<T>
Sampling grid of shape [batch, outputHeight, outputWidth, 2] with (x, y) coordinates
Remarks
This operation generates a grid of sampling coordinates for spatial transformations. The output grid starts as a regular grid in normalized coordinates [-1, 1], then each point is transformed using the affine matrix.
Forward pass: 1. Generate base grid in [-1, 1] normalized space 2. For each point (x_out, y_out) in output space: x_in = theta[0,0]*x_out + theta[0,1]*y_out + theta[0,2] y_in = theta[1,0]*x_out + theta[1,1]*y_out + theta[1,2]
Backward pass: - ∂L/∂theta[i,j] = sum over all grid points of (∂L/∂grid * ∂grid/∂theta)
For Beginners: This creates a map showing where each output pixel should sample from. The affine matrix controls rotation, scaling, translation, and shearing of the grid.
AnomalyScore(ComputationNode<T>, ComputationNode<T>)
Anomaly score computation using reconstruction error or density estimation.
public static ComputationNode<T> AnomalyScore(ComputationNode<T> input, ComputationNode<T> reconstruction)
Parameters
inputComputationNode<T>Input tensor.
reconstructionComputationNode<T>Reconstructed input (e.g., from autoencoder).
Returns
- ComputationNode<T>
Anomaly scores (higher = more anomalous).
ApplyActivation(ComputationNode<T>, IActivationFunction<T>)
Applies a generic activation function (scalar or element-wise) with automatic differentiation.
public static ComputationNode<T> ApplyActivation(ComputationNode<T> input, IActivationFunction<T> activation)
Parameters
inputComputationNode<T>The input computation node.
activationIActivationFunction<T>The activation function to apply.
Returns
- ComputationNode<T>
A new computation node with the activation applied.
Remarks
This method provides generic autodiff support for ANY activation function that implements IActivationFunction{T}. It works by applying the activation function element-wise during the forward pass, then using the activation's ComputeDerivative method during backpropagation.
This means ALL 39 built-in activation functions automatically work with autodiff, and only truly custom user-defined activations (that don't inherit from ActivationFunctionBase) would fail.
AvgPool2D(ComputationNode<T>, int[], int[]?)
Performs 2D average pooling on a 4D tensor (batch, channels, height, width).
public static ComputationNode<T> AvgPool2D(ComputationNode<T> a, int[] poolSize, int[]? strides = null)
Parameters
aComputationNode<T>The input node with shape [batch, channels, height, width].
poolSizeint[]The size of the pooling window [poolH, poolW].
stridesint[]The stride for the pooling operation [strideH, strideW]. If null, uses poolSize.
Returns
- ComputationNode<T>
A new computation node containing the average pooled result.
Remarks
This method performs average pooling over 2D spatial dimensions. The backward function distributes gradients equally across the pooling window.
For Beginners: AvgPool downsamples by taking the average value in each window.
For average pooling:
- The forward pass slides a window and computes the average
- This smoothly reduces spatial dimensions
- The backward pass distributes gradients equally to all elements in the window
- Each element gets gradient / pool_area
Used in:
- CNNs for smoother downsampling than max pooling
- Global average pooling (replacing fully connected layers)
- Reducing overfitting
BatchMatrixMultiply(ComputationNode<T>, ComputationNode<T>)
Performs batched matrix multiplication of two 3D computation nodes.
public static ComputationNode<T> BatchMatrixMultiply(ComputationNode<T> a, ComputationNode<T> b)
Parameters
aComputationNode<T>The first 3D tensor with shape [Batch, M, K].
bComputationNode<T>The second 3D tensor with shape [Batch, K, N].
Returns
- ComputationNode<T>
A computation node representing the batched matrix multiplication with shape [Batch, M, N].
Remarks
For 3D tensors, performs element-wise matrix multiplication across the batch dimension: result[i] = a[i] @ b[i] for each batch index i.
Gradient computation: - ∂(A·B)/∂A = gradOut·B^T (batch-wise) - ∂(A·B)/∂B = A^T·gradOut (batch-wise)
BatchNorm(ComputationNode<T>, ComputationNode<T>?, ComputationNode<T>?, Tensor<T>?, Tensor<T>?, bool, double)
Applies batch normalization to a computation node.
public static ComputationNode<T> BatchNorm(ComputationNode<T> a, ComputationNode<T>? gamma = null, ComputationNode<T>? beta = null, Tensor<T>? runningMean = null, Tensor<T>? runningVar = null, bool training = true, double epsilon = 1E-05)
Parameters
aComputationNode<T>The input node with shape [batch, features].
gammaComputationNode<T>Optional scale parameter (learnable). If null, uses ones.
betaComputationNode<T>Optional shift parameter (learnable). If null, uses zeros.
runningMeanTensor<T>Running mean for inference (not updated during this operation).
runningVarTensor<T>Running variance for inference (not updated during this operation).
trainingboolWhether in training mode (uses batch statistics) or inference mode (uses running statistics).
epsilondoubleSmall constant for numerical stability. Default is 1e-5.
Returns
- ComputationNode<T>
A new computation node containing the batch normalized result.
Remarks
Batch normalization normalizes inputs across the batch dimension. During training: Uses batch statistics (mean and variance computed from current batch). During inference: Uses running statistics (accumulated during training).
For Beginners: BatchNorm standardizes features across the batch.
For batch normalization:
- Training mode: Uses current batch's mean and variance
- Inference mode: Uses running mean/variance from training
- Normalizes: (x - mean) / sqrt(variance)
- Scales and shifts: result * gamma + beta
Benefits:
- Stabilizes training (reduces internal covariate shift)
- Allows higher learning rates
- Acts as regularization
Used in:
- CNNs (after convolutional layers)
- Deep feedforward networks
- GANs and many other architectures
BentIdentity(ComputationNode<T>)
Applies the Bent Identity activation function element-wise.
public static ComputationNode<T> BentIdentity(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with BentIdentity applied.
Remarks
BentIdentity is defined as: f(x) = (sqrt(x² + 1) - 1) / 2 + x The gradient is: x / (2 * sqrt(x² + 1)) + 1
For Beginners: BentIdentity is a smooth alternative to ReLU with non-zero gradient everywhere, preventing dead neurons during training.
Broadcast(ComputationNode<T>, int[])
Broadcasts a 1D tensor to a 2D tensor by tiling along the batch dimension.
public static ComputationNode<T> Broadcast(ComputationNode<T> a, int[] targetShape)
Parameters
aComputationNode<T>The input 1D tensor node with shape [N].
targetShapeint[]The target 2D shape [batchSize, N].
Returns
- ComputationNode<T>
A new computation node with the broadcasted tensor.
Remarks
This operation broadcasts a 1D tensor (e.g., biases with shape [outputSize]) to a 2D tensor (e.g., [batchSize, outputSize]) by replicating values along the batch dimension. The backward pass correctly sums gradients along the broadcasted dimension.
For Beginners: Broadcasting is like copying a row multiple times to create a matrix.
For example, if you have biases [b1, b2, b3] and need to add them to a batch of outputs:
- Input: [b1, b2, b3] (shape [3])
- Target shape: [batchSize=2, 3]
- Output: [[b1, b2, b3], [b1, b2, b3]] (each row is a copy)
During backpropagation, gradients from all rows are summed back to the original biases, because each bias contributed to all batch elements.
CELU(ComputationNode<T>, double)
Applies the CELU (Continuously Differentiable ELU) activation function element-wise.
public static ComputationNode<T> CELU(ComputationNode<T> a, double alpha = 1)
Parameters
aComputationNode<T>The input computation node.
alphadoubleThe alpha parameter controlling negative saturation. Default is 1.0.
Returns
- ComputationNode<T>
A new computation node with CELU applied.
Remarks
CELU is defined as: max(0, x) + min(0, α * (exp(x/α) - 1)) The gradient is: 1 if x >= 0, otherwise exp(x/α)
For Beginners: CELU is an improved version of ELU that is continuously differentiable everywhere, which can help with optimization and training stability.
CRFForward(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, ComputationNode<T>?)
CRF forward algorithm for sequence labeling.
public static ComputationNode<T> CRFForward(ComputationNode<T> emissions, ComputationNode<T> transitions, ComputationNode<T>? startScores = null, ComputationNode<T>? endScores = null)
Parameters
emissionsComputationNode<T>Emission scores [seq_len, num_tags].
transitionsComputationNode<T>Transition matrix [num_tags, num_tags].
startScoresComputationNode<T>Optional start scores [num_tags].
endScoresComputationNode<T>Optional end scores [num_tags].
Returns
- ComputationNode<T>
Log partition function (normalizer).
Remarks
Computes the log partition function using the forward-backward algorithm. This is differentiable and returns proper gradients for emissions, transitions, start scores, and end scores.
ComplexMatMul(ComputationNode<T>, ComputationNode<T>, string)
Performs complex matrix multiplication on tensors representing complex numbers as [real, imag] pairs.
public static ComputationNode<T> ComplexMatMul(ComputationNode<T> a, ComputationNode<T> b, string format = "split")
Parameters
aComputationNode<T>First complex matrix [batch, m, 2*k] where dimensions are [real, imag] interleaved or concatenated.
bComputationNode<T>Second complex matrix [batch, 2*k, n].
formatstringWhether complex numbers are "interleaved" ([r,i,r,i,...]) or "split" ([r,r,...,i,i,...]).
Returns
- ComputationNode<T>
Complex matrix product [batch, m, 2*n].
Remarks
Complex multiplication: (a + bi)(c + di) = (ac - bd) + (ad + bc)i
For Beginners: This multiplies matrices of complex numbers.
Complex numbers are represented as pairs of real numbers [real_part, imaginary_part]. This operation implements the full complex matrix multiplication formula.
Used in quantum computing layers where quantum gates are unitary matrices.
ComplexMultiply(ComputationNode<T>, ComputationNode<T>, string)
Performs element-wise complex multiplication.
public static ComputationNode<T> ComplexMultiply(ComputationNode<T> a, ComputationNode<T> b, string format = "split")
Parameters
aComputationNode<T>First complex tensor with last dimension of size 2*n.
bComputationNode<T>Second complex tensor with last dimension of size 2*n.
formatstringWhether complex numbers are "split" ([r,r,...,i,i,...]).
Returns
- ComputationNode<T>
Element-wise complex product.
Remarks
Complex multiplication: (a + bi)(c + di) = (ac - bd) + (ad + bc)i
Concat(List<ComputationNode<T>>, int)
Concatenates multiple computation nodes along a specified axis.
public static ComputationNode<T> Concat(List<ComputationNode<T>> nodes, int axis = 0)
Parameters
nodesList<ComputationNode<T>>The list of nodes to concatenate.
axisintThe axis along which to concatenate. Default is 0.
Returns
- ComputationNode<T>
A new computation node containing the concatenated result.
Remarks
This method concatenates tensors along the specified axis. All tensors must have the same shape except along the concatenation axis. The backward function splits the gradient and sends each portion to the corresponding input.
For Beginners: Concat stacks tensors together along a dimension.
For concatenation:
- The forward pass combines multiple tensors into one larger tensor
- The backward pass splits the gradient back to each input
- Think of it like gluing arrays together end-to-end
Used in:
- Skip connections (concatenating features from different layers)
- Multi-input architectures
- Feature fusion in neural networks
Constant(Tensor<T>, string?)
Creates a constant computation node from a tensor value.
public static ComputationNode<T> Constant(Tensor<T> value, string? name = null)
Parameters
valueTensor<T>The tensor value.
namestringOptional name for the node.
Returns
- ComputationNode<T>
A computation node that doesn't require gradients.
Remarks
This method creates a constant node - a value that won't have gradients computed. Use this for constants, hyperparameters, or intermediate values you don't need gradients for.
For Beginners: This creates a value that won't be adjusted during training.
Use this for:
- Constants (like pi, e, or fixed multipliers)
- Hyperparameters that don't change during training
- Any value you don't need gradients for (saves memory)
Conv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?)
Performs 2D convolution on a 4D tensor (batch, channels, height, width).
public static ComputationNode<T> Conv2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null)
Parameters
inputComputationNode<T>The input node with shape [batch, inChannels, height, width].
kernelComputationNode<T>The kernel/filter with shape [outChannels, inChannels, kernelH, kernelW].
biasComputationNode<T>Optional bias with shape [outChannels]. If null, no bias is added.
strideint[]The stride [strideH, strideW]. Default is [1, 1].
paddingint[]The padding [padH, padW]. Default is [0, 0].
Returns
- ComputationNode<T>
A new computation node containing the convolution result.
Remarks
This method performs 2D convolution, the fundamental operation in CNNs. Forward: Slides the kernel over the input computing dot products. Backward: Computes gradients for both input and kernel using transposed convolutions.
For Beginners: Conv2D is the core operation of convolutional neural networks.
For 2D convolution:
- The kernel "slides" over the input, computing weighted sums
- Each output position is a dot product of the kernel with input patch
- Stride controls how far the kernel moves each step
- Padding adds borders to control output size
Gradient computation:
- Gradient w.r.t. input: "full" convolution with flipped kernel
- Gradient w.r.t. kernel: cross-correlation between input and output gradient
Used in:
- All CNNs (image classification, object detection, segmentation)
- Feature extraction in vision models
- Learning spatial hierarchies
Conv3D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?)
Performs 3D convolution on a 5D tensor (batch, channels, depth, height, width).
public static ComputationNode<T> Conv3D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null)
Parameters
inputComputationNode<T>The input node with shape [batch, inChannels, depth, height, width].
kernelComputationNode<T>The kernel/filter with shape [outChannels, inChannels, kernelD, kernelH, kernelW].
biasComputationNode<T>Optional bias with shape [outChannels]. If null, no bias is added.
strideint[]The stride [strideD, strideH, strideW]. Default is [1, 1, 1].
paddingint[]The padding [padD, padH, padW]. Default is [0, 0, 0].
Returns
- ComputationNode<T>
A new computation node containing the 3D convolution result.
Remarks
This method performs 3D convolution, the fundamental operation for volumetric data processing. Forward: Slides the kernel over the input computing dot products across all three spatial dimensions. Backward: Computes gradients for both input and kernel using transposed 3D convolutions.
For Beginners: Conv3D is the 3D extension of Conv2D for volumetric data.
For 3D convolution:
- The kernel "slides" over depth, height, and width dimensions
- Each output position is a dot product of the kernel with an input volume
- Stride controls how far the kernel moves each step in each dimension
- Padding adds borders to control output size
Gradient computation:
- Gradient w.r.t. input: "full" 3D convolution with flipped kernel
- Gradient w.r.t. kernel: 3D cross-correlation between input and output gradient
Used in:
- 3D object recognition from voxel grids (VoxNet, VoxelCNN)
- Medical image analysis (CT/MRI volumetric scans)
- Video understanding (treating time as depth dimension)
- Point cloud processing after voxelization
Exceptions
- ArgumentException
Thrown when input or kernel have invalid dimensions.
ConvTranspose2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?, int[]?)
Performs 2D transposed convolution (deconvolution) on a 4D tensor.
public static ComputationNode<T> ConvTranspose2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null, int[]? outputPadding = null)
Parameters
inputComputationNode<T>The input node with shape [batch, inChannels, height, width].
kernelComputationNode<T>The kernel with shape [inChannels, outChannels, kernelH, kernelW] (note: reversed from Conv2D).
biasComputationNode<T>Optional bias with shape [outChannels]. If null, no bias is added.
strideint[]The stride [strideH, strideW]. Default is [1, 1].
paddingint[]The padding [padH, padW]. Default is [0, 0].
outputPaddingint[]Output padding [outPadH, outPadW] for size adjustment. Default is [0, 0].
Returns
- ComputationNode<T>
A new computation node containing the transposed convolution result.
Remarks
Transposed convolution (often called deconvolution) upsamples the input. It's the gradient of Conv2D with respect to its input, used as a forward operation.
For Beginners: ConvTranspose2D upsamples spatial dimensions.
For transposed convolution:
- Inserts zeros between input elements according to stride
- Applies regular convolution to the expanded input
- Results in larger spatial dimensions (upsampling)
Used in:
- Image generation (GANs, VAEs)
- Semantic segmentation (U-Net decoder)
- Super-resolution
- Any task requiring upsampling
Crop(ComputationNode<T>, int[])
Crops a tensor by removing elements from the edges.
public static ComputationNode<T> Crop(ComputationNode<T> a, int[] cropping)
Parameters
aComputationNode<T>The input computation node.
croppingint[]Array of [top, bottom, left, right] cropping amounts for 4D tensors.
Returns
- ComputationNode<T>
A computation node representing the cropped tensor.
DeformableConv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, ComputationNode<T>?, int[]?, int[]?, int[]?)
Performs 2D deformable convolution with learnable offsets and optional modulation mask.
public static ComputationNode<T> DeformableConv2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T> offset, ComputationNode<T>? mask = null, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null, int[]? dilation = null)
Parameters
inputComputationNode<T>Input tensor [batch, inChannels, height, width].
kernelComputationNode<T>Convolution kernel [outChannels, inChannels, kernelH, kernelW].
offsetComputationNode<T>Spatial offsets [batch, 2kernelHkernelW, outH, outW].
maskComputationNode<T>Optional modulation mask [batch, kernelH*kernelW, outH, outW]. If null, uses uniform weights.
biasComputationNode<T>Optional bias [outChannels]. If null, no bias is added.
strideint[]Stride [strideH, strideW]. Default is [1, 1].
paddingint[]Padding [padH, padW]. Default is [0, 0].
dilationint[]Dilation [dilationH, dilationW]. Default is [1, 1].
Returns
- ComputationNode<T>
Output tensor [batch, outChannels, outH, outW].
Remarks
Deformable convolution augments standard convolution with learnable 2D offsets for each sampling position in the kernel. This allows the network to adaptively adjust its receptive field based on the input, enabling better modeling of geometric transformations.
For Beginners: Standard convolution samples at fixed grid positions. Deformable convolution learns where to sample, allowing it to handle objects of various shapes and scales more effectively.
DepthwiseConv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?)
Performs depthwise 2D convolution where each input channel is convolved with its own set of filters.
public static ComputationNode<T> DepthwiseConv2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null)
Parameters
inputComputationNode<T>Input tensor of shape [batch, in_channels, height, width]
kernelComputationNode<T>Kernel tensor of shape [in_channels, multiplier, kernel_height, kernel_width]
biasComputationNode<T>Optional bias tensor of shape [in_channels * multiplier]
strideint[]Stride for the convolution, defaults to [1, 1]
paddingint[]Padding for the convolution, defaults to [0, 0]
Returns
- ComputationNode<T>
Output tensor of shape [batch, in_channels * multiplier, out_height, out_width]
Remarks
Depthwise convolution applies a separate filter to each input channel independently, with no mixing across channels. This is in contrast to standard convolution which mixes all input channels. Each input channel gets 'multiplier' filters applied to it, producing 'multiplier' output channels. The total output channels is in_channels * multiplier.
This operation is commonly used in MobileNets and other efficient architectures, often followed by a pointwise (1x1) convolution to mix channels. The combination dramatically reduces computational cost compared to standard convolution.
Forward pass computes the depthwise convolution by applying each filter only to its corresponding input channel. Backward pass computes gradients with respect to input, kernel, and bias.
DilatedConv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?, int[]?, int[]?)
Performs dilated (atrous) 2D convolution operation.
public static ComputationNode<T> DilatedConv2D(ComputationNode<T> input, ComputationNode<T> kernel, ComputationNode<T>? bias = null, int[]? stride = null, int[]? padding = null, int[]? dilation = null)
Parameters
inputComputationNode<T>The input tensor with shape [batch, channels, height, width].
kernelComputationNode<T>The convolution kernel with shape [out_channels, in_channels, kernel_height, kernel_width].
biasComputationNode<T>Optional bias tensor with shape [out_channels].
strideint[]The stride for the convolution. Defaults to [1, 1].
paddingint[]The padding for the convolution. Defaults to [0, 0].
dilationint[]The dilation rate for the convolution. Defaults to [1, 1].
Returns
- ComputationNode<T>
A computation node representing the dilated convolution result.
Divide(ComputationNode<T>, ComputationNode<T>)
Performs element-wise division of two computation nodes.
public static ComputationNode<T> Divide(ComputationNode<T> a, ComputationNode<T> b)
Parameters
aComputationNode<T>The numerator node.
bComputationNode<T>The denominator node.
Returns
- ComputationNode<T>
A new computation node containing the element-wise quotient.
Remarks
This method performs element-wise division and records the operation to any active GradientTape. The backward function uses the quotient rule: ∂(a/b)/∂a = 1/b and ∂(a/b)/∂b = -a/b².
For Beginners: This divides one tensor by another element-wise and tracks gradients.
For element-wise division (c = a / b):
- The forward pass divides corresponding elements
- The backward pass uses the quotient rule from calculus
- Gradient to 'a' is: incoming gradient * (1/b)
- Gradient to 'b' is: incoming gradient * (-a/b²)
Example: If a=[6,8], b=[2,4], c=[3,2] If gradient to c is [1,1]:
- 'a' receives [1/2, 1/4] = [0.5, 0.25]
- 'b' receives [-6/4, -8/16] = [-1.5, -0.5]
ELU(ComputationNode<T>, double)
Applies the Exponential Linear Unit (ELU) activation function to a computation node.
public static ComputationNode<T> ELU(ComputationNode<T> a, double alpha = 1)
Parameters
aComputationNode<T>The input computation node.
alphadoubleThe alpha parameter controlling the negative saturation value. Default is 1.0.
Returns
- ComputationNode<T>
A new computation node with ELU applied.
Remarks
ELU(x) = x if x > 0, alpha * (exp(x) - 1) otherwise. ELU helps prevent "dying neurons" and pushes mean activations closer to zero.
Gradient: d(ELU)/dx = 1 if x > 0, alpha * exp(x) = ELU(x) + alpha otherwise.
ElementwiseMultiply(ComputationNode<T>, ComputationNode<T>)
Performs element-wise multiplication of two computation nodes.
public static ComputationNode<T> ElementwiseMultiply(ComputationNode<T> a, ComputationNode<T> b)
Parameters
aComputationNode<T>The first node.
bComputationNode<T>The second node.
Returns
- ComputationNode<T>
A new computation node containing the element-wise product.
Remarks
This method performs element-wise (Hadamard) multiplication and records the operation. The backward function uses the product rule: ∂(a*b)/∂a = b and ∂(a*b)/∂b = a.
For Beginners: This multiplies two tensors element-wise and tracks gradients.
For element-wise multiplication (c = a * b):
- The forward pass multiplies corresponding elements
- The backward pass uses the product rule from calculus
- Gradient to 'a' is: incoming gradient * b's value
- Gradient to 'b' is: incoming gradient * a's value
Example: If a=[2,3], b=[4,5], c=[8,15] If gradient to c is [1,1]:
- 'a' receives [14, 15] = [4, 5]
- 'b' receives [12, 13] = [2, 3]
EmbeddingLookup(ComputationNode<T>, ComputationNode<T>)
Performs embedding lookup operation.
public static ComputationNode<T> EmbeddingLookup(ComputationNode<T> embeddings, ComputationNode<T> indices)
Parameters
embeddingsComputationNode<T>The embedding matrix [vocab_size, embedding_dim].
indicesComputationNode<T>The indices to lookup [batch_size, sequence_length].
Returns
- ComputationNode<T>
The looked up embeddings [batch_size, sequence_length, embedding_dim].
Exp(ComputationNode<T>)
Computes the exponential function (e^x) for a computation node.
public static ComputationNode<T> Exp(ComputationNode<T> a)
Parameters
aComputationNode<T>The input node.
Returns
- ComputationNode<T>
A new computation node containing the exponential result.
Remarks
This method computes e raised to each element and records the operation. The backward function uses: ∂(e^a)/∂a = e^a.
For Beginners: This computes e^x for each element and tracks gradients.
For exponential (c = e^a):
- The forward pass computes e^x for each element
- The backward pass has a special property: the derivative equals the output!
- Gradient to 'a' is: incoming gradient * e^a (which is just the output)
This is used in softmax, sigmoid, and many activation functions.
FakeQuantize(ComputationNode<T>, int, T?, T?, bool)
Performs fake quantization with Straight-Through Estimator (STE) for differentiable quantization.
public static ComputationNode<T> FakeQuantize(ComputationNode<T> input, int numBits = 8, T? scale = default, T? zeroPoint = default, bool symmetric = true)
Parameters
inputComputationNode<T>The input tensor to quantize.
numBitsintNumber of quantization bits (default: 8).
scaleTScale factor (if null, computed from input range).
zeroPointTZero point for asymmetric quantization (default: 0).
symmetricboolWhether to use symmetric quantization (default: true).
Returns
- ComputationNode<T>
Fake-quantized tensor (quantized forward, STE backward).
Remarks
Forward: output = round(input / scale) * scale (clipped to valid range) Backward: gradient passes through unchanged (Straight-Through Estimator)
For Beginners: This simulates quantization during training while allowing gradients to flow back for optimization. The forward pass applies real quantization, but the backward pass pretends it didn't happen - this trick (STE) lets us train models that will be quantized for deployment.
GELU(ComputationNode<T>)
Applies the Gaussian Error Linear Unit (GELU) activation function.
public static ComputationNode<T> GELU(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with GELU applied.
Remarks
GELU(x) = x * Φ(x) where Φ is the standard Gaussian cumulative distribution function. Approximation: 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))
GELU is widely used in transformers (BERT, GPT) and modern architectures.
Gradient: d(GELU)/dx = Φ(x) + x * φ(x) where φ is the Gaussian PDF.
GRUCell(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>)
GRU cell forward pass.
public static ComputationNode<T> GRUCell(ComputationNode<T> input, ComputationNode<T> hiddenState, ComputationNode<T> weightIH, ComputationNode<T> weightHH, ComputationNode<T> bias)
Parameters
inputComputationNode<T>Input tensor [batch, input_dim].
hiddenStateComputationNode<T>Previous hidden state [batch, hidden_dim].
weightIHComputationNode<T>Input-to-hidden weights [input_dim, 3*hidden_dim].
weightHHComputationNode<T>Hidden-to-hidden weights [hidden_dim, 3*hidden_dim].
biasComputationNode<T>Bias terms [3*hidden_dim].
Returns
- ComputationNode<T>
New hidden state.
Gaussian(ComputationNode<T>)
Applies the Gaussian activation function element-wise: f(x) = exp(-x²).
public static ComputationNode<T> Gaussian(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with Gaussian applied.
Remarks
Gaussian is defined as: f(x) = exp(-x²) The gradient is: -2x * exp(-x²)
For Beginners: Gaussian creates a bell-shaped response curve that is maximum at zero and approaches zero for large inputs in either direction. Useful for RBF networks and pattern recognition.
GraphConv(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?)
Performs graph convolution operation for graph neural networks.
public static ComputationNode<T> GraphConv(ComputationNode<T> input, ComputationNode<T> adjacency, ComputationNode<T> weights, ComputationNode<T>? bias = null)
Parameters
inputComputationNode<T>Input node features of shape [batch, numNodes, inputFeatures]
adjacencyComputationNode<T>Adjacency matrix of shape [batch, numNodes, numNodes]
weightsComputationNode<T>Weight matrix of shape [inputFeatures, outputFeatures]
biasComputationNode<T>Optional bias vector of shape [outputFeatures]
Returns
- ComputationNode<T>
Output node features of shape [batch, numNodes, outputFeatures]
Remarks
This operation implements graph convolution: output = adjacency @ (input @ weights) + bias. It aggregates features from neighboring nodes according to the graph structure defined by the adjacency matrix.
Forward pass: 1. Transform node features: X' = X @ W 2. Aggregate via graph structure: output = A @ X' 3. Add bias: output = output + b
Backward pass gradients: - ∂L/∂X = A^T @ (∂L/∂out) @ W^T - ∂L/∂W = X^T @ A^T @ (∂L/∂out) - ∂L/∂b = sum(∂L/∂out) across batch and nodes - ∂L/∂A = (∂L/∂out) @ (X @ W)^T
For Beginners: This operation helps neural networks learn from graph-structured data.
Think of it like spreading information through a social network:
- Each person (node) has certain features
- The adjacency matrix shows who is connected to whom
- This operation lets each person's features be influenced by their connections
- The weights control how features are transformed during this process
GridSample(ComputationNode<T>, ComputationNode<T>)
Samples input using bilinear interpolation at grid locations for spatial transformer networks.
public static ComputationNode<T> GridSample(ComputationNode<T> input, ComputationNode<T> grid)
Parameters
inputComputationNode<T>Input tensor of shape [batch, height, width, channels]
gridComputationNode<T>Sampling grid of shape [batch, out_height, out_width, 2] with normalized coordinates in [-1, 1]
Returns
- ComputationNode<T>
Sampled output of shape [batch, out_height, out_width, channels]
Remarks
This operation performs differentiable bilinear sampling from the input tensor using coordinates specified in the grid. Grid coordinates are in normalized [-1, 1] space where (-1, -1) is top-left and (1, 1) is bottom-right.
Forward pass: 1. Convert normalized grid coordinates to input pixel coordinates 2. For each sampling point, find the 4 nearest pixels 3. Compute bilinear interpolation weights 4. Interpolate: out = w00*v00 + w01*v01 + w10*v10 + w11*v11
Backward pass: - ∂L/∂input: Distribute gradients back to the 4 nearest pixels using same weights - ∂L/∂grid: Compute how grid coordinates affect the sampling result
For Beginners: This samples from an image using smooth interpolation. Instead of reading exact pixels, it can sample from positions between pixels by blending nearby pixel values. This enables smooth transformations like rotation.
GroupNorm(ComputationNode<T>, int, ComputationNode<T>?, ComputationNode<T>?, double)
Applies group normalization to a computation node.
public static ComputationNode<T> GroupNorm(ComputationNode<T> a, int numGroups, ComputationNode<T>? gamma = null, ComputationNode<T>? beta = null, double epsilon = 1E-05)
Parameters
aComputationNode<T>The input node with shape [batch, channels, ...] where ... can be spatial dimensions.
numGroupsintThe number of groups to divide channels into.
gammaComputationNode<T>Optional scale parameter per channel. If null, uses ones.
betaComputationNode<T>Optional shift parameter per channel. If null, uses zeros.
epsilondoubleSmall constant for numerical stability. Default is 1e-5.
Returns
- ComputationNode<T>
A new computation node containing the group normalized result.
Remarks
Group normalization divides channels into groups and normalizes within each group. Unlike batch normalization, it doesn't depend on batch size, making it suitable for small batch sizes or generative models.
For Beginners: GroupNorm is an alternative to BatchNorm that works better when batch sizes are small.
For group normalization:
- Divides channels into groups (e.g., 32 groups for 256 channels = 8 channels per group)
- Normalizes each group independently: (x - mean) / sqrt(variance + epsilon)
- Scales and shifts per channel: result * gamma + beta
- Works the same during training and inference (no batch dependency)
Key advantages:
- Works with batch size of 1 (unlike BatchNorm)
- More stable for generative models (VAEs, GANs, diffusion models)
- Used in modern architectures like Stable Diffusion VAE
Typical usage:
- numGroups=32 for 256+ channels
- numGroups=16 for 128 channels
- numGroups=8 for 64 channels
GumbelSoftmax(ComputationNode<T>, double, bool)
Applies Gumbel-Softmax for differentiable discrete sampling approximation.
public static ComputationNode<T> GumbelSoftmax(ComputationNode<T> logits, double temperature = 1, bool hard = false)
Parameters
logitsComputationNode<T>The input logits.
temperaturedoubleTemperature parameter controlling softness (default 1.0).
hardboolWhether to use straight-through estimator for hard samples.
Returns
- ComputationNode<T>
A computation node containing the soft/hard samples.
Remarks
Gumbel-Softmax provides a differentiable approximation to categorical sampling. As temperature approaches 0, outputs approach one-hot categorical samples. When hard=true, uses straight-through estimator for discrete outputs with gradient pass-through.
HardSigmoid(ComputationNode<T>)
Applies the Hard Sigmoid activation function element-wise: f(x) = clip((x + 3) / 6, 0, 1).
public static ComputationNode<T> HardSigmoid(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with HardSigmoid applied.
Remarks
HardSigmoid is a piecewise linear approximation of sigmoid that is computationally efficient. The gradient is 1/6 when -3 < x < 3, and 0 otherwise.
For Beginners: HardSigmoid uses straight lines instead of curves, making it faster to compute while still mapping inputs to the [0, 1] range. It's commonly used in mobile and embedded neural networks.
HardTanh(ComputationNode<T>)
Applies the Hard Tanh activation function element-wise: f(x) = clip(x, -1, 1).
public static ComputationNode<T> HardTanh(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with HardTanh applied.
Remarks
HardTanh is a piecewise linear approximation of tanh that is computationally efficient. The gradient is 1 when -1 < x < 1, and 0 otherwise.
For Beginners: HardTanh clips values to the range [-1, 1], passing through values in the middle range unchanged. It's faster than regular tanh and useful when you need bounded outputs.
HierarchicalSoftmax(ComputationNode<T>, ComputationNode<T>, int)
Applies the Hierarchical Softmax activation function for efficient large-vocabulary classification.
public static ComputationNode<T> HierarchicalSoftmax(ComputationNode<T> input, ComputationNode<T> nodeWeights, int numClasses)
Parameters
inputComputationNode<T>The input computation node (2D: batch × inputDim).
nodeWeightsComputationNode<T>The tree node weights (2D: treeDepth × inputDim).
numClassesintNumber of output classes.
Returns
- ComputationNode<T>
A new computation node with HierarchicalSoftmax applied.
Remarks
Hierarchical Softmax organizes classes in a binary tree structure. Each node makes a binary decision using sigmoid, and the final probability is the product of probabilities along the path to each class.
Computational complexity is O(log N) instead of O(N) for standard softmax.
Gradient: Flows through sigmoid derivatives at each tree node.
HyperbolicLinear(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, double)
Hyperbolic linear transformation in the Poincare ball model.
public static ComputationNode<T> HyperbolicLinear(ComputationNode<T> input, ComputationNode<T> weights, ComputationNode<T>? biases = null, double curvature = -1)
Parameters
inputComputationNode<T>Input tensor [batchSize, inputFeatures].
weightsComputationNode<T>Weight matrix in tangent space [outputFeatures, inputFeatures].
biasesComputationNode<T>Bias points on Poincare ball [outputFeatures, inputFeatures].
curvaturedoubleNegative curvature of hyperbolic space (default -1).
Returns
- ComputationNode<T>
Output tensor [batchSize, outputFeatures] with Poincare distances.
Remarks
Performs hyperbolic linear transformation:
- Project input to Poincare ball
- For each output: exp_origin(weight) → Mobius add with input → Mobius add with bias → distance from origin
ISRU(ComputationNode<T>, double)
Applies the Inverse Square Root Unit (ISRU) activation function.
public static ComputationNode<T> ISRU(ComputationNode<T> a, double alpha = 1)
Parameters
aComputationNode<T>The input computation node.
alphadoubleThe scaling parameter (default 1.0).
Returns
- ComputationNode<T>
A new computation node with ISRU applied.
Remarks
ISRU(x) = x / sqrt(1 + alpha * x²) A smooth, bounded activation function that ranges from -1/sqrt(alpha) to 1/sqrt(alpha).
Gradient: d(ISRU)/dx = (1 + alpha * x²)^(-3/2)
LSTMCell(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>)
LSTM cell forward pass.
public static (ComputationNode<T>, ComputationNode<T>) LSTMCell(ComputationNode<T> input, ComputationNode<T> hiddenState, ComputationNode<T> cellState, ComputationNode<T> weightIH, ComputationNode<T> weightHH, ComputationNode<T> bias)
Parameters
inputComputationNode<T>Input tensor [batch, input_dim].
hiddenStateComputationNode<T>Previous hidden state [batch, hidden_dim].
cellStateComputationNode<T>Previous cell state [batch, hidden_dim].
weightIHComputationNode<T>Input-to-hidden weights [input_dim, 4*hidden_dim].
weightHHComputationNode<T>Hidden-to-hidden weights [hidden_dim, 4*hidden_dim].
biasComputationNode<T>Bias terms [4*hidden_dim].
Returns
- (ComputationNode<T>, ComputationNode<T>)
Tuple of (new hidden state, new cell state).
LayerNorm(ComputationNode<T>, int[], ComputationNode<T>?, ComputationNode<T>?, double)
Applies layer normalization to a computation node.
public static ComputationNode<T> LayerNorm(ComputationNode<T> a, int[] normalizedShape, ComputationNode<T>? gamma = null, ComputationNode<T>? beta = null, double epsilon = 1E-05)
Parameters
aComputationNode<T>The input node.
normalizedShapeint[]The shape over which to normalize (typically the feature dimensions).
gammaComputationNode<T>Optional scale parameter (learnable). If null, uses ones.
betaComputationNode<T>Optional shift parameter (learnable). If null, uses zeros.
epsilondoubleSmall constant for numerical stability. Default is 1e-5.
Returns
- ComputationNode<T>
A new computation node containing the layer normalized result.
Remarks
Layer normalization normalizes inputs across the feature dimension for each sample independently. Formula: y = gamma * (x - mean) / sqrt(variance + epsilon) + beta Unlike batch normalization, this doesn't depend on batch statistics.
For Beginners: LayerNorm standardizes features for each sample independently.
For layer normalization:
- Computes mean and variance for each sample's features
- Normalizes: (x - mean) / sqrt(variance)
- Scales and shifts: result * gamma + beta
- Works the same during training and inference (no batch dependency)
Used in:
- Transformers (critical component)
- RNNs (stabilizes training)
- Any architecture needing sample-independent normalization
LeakyReLU(ComputationNode<T>, double)
Applies the Leaky Rectified Linear Unit (LeakyReLU) activation function.
public static ComputationNode<T> LeakyReLU(ComputationNode<T> a, double alpha = 0.01)
Parameters
aComputationNode<T>The input computation node.
alphadoubleThe slope for negative values. Default is 0.01.
Returns
- ComputationNode<T>
A new computation node with LeakyReLU applied.
Remarks
LeakyReLU(x) = x if x > 0, alpha * x otherwise. Unlike ReLU, LeakyReLU allows a small gradient for negative inputs, preventing dying neurons.
Gradient: d(LeakyReLU)/dx = 1 if x > 0, alpha otherwise.
LeakyStateUpdate(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, double)
Leaky state update for reservoir/echo state networks.
public static ComputationNode<T> LeakyStateUpdate(ComputationNode<T> prevState, ComputationNode<T> input, ComputationNode<T> weights, double leakingRate = 1)
Parameters
prevStateComputationNode<T>Previous hidden state.
inputComputationNode<T>Current input.
weightsComputationNode<T>Reservoir weight matrix (can be frozen).
leakingRatedoubleLeaking rate (default 1.0 for full update).
Returns
- ComputationNode<T>
New hidden state.
Remarks
Computes: new_state = (1 - leakingRate) * prevState + leakingRate * tanh(weights @ prevState + input)
LiSHT(ComputationNode<T>)
Applies the LiSHT (Linearly Scaled Hyperbolic Tangent) activation function element-wise.
public static ComputationNode<T> LiSHT(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with LiSHT applied.
Remarks
LiSHT is defined as: f(x) = x * tanh(x) The gradient is: tanh(x) + x * (1 - tanh²(x))
For Beginners: LiSHT combines the input with its tanh, creating a smooth activation that preserves sign and helps prevent vanishing gradients.
LocallyConnectedConv2D(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?, int[]?)
Performs locally connected 2D convolution where weights are NOT shared across spatial locations.
public static ComputationNode<T> LocallyConnectedConv2D(ComputationNode<T> input, ComputationNode<T> weights, ComputationNode<T>? bias = null, int[]? stride = null)
Parameters
inputComputationNode<T>Input tensor of shape [batch, in_channels, height, width]
weightsComputationNode<T>Weight tensor of shape [out_h, out_w, out_channels, in_channels, kernel_h, kernel_w]
biasComputationNode<T>Optional bias tensor of shape [out_channels]
strideint[]Stride for the convolution, defaults to [1, 1]
Returns
- ComputationNode<T>
Output tensor of shape [batch, out_channels, out_h, out_w]
Remarks
Locally connected convolution is like regular convolution but uses different weights for each spatial output location. This increases parameters but allows position-specific feature detection.
Unlike Conv2D where weights are shared across all positions, LocallyConnectedConv2D uses unique weights for each (h,w) output position. This is useful when different regions have fundamentally different characteristics (e.g., face recognition where eyes/nose/mouth are at specific locations).
Forward pass applies position-specific filters at each output location. Backward pass computes gradients with respect to input, position-specific weights, and bias.
Log(ComputationNode<T>)
Computes the natural logarithm for a computation node.
public static ComputationNode<T> Log(ComputationNode<T> a)
Parameters
aComputationNode<T>The input node.
Returns
- ComputationNode<T>
A new computation node containing the logarithm result.
Remarks
This method computes the natural logarithm of each element and records the operation. The backward function uses: ∂(log(a))/∂a = 1/a.
For Beginners: This computes the natural log and tracks gradients.
For logarithm (c = log(a)):
- The forward pass computes log for each element
- The backward pass uses: gradient to 'a' is incoming gradient * (1/a)
Logarithms are used in loss functions like cross-entropy.
LogSoftmax(ComputationNode<T>, int)
Applies the Log-Softmax function for numerically stable cross-entropy loss computation.
public static ComputationNode<T> LogSoftmax(ComputationNode<T> a, int axis = -1)
Parameters
aComputationNode<T>The input computation node.
axisintThe axis along which to compute log-softmax (default -1, last axis).
Returns
- ComputationNode<T>
A new computation node with Log-Softmax applied.
Remarks
LogSoftmax(x) = log(softmax(x)) = x - log(sum(exp(x))) More numerically stable than computing log(softmax(x)) separately.
Gradient: dL/dx_i = dL/dy_i - softmax(x)_i * sum_j(dL/dy_j) where y = LogSoftmax(x) and dL/dy is the incoming gradient.
LogSoftmin(ComputationNode<T>, int)
Applies the Log-Softmin function for numerically stable computation.
public static ComputationNode<T> LogSoftmin(ComputationNode<T> a, int axis = -1)
Parameters
aComputationNode<T>The input computation node.
axisintThe axis along which to compute log-softmin (default -1, last axis).
Returns
- ComputationNode<T>
A new computation node with Log-Softmin applied.
Remarks
LogSoftmin(x) = log(softmin(x)) = -x - log(sum(exp(-x))) Combines log and softmin for numerical stability.
MatrixMultiply(ComputationNode<T>, ComputationNode<T>)
Performs matrix multiplication on two computation nodes.
public static ComputationNode<T> MatrixMultiply(ComputationNode<T> a, ComputationNode<T> b)
Parameters
aComputationNode<T>The left matrix (must be 2D).
bComputationNode<T>The right matrix (must be 2D).
Returns
- ComputationNode<T>
A computation node representing the matrix product.
Remarks
Computes C = A·B where A has shape [m, n] and B has shape [n, p], resulting in C with shape [m, p].
Gradient computation: - ∂(A·B)/∂A = gradOut·B^T - ∂(A·B)/∂B = A^T·gradOut
MatrixVectorMultiply(ComputationNode<T>, ComputationNode<T>)
Performs a matrix-vector multiplication (2D x 1D) by reshaping the vector into a column matrix.
public static ComputationNode<T> MatrixVectorMultiply(ComputationNode<T> matrix, ComputationNode<T> vector)
Parameters
matrixComputationNode<T>The left matrix (must be 2D).
vectorComputationNode<T>The right vector (must be 1D).
Returns
- ComputationNode<T>
A computation node representing the vector result.
MaxPool2D(ComputationNode<T>, int[], int[]?)
Performs 2D max pooling on a 4D tensor (batch, channels, height, width).
public static ComputationNode<T> MaxPool2D(ComputationNode<T> a, int[] poolSize, int[]? strides = null)
Parameters
aComputationNode<T>The input node with shape [batch, channels, height, width].
poolSizeint[]The size of the pooling window [poolH, poolW].
stridesint[]The stride for the pooling operation [strideH, strideW]. If null, uses poolSize.
Returns
- ComputationNode<T>
A new computation node containing the max pooled result.
Remarks
This method performs max pooling over 2D spatial dimensions. During forward pass, it tracks which element was the max for routing gradients during backward pass.
For Beginners: MaxPool downsamples by taking the maximum value in each window.
For max pooling:
- The forward pass slides a window and takes the max value in each position
- This reduces spatial dimensions (downsampling)
- The backward pass routes gradients only to the positions that were max
- Other positions get zero gradient (they didn't contribute to the output)
Used in:
- CNNs for translation invariance
- Reducing spatial resolution
- Building hierarchical features
MaxPool3D(ComputationNode<T>, int[], int[]?)
Performs 3D max pooling on a 5D tensor (batch, channels, depth, height, width).
public static ComputationNode<T> MaxPool3D(ComputationNode<T> input, int[] poolSize, int[]? strides = null)
Parameters
inputComputationNode<T>The input node with shape [batch, channels, depth, height, width].
poolSizeint[]The size of the pooling window [poolD, poolH, poolW].
stridesint[]The stride for the pooling operation [strideD, strideH, strideW]. If null, uses poolSize.
Returns
- ComputationNode<T>
A new computation node containing the max pooled result.
Remarks
This method performs max pooling over 3D spatial dimensions (depth, height, width). The backward function routes gradients only to the maximum values in each pooling window.
For Beginners: MaxPool3D downsamples volumetric data by taking the maximum value in each window.
For max pooling:
- The forward pass slides a 3D window and takes the maximum
- This reduces the spatial dimensions while preserving the strongest activations
- The backward pass routes gradients only to where the max came from
- Non-max elements get zero gradient
Used in:
- Voxel-based 3D CNNs for shape classification
- Medical image analysis (CT/MRI)
- Video processing
Maxout(ComputationNode<T>, int)
Applies the Maxout activation function which takes maximum over groups of inputs.
public static ComputationNode<T> Maxout(ComputationNode<T> a, int numPieces = 2)
Parameters
aComputationNode<T>The input computation node (2D: batch × features).
numPiecesintNumber of inputs per group (default 2).
Returns
- ComputationNode<T>
A new computation node with Maxout applied.
Remarks
Maxout groups consecutive features and outputs the maximum from each group. Input features must be divisible by numPieces. Output shape: [batch, features / numPieces].
Gradient: Flows only to the maximum element in each group (sparse gradient).
Mean(ComputationNode<T>)
Computes the mean of elements in a computation node.
public static ComputationNode<T> Mean(ComputationNode<T> a)
Parameters
aComputationNode<T>The computation node to compute mean of.
Returns
- ComputationNode<T>
A computation node representing the mean (scalar).
Remarks
Computes the average of all elements in the tensor.
Gradient computation: - ∂(mean(A))/∂A = gradOut / count - Each element gets an equal share of the gradient, divided by the total count.
Mish(ComputationNode<T>)
Applies the Mish activation function.
public static ComputationNode<T> Mish(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with Mish applied.
Remarks
Mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + exp(x))) Mish is a smooth, self-regularizing activation function.
Gradient: d(Mish)/dx = sech²(softplus(x)) * sigmoid(x) + tanh(softplus(x))
MobiusAdd(ComputationNode<T>, ComputationNode<T>, double)
Mobius addition in the Poincare ball model.
public static ComputationNode<T> MobiusAdd(ComputationNode<T> x, ComputationNode<T> y, double curvature = -1)
Parameters
xComputationNode<T>First point tensor [batchSize, dim] or [dim].
yComputationNode<T>Second point tensor with same shape as x.
curvaturedoubleNegative curvature of hyperbolic space (default -1).
Returns
- ComputationNode<T>
Result of Mobius addition x ⊕ y.
Remarks
Mobius addition is the hyperbolic analog of vector addition: x ⊕ y = ((1 + 2c⟨x,y⟩ + c||y||²)x + (1 - c||x||²)y) / (1 + 2c⟨x,y⟩ + c²||x||²||y||²)
MultiHeadAttention(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, int, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>)
Applies multi-head attention mechanism.
public static ComputationNode<T> MultiHeadAttention(ComputationNode<T> query, ComputationNode<T> key, ComputationNode<T> value, int numHeads, ComputationNode<T> wQ, ComputationNode<T> wK, ComputationNode<T> wV, ComputationNode<T> wO)
Parameters
queryComputationNode<T>Query tensor.
keyComputationNode<T>Key tensor.
valueComputationNode<T>Value tensor.
numHeadsintNumber of attention heads.
wQComputationNode<T>Query projection weights.
wKComputationNode<T>Key projection weights.
wVComputationNode<T>Value projection weights.
wOComputationNode<T>Output projection weights.
Returns
- ComputationNode<T>
Multi-head attention output.
Negate(ComputationNode<T>)
Negates a computation node (computes -a).
public static ComputationNode<T> Negate(ComputationNode<T> a)
Parameters
aComputationNode<T>The input node.
Returns
- ComputationNode<T>
A new computation node containing the negated result.
Remarks
This method negates each element and records the operation. The backward function simply negates the incoming gradient.
For Beginners: This flips the sign of each element.
For negation (c = -a):
- The forward pass flips signs (positive becomes negative, vice versa)
- The backward pass also flips the gradient sign
Norm(ComputationNode<T>, int, bool, double)
Computes the L2 norm along a specified axis.
public static ComputationNode<T> Norm(ComputationNode<T> a, int axis = -1, bool keepDims = false, double epsilon = 1E-12)
Parameters
aComputationNode<T>The input node.
axisintThe axis along which to compute the norm. Default is -1 (last axis).
keepDimsboolWhether to keep the reduced dimensions. Default is false.
epsilondoubleSmall value for numerical stability. Default is 1e-12.
Returns
- ComputationNode<T>
A new computation node containing the norm along the specified axis.
Remarks
This method computes the L2 (Euclidean) norm: sqrt(sum(x²)) along the specified axis. The gradient is computed as: ∂||x||/∂x = x / ||x||.
For Beginners: The norm measures the "length" of vectors.
For example, with axis=-1:
- Input shape: [batch, features]
- Output shape: [batch] (or [batch, 1] with keepDims=True)
- Each output value is sqrt(sum of squares along that row)
This is commonly used in capsule networks to compute capsule lengths, and in normalization operations.
OctonionMatMul(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?)
Performs octonion matrix multiplication for OctonionLinearLayer.
public static ComputationNode<T> OctonionMatMul(ComputationNode<T> input, ComputationNode<T> weights, ComputationNode<T>? biases = null)
Parameters
inputComputationNode<T>Input tensor with shape [batch, inputFeatures * 8] where each group of 8 represents an octonion.
weightsComputationNode<T>Weight tensor with shape [outputFeatures, inputFeatures, 8] where last dimension is octonion components.
biasesComputationNode<T>Optional bias tensor with shape [outputFeatures, 8].
Returns
- ComputationNode<T>
Output tensor with shape [batch, outputFeatures * 8].
Remarks
Octonions are 8-dimensional numbers that generalize quaternions. They are non-associative but can capture more complex relationships in data. This operation performs: output[b, o] = sum_i(input[b, i] * weights[o, i]) + biases[o] where * is octonion multiplication.
For Beginners: This is like matrix multiplication but using 8-dimensional octonion numbers instead of regular numbers. Each octonion has 8 components: (scalar, e1, e2, e3, e4, e5, e6, e7).
PReLU(ComputationNode<T>, double)
Applies the Parametric Rectified Linear Unit (PReLU) activation function.
public static ComputationNode<T> PReLU(ComputationNode<T> a, double alpha = 0.01)
Parameters
aComputationNode<T>The input computation node.
alphadoubleThe slope for negative values (default 0.01).
Returns
- ComputationNode<T>
A new computation node with PReLU applied.
Remarks
PReLU(x) = x if x > 0, alpha * x otherwise. Similar to LeakyReLU but alpha is typically learned during training.
Gradient: d(PReLU)/dx = 1 if x > 0, alpha otherwise.
Pad(ComputationNode<T>, int[,], T?)
Pads a tensor with a constant value along specified dimensions.
public static ComputationNode<T> Pad(ComputationNode<T> a, int[,] padWidth, T? value = default)
Parameters
aComputationNode<T>The input node.
padWidthint[,]Padding width for each dimension as (before, after) pairs.
valueTThe value to use for padding. Default is zero.
Returns
- ComputationNode<T>
A new computation node containing the padded result.
Remarks
This method adds padding around the tensor. The backward function simply crops the gradient back to the original size (gradients for padding are zero).
For Beginners: Pad adds extra elements around a tensor.
For padding:
- The forward pass adds border elements with a constant value
- The backward pass removes those border gradients (they don't affect the original tensor)
- Think of it like adding margins to an image
Used in:
- Convolutional layers (to maintain spatial dimensions)
- Handling variable-length sequences
- Data augmentation
Pad(ComputationNode<T>, int[])
Pads a tensor with zeros along specified dimensions.
public static ComputationNode<T> Pad(ComputationNode<T> a, int[] padding)
Parameters
aComputationNode<T>The input computation node to pad.
paddingint[]Array specifying padding amount for each dimension (applied symmetrically on both sides).
Returns
- ComputationNode<T>
A new computation node containing the padded tensor.
Remarks
This method pads the input tensor by adding zeros around each dimension. The padding array specifies how many zeros to add on BOTH sides of each dimension. For example, padding[1] = 2 means add 2 zeros on the left AND 2 zeros on the right of dimension 1.
The backward function for padding simply extracts the non-padded region from the output gradient, since ∂(pad(x))/∂x is an extraction operation that removes the padded regions.
For Beginners: Padding adds a border of zeros around your data.
For padding (output = pad(input, [p0, p1, ...])):
- The forward pass creates a larger tensor and copies input to the center
- Padding p on dimension d means: add p zeros on left, p zeros on right
- The backward pass extracts the center region from the gradient (removes the padding)
This is commonly used in convolutional neural networks to preserve spatial dimensions.
Permute(ComputationNode<T>, params int[])
Permutes the dimensions of a computation node (general transpose).
public static ComputationNode<T> Permute(ComputationNode<T> a, params int[] axes)
Parameters
aComputationNode<T>The computation node to permute.
axesint[]The new order of dimensions.
Returns
- ComputationNode<T>
A computation node with permuted dimensions.
Remarks
Rearranges dimensions according to the axes array. Equivalent to Transpose but for N dimensions.
Gradient computation: - ∂(Permute(A))/∂A = Permute(gradOut, inverseAxes)
PixelShuffle(ComputationNode<T>, int)
Performs pixel shuffle (depth-to-space) operation for sub-pixel convolution.
public static ComputationNode<T> PixelShuffle(ComputationNode<T> a, int upscaleFactor)
Parameters
aComputationNode<T>The input computation node with shape [batch, channels, height, width].
upscaleFactorintThe upscaling factor (r). Channels must be divisible by r².
Returns
- ComputationNode<T>
A computation node with shape [batch, channels/(r²), heightr, widthr].
PoincareDistance(ComputationNode<T>, ComputationNode<T>, double)
Computes the Poincare ball distance between two points.
public static ComputationNode<T> PoincareDistance(ComputationNode<T> x, ComputationNode<T> y, double curvature = -1)
Parameters
xComputationNode<T>First point tensor [batchSize, dim] or [dim].
yComputationNode<T>Second point tensor with same shape as x.
curvaturedoubleNegative curvature of hyperbolic space (default -1).
Returns
- ComputationNode<T>
Distance tensor [batchSize] or scalar.
Remarks
The Poincare distance between points x and y is: d(x, y) = (2/sqrt(c)) * arctanh(sqrt(c) || -x ⊕ y ||)
PoincareExpMap(ComputationNode<T>, ComputationNode<T>, double)
Poincare ball exponential map from tangent space at a point.
public static ComputationNode<T> PoincareExpMap(ComputationNode<T> point, ComputationNode<T> tangent, double curvature = -1)
Parameters
pointComputationNode<T>Base point on the Poincare ball [batchSize, dim] or [dim].
tangentComputationNode<T>Tangent vector at the point with same shape.
curvaturedoubleNegative curvature of hyperbolic space (default -1).
Returns
- ComputationNode<T>
Point on manifold after following geodesic.
Remarks
The exponential map takes a tangent vector at point p and returns the point reached by following the geodesic in that direction: exp_p(v) = p ⊕ (tanh(sqrt(c)||v||_p / 2) * v / (sqrt(c)||v||)) where ||v||_p = ||v|| * 2 / (1 - c||p||²) is the Poincare norm.
PoincareLogMap(ComputationNode<T>, ComputationNode<T>, double)
Poincare ball logarithmic map to tangent space at a point.
public static ComputationNode<T> PoincareLogMap(ComputationNode<T> point, ComputationNode<T> target, double curvature = -1)
Parameters
pointComputationNode<T>Base point on the Poincare ball [batchSize, dim] or [dim].
targetComputationNode<T>Target point on the Poincare ball with same shape.
curvaturedoubleNegative curvature of hyperbolic space (default -1).
Returns
- ComputationNode<T>
Tangent vector at point pointing towards target.
Remarks
The logarithmic map is the inverse of the exponential map: log_p(q) = (2 / (sqrt(c) * lambda_p)) * arctanh(sqrt(c) || -p ⊕ q ||) * (-p ⊕ q) / || -p ⊕ q ||
PoincareProject(ComputationNode<T>, double, double)
Projects a point onto the Poincare ball to ensure it stays inside the unit ball.
public static ComputationNode<T> PoincareProject(ComputationNode<T> point, double curvature = -1, double epsilon = 1E-05)
Parameters
pointComputationNode<T>Input point tensor [batchSize, dim] or [dim].
curvaturedoubleNegative curvature of hyperbolic space (default -1).
epsilondoubleSmall value for numerical stability.
Returns
- ComputationNode<T>
Projected point on the Poincare ball.
Remarks
Projects points that are outside or on the boundary of the Poincare ball back inside by scaling to have norm slightly less than 1/sqrt(|c|).
Power(ComputationNode<T>, double)
Raises a computation node to a power.
public static ComputationNode<T> Power(ComputationNode<T> a, double exponent)
Parameters
aComputationNode<T>The base node.
exponentdoubleThe exponent value.
Returns
- ComputationNode<T>
A new computation node containing the power operation result.
Remarks
This method raises each element to a power and records the operation. The backward function uses the power rule: ∂(a^n)/∂a = n * a^(n-1).
For Beginners: This raises a tensor to a power and tracks gradients.
For power operation (c = a^n):
- The forward pass raises each element to the power
- The backward pass uses the power rule from calculus
- Gradient to 'a' is: incoming gradient * n * a^(n-1)
Example: If a=[2,3], n=2, c=[4,9] If gradient to c is [1,1]:
- 'a' receives [122^1, 123^1] = [4, 6]
RBFKernel(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>)
Computes Gaussian Radial Basis Function (RBF) kernel activations.
public static ComputationNode<T> RBFKernel(ComputationNode<T> input, ComputationNode<T> centers, ComputationNode<T> epsilons)
Parameters
inputComputationNode<T>Input tensor of shape [batch, inputSize]
centersComputationNode<T>Center points tensor of shape [numCenters, inputSize]
epsilonsComputationNode<T>Width parameters tensor of shape [numCenters]
Returns
- ComputationNode<T>
Output tensor of shape [batch, numCenters] containing RBF activations
Remarks
This operation implements the Gaussian RBF: f(r) = exp(-epsilon * r²) where r is the Euclidean distance between input and center.
Forward pass: For each input and center pair, computes: 1. distance = sqrt(sum((input - center)²)) 2. output = exp(-epsilon * distance²)
Backward pass gradients: - ∂L/∂input = ∂L/∂output * (-2 * epsilon * distance) * (input - center) / distance - ∂L/∂centers = -∂L/∂input (opposite direction) - ∂L/∂epsilon = ∂L/∂output * (-distance²) * output
For Beginners: This operation creates "similarity scores" between inputs and centers. Each RBF neuron responds strongly (value near 1) when input is close to its center, and weakly (value near 0) when far away. The epsilon parameter controls how quickly the response decreases with distance.
RReLU(ComputationNode<T>, double, double, bool, int?)
Applies the Randomized Leaky ReLU (RReLU) activation function.
public static ComputationNode<T> RReLU(ComputationNode<T> a, double lower = 0.125, double upper = 0.333, bool isTraining = false, int? seed = null)
Parameters
aComputationNode<T>The input computation node.
lowerdoubleLower bound for alpha (default 1/8).
upperdoubleUpper bound for alpha (default 1/3).
isTrainingboolIf true, samples random alpha; if false, uses average (default false for JIT).
seedint?Optional random seed for reproducibility.
Returns
- ComputationNode<T>
A new computation node with RReLU applied.
Remarks
RReLU(x) = x if x >= 0, alpha * x otherwise. During training, alpha is sampled uniformly from [lower, upper]. During inference (JIT default), alpha = (lower + upper) / 2.
Gradient: 1 for x >= 0, alpha for x < 0.
ReLU(ComputationNode<T>)
Computes the ReLU (Rectified Linear Unit) activation for a computation node.
public static ComputationNode<T> ReLU(ComputationNode<T> a)
Parameters
aComputationNode<T>The input node.
Returns
- ComputationNode<T>
A new computation node containing the ReLU result.
Remarks
This method computes ReLU (max(0, x)) and records the operation. The backward function uses: ∂ReLU(a)/∂a = 1 if a > 0, else 0.
For Beginners: ReLU is the most popular activation function in deep learning.
For ReLU (c = max(0, a)):
- The forward pass keeps positive values, zeros out negative values
- The backward pass: gradient flows through if input was positive, blocked if negative
ReLU is popular because:
- Very fast to compute
- Helps avoid vanishing gradients
- Works well in practice for deep networks
ReduceLogVariance(ComputationNode<T>, int, double)
Computes the natural logarithm of variance along the specified axis.
public static ComputationNode<T> ReduceLogVariance(ComputationNode<T> input, int axis, double epsilon = 1E-08)
Parameters
inputComputationNode<T>Input tensor of any shape
axisintThe axis along which to compute variance (must be specified)
epsilondoubleSmall constant for numerical stability (default: 1e-8)
Returns
- ComputationNode<T>
Tensor with reduced shape containing log-variance values
Remarks
This operation computes log(variance + epsilon) along the specified axis. The output shape has the specified axis dimension removed from the input shape.
Forward pass: log(variance + epsilon) where variance = mean((x - mean(x))^2)
Backward pass uses chain rule: ∂L/∂x_i = ∂L/∂log_var * (1/variance) * (2/N) * (x_i - mean) where N is the size of the reduction axis.
For Beginners: This operation measures how spread out values are along an axis, then takes the logarithm. Commonly used in variational autoencoders and uncertainty estimation.
ReduceMax(ComputationNode<T>, int[]?, bool)
Reduces a tensor by computing the maximum value along specified axes.
public static ComputationNode<T> ReduceMax(ComputationNode<T> a, int[]? axes = null, bool keepDims = false)
Parameters
aComputationNode<T>The input computation node.
axesint[]The axes along which to compute the maximum. If null, reduces over all axes.
keepDimsboolWhether to keep the reduced dimensions with size 1.
Returns
- ComputationNode<T>
A computation node representing the result of the reduce max operation.
ReduceMean(ComputationNode<T>, int[]?, bool)
Reduces a tensor by computing the mean value along specified axes.
public static ComputationNode<T> ReduceMean(ComputationNode<T> a, int[]? axes = null, bool keepDims = false)
Parameters
aComputationNode<T>The input computation node.
axesint[]The axes along which to compute the mean. If null, reduces over all axes.
keepDimsboolWhether to keep the reduced dimensions with size 1.
Returns
- ComputationNode<T>
A computation node representing the result of the reduce mean operation.
Reshape(ComputationNode<T>, params int[])
Reshapes a computation node to a new shape.
public static ComputationNode<T> Reshape(ComputationNode<T> a, params int[] newShape)
Parameters
aComputationNode<T>The computation node to reshape.
newShapeint[]The new shape (must have same total number of elements).
Returns
- ComputationNode<T>
A computation node with the new shape.
Remarks
Changes the shape of the tensor without changing the underlying data. The total number of elements must remain the same.
Gradient computation: - ∂(Reshape(A))/∂A = Reshape(gradOut, A.Shape) - Simply reshape the gradient back to the original shape.
SELU(ComputationNode<T>)
Applies the SELU (Scaled Exponential Linear Unit) activation function element-wise.
public static ComputationNode<T> SELU(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with SELU applied.
Remarks
SELU is defined as: λ * x if x > 0, otherwise λ * α * (e^x - 1) where λ ≈ 1.0507 and α ≈ 1.6733 are fixed constants for self-normalization. The gradient is: λ if x > 0, otherwise λ * α * e^x
For Beginners: SELU enables self-normalizing neural networks where activations converge to zero mean and unit variance, reducing the need for batch normalization.
SQRBF(ComputationNode<T>, double)
Applies the Squared Radial Basis Function (SQRBF) activation.
public static ComputationNode<T> SQRBF(ComputationNode<T> a, double beta = 1)
Parameters
aComputationNode<T>The input computation node.
betadoubleThe width parameter controlling the Gaussian bell curve (default 1.0).
Returns
- ComputationNode<T>
A new computation node with SQRBF applied.
Remarks
SQRBF(x) = exp(-β * x²) A Gaussian bell-shaped activation with maximum at x=0 and values approaching 0 as |x| increases.
Gradient: d(SQRBF)/dx = -2βx * exp(-β * x²)
ScaledDotProductAttention(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, ComputationNode<T>?)
Computes scaled dot-product attention: softmax(Q @ K^T / sqrt(d_k)) @ V.
public static ComputationNode<T> ScaledDotProductAttention(ComputationNode<T> query, ComputationNode<T> key, ComputationNode<T> value, ComputationNode<T>? mask = null)
Parameters
queryComputationNode<T>Query tensor [batch, seq_len_q, d_k].
keyComputationNode<T>Key tensor [batch, seq_len_k, d_k].
valueComputationNode<T>Value tensor [batch, seq_len_k, d_v].
maskComputationNode<T>Optional attention mask.
Returns
- ComputationNode<T>
Attention output [batch, seq_len_q, d_v].
ScaledTanh(ComputationNode<T>, double)
Applies the Scaled Tanh activation function element-wise.
public static ComputationNode<T> ScaledTanh(ComputationNode<T> a, double beta = 1)
Parameters
aComputationNode<T>The input computation node.
betadoubleThe steepness parameter. Default is 1.0.
Returns
- ComputationNode<T>
A new computation node with ScaledTanh applied.
Remarks
ScaledTanh is defined as: f(x) = (1 - exp(-βx)) / (1 + exp(-βx)) The gradient is: β * (1 - f(x)²) When β = 2, this equals standard tanh.
For Beginners: ScaledTanh allows you to control the steepness of the tanh curve, which can be useful for tuning network behavior.
Sigmoid(ComputationNode<T>)
Computes the sigmoid function for a computation node.
public static ComputationNode<T> Sigmoid(ComputationNode<T> a)
Parameters
aComputationNode<T>The input node.
Returns
- ComputationNode<T>
A new computation node containing the sigmoid result.
Remarks
This method computes sigmoid (σ(x) = 1/(1+e^(-x))) and records the operation. The backward function uses: ∂σ(a)/∂a = σ(a) * (1 - σ(a)).
For Beginners: Sigmoid squashes values to be between 0 and 1.
For sigmoid (c = σ(a)):
- The forward pass computes 1/(1+e^(-x)) for each element
- The backward pass: gradient to 'a' is incoming gradient * output * (1 - output)
Sigmoid is used in binary classification and as a gate in LSTM networks.
Sign(ComputationNode<T>, double)
public static ComputationNode<T> Sign(ComputationNode<T> a, double surrogateBeta = 1)
Parameters
aComputationNode<T>surrogateBetadouble
Returns
SinusoidalTimeEmbedding(ComputationNode<T>, int)
Creates sinusoidal time embeddings for diffusion models.
public static ComputationNode<T> SinusoidalTimeEmbedding(ComputationNode<T> timesteps, int embeddingDim)
Parameters
timestepsComputationNode<T>The timesteps to embed [batchSize] or [batchSize, 1].
embeddingDimintThe dimension of the output embeddings.
Returns
- ComputationNode<T>
A computation node with sinusoidal embeddings [batchSize, embeddingDim].
Slice(ComputationNode<T>, int, int, int, int)
Extracts a slice from a tensor along a specified axis.
public static ComputationNode<T> Slice(ComputationNode<T> a, int start, int length, int step = 1, int axis = 0)
Parameters
aComputationNode<T>The input tensor to slice.
startintThe starting index along the specified axis.
lengthintThe number of elements to extract.
stepintThe step size between elements (default 1).
axisintThe axis along which to slice (default 0).
Returns
- ComputationNode<T>
A new computation node containing the sliced tensor.
Remarks
This operation extracts a portion of a tensor along a specified axis, starting at a given offset and continuing for a specified length. An optional step parameter allows for strided slicing (e.g., every 2nd element).
For Beginners: Think of this like taking a substring from a string.
For example, if you have a tensor [1, 2, 3, 4, 5, 6] and you slice with start=1, length=3:
- You get [2, 3, 4]
With step=2 and start=0, length=3:
- You get [1, 3, 5] (every 2nd element)
This is useful for extracting specific parts of data, like separating real and imaginary parts of complex numbers stored in interleaved format.
SoftKNN(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, T?)
Performs a soft K-Nearest Neighbors operation for differentiable instance-based learning.
public static ComputationNode<T> SoftKNN(ComputationNode<T> input, ComputationNode<T> supportVectors, ComputationNode<T> labels, T? temperature = default)
Parameters
inputComputationNode<T>The query input tensor.
supportVectorsComputationNode<T>Matrix of support vectors (training points) [n_samples, n_features].
labelsComputationNode<T>Labels for each support vector [n_samples] or [n_samples, n_outputs].
temperatureTTemperature for softmax attention (default: 1.0).
Returns
- ComputationNode<T>
Attention-weighted sum of labels.
Remarks
Computes: distances[i] = ||input - supportVectors[i]||² weights = softmax(-distances / temperature) output = Σ weights[i] * labels[i]
For Beginners: Instead of finding exactly k nearest neighbors, this computes attention weights for ALL neighbors based on distance. Closer neighbors get higher attention. This makes KNN differentiable and JIT-compilable.
SoftLocallyWeighted(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, T?)
Performs soft locally-weighted regression for differentiable instance-based learning.
public static ComputationNode<T> SoftLocallyWeighted(ComputationNode<T> input, ComputationNode<T> xTrain, ComputationNode<T> yTrain, T? bandwidth = default)
Parameters
inputComputationNode<T>The query input tensor.
xTrainComputationNode<T>Training feature matrix [n_samples, n_features].
yTrainComputationNode<T>Training target values [n_samples] or [n_samples, n_outputs].
bandwidthTBandwidth parameter controlling locality (default: 1.0).
Returns
- ComputationNode<T>
Attention-weighted prediction.
Remarks
Computes: distances[i] = ||input - xTrain[i]||² weights = softmax(-distances / bandwidth) output = Σ weights[i] * yTrain[i]
For Beginners: This is similar to SoftKNN but specifically designed for regression with a bandwidth parameter that controls how local the weighting is. Smaller bandwidth = more local predictions.
SoftPlus(ComputationNode<T>)
Applies the SoftPlus activation function element-wise: f(x) = ln(1 + e^x).
public static ComputationNode<T> SoftPlus(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with SoftPlus applied.
Remarks
SoftPlus is a smooth approximation of ReLU. The gradient is the sigmoid function: d(SoftPlus)/dx = sigmoid(x) = 1 / (1 + e^(-x))
For Beginners: SoftPlus smoothly approaches 0 for negative inputs and approaches the input value for large positive inputs, similar to ReLU but without the sharp corner at x=0.
SoftSign(ComputationNode<T>)
Applies the SoftSign activation function element-wise: f(x) = x / (1 + |x|).
public static ComputationNode<T> SoftSign(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with SoftSign applied.
Remarks
SoftSign is an alternative to tanh with polynomial tails that approach ±1 more slowly. The gradient is: d(SoftSign)/dx = 1 / (1 + |x|)²
For Beginners: SoftSign maps inputs to (-1, 1) like tanh, but with a different shape. The slower saturation can help prevent vanishing gradients in deep networks.
SoftSplit(ComputationNode<T>, ComputationNode<T>, ComputationNode<T>, int, T, T?)
Performs a soft split operation for differentiable decision trees.
public static ComputationNode<T> SoftSplit(ComputationNode<T> input, ComputationNode<T> leftValue, ComputationNode<T> rightValue, int featureIndex, T threshold, T? temperature = default)
Parameters
inputComputationNode<T>The input features tensor.
leftValueComputationNode<T>The value to return if going left.
rightValueComputationNode<T>The value to return if going right.
featureIndexintThe index of the feature to split on.
thresholdTThe threshold value for the split.
temperatureTTemperature parameter controlling split sharpness (default: 1.0).
Returns
- ComputationNode<T>
A weighted combination of left and right values based on soft split.
Remarks
Computes: p_left = σ((threshold - x[featureIndex]) / temperature) output = p_left * leftValue + (1 - p_left) * rightValue
For Beginners: This makes decision tree splits differentiable by using a smooth sigmoid function instead of a hard if-then-else. Lower temperature makes the split sharper (more like a hard decision), while higher temperature makes it softer.
Softmax(ComputationNode<T>, int)
Computes the softmax function for a computation node along a specified axis.
public static ComputationNode<T> Softmax(ComputationNode<T> a, int axis = -1)
Parameters
aComputationNode<T>The input node.
axisintThe axis along which to compute softmax. Default is -1 (last axis).
Returns
- ComputationNode<T>
A new computation node containing the softmax result.
Remarks
This method computes softmax (σ(x_i) = exp(x_i) / Σexp(x_j)) along the specified axis. Uses numerical stability trick: subtract max before exponentiating. The backward function uses: ∂softmax/∂x = softmax(x) * (grad - Σ(grad * softmax(x))).
For Beginners: Softmax converts a vector of numbers into probabilities.
For softmax:
- The forward pass exponentiates each element, then normalizes so they sum to 1
- The result is a probability distribution (all values between 0 and 1, summing to 1)
- The backward pass is complex but efficient: uses the Jacobian of softmax
Softmax is crucial for:
- Multi-class classification (final layer outputs)
- Attention mechanisms (computing attention weights)
- Anywhere you need to convert scores to probabilities
Softmin(ComputationNode<T>, int)
Applies the Softmin function, which assigns higher probability to lower values.
public static ComputationNode<T> Softmin(ComputationNode<T> a, int axis = -1)
Parameters
aComputationNode<T>The input computation node.
axisintThe axis along which to compute softmin (default -1, last axis).
Returns
- ComputationNode<T>
A new computation node with Softmin applied.
Remarks
Softmin(x) = softmax(-x) = exp(-x) / sum(exp(-x)) Useful when lower values should have higher probability, e.g., in attention over distances.
Gradient: Same Jacobian structure as softmax but with negated input.
Sparsemax(ComputationNode<T>, int)
Applies the Sparsemax activation function which projects onto the probability simplex.
public static ComputationNode<T> Sparsemax(ComputationNode<T> a, int axis = -1)
Parameters
aComputationNode<T>The input computation node (2D: batch × features).
axisintAxis along which to apply (default -1, last axis).
Returns
- ComputationNode<T>
A new computation node with Sparsemax applied.
Remarks
Sparsemax produces sparse probability distributions where some outputs are exactly zero. Unlike softmax which always gives positive probabilities to all classes, sparsemax can assign exactly zero to low-scoring classes.
Gradient: For support set S (non-zero outputs): grad = upstream - mean(upstream[S])
SphericalSoftmax(ComputationNode<T>, int)
Applies the Spherical Softmax activation function.
public static ComputationNode<T> SphericalSoftmax(ComputationNode<T> a, int axis = -1)
Parameters
aComputationNode<T>The input computation node (2D: batch × features).
axisintAxis along which to apply (default -1, last axis).
Returns
- ComputationNode<T>
A new computation node with SphericalSoftmax applied.
Remarks
SphericalSoftmax = softmax(x / ||x||₂) First L2-normalizes the input, then applies softmax. This improves numerical stability for inputs with varying magnitudes.
Gradient: Chain rule through L2 normalization and softmax.
Split(ComputationNode<T>, int, int)
Splits a tensor along a specified axis into multiple tensors.
public static List<ComputationNode<T>> Split(ComputationNode<T> a, int numSplits, int axis = 0)
Parameters
aComputationNode<T>The input computation node.
numSplitsintThe number of splits to create.
axisintThe axis along which to split.
Returns
- List<ComputationNode<T>>
A list of computation nodes representing the split tensors.
Sqrt(ComputationNode<T>)
Computes the square root for a computation node.
public static ComputationNode<T> Sqrt(ComputationNode<T> a)
Parameters
aComputationNode<T>The input node.
Returns
- ComputationNode<T>
A new computation node containing the square root result.
Remarks
This method computes the square root of each element and records the operation. The backward function uses: ∂(√a)/∂a = 1/(2√a).
For Beginners: This computes square root and tracks gradients.
For square root (c = √a):
- The forward pass computes √x for each element
- The backward pass: gradient to 'a' is incoming gradient * 1/(2√a)
- Which simplifies to: incoming gradient / (2 * output)
Square(ComputationNode<T>)
Computes the element-wise square of the input (x²).
public static ComputationNode<T> Square(ComputationNode<T> a)
Parameters
aComputationNode<T>The input node.
Returns
- ComputationNode<T>
A new computation node containing the squared result.
Remarks
This method computes the square of each element (x²) and records the operation. The backward function uses: ∂(x²)/∂x = 2x.
For Beginners: Square is a common operation in neural networks.
For square (c = a²):
- The forward pass computes a² for each element
- The backward pass: gradient to 'a' is incoming gradient * 2a
This is more efficient than using Power(a, 2) and is frequently needed for operations like computing distances, norms, and variance.
Squash(ComputationNode<T>, double)
Computes the squashing function used in capsule networks: s(x) = ||x||² / (1 + ||x||²) * (x / ||x||).
public static ComputationNode<T> Squash(ComputationNode<T> a, double epsilon = 1E-07)
Parameters
aComputationNode<T>The input node representing capsule vectors.
epsilondoubleSmall value for numerical stability (default: 1e-7).
Returns
- ComputationNode<T>
A new computation node containing the squashed result.
Remarks
This method computes the squashing nonlinearity used in capsule networks. The squashing function ensures that short vectors shrink to near zero length and long vectors shrink to a length slightly below 1.
For Beginners: Squashing is the activation function for capsule layers.
The squashing function:
- Keeps the direction of the vector unchanged
- Scales the length to be between 0 and 1
- Short vectors get much shorter (near 0)
- Long vectors approach length 1
This is crucial for capsule networks where the length represents the probability that the entity represented by the capsule exists, and the direction represents its properties.
Formula: s(v) = ||v||² / (1 + ||v||²) * (v / ||v||)
StraightThroughThreshold(ComputationNode<T>, double)
Applies a straight-through threshold for HTM-style sparse activations.
public static ComputationNode<T> StraightThroughThreshold(ComputationNode<T> input, double threshold)
Parameters
inputComputationNode<T>The input activations.
thresholddoubleThe threshold value.
Returns
- ComputationNode<T>
Binary activations with straight-through gradients.
Remarks
Forward: output = (input > threshold) ? 1 : 0 Backward: gradients pass through unchanged (straight-through estimator)
Subtract(ComputationNode<T>, ComputationNode<T>)
Performs element-wise subtraction of two computation nodes.
public static ComputationNode<T> Subtract(ComputationNode<T> a, ComputationNode<T> b)
Parameters
aComputationNode<T>The node to subtract from.
bComputationNode<T>The node to subtract.
Returns
- ComputationNode<T>
A new computation node containing the difference.
Remarks
This method performs element-wise subtraction and records the operation to any active GradientTape. The backward function sends gradient to 'a' unchanged and negated gradient to 'b' (since ∂(a-b)/∂a = 1 and ∂(a-b)/∂b = -1).
For Beginners: This subtracts one tensor from another and tracks gradients.
For subtraction (c = a - b):
- The forward pass computes a minus b element-wise
- The backward pass sends the gradient to 'a' unchanged
- But sends the negative gradient to 'b'
- This is because increasing 'b' by 1 decreases the result by 1
Example: If the gradient flowing to c is [1, 2, 3]:
- 'a' receives [1, 2, 3]
- 'b' receives [-1, -2, -3]
Sum(ComputationNode<T>, int[]?, bool)
Sums elements of a computation node along specified axes.
public static ComputationNode<T> Sum(ComputationNode<T> a, int[]? axes = null, bool keepDims = false)
Parameters
aComputationNode<T>The computation node to sum.
axesint[]The axes along which to sum. If null, sums all elements.
keepDimsboolWhether to keep the reduced dimensions with size 1. Default is false.
Returns
- ComputationNode<T>
A computation node representing the sum.
Remarks
Reduces the tensor by summing along specified axes.
Gradient computation: - The gradient is broadcast back to the original shape, as each element contributed equally to the sum.
SurrogateSpike(ComputationNode<T>, double, double)
Applies a surrogate spike function for spiking neural network JIT compilation.
public static ComputationNode<T> SurrogateSpike(ComputationNode<T> membranePotential, double threshold = 1, double surrogateBeta = 1)
Parameters
membranePotentialComputationNode<T>The membrane potential input.
thresholddoubleThe spike threshold (default 1.0).
surrogateBetadoubleSharpness of the surrogate gradient (default 1.0).
Returns
- ComputationNode<T>
A computation node containing spike outputs with surrogate gradients.
Remarks
Uses the sigmoid surrogate for gradient computation while producing hard spikes in forward pass. Forward: spike = (potential > threshold) ? 1 : 0 Backward: uses sigmoid derivative as surrogate gradient
Swish(ComputationNode<T>)
Applies the Swish (SiLU) activation function.
public static ComputationNode<T> Swish(ComputationNode<T> a)
Parameters
aComputationNode<T>The input computation node.
Returns
- ComputationNode<T>
A new computation node with Swish applied.
Remarks
Swish(x) = x * sigmoid(x) = x / (1 + exp(-x)) Also known as SiLU (Sigmoid Linear Unit). Used in EfficientNet and other modern architectures.
Gradient: d(Swish)/dx = sigmoid(x) + x * sigmoid(x) * (1 - sigmoid(x)) = Swish(x) + sigmoid(x) * (1 - Swish(x))
Tanh(ComputationNode<T>)
Computes the hyperbolic tangent (tanh) for a computation node.
public static ComputationNode<T> Tanh(ComputationNode<T> a)
Parameters
aComputationNode<T>The input node.
Returns
- ComputationNode<T>
A new computation node containing the tanh result.
Remarks
This method computes tanh of each element and records the operation. The backward function uses: ∂(tanh(a))/∂a = 1 - tanh²(a).
For Beginners: Tanh is a common activation function in neural networks.
For tanh (c = tanh(a)):
- The forward pass computes tanh for each element (outputs between -1 and 1)
- The backward pass: gradient to 'a' is incoming gradient * (1 - output²)
Tanh is popular because it's centered around 0 (unlike sigmoid which is 0 to 1).
TaylorSoftmax(ComputationNode<T>, int, int)
Applies the Taylor Softmax activation function using Taylor series approximation.
public static ComputationNode<T> TaylorSoftmax(ComputationNode<T> a, int order = 2, int axis = -1)
Parameters
aComputationNode<T>The input computation node (2D: batch × features).
orderintOrder of Taylor series expansion (default 2).
axisintAxis along which to apply (default -1, last axis).
Returns
- ComputationNode<T>
A new computation node with TaylorSoftmax applied.
Remarks
TaylorSoftmax uses Taylor series approximation of exp(x): exp(x) ≈ 1 + x + x²/2! + x³/3! + ... + xⁿ/n! Then normalizes like standard softmax. More computationally efficient than standard softmax for some hardware.
Gradient: Similar to softmax but using polynomial derivatives.
ThresholdedReLU(ComputationNode<T>, double)
Applies the Thresholded Rectified Linear Unit activation function.
public static ComputationNode<T> ThresholdedReLU(ComputationNode<T> a, double threshold = 1)
Parameters
aComputationNode<T>The input computation node.
thresholddoubleThe threshold value (default 1.0).
Returns
- ComputationNode<T>
A new computation node with ThresholdedReLU applied.
Remarks
ThresholdedReLU(x) = x if x > threshold, 0 otherwise. Unlike standard ReLU which activates at 0, this activates at a configurable threshold.
Gradient: d(ThresholdedReLU)/dx = 1 if x > threshold, 0 otherwise.
TopKSoftmax(ComputationNode<T>, int)
Differentiable Top-K selection for mixture-of-experts routing.
public static ComputationNode<T> TopKSoftmax(ComputationNode<T> scores, int k)
Parameters
scoresComputationNode<T>The routing scores for each expert.
kintNumber of experts to select.
Returns
- ComputationNode<T>
Sparse routing weights with only top-K non-zero.
Remarks
Selects top-K values and normalizes them via softmax. Gradients flow only to the selected experts.
Transpose(ComputationNode<T>)
Transposes a 2D computation node (matrix).
public static ComputationNode<T> Transpose(ComputationNode<T> a)
Parameters
aComputationNode<T>The matrix to transpose (must be 2D).
Returns
- ComputationNode<T>
A computation node representing the transposed matrix.
Remarks
For a 2D tensor, swaps rows and columns: if A has shape [m, n], result has shape [n, m].
Gradient computation: - ∂(A^T)/∂A = gradOut^T (transpose the gradient back)
Upsample(ComputationNode<T>, int)
Upsamples a tensor using nearest neighbor interpolation. Supports tensors of any rank (at least 2D), treating the last two dimensions as height and width.
public static ComputationNode<T> Upsample(ComputationNode<T> a, int scale)
Parameters
aComputationNode<T>The input computation node with at least 2 dimensions.
scaleintThe upsampling scale factor.
Returns
- ComputationNode<T>
A computation node representing the upsampled tensor.
Upsample3D(ComputationNode<T>, int, int, int)
Performs 3D upsampling (nearest neighbor) on a 5D tensor.
public static ComputationNode<T> Upsample3D(ComputationNode<T> input, int scaleD, int scaleH, int scaleW)
Parameters
inputComputationNode<T>The input node with shape [batch, channels, depth, height, width].
scaleDintScale factor for depth dimension.
scaleHintScale factor for height dimension.
scaleWintScale factor for width dimension.
Returns
- ComputationNode<T>
A new computation node containing the upsampled result with shape [batch, channels, depthscaleD, heightscaleH, width*scaleW].
Remarks
3D upsampling increases spatial resolution by repeating values. This is the inverse operation of 3D max pooling and is commonly used in 3D U-Net decoder paths.
For Beginners: Upsample3D makes volumetric data larger by repeating voxels. If you have a 4x4x4 volume and upsample by 2 in each dimension, you get an 8x8x8 volume where each original voxel is repeated 2x2x2 times.
Gradient: The gradient is computed by summing gradients that were distributed to repeated elements back to the original position.
Exceptions
- ArgumentException
Thrown when input is not 5D.
Variable(Tensor<T>, string?, bool)
Creates a computation node from a tensor value.
public static ComputationNode<T> Variable(Tensor<T> value, string? name = null, bool requiresGradient = true)
Parameters
valueTensor<T>The tensor value.
namestringOptional name for the node.
requiresGradientboolWhether this node requires gradient computation.
Returns
- ComputationNode<T>
A computation node wrapping the tensor.
Remarks
This method creates a leaf node in the computation graph - a node with no parents. Leaf nodes typically represent inputs or parameters that gradients will be computed with respect to.
For Beginners: This creates a starting point in your calculation graph.
Use this to wrap:
- Model parameters (weights, biases) that need gradients
- Input data that you want to compute gradients for
- Constants (with requiresGradient=false)
The returned ComputationNode tracks the tensor's value and will accumulate gradients during backpropagation.