Table of Contents

Class LayerHelper<T>

Namespace
AiDotNet.Helpers
Assembly
AiDotNet.dll

Provides helper methods for creating various neural network layer configurations.

public static class LayerHelper<T>

Type Parameters

T

The numeric type used for calculations (typically float or double).

Inheritance
LayerHelper<T>
Inherited Members

Remarks

This class contains factory methods that create pre-configured sets of neural network layers for common architectures like standard feed-forward networks, CNNs, ResNets, and more.

Methods

CreateDefaultABINetLayers(int, int, int, int, int, int)

Creates default ABINet (Autonomous, Bidirectional, Iterative) layers.

public static IEnumerable<ILayer<T>> CreateDefaultABINetLayers(int imageWidth = 128, int imageHeight = 32, int visionDim = 512, int languageDim = 512, int numIterations = 3, int charsetSize = 95)

Parameters

imageWidth int

Input image width (default: 128).

imageHeight int

Input image height (default: 32).

visionDim int

Vision encoder dimension (default: 512).

languageDim int

Language model dimension (default: 512).

numIterations int

Number of refinement iterations (default: 3).

charsetSize int

Character set size (default: 95).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an ABINet model.

CreateDefaultAnimateDiffLayers(int, int, int, int, int)

Creates layers for an AnimateDiff motion module that adds temporal coherence.

public static IEnumerable<ILayer<T>> CreateDefaultAnimateDiffLayers(int inputChannels = 320, int inputHeight = 64, int inputWidth = 64, int numLayers = 8, int numFrames = 16)

Parameters

inputChannels int

Number of input feature channels (default: 320).

inputHeight int

Input feature height (default: 64).

inputWidth int

Input feature width (default: 64).

numLayers int

Number of motion transformer layers (default: 8).

numFrames int

Number of video frames (default: 16).

Returns

IEnumerable<ILayer<T>>

A collection of layers for motion modeling.

Remarks

For Beginners: AnimateDiff is a motion module that plugs into existing image generation models (like Stable Diffusion) to create animated videos. It learns temporal dynamics from video data.

Architecture (based on the paper):

  1. Input features come from the base image model
  2. Temporal attention layers model motion across frames
  3. Cross-attention with motion context enables coherent animation
  4. Output features blend back into the base model

The motion module is designed to be inserted at multiple points in the U-Net.

Reference: "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models" https://arxiv.org/abs/2307.04725

CreateDefaultAttentionLayers(NeuralNetworkArchitecture<T>)

Creates a default set of attention-based layers for transformer-style architectures.

public static IEnumerable<ILayer<T>> CreateDefaultAttentionLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an attention-based neural network.

Remarks

For Beginners: Attention mechanisms allow neural networks to focus on specific parts of the input that are most relevant for a given task. Similar to how humans pay attention to specific details in a conversation, these layers help the network "pay attention" to important parts of the data. Transformers use this mechanism to process sequences (like text) very effectively.

CreateDefaultAudioGenLayers(int, int, int, int, int, int, int, int, double)

Creates default AudioGen layers for text-to-audio generation.

public static IEnumerable<ILayer<T>> CreateDefaultAudioGenLayers(int textHiddenDim = 768, int lmHiddenDim = 1536, int numLmLayers = 24, int numHeads = 16, int numCodebooks = 4, int codebookSize = 1024, int maxTextLength = 256, int maxAudioTokens = 1500, double dropoutRate = 0.1)

Parameters

textHiddenDim int

Text encoder hidden dimension (default: 768 for T5-base).

lmHiddenDim int

Language model hidden dimension (default: 1536).

numLmLayers int

Number of language model transformer layers (default: 24).

numHeads int

Number of attention heads (default: 16).

numCodebooks int

Number of EnCodec codebooks (default: 4).

codebookSize int

Size of each codebook vocabulary (default: 1024).

maxTextLength int

Maximum text sequence length (default: 256).

maxAudioTokens int

Maximum audio tokens (~50 tokens/sec) (default: 1500 for 30s).

dropoutRate double

Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an AudioGen model.

Remarks

AudioGen is a text-to-audio generation model that uses a transformer language model operating over EnCodec audio codes. Unlike MusicGen, it focuses on general audio and environmental sounds rather than music.

  • T5-based text encoder for conditioning
  • Transformer decoder generating audio codes autoregressively
  • EnCodec neural audio codec for audio reconstruction

Reference: "AudioGen: Textually Guided Audio Generation" by Kreuk et al., 2022

CreateDefaultAudioLDMLayers(int, int, int, int, int[]?, int, int, int, double)

Creates default AudioLDM layers for text-to-audio generation using latent diffusion.

public static IEnumerable<ILayer<T>> CreateDefaultAudioLDMLayers(int textHiddenDim = 768, int latentDim = 8, int unetChannels = 256, int numResBlocks = 2, int[]? attentionResolutions = null, int numHeads = 8, int numMels = 64, int maxTextLength = 77, double dropoutRate = 0.1)

Parameters

textHiddenDim int

Text encoder hidden dimension (default: 768 for CLAP).

latentDim int

Latent space dimension (default: 8).

unetChannels int

U-Net base channels (default: 256).

numResBlocks int

Number of residual blocks per level (default: 2).

attentionResolutions int[]

Resolutions at which to apply attention (default: [4, 2, 1]).

numHeads int

Number of attention heads (default: 8).

numMels int

Number of mel spectrogram channels (default: 64).

maxTextLength int

Maximum text sequence length (default: 77).

dropoutRate double

Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an AudioLDM model.

Remarks

AudioLDM uses latent diffusion for text-to-audio generation:

  • CLAP text encoder for conditioning
  • VAE to encode/decode mel spectrograms to latent space
  • U-Net for denoising in latent space
  • HiFi-GAN vocoder for waveform generation

Reference: "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models" by Liu et al., 2023

CreateDefaultAutoEncoderLayers(NeuralNetworkArchitecture<T>)

Creates a default autoencoder neural network architecture.

public static IEnumerable<ILayer<T>> CreateDefaultAutoEncoderLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an autoencoder neural network.

Remarks

For Beginners: An autoencoder is a type of neural network that learns to compress data into a smaller representation and then reconstruct it back to the original form. Think of it like learning to create a thumbnail of an image and then expanding it back to full size. The network has two main parts: an encoder that compresses the data and a decoder that reconstructs it.

CreateDefaultBGELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a BGE (BAAI General Embedding) model.

public static IEnumerable<ILayer<T>> CreateDefaultBGELayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultBayesianNeuralNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default configuration of layers for a Bayesian neural network (Bayes-by-Backprop style).

public static IEnumerable<ILayer<T>> CreateDefaultBayesianNeuralNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

Returns

IEnumerable<ILayer<T>>

Remarks

This mirrors the library's default dense+activation patterns, but uses Bayesian dense layers so the network can express epistemic uncertainty through weight distributions.

CreateDefaultBlip2Layers(int, int, int, int, int, int, int, int, int, int, int, int)

Creates default layers for a BLIP-2 neural network.

public static IEnumerable<ILayer<T>> CreateDefaultBlip2Layers(int imageSize = 224, int channels = 3, int patchSize = 14, int vocabularySize = 30522, int embeddingDimension = 256, int qformerHiddenDim = 768, int visionHiddenDim = 1408, int lmHiddenDim = 2560, int numQformerLayers = 12, int numHeads = 12, int numLmDecoderLayers = 6, int maxSequenceLength = 32)

Parameters

imageSize int
channels int
patchSize int
vocabularySize int
embeddingDimension int
qformerHiddenDim int
visionHiddenDim int
lmHiddenDim int
numQformerLayers int
numHeads int
numLmDecoderLayers int
maxSequenceLength int

Returns

IEnumerable<ILayer<T>>

CreateDefaultByteTrackLayers(int, int, int, int, int)

Creates default layers for ByteTrack multi-object tracking.

public static IEnumerable<ILayer<T>> CreateDefaultByteTrackLayers(int inputChannels = 3, int inputHeight = 800, int inputWidth = 1440, int numFeatures = 256, int numClasses = 1)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numClasses int

Returns

IEnumerable<ILayer<T>>

CreateDefaultCNNLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates a Convolutional Neural Network (CNN) with configurable layers.

public static IEnumerable<ILayer<T>> CreateDefaultCNNLayers(NeuralNetworkArchitecture<T> architecture, int convLayerCount = 2, int filterCount = 32, int kernelSize = 3, int denseLayerCount = 1, int denseLayerSize = 64, int outputSize = 1)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

convLayerCount int

Number of convolutional layers (default: 2).

filterCount int

Number of filters in each convolutional layer (default: 32).

kernelSize int

Size of the convolutional kernel (default: 3).

denseLayerCount int

Number of dense layers after convolutional layers (default: 1).

denseLayerSize int

Number of neurons in each dense layer (default: 64).

outputSize int

Number of output neurons (default: 1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a CNN.

Remarks

For Beginners: A Convolutional Neural Network (CNN) is specialized for processing grid-like data, such as images. Instead of connecting every input to every neuron (which would be inefficient for images), CNNs use filters that scan across the image to detect features like edges, textures, and shapes.

Key components in this CNN: - Convolutional layers: Detect features in the input using filters - Pooling layers: Reduce the size of the data while keeping important information - Flatten layer: Converts the multi-dimensional data to a flat vector - Dense layers: Process the extracted features to make predictions

CreateDefaultCRAFTLayers(int, int, int)

Creates default CRAFT layers for character-level text detection.

public static IEnumerable<ILayer<T>> CreateDefaultCRAFTLayers(int imageSize = 768, int backboneChannels = 512, int upscaleChannels = 256)

Parameters

imageSize int

Input image size (default: 768).

backboneChannels int

Backbone output channels (default: 512).

upscaleChannels int

Upscale network channels (default: 256).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a CRAFT model.

Remarks

Reference: "Character Region Awareness for Text Detection" (CVPR 2019)

CreateDefaultCRNNLayers(int, int, int, int, int, int)

Creates default CRNN layers for sequence text recognition.

public static IEnumerable<ILayer<T>> CreateDefaultCRNNLayers(int imageWidth = 128, int imageHeight = 32, int cnnChannels = 512, int rnnHiddenSize = 256, int rnnLayers = 2, int charsetSize = 95)

Parameters

imageWidth int

Input image width (default: 128).

imageHeight int

Input image height (default: 32).

cnnChannels int

CNN output channels (default: 512).

rnnHiddenSize int

RNN hidden size (default: 256).

rnnLayers int

Number of RNN layers (default: 2).

charsetSize int

Character set size (default: 95).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a CRNN model.

Remarks

Reference: "An End-to-End Trainable Neural Network for Image-based Sequence Recognition" (TPAMI 2017)

CreateDefaultCapsuleNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default capsule network architecture.

public static IEnumerable<ILayer<T>> CreateDefaultCapsuleNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a capsule network.

Remarks

For Beginners: A capsule network is an advanced type of neural network that tries to better understand spatial relationships in data. Unlike traditional networks that just detect features, capsule networks also track the position, orientation, and size of features. Think of it like the difference between recognizing a face by just its parts (eyes, nose, mouth) versus understanding how those parts relate to each other in 3D space.

The network consists of special "capsule" layers that group neurons together to represent entities and their properties, allowing the network to better understand complex structures in data.

CreateDefaultClipLayers(NeuralNetworkArchitecture<T>, int)

Creates default layers for CLIP-style multimodal networks.

public static IEnumerable<ILayer<T>> CreateDefaultClipLayers(NeuralNetworkArchitecture<T> architecture, int projectionDim = 512)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

projectionDim int

The projection dimension for embeddings (default: 512).

Returns

IEnumerable<ILayer<T>>

A collection of projection layers for CLIP fine-tuning.

Remarks

CLIP uses pre-trained ONNX encoders for most of its work, but these layers provide optional projection heads for fine-tuning or feature extraction.

For Beginners: CLIP has two main parts: an image encoder and a text encoder. These pre-trained encoders are loaded from ONNX files. The projection layers here are optional additions that can: - Adapt the embeddings for specific tasks - Allow fine-tuning on new domains - Match embedding dimensions between different model variants

If you're just using CLIP for inference (getting embeddings), you typically don't need these layers. They're useful when you want to adapt CLIP for a specific task.

CreateDefaultCogVideoLayers(int, int, int, int, int, int)

Creates layers for a CogVideo text-to-video generation model.

public static IEnumerable<ILayer<T>> CreateDefaultCogVideoLayers(int inputChannels = 4, int inputHeight = 32, int inputWidth = 32, int embedDim = 1024, int numLayers = 24, int numFrames = 16)

Parameters

inputChannels int

Number of input channels for latent (default: 4).

inputHeight int

Input latent height (default: 32).

inputWidth int

Input latent width (default: 32).

embedDim int

Embedding dimension (default: 1024).

numLayers int

Number of transformer layers (default: 24).

numFrames int

Number of video frames to generate (default: 16).

Returns

IEnumerable<ILayer<T>>

A collection of layers for video generation.

Remarks

For Beginners: CogVideo generates videos from text descriptions. It works in the latent space (compressed representation) and uses a diffusion-based approach to iteratively refine noise into coherent video.

Architecture (based on the CogVideoX paper):

  1. Text encoder processes the input prompt
  2. Latent space diffusion model generates video frames
  3. VAE decoder converts latent to pixel space

This creates the denoising U-Net backbone that refines latent codes.

Reference: "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer" https://arxiv.org/abs/2408.06072

CreateDefaultColBERTLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a ColBERT (Contextualized Late Interaction over BERT) model.

public static IEnumerable<ILayer<T>> CreateDefaultColBERTLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 128, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultCutieLayers(int, int, int, int)

Creates layers for a Cutie video object segmentation model.

public static IEnumerable<ILayer<T>> CreateDefaultCutieLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 854, int numFeatures = 256)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input frame height (default: 480).

inputWidth int

Input frame width (default: 854).

numFeatures int

Feature dimension (default: 256).

Returns

IEnumerable<ILayer<T>>

A collection of layers for video object segmentation.

Remarks

For Beginners: Cutie is designed for semi-supervised video object segmentation (VOS). Given a mask for an object in the first frame, it tracks and segments that object throughout the entire video with high accuracy.

Architecture:

  1. Image encoder (ResNet-like backbone) extracts features
  2. Object encoder processes mask with features
  3. Memory attention matches current frame to stored memories
  4. Mask decoder produces segmentation output

Reference: "Putting the Object Back into Video Object Segmentation" https://arxiv.org/abs/2310.12982

CreateDefaultDBNetLayers(int, int, int)

Creates default layers for DBNet text detection model.

public static IEnumerable<ILayer<T>> CreateDefaultDBNetLayers(int imageSize = 640, int backboneChannels = 256, int innerChannels = 256)

Parameters

imageSize int

Input image size (default: 640).

backboneChannels int

Backbone output channels (default: 256).

innerChannels int

FPN inner channels (default: 256).

Returns

IEnumerable<ILayer<T>>

Enumerable of layers for DBNet.

Remarks

DBNet uses a ResNet backbone with FPN for multi-scale features, followed by probability and threshold prediction heads.

Reference: "Real-time Scene Text Detection with Differentiable Binarization" (AAAI 2020)

CreateDefaultDIFRINTLayers(int, int, int, int, int)

Creates default layers for DIFRINT video stabilization.

public static IEnumerable<ILayer<T>> CreateDefaultDIFRINTLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 640, int numFeatures = 64, int numIterations = 3)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numIterations int

Returns

IEnumerable<ILayer<T>>

CreateDefaultDNCLayers(NeuralNetworkArchitecture<T>, int, int, int, int)

Creates a default Differentiable Neural Computer (DNC) with pre-configured layers.

public static IEnumerable<ILayer<T>> CreateDefaultDNCLayers(NeuralNetworkArchitecture<T> architecture, int controllerSize, int memoryWordSize, int readHeads, int interfaceSize)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

controllerSize int

The size of the controller network.

memoryWordSize int

The size of each memory word.

readHeads int

The number of read heads.

interfaceSize int

The size of the interface between controller and memory.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Differentiable Neural Computer.

Remarks

For Beginners: A Differentiable Neural Computer (DNC) is like a neural network with a built-in memory system. Traditional neural networks process information and then forget it, but a DNC can store information in its "memory" and retrieve it later when needed. This makes DNCs good at tasks that require remembering information over time, like answering questions about a story or navigating through complex environments.

CreateDefaultDeepBeliefNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default Deep Belief Network (DBN) with pre-configured layers.

public static IEnumerable<ILayer<T>> CreateDefaultDeepBeliefNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Deep Belief Network.

Remarks

For Beginners: A Deep Belief Network is a type of neural network that learns to recognize patterns in data by building multiple layers that each specialize in finding specific features. It works by training each layer one at a time (called "pre-training"), which helps the network learn more effectively, especially when you don't have a lot of labeled training data.

CreateDefaultDeepBoltzmannMachineLayers(NeuralNetworkArchitecture<T>)

Creates default layers for a Deep Boltzmann Machine (DBM).

public static IEnumerable<ILayer<T>> CreateDefaultDeepBoltzmannMachineLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration that defines input and output shapes.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Deep Boltzmann Machine.

Remarks

For Beginners: A Deep Boltzmann Machine is a type of neural network that learns to recognize patterns in data without supervision. It's made up of multiple layers of "hidden units" that learn to represent features of the input data. DBMs are particularly good at learning complex patterns and can be used for tasks like feature learning, dimensionality reduction, and generating new data similar to the training set.

CreateDefaultDeepOperatorNetworkLayers(int, int, int, int, int)

Creates default layers for a Deep Operator Network (DeepONet).

public static (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers) CreateDefaultDeepOperatorNetworkLayers(int branchInputSize, int trunkInputSize, int outputSize = 1, int hiddenLayerCount = 3, int hiddenLayerSize = 64)

Parameters

branchInputSize int

Size of the branch network input (function samples).

trunkInputSize int

Size of the trunk network input (query locations).

outputSize int

Number of output components (default: 1 for scalar operators). For multi-output operators, each output component uses hiddenLayerSize basis functions, so the final layer outputs hiddenLayerSize * outputSize values that are reshaped and summed.

hiddenLayerCount int

Number of hidden layers in each sub-network (default: 3).

hiddenLayerSize int

Number of neurons in each hidden layer, and the number of basis functions per output component (default: 64).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)

A tuple of (branchLayers, trunkLayers) for the DeepONet architecture.

Remarks

For Beginners: DeepONet learns operators - functions that take functions as input. For example, an operator might take a temperature distribution as input and output the resulting heat flow. The branch network encodes the input function, while the trunk network handles where you want to evaluate the output.

Architecture: Branch encodes input function, Trunk encodes query location. Output = sum(Branch * Trunk) + bias, allowing learning of complex operators.

Multi-output handling: For operators with multiple output components (e.g., velocity with x,y,z components), set outputSize to the number of components. Each component gets its own set of basis functions. The branch and trunk networks output hiddenLayerSize * outputSize values, which are grouped as [component1_basis1..p, component2_basis1..p, ...] where p = hiddenLayerSize.

CreateDefaultDeepQNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default Deep Q-Network (DQN) with pre-configured layers for reinforcement learning.

public static IEnumerable<ILayer<T>> CreateDefaultDeepQNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Deep Q-Network.

Remarks

For Beginners: A Deep Q-Network is a type of neural network used in reinforcement learning, which is how computers learn to make decisions by trying different actions and receiving rewards. Think of it like teaching a dog new tricks with treats. The network learns which actions (like moving left or right in a game) will lead to the highest rewards over time.

CreateDefaultDeepRitzLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for the Deep Ritz Method network.

public static IEnumerable<ILayer<T>> CreateDefaultDeepRitzLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 4, int hiddenLayerSize = 50)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

hiddenLayerCount int

Number of hidden layers (default: 4).

hiddenLayerSize int

Number of neurons in each hidden layer (default: 50).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Deep Ritz network.

Remarks

For Beginners: The Deep Ritz Method solves PDEs by minimizing an energy functional instead of directly enforcing the PDE. This is based on the Ritz method from calculus of variations. The network learns the function that minimizes the energy.

Similar architecture to VPINN but used with energy-based loss functions. Tanh activation provides smooth second derivatives needed for energy computations.

CreateDefaultDenseNetLayers(NeuralNetworkArchitecture<T>, DenseNetConfiguration)

Creates default layers for a DenseNet network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultDenseNetLayers(NeuralNetworkArchitecture<T> architecture, DenseNetConfiguration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

configuration DenseNetConfiguration

The DenseNet-specific configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a DenseNet network.

Remarks

For Beginners: DenseNet (Densely Connected Convolutional Network) connects each layer to every other layer in a feed-forward fashion. This creates strong gradient flow and feature reuse, enabling very deep networks with fewer parameters.

The DenseNet architecture consists of:

  • Stem: Initial 7x7 conv with stride 2, followed by 3x3 max pooling
  • Dense Blocks: Multiple dense blocks with transition layers between them
  • Transition Layers: 1x1 conv for channel reduction followed by 2x2 avg pooling
  • Classification Head: Global average pooling followed by a dense layer

CreateDefaultDepthAnythingV2Layers(int, int, int, int, int)

Creates default layers for Depth Anything V2 monocular depth estimation model.

public static IEnumerable<ILayer<T>> CreateDefaultDepthAnythingV2Layers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 640, int numFeatures = 768, int numEncoderBlocks = 12)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input height (default: 480).

inputWidth int

Input width (default: 640).

numFeatures int

Number of feature channels (default: 768 for Base).

numEncoderBlocks int

Number of encoder transformer blocks (default: 12).

Returns

IEnumerable<ILayer<T>>

An enumerable of layers configured for Depth Anything V2.

Remarks

For Beginners: Depth Anything V2 estimates depth maps from single images. Given an RGB image, it predicts the relative distance of each pixel from the camera.

Architecture: - ViT-based encoder with DINOv2 initialization - Multi-scale decoder for dense prediction - Depth prediction head

Reference: "Depth Anything V2" https://arxiv.org/abs/2406.09414

CreateDefaultDessurtLayers(int, int, int, int, int, int)

Creates default Dessurt (self-supervised document transformer) layers.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultDessurtLayers(int encoderDim = 768, int decoderDim = 768, int encoderLayers = 12, int decoderLayers = 6, int numHeads = 12, int vocabSize = 50265)

Parameters

encoderDim int

Encoder dimension (default: 768).

decoderDim int

Decoder dimension (default: 768).

encoderLayers int

Number of encoder layers (default: 12).

decoderLayers int

Number of decoder layers (default: 6).

numHeads int

Number of attention heads (default: 12).

vocabSize int

Vocabulary size (default: 50265).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)

Encoder and decoder layers for a Dessurt model.

CreateDefaultDiTLayers(int, int, int, int, int, int)

Creates default DiT (Document Image Transformer) layers.

public static IEnumerable<ILayer<T>> CreateDefaultDiTLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int patchSize = 16, int imageSize = 224, int numClasses = 16)

Parameters

hiddenDim int

Hidden dimension (default: 768).

numLayers int

Number of transformer layers (default: 12).

numHeads int

Number of attention heads (default: 12).

patchSize int

Patch size for ViT (default: 16).

imageSize int

Input image size (default: 224).

numClasses int

Number of output classes (default: 16).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a DiT model.

CreateDefaultDocBankLayers(int, int, int, int)

Creates default layers for DocBank page segmentation model.

public static IEnumerable<ILayer<T>> CreateDefaultDocBankLayers(int imageSize = 1024, int backboneChannels = 256, int numClasses = 13, int hiddenDim = 256)

Parameters

imageSize int

Input image size (default: 1024).

backboneChannels int

Backbone output channels (default: 256).

numClasses int

Number of segmentation classes (default: 13).

hiddenDim int

Hidden dimension for segmentation head (default: 256).

Returns

IEnumerable<ILayer<T>>

Enumerable of layers for DocBank.

Remarks

DocBank uses a ResNet backbone with FPN for semantic segmentation.

Reference: "DocBank: A Benchmark Dataset for Document Layout Analysis" (COLING 2020)

CreateDefaultDocFormerLayers(int, int, int, int, int, int, int)

Creates default DocFormer layers for document understanding with shared spatial encodings.

public static IEnumerable<ILayer<T>> CreateDefaultDocFormerLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 30522, int imageSize = 224, int spatialDim = 128, int numClasses = 16)

Parameters

hiddenDim int

Hidden dimension (default: 768).

numLayers int

Number of transformer layers (default: 12).

numHeads int

Number of attention heads (default: 12).

vocabSize int

Vocabulary size (default: 30522).

imageSize int

Input image size (default: 224).

spatialDim int

Spatial embedding dimension (default: 128).

numClasses int

Number of output classes (default: 16).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a DocFormer model.

Remarks

DocFormer uses shared spatial encodings across text, visual, and layout modalities.

Reference: "DocFormer: End-to-End Transformer for Document Understanding" (ICCV 2021)

CreateDefaultDocGCNLayers(int, int, int, int)

Creates default DocGCN (Document Graph Convolutional Network) layers.

public static IEnumerable<ILayer<T>> CreateDefaultDocGCNLayers(int inputDim = 768, int hiddenDim = 256, int numGCNLayers = 3, int numClasses = 7)

Parameters

inputDim int

Input feature dimension (default: 768).

hiddenDim int

Hidden dimension (default: 256).

numGCNLayers int

Number of GCN layers (default: 3).

numClasses int

Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a DocGCN model.

CreateDefaultDocOwlLayers(int, int, int, int, int, int)

Creates default DocOwl (mPLUG-DocOwl) layers for document understanding.

public static IEnumerable<ILayer<T>> CreateDefaultDocOwlLayers(int visionDim = 1024, int textDim = 4096, int visionLayers = 24, int textLayers = 32, int numHeads = 16, int vocabSize = 32000)

Parameters

visionDim int

Vision encoder dimension (default: 1024).

textDim int

Text encoder dimension (default: 4096).

visionLayers int

Number of vision layers (default: 24).

textLayers int

Number of text layers (default: 32).

numHeads int

Number of attention heads (default: 16).

vocabSize int

Vocabulary size (default: 32000).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a DocOwl model.

CreateDefaultDonutLayers(int, int, int, int, int[]?, int[]?, int, int, int, int, int, int, int, int)

Creates default Donut layers for OCR-free document understanding.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultDonutLayers(int imageHeight = 1920, int imageWidth = 2560, int inputChannels = 3, int embedDim = 128, int[]? depths = null, int[]? numHeads = null, int windowSize = 10, int patchSize = 4, int mlpRatio = 4, int decoderHiddenDim = 1024, int numDecoderLayers = 4, int decoderHeads = 16, int vocabSize = 57522, int maxGenerationLength = 768)

Parameters

imageHeight int

Input image height (default: 1920 for donut-base).

imageWidth int

Input image width (default: 2560 for donut-base).

inputChannels int

Number of input channels (default: 3 for RGB).

embedDim int

Initial embedding dimension (default: 128 for Swin-B).

depths int[]

Depths of each Swin stage (default: {2,2,14,2} for donut-base).

numHeads int[]

Attention heads per stage (default: {4,8,16,32}).

windowSize int

Window size for attention (default: 10 for donut-base).

patchSize int

Initial patch size (default: 4).

mlpRatio int

MLP expansion ratio (default: 4).

decoderHiddenDim int

Decoder hidden dimension (default: 1024).

numDecoderLayers int

Number of decoder layers (default: 4).

decoderHeads int

Number of decoder attention heads (default: 16).

vocabSize int

Vocabulary size (default: 57522).

maxGenerationLength int

Maximum output sequence length (default: 768).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)

A tuple of (EncoderLayers, DecoderLayers) forming a Donut architecture.

Remarks

Donut (Document Understanding Transformer) is an OCR-free end-to-end model: - Swin Transformer-B encoder with hierarchical stages for image features - BART-style decoder for text generation - Direct pixel-to-text conversion without explicit OCR

For Beginners: This creates a model that can "read" documents directly from pixels without needing a separate OCR step. The encoder extracts visual features at multiple scales using the Swin Transformer architecture, while the decoder generates text autoregressively.

Default Configuration (donut-base): - Input: 2560×1920 RGB images - Encoder: Swin-B with depths {2,2,14,2}, 128 initial dim, window size 10 - Decoder: 4-layer BART-style with 1024 hidden dim

Reference: "OCR-free Document Understanding Transformer" (ECCV 2022)

CreateDefaultEASTLayers(int, int, int, string)

Creates default EAST (Efficient and Accurate Scene Text Detector) layers.

public static IEnumerable<ILayer<T>> CreateDefaultEASTLayers(int imageSize = 512, int backboneChannels = 512, int featureChannels = 128, string geometryType = "RBOX")

Parameters

imageSize int

Input image size (default: 512).

backboneChannels int

Backbone output channels (default: 512).

featureChannels int

Feature map channels (default: 128).

geometryType string

Geometry output type: RBOX or QUAD (default: RBOX).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an EAST model.

CreateDefaultECAPATDNNLanguageIdentifierLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int[]?)

Creates default ECAPA-TDNN layers for spoken language identification.

public static IEnumerable<ILayer<T>> CreateDefaultECAPATDNNLanguageIdentifierLayers(NeuralNetworkArchitecture<T> architecture, int numMels = 80, int tdnnChannels = 1024, int embeddingDimension = 192, int numLanguages = 20, int[]? dilations = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

numMels int

Number of mel filterbank channels (default: 80).

tdnnChannels int

Number of TDNN channels (default: 1024).

embeddingDimension int

Embedding dimension (default: 192).

numLanguages int

Number of languages to classify (default: 20).

dilations int[]

Dilation factors for TDNN layers (default: [1, 2, 3, 4, 1]).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an ECAPA-TDNN language identifier.

Remarks

ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation TDNN) is a state-of-the-art architecture for speaker and language recognition using: - SE-Res2Net blocks with channel attention - Multi-layer feature aggregation (MFA) - Attentive statistics pooling

CreateDefaultEDVRLayers(int, int, int, int, int, int, int)

Creates default layers for EDVR video restoration.

public static IEnumerable<ILayer<T>> CreateDefaultEDVRLayers(int inputChannels = 3, int inputHeight = 256, int inputWidth = 256, int numFeatures = 64, int numFrames = 5, int numGroups = 8, int numBlocks = 5)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numFrames int
numGroups int
numBlocks int

Returns

IEnumerable<ILayer<T>>

CreateDefaultELMLayers(NeuralNetworkArchitecture<T>, int)

Creates default layers for an Extreme Learning Machine (ELM) neural network.

public static IEnumerable<ILayer<T>> CreateDefaultELMLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerSize)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

hiddenLayerSize int

The size of the hidden layer.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an Extreme Learning Machine.

Remarks

For Beginners: An Extreme Learning Machine (ELM) is a simplified neural network where only the output layer weights are trained. The hidden layer weights are randomly initialized and never updated. This makes ELMs very fast to train compared to traditional neural networks, while still providing good performance for many tasks. Think of it as a "shortcut" approach to neural network training.

ELMs work by projecting the input data into a higher-dimensional space using random weights, then finding the best output weights to solve the problem. They're particularly useful when you need a quick solution and don't have time for extensive training.

CreateDefaultESNLayers(int, int, int, double, double)

Creates a default Echo State Network (ESN) with pre-configured layers.

public static IEnumerable<ILayer<T>> CreateDefaultESNLayers(int inputSize, int outputSize, int reservoirSize, double spectralRadius = 0.9, double sparsity = 0.1)

Parameters

inputSize int

The size of the input layer.

outputSize int

The size of the output layer.

reservoirSize int

The size of the reservoir (hidden layer).

spectralRadius double

Controls the stability of the reservoir dynamics (default: 0.9).

sparsity double

The connection sparsity in the reservoir (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an Echo State Network.

Remarks

For Beginners: An Echo State Network is a special type of recurrent neural network where most of the connections between neurons are fixed (not trained). Only the connections from the hidden layer to the output are trained. Think of it like having a pool of water (the reservoir) that you disturb with input signals, and then you learn to read the ripple patterns to predict outputs. This makes ESNs very fast to train compared to other recurrent networks.

CreateDefaultEfficientNetLayers(NeuralNetworkArchitecture<T>, EfficientNetConfiguration)

Creates default layers for an EfficientNet network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultEfficientNetLayers(NeuralNetworkArchitecture<T> architecture, EfficientNetConfiguration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

configuration EfficientNetConfiguration

The EfficientNet-specific configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an EfficientNet network.

Remarks

For Beginners: EfficientNet uses compound scaling to balance network depth, width, and resolution. Each variant (B0-B7) represents a different scale factor, achieving excellent accuracy with fewer parameters than previous architectures.

CreateDefaultFLAVRLayers(int, int, int, int, int)

Creates default layers for FLAVR frame interpolation.

public static IEnumerable<ILayer<T>> CreateDefaultFLAVRLayers(int inputChannels = 3, int inputHeight = 256, int inputWidth = 256, int numFeatures = 64, int numInputFrames = 4)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numInputFrames int

Returns

IEnumerable<ILayer<T>>

CreateDefaultFastDVDNetLayers(int, int, int, int, int)

Creates default layers for FastDVDNet video denoising.

public static IEnumerable<ILayer<T>> CreateDefaultFastDVDNetLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 854, int numFeatures = 32, int numInputFrames = 5)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numInputFrames int

Returns

IEnumerable<ILayer<T>>

CreateDefaultFastTextLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates default layers for a FastText model.

public static IEnumerable<ILayer<T>> CreateDefaultFastTextLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize, int bucketSize, int embeddingDimension)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

vocabSize int

The size of the vocabulary.

bucketSize int

The number of buckets for n-gram hashing.

embeddingDimension int

The dimension of the embedding vectors.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a FastText model.

Remarks

For Beginners: FastText improves on Word2Vec by considering sub-word information (character n-grams). It represents words as the sum of their n-gram embeddings.

CreateDefaultFeedForwardLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a feed-forward neural network.

public static IEnumerable<ILayer<T>> CreateDefaultFeedForwardLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 2, int hiddenLayerSize = 64)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration that defines input and output shapes.

hiddenLayerCount int

Number of hidden layers (default: 2).

hiddenLayerSize int

Number of neurons in each hidden layer (default: 64).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a feed-forward neural network.

Remarks

For Beginners: This method builds a basic feed-forward neural network. Think of it as a series of connected layers where information flows from the input, through "hidden" processing layers, to the output.

Key components: - Input layer: Receives the raw data - Hidden layers: Process and transform the data, learning patterns - Output layer: Produces the final prediction or classification

The network automatically adjusts for different types of tasks (like classification or regression) by choosing appropriate activation functions for the output layer.

CreateDefaultFlowFormerLayers(int, int, int, int, int)

Creates default layers for FlowFormer optical flow estimation.

public static IEnumerable<ILayer<T>> CreateDefaultFlowFormerLayers(int inputChannels = 3, int inputHeight = 448, int inputWidth = 1024, int embedDim = 256, int numLayers = 6)

Parameters

inputChannels int
inputHeight int
inputWidth int
embedDim int
numLayers int

Returns

IEnumerable<ILayer<T>>

CreateDefaultFourierNeuralOperatorLayers(NeuralNetworkArchitecture<T>, int[], int, int, int)

Creates default layers for a Fourier Neural Operator (FNO).

public static IEnumerable<ILayer<T>> CreateDefaultFourierNeuralOperatorLayers(NeuralNetworkArchitecture<T> architecture, int[] spatialDimensions, int numFourierLayers = 4, int hiddenChannels = 64, int numModes = 12)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

spatialDimensions int[]

Dimensions of the spatial domain (e.g., [64, 64] for 2D grid, [32] for 1D). This determines the FFT size for spectral operations.

numFourierLayers int

Number of Fourier layers (default: 4).

hiddenChannels int

Number of hidden channels/width (default: 64).

numModes int

Number of Fourier modes to retain (default: 12). Lower = smoother, higher = more detail. Should be less than or equal to smallest spatial dimension.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Fourier Neural Operator.

Remarks

For Beginners: Fourier Neural Operators learn mappings between function spaces by operating in frequency domain. They're especially powerful for PDEs because many physical phenomena have simple representations in frequency space.

Architecture:

  1. Lifting layer: Projects input to higher-dimensional channel space
  2. Fourier layers: Apply spectral convolution (FFT → learnable weights → IFFT) + local linear transform
  3. Projection layers: Map back to output dimension

Key FNO Properties:

  • Resolution-invariant: Train at one resolution, evaluate at another
  • Global receptive field through spectral operations
  • Efficient for smooth functions (low-frequency dominated)

Note: For full FNO functionality with training, use the FourierNeuralOperator<T> class directly, which provides a complete neural operator implementation.

Exceptions

ArgumentNullException

Thrown when spatialDimensions is null.

ArgumentException

Thrown when spatialDimensions is empty.

CreateDefaultFrameInterpolationLayers(int, int, int, int)

Creates layers for a frame interpolation model (FILM/RIFE-style).

public static IEnumerable<ILayer<T>> CreateDefaultFrameInterpolationLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int numFeatures = 64)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input frame height.

inputWidth int

Input frame width.

numFeatures int

Number of feature channels (default: 64).

Returns

IEnumerable<ILayer<T>>

A collection of layers for frame interpolation.

Remarks

For Beginners: Frame interpolation creates new frames between existing ones to make video smoother (e.g., 30fps to 60fps). The model learns to "imagine" what the in-between frames should look like based on the surrounding frames.

Architecture:

  1. Feature pyramid extracts multi-scale features
  2. Flow estimation predicts motion
  3. Synthesis network generates interpolated frames

CreateDefaultGNNLayers(NeuralNetworkArchitecture<T>)

Creates default layers for a Graph Neural Network (GNN).

public static IEnumerable<ILayer<T>> CreateDefaultGNNLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Graph Neural Network.

Remarks

For Beginners: Graph Neural Networks (GNNs) are specialized neural networks designed to work with graph-structured data, where information is represented as nodes (points) connected by edges (lines). Examples include social networks, molecular structures, or road networks.

Unlike standard neural networks that process individual data points independently, GNNs can understand relationships between data points. They work by passing information between connected nodes, allowing each node to "learn" from its neighbors. This makes GNNs powerful for tasks where relationships between entities matter, such as recommending friends on social media, predicting protein interactions, or analyzing traffic patterns.

CreateDefaultGRULayers(NeuralNetworkArchitecture<T>)

Creates a default Gated Recurrent Unit (GRU) neural network layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultGRULayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for GRU-based processing.

Remarks

For Beginners: A GRU (Gated Recurrent Unit) is a type of recurrent neural network that's especially good at learning patterns in sequences of data, like text or time series. It's similar to LSTM but with a simpler structure, making it faster to train while still capturing long-term dependencies in data.

This method automatically configures appropriate GRU layers based on your task type, with sensible defaults for hidden layer sizes and activation functions.

Exceptions

InvalidOperationException

Thrown when the architecture has invalid input or output dimensions.

CreateDefaultGenreClassifierLayers(int, int, int, int, int, double)

Creates default genre classification layers.

public static IEnumerable<ILayer<T>> CreateDefaultGenreClassifierLayers(int numMels = 128, int hiddenDim = 256, int numClasses = 10, int maxFrames = 1000, int numAttentionLayers = 4, double dropoutRate = 0.3)

Parameters

numMels int

Number of mel spectrogram bins (default: 128).

hiddenDim int

Hidden layer dimension (default: 256).

numClasses int

Number of genre classes (default: 10).

maxFrames int

Maximum input frames (default: 1000).

numAttentionLayers int

Number of attention layers (default: 4).

dropoutRate double

Dropout rate (default: 0.3).

Returns

IEnumerable<ILayer<T>>

A collection of layers for genre classification.

Remarks

Audio classification architecture with:

  • Mel spectrogram feature extraction
  • Transformer encoder for temporal modeling
  • Global average pooling
  • Classification head with softmax output

CreateDefaultGloVeLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a GloVe (Global Vectors) model.

public static IEnumerable<ILayer<T>> CreateDefaultGloVeLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize, int embeddingDimension)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

vocabSize int

The size of the vocabulary.

embeddingDimension int

The dimension of the embedding vectors.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a GloVe model.

Remarks

For Beginners: GloVe creates word embeddings by learning from the co-occurrence statistics of words. It uses two sets of embeddings and two sets of biases.

Note: The layers returned by this method are not intended to be used as a sequential feed-forward stack. They represent the four components (W, W_tilde, b, b_tilde) required for the GloVe model's custom forward pass.

CreateDefaultGraphAttentionLayers(NeuralNetworkArchitecture<T>, int, int, double)

Creates default layers for a Graph Attention Network (GAT).

public static IEnumerable<ILayer<T>> CreateDefaultGraphAttentionLayers(NeuralNetworkArchitecture<T> architecture, int numHeads = 8, int numLayers = 2, double dropoutRate = 0.6)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

numHeads int

Number of attention heads per layer (default: 8).

numLayers int

Number of GAT layers (default: 2).

dropoutRate double

Dropout rate for attention coefficients (default: 0.6).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for GAT processing.

Remarks

For Beginners: GAT uses attention mechanisms to learn which neighbors are most important for each node, allowing dynamic weighting of neighbor contributions.

CreateDefaultGraphClassificationLayers(NeuralNetworkArchitecture<T>, int, int, int, double)

Creates default layers for a Graph Classification model.

public static IEnumerable<ILayer<T>> CreateDefaultGraphClassificationLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 64, int embeddingDim = 128, int numGnnLayers = 3, double dropoutRate = 0.5)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

hiddenDim int

Hidden dimension size (default: 64).

embeddingDim int

Graph embedding dimension (default: 128).

numGnnLayers int

Number of GNN layers (default: 3).

dropoutRate double

Dropout rate for regularization (default: 0.5).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for graph classification.

Remarks

For Beginners: Graph classification predicts labels for entire graphs. This architecture uses multiple GCN layers followed by pooling and classification.

CreateDefaultGraphGenerationLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Graph Generation model (VGAE encoder).

public static IEnumerable<ILayer<T>> CreateDefaultGraphGenerationLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 32, int numEncoderLayers = 2)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

hiddenDim int

Hidden dimension size (default: 32).

numEncoderLayers int

Number of encoder GNN layers (default: 2).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for graph generation encoder.

Remarks

For Beginners: Graph generation models learn to create new graph structures. This encoder uses GCN layers to map node features to a latent space.

CreateDefaultGraphIsomorphismLayers(NeuralNetworkArchitecture<T>, int, int, bool, double)

Creates default layers for a Graph Isomorphism Network (GIN).

public static IEnumerable<ILayer<T>> CreateDefaultGraphIsomorphismLayers(NeuralNetworkArchitecture<T> architecture, int mlpHiddenDim = 64, int numLayers = 5, bool learnEpsilon = true, double initialEpsilon = 0)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

mlpHiddenDim int

Hidden dimension for MLP within GIN layers (default: 64).

numLayers int

Number of GIN layers (default: 5).

learnEpsilon bool

Whether to learn epsilon parameter (default: true).

initialEpsilon double

Initial value for epsilon (default: 0.0).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for GIN processing.

Remarks

For Beginners: GIN is provably as powerful as the Weisfeiler-Lehman graph isomorphism test, making it optimal for distinguishing graph structures.

CreateDefaultGraphSAGELayers(NeuralNetworkArchitecture<T>, SAGEAggregatorType, int, bool)

Creates default layers for a GraphSAGE (Graph Sample and Aggregate) Network.

public static IEnumerable<ILayer<T>> CreateDefaultGraphSAGELayers(NeuralNetworkArchitecture<T> architecture, SAGEAggregatorType aggregatorType = SAGEAggregatorType.Mean, int numLayers = 2, bool normalize = true)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

aggregatorType SAGEAggregatorType

The type of aggregation function (default: Mean).

numLayers int

Number of GraphSAGE layers (default: 2).

normalize bool

Whether to apply L2 normalization (default: true).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for GraphSAGE processing.

Remarks

For Beginners: GraphSAGE learns to aggregate neighbor information for inductive learning. It can generalize to new, unseen nodes by learning aggregation functions.

CreateDefaultHTMLayers(NeuralNetworkArchitecture<T>, int, int, double)

Creates a default Hierarchical Temporal Memory (HTM) neural network layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultHTMLayers(NeuralNetworkArchitecture<T> architecture, int columnCount, int cellsPerColumn, double sparsityThreshold)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

columnCount int

The number of columns in the HTM network.

cellsPerColumn int

The number of cells per column.

sparsityThreshold double

The sparsity threshold for the spatial pooler.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for HTM processing.

Remarks

For Beginners: Hierarchical Temporal Memory (HTM) is a machine learning technology that mimics certain structural and algorithmic properties of the neocortex (the part of the brain responsible for higher-order thinking). HTM is particularly good at learning patterns in sequential data and making predictions.

Key HTM concepts: - Columns: Vertical arrangements of cells that work together - Cells: The basic processing units (like neurons) - Sparsity: Only a small percentage of cells are active at any time, which helps with learning

Exceptions

InvalidOperationException

Thrown when the architecture has invalid input or output dimensions.

CreateDefaultHamiltonianLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Hamiltonian Neural Network.

public static IEnumerable<ILayer<T>> CreateDefaultHamiltonianLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 3, int hiddenLayerSize = 64)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration that defines input and output shapes.

hiddenLayerCount int

Number of hidden layers (default: 3).

hiddenLayerSize int

Number of neurons in each hidden layer (default: 64).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Hamiltonian neural network.

Remarks

For Beginners: Hamiltonian Neural Networks (HNNs) learn the energy function (Hamiltonian) of a physical system. The network takes a state vector [q, p] (positions and momenta) as input and outputs a scalar energy value.

Key design choices: - Uses Tanh activation in hidden layers for smooth, bounded outputs that help with gradient computation - Output layer has linear activation since the Hamiltonian can be any real number - Architecture is designed for computing gradients (∂H/∂q, ∂H/∂p) to derive dynamics

The network structure enables Hamilton's equations:

  • dq/dt = ∂H/∂p (velocity from momentum gradient)
  • dp/dt = -∂H/∂q (force from position gradient)

This guarantees energy conservation by construction.

CreateDefaultInfographicVQALayers(int, int, int, int, int, int, int, int)

Creates default InfographicVQA layers for infographic understanding.

public static IEnumerable<ILayer<T>> CreateDefaultInfographicVQALayers(int imageSize = 1024, int visionDim = 768, int textDim = 768, int fusionDim = 768, int visionLayers = 12, int fusionLayers = 6, int numHeads = 12, int vocabSize = 30522)

Parameters

imageSize int

Input image size (default: 1024).

visionDim int

Vision encoder dimension (default: 768).

textDim int

Text encoder dimension (default: 768).

fusionDim int

Fusion dimension (default: 768).

visionLayers int

Number of vision layers (default: 12).

fusionLayers int

Number of fusion layers (default: 6).

numHeads int

Number of attention heads (default: 12).

vocabSize int

Vocabulary size (default: 30522).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an InfographicVQA model.

CreateDefaultInstructorLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for an Instructor/E5 (Instruction-Tuned) embedding model.

public static IEnumerable<ILayer<T>> CreateDefaultInstructorLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultInternVideo2Layers(int, int, int, int, int, int)

Creates layers for an InternVideo2-style video understanding model.

public static IEnumerable<ILayer<T>> CreateDefaultInternVideo2Layers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int embedDim = 768, int numEncoderLayers = 12, int patchSize = 14)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input frame height.

inputWidth int

Input frame width.

embedDim int

Embedding dimension (default: 768).

numEncoderLayers int

Number of transformer encoder layers (default: 12).

patchSize int

Patch size for video tokenization (default: 14).

Returns

IEnumerable<ILayer<T>>

A collection of layers for video understanding.

Remarks

For Beginners: InternVideo2 understands video content by encoding frames into embeddings that capture both spatial (what's in each frame) and temporal (how things change over time) information. It can be used for: - Video classification (identifying what's happening) - Video-text retrieval (finding videos matching descriptions) - Video question answering

Architecture (based on the paper):

  1. Patch embedding converts video frames into tokens
  2. Spatial attention processes within-frame relationships
  3. Temporal attention processes across-frame relationships
  4. FFN layers add non-linearity and expressiveness
  5. Projection maps to a shared video-text embedding space

Reference: "InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding" https://arxiv.org/abs/2403.15377

CreateDefaultLSMLayers(NeuralNetworkArchitecture<T>, int, double, double, double, double)

Creates a default configuration of layers for a Liquid State Machine (LSM) neural network.

public static IEnumerable<ILayer<T>> CreateDefaultLSMLayers(NeuralNetworkArchitecture<T> architecture, int reservoirSize = 100, double connectionProbability = 0.1, double spectralRadius = 0.9, double inputScaling = 1, double leakingRate = 1)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

reservoirSize int

The size of the reservoir (number of neurons in the reservoir layer). Default is 100.

connectionProbability double

The probability of connection between neurons in the reservoir. Default is 0.1 (10%).

spectralRadius double

Controls the stability of the reservoir dynamics. Default is 0.9.

inputScaling double

Scaling factor for input connections. Default is 1.0.

leakingRate double

Controls how quickly the reservoir responds to new inputs. Default is 1.0.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for a Liquid State Machine.

Remarks

For Beginners: A Liquid State Machine is a special type of neural network inspired by how the brain processes information. The key component is the "reservoir" - imagine it as a pool of randomly connected neurons that create complex patterns when input is fed into them.

  • The reservoirSize is how many neurons are in this pool
  • The connectionProbability determines how densely connected these neurons are
  • The spectralRadius affects how stable the patterns in the reservoir are
  • The inputScaling controls how strongly the input affects the reservoir
  • The leakingRate determines how quickly the reservoir responds to new information

LSMs are particularly good at processing time-dependent data like speech or video.

Exceptions

ArgumentNullException

Thrown when architecture is null.

InvalidOperationException

Thrown when input shape is not specified or input/output size is not positive.

CreateDefaultLSTMNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default configuration of layers for a Long Short-Term Memory (LSTM) neural network.

public static IEnumerable<ILayer<T>> CreateDefaultLSTMNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for an LSTM neural network.

Remarks

For Beginners: LSTM (Long Short-Term Memory) networks are a special kind of neural network designed to remember information for long periods of time. Think of them like a person with a good memory who can recall things from the past to make decisions in the present.

LSTMs are particularly useful for: - Text prediction (like autocomplete on your phone) - Speech recognition - Time series forecasting (like stock prices or weather) - Any task where the order of data matters

Key terms explained: - Hidden Size: How much information the network can remember at once (bigger = more memory) - Layers: How many processing steps the data goes through (more layers = more complex patterns) - Activation Function: How neurons decide whether to fire (like Tanh or Sigmoid) - Recurrent Activation: Special activation function used for the memory gates

Exceptions

ArgumentNullException

Thrown when architecture is null.

InvalidOperationException

Thrown when input shape is not specified or input/output size is not positive.

CreateDefaultLagrangianLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Lagrangian Neural Network.

public static IEnumerable<ILayer<T>> CreateDefaultLagrangianLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 3, int hiddenLayerSize = 64)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration that defines input and output shapes.

hiddenLayerCount int

Number of hidden layers (default: 3).

hiddenLayerSize int

Number of neurons in each hidden layer (default: 64).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Lagrangian neural network.

Remarks

For Beginners: Lagrangian Neural Networks (LNNs) learn the Lagrangian function L(q, q̇) of a physical system. The Lagrangian is typically L = T - V (kinetic minus potential energy).

Key design choices: - Uses Tanh activation in hidden layers for smooth derivatives needed in Euler-Lagrange equations - Output is scalar (the Lagrangian value) - Structure supports computing second derivatives for equations of motion

The Euler-Lagrange equation: d/dt(∂L/∂q̇) = ∂L/∂q This gives the equations of motion while automatically respecting conservation laws.

CreateDefaultLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates a standard feed-forward neural network with configurable hidden layers.

public static IEnumerable<ILayer<T>> CreateDefaultLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 1, int hiddenLayerSize = 64, int outputSize = 1)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

hiddenLayerCount int

Number of hidden layers (default: 1).

hiddenLayerSize int

Number of neurons in each hidden layer (default: 64).

outputSize int

Number of output neurons (default: 1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a feed-forward neural network.

Remarks

For Beginners: A feed-forward neural network is the simplest type of neural network where information flows in one direction from input to output. Think of it as an assembly line where each layer processes the data and passes it to the next layer.

This method creates: - An input layer that takes your data - One or more hidden layers that learn patterns in your data - An output layer that produces the final prediction

CreateDefaultLayoutGraphLayers(int, int, int, int)

Creates default LayoutGraph layers for graph-based layout analysis.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutGraphLayers(int inputDim = 768, int hiddenDim = 256, int numGraphLayers = 4, int numClasses = 7)

Parameters

inputDim int

Input feature dimension (default: 768).

hiddenDim int

Hidden dimension (default: 256).

numGraphLayers int

Number of graph layers (default: 4).

numClasses int

Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a LayoutGraph model.

CreateDefaultLayoutLMLayers(int, int, int, int, int, int)

Creates default LayoutLM (v1) layers for document understanding with layout-aware pre-training.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutLMLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 30522, int maxSequenceLength = 512, int numClasses = 7)

Parameters

hiddenDim int

Hidden dimension (default: 768 for BERT-base).

numLayers int

Number of transformer layers (default: 12).

numHeads int

Number of attention heads (default: 12).

vocabSize int

Vocabulary size (default: 30522 for BERT).

maxSequenceLength int

Maximum sequence length (default: 512).

numClasses int

Number of output classes (default: 7 for FUNSD).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a LayoutLM model.

Remarks

LayoutLM v1 combines BERT text embeddings with 2D position embeddings to jointly model text and layout. Unlike v2/v3, it does NOT use visual features.

Reference: "LayoutLM: Pre-training of Text and Layout for Document Image Understanding" (KDD 2020) https://arxiv.org/abs/1912.13318

CreateDefaultLayoutLMv2Layers(int, int, int, int, int, int, int)

Creates default LayoutLMv2 layers for document understanding with visual features.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutLMv2Layers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 30522, int imageSize = 224, int visualBackboneChannels = 256, int numClasses = 7)

Parameters

hiddenDim int

Hidden dimension (default: 768 for BERT-base).

numLayers int

Number of transformer layers (default: 12).

numHeads int

Number of attention heads (default: 12).

vocabSize int

Vocabulary size (default: 30522 for BERT).

imageSize int

Input image size (default: 224).

visualBackboneChannels int

Visual backbone output channels (default: 256).

numClasses int

Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a LayoutLMv2 model.

Remarks

LayoutLMv2 extends LayoutLM by adding visual features from a ResNeXt-FPN backbone, enabling the model to understand documents through text, layout, AND image features.

Key components: - Visual backbone (ResNeXt-101 with FPN) - Text encoder (BERT-base) - Spatial-aware self-attention mechanism

Reference: "LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding" (ACL 2021) https://arxiv.org/abs/2012.14740

CreateDefaultLayoutLMv3Layers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int)

Creates default LayoutLMv3 layers for document understanding.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutLMv3Layers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 50265, int imageSize = 224, int patchSize = 16, int numClasses = 17)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

hiddenDim int

Hidden dimension size (default: 768 from paper).

numLayers int

Number of transformer layers (default: 12 from paper).

numHeads int

Number of attention heads (default: 12 from paper).

vocabSize int

Vocabulary size (default: 50265 for RoBERTa tokenizer).

imageSize int

Input image size (default: 224).

patchSize int

Vision patch size (default: 16).

numClasses int

Number of output classes (default: 17 for layout detection).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a LayoutLMv3 architecture.

Remarks

LayoutLMv3 uses unified multimodal pre-training with: - Text embedding layer (RoBERTa-style) - Image patch embedding (ViT-style) - Transformer encoder with spatial-aware self-attention - Classification head for layout detection or other tasks

Reference: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking" (ICCV 2022)

CreateDefaultLayoutXLMLayers(int, int, int, int, int, int, int)

Creates default LayoutXLM layers for multilingual document understanding.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutXLMLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 250002, int imageSize = 224, int visualBackboneChannels = 256, int numClasses = 7)

Parameters

hiddenDim int

Hidden dimension (default: 768).

numLayers int

Number of transformer layers (default: 12).

numHeads int

Number of attention heads (default: 12).

vocabSize int

Vocabulary size (default: 250002 for XLM-RoBERTa).

imageSize int

Input image size (default: 224).

visualBackboneChannels int

Visual backbone channels (default: 256).

numClasses int

Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a LayoutXLM model.

CreateDefaultLiLTLayers(int, int, int, int, int, int)

Creates default LiLT (Language-Independent Layout Transformer) layers.

public static IEnumerable<ILayer<T>> CreateDefaultLiLTLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int layoutDim = 768, int vocabSize = 30522, int numClasses = 7)

Parameters

hiddenDim int

Hidden dimension (default: 768).

numLayers int

Number of transformer layers (default: 12).

numHeads int

Number of attention heads (default: 12).

layoutDim int

Layout embedding dimension (default: 768).

vocabSize int

Vocabulary size (default: 30522).

numClasses int

Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a LiLT model.

CreateDefaultLinkPredictionLayers(NeuralNetworkArchitecture<T>, int, int, int, double)

Creates default layers for a Link Prediction model encoder.

public static IEnumerable<ILayer<T>> CreateDefaultLinkPredictionLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 64, int embeddingDim = 32, int numLayers = 2, double dropoutRate = 0.5)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

hiddenDim int

Hidden dimension size (default: 64).

embeddingDim int

Node embedding dimension (default: 32).

numLayers int

Number of GNN layers (default: 2).

dropoutRate double

Dropout rate for regularization (default: 0.5).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for link prediction.

Remarks

For Beginners: Link prediction predicts whether edges should exist between nodes. This encoder learns node embeddings that can be combined to score potential edges.

CreateDefaultMATCHALayers(int, int, int, int, int, int, int)

Creates default MATCHA (chart understanding) layers.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultMATCHALayers(int encoderDim = 1536, int decoderDim = 1536, int encoderLayers = 18, int decoderLayers = 18, int numHeads = 24, int vocabSize = 50265, int maxPatchesPerImage = 4096)

Parameters

encoderDim int

Encoder dimension (default: 1536).

decoderDim int

Decoder dimension (default: 1536).

encoderLayers int

Number of encoder layers (default: 18).

decoderLayers int

Number of decoder layers (default: 18).

numHeads int

Number of attention heads (default: 24).

vocabSize int

Vocabulary size (default: 50265).

maxPatchesPerImage int

Maximum patches per image (default: 4096).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)

Encoder and decoder layers for a MATCHA model.

CreateDefaultMRLLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a Matryoshka Representation Learning (MRL) model.

public static IEnumerable<ILayer<T>> CreateDefaultMRLLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int maxEmbeddingDimension = 1536, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
maxEmbeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultMemoryNetworkLayers(NeuralNetworkArchitecture<T>, int, int)

Creates a default Memory Network layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultMemoryNetworkLayers(NeuralNetworkArchitecture<T> architecture, int memorySize, int embeddingSize)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

memorySize int

The size of the memory component (number of memory slots).

embeddingSize int

The dimension of the embedding vectors.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for a Memory Network.

Remarks

For Beginners: A Memory Network is a type of neural network that has an explicit memory component. Think of it like a notebook that the network can write to and read from while processing information. This makes it particularly good at tasks that require remembering context from earlier in a sequence, such as answering questions about a story or maintaining a conversation.

The memory size parameter controls how many "pages" are in the notebook, while the embedding size determines how detailed each "note" can be.

Exceptions

InvalidOperationException

Thrown when the architecture has invalid input or output dimensions.

CreateDefaultMeshCNNLayers(NeuralNetworkArchitecture<T>, int, int[]?, int[]?, int[]?, int, bool, double, bool)

Creates default layers for a MeshCNN architecture for mesh classification/segmentation.

public static IEnumerable<ILayer<T>> CreateDefaultMeshCNNLayers(NeuralNetworkArchitecture<T> architecture, int inputFeatures = 5, int[]? convChannels = null, int[]? poolTargets = null, int[]? fcSizes = null, int numNeighbors = 4, bool useBatchNorm = true, double dropoutRate = 0.5, bool useGlobalAveragePooling = false)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

inputFeatures int

Number of input features per edge. Default is 5.

convChannels int[]

Channel sizes for each edge convolution block.

poolTargets int[]

Target edge counts after each pooling operation.

fcSizes int[]

Sizes of fully connected layers before output.

numNeighbors int

Number of neighboring edges per edge. Default is 4.

useBatchNorm bool

Whether to use batch normalization. Default is true.

dropoutRate double

Dropout rate for regularization. Default is 0.5.

useGlobalAveragePooling bool

Whether to use global average pooling. Default is false (max pooling).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for mesh processing.

Remarks

For Beginners: MeshCNN processes 3D mesh data by learning from edge features.

The architecture consists of: - Edge convolution blocks: Learn patterns from edge neighborhoods - Mesh pooling: Simplify the mesh by removing less important edges - Global pooling: Aggregate all edge features into a fixed-size vector - Fully connected layers: Map aggregated features to class predictions

Applications include: - 3D shape classification from mesh data - Mesh segmentation (labeling different parts) - Learning from CAD models and 3D scans

Exceptions

InvalidOperationException

Thrown when the architecture has invalid output size.

CreateDefaultMiDaSLayers(int, int, int, int, int)

Creates default layers for MiDaS depth estimation.

public static IEnumerable<ILayer<T>> CreateDefaultMiDaSLayers(int inputChannels = 3, int inputHeight = 384, int inputWidth = 384, int embedDim = 768, int numEncoderLayers = 12)

Parameters

inputChannels int
inputHeight int
inputWidth int
embedDim int
numEncoderLayers int

Returns

IEnumerable<ILayer<T>>

CreateDefaultMobileNetV2Layers(NeuralNetworkArchitecture<T>, MobileNetV2Configuration)

Creates default layers for a MobileNetV2 network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultMobileNetV2Layers(NeuralNetworkArchitecture<T> architecture, MobileNetV2Configuration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

configuration MobileNetV2Configuration

The MobileNetV2-specific configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a MobileNetV2 network.

Remarks

For Beginners: MobileNetV2 is designed for efficient mobile inference, using inverted residual blocks with linear bottlenecks to achieve high accuracy with low computational cost.

CreateDefaultMobileNetV3Layers(NeuralNetworkArchitecture<T>, MobileNetV3Configuration)

Creates default layers for a MobileNetV3 network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultMobileNetV3Layers(NeuralNetworkArchitecture<T> architecture, MobileNetV3Configuration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

configuration MobileNetV3Configuration

The MobileNetV3-specific configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a MobileNetV3 network.

Remarks

For Beginners: MobileNetV3 builds on MobileNetV2 with additional optimizations including squeeze-and-excitation blocks and hard-swish activation for improved accuracy and efficiency.

CreateDefaultMusicGenLayers(int, int, int, int, int, int, int, int, double)

Creates default MusicGen layers for text-to-music generation.

public static IEnumerable<ILayer<T>> CreateDefaultMusicGenLayers(int textHiddenDim = 768, int lmHiddenDim = 1536, int numLmLayers = 24, int numHeads = 16, int numCodebooks = 4, int codebookSize = 2048, int maxTextLength = 256, int maxAudioTokens = 1500, double dropoutRate = 0.1)

Parameters

textHiddenDim int

Text encoder hidden dimension (default: 768 for T5-base).

lmHiddenDim int

Language model hidden dimension (default: 1536).

numLmLayers int

Number of language model transformer layers (default: 24).

numHeads int

Number of attention heads (default: 16).

numCodebooks int

Number of EnCodec codebooks (default: 4).

codebookSize int

Size of each codebook vocabulary (default: 2048).

maxTextLength int

Maximum text sequence length (default: 256).

maxAudioTokens int

Maximum audio tokens (~50 tokens/sec) (default: 1500 for 30s).

dropoutRate double

Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a MusicGen model.

Remarks

MusicGen is Meta's text-to-music generation model that uses a single-stage transformer language model operating over EnCodec audio codes. Key features:

  • Delay pattern for codebook interleaving (reduces sequence length)
  • T5-based text encoder for conditioning
  • Transformer decoder generating audio codes autoregressively
  • EnCodec neural audio codec for high-quality audio reconstruction

Reference: "Simple and Controllable Music Generation" by Copet et al., 2023

CreateDefaultNTMLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates a default configuration of layers for a Neural Turing Machine (NTM).

public static IEnumerable<ILayer<T>> CreateDefaultNTMLayers(NeuralNetworkArchitecture<T> architecture, int memorySize = 128, int memoryVectorSize = 20, int controllerSize = 100)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

memorySize int

The number of memory locations (default: 128).

memoryVectorSize int

The size of each memory vector (default: 20).

controllerSize int

The size of the controller network (default: 100).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for a Neural Turing Machine.

Remarks

For Beginners: A Neural Turing Machine (NTM) is a type of neural network that has an external memory component, similar to how computers have RAM. The network learns to read from and write to this memory, which helps it solve tasks that require remembering information over long periods.

- memorySize: How many "slots" are in the memory (like pages in a notebook) - memoryVectorSize: How much information each memory slot can hold - controllerSize: How complex the "brain" of the network is that decides what to read/write

Exceptions

ArgumentNullException

Thrown when architecture is null.

ArgumentException

Thrown when memory parameters are not positive.

CreateDefaultNeuralNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default configuration of layers for a standard neural network.

public static IEnumerable<ILayer<T>> CreateDefaultNeuralNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for a standard neural network.

Remarks

For Beginners: This method creates the basic building blocks (layers) of a neural network. Think of layers as a series of connected processing units that transform your input data step by step until it produces the desired output. The complexity parameter in the architecture determines how many layers and neurons your network will have - Simple networks have fewer layers while Deep networks have more layers for handling more complex problems.

Exceptions

ArgumentNullException

Thrown when architecture is null.

InvalidOperationException

Thrown when input size or output size is not positive.

CreateDefaultNodeClassificationLayers(NeuralNetworkArchitecture<T>, int, int, double)

Creates default layers for a Node Classification model.

public static IEnumerable<ILayer<T>> CreateDefaultNodeClassificationLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 64, int numLayers = 2, double dropoutRate = 0.5)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

hiddenDim int

Hidden dimension size (default: 64).

numLayers int

Number of GNN layers (default: 2).

dropoutRate double

Dropout rate for regularization (default: 0.5).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for node classification.

Remarks

For Beginners: Node classification predicts labels for individual nodes in a graph. This architecture uses GCN layers with dropout for semi-supervised learning on graphs.

CreateDefaultNougatLayers(int, int, int, int, int, int, int, int)

Creates default Nougat layers for academic document understanding.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultNougatLayers(int hiddenDim = 1024, int numEncoderLayers = 12, int numDecoderLayers = 10, int numHeads = 16, int vocabSize = 50000, int imageSize = 896, int patchSize = 16, int maxSequenceLength = 4096)

Parameters

hiddenDim int

Hidden dimension (default: 1024).

numEncoderLayers int

Number of encoder layers (default: 12).

numDecoderLayers int

Number of decoder layers (default: 10).

numHeads int

Number of attention heads (default: 16).

vocabSize int

Vocabulary size (default: 50000).

imageSize int

Input image size (default: 896).

patchSize int

Patch size (default: 16).

maxSequenceLength int

Maximum sequence length (default: 4096).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)

Tuple of encoder and decoder layers.

Remarks

Reference: "Nougat: Neural Optical Understanding for Academic Documents" (arXiv 2023)

CreateDefaultOccupancyLayers(NeuralNetworkArchitecture<T>)

Creates default layers for an occupancy detection neural network without temporal data.

public static IEnumerable<ILayer<T>> CreateDefaultOccupancyLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration that defines input and output shapes.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a non-temporal occupancy detection network.

Remarks

For Beginners: This method builds a simpler neural network for detecting occupancy (whether a space is occupied by people) using data from a single point in time, rather than a sequence of time points. It uses standard Dense layers (also called fully connected layers) to process the input features.

Non-temporal data means the model makes predictions based only on current data points without considering how values have changed over time. For example, using the current temperature, humidity, and CO2 levels to predict occupancy without looking at historical values.

CreateDefaultOccupancyTemporalLayers(NeuralNetworkArchitecture<T>, int)

Creates default layers for an occupancy detection neural network with temporal data.

public static IEnumerable<ILayer<T>> CreateDefaultOccupancyTemporalLayers(NeuralNetworkArchitecture<T> architecture, int historyWindowSize)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration that defines input and output shapes.

historyWindowSize int

The number of time steps to consider in the temporal data (how many past observations to include).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a temporal occupancy detection network.

Remarks

For Beginners: This method builds a neural network specifically designed to detect occupancy (whether a space is occupied by people) using data that changes over time. It uses special layer types like LSTM (Long Short-Term Memory) that can "remember" patterns in sequential data, and attention mechanisms that help the network focus on the most important time steps in the data sequence.

Temporal data refers to data collected over time, where the sequence and patterns across time points are important for making predictions. For example, sensor readings collected every minute over several hours would be temporal data.

CreateDefaultOpticalFlowLayers(int, int, int, int)

Creates layers for an optical flow estimation model (RAFT-style).

public static IEnumerable<ILayer<T>> CreateDefaultOpticalFlowLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int hiddenDim = 192)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input frame height.

inputWidth int

Input frame width.

hiddenDim int

Hidden dimension for flow estimation (default: 192).

Returns

IEnumerable<ILayer<T>>

A collection of layers for optical flow estimation.

Remarks

For Beginners: Optical flow tells you how each pixel moves between two frames. This is useful for motion analysis, video editing, and as input to other models. The output is a 2-channel tensor showing horizontal and vertical motion.

Architecture:

  1. Feature encoder extracts features from both frames
  2. Correlation volume computes matching scores
  3. Iterative refinement improves the flow estimate

CreateDefaultPICKLayers(int, int, int, int, int, int)

Creates default PICK layers for key information extraction.

public static IEnumerable<ILayer<T>> CreateDefaultPICKLayers(int hiddenDim = 256, int numGcnLayers = 2, int numHeads = 8, int vocabSize = 30522, int numEntityTypes = 14, int maxSequenceLength = 512)

Parameters

hiddenDim int

Hidden dimension (default: 256).

numGcnLayers int

Number of GCN layers (default: 2).

numHeads int

Number of attention heads (default: 8).

vocabSize int

Vocabulary size (default: 30522).

numEntityTypes int

Number of entity types (default: 14).

maxSequenceLength int

Maximum sequence length (default: 512).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a PICK model.

Remarks

Reference: "PICK: Processing Key Information Extraction" (ICPR 2020)

CreateDefaultPINNLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Physics-Informed Neural Network (PINN).

public static IEnumerable<ILayer<T>> CreateDefaultPINNLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 4, int hiddenLayerSize = 32)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

hiddenLayerCount int

Number of hidden layers (default: 4).

hiddenLayerSize int

Number of neurons in each hidden layer (default: 32).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a PINN.

Remarks

For Beginners: Physics-Informed Neural Networks (PINNs) solve PDEs by training a neural network to minimize the PDE residual at collocation points. The network learns the solution function u(x,t) while respecting the physics (PDE, boundary conditions, and initial conditions).

Uses Tanh activation for smooth derivatives (important for computing PDE residuals). Multiple hidden layers capture complex solution behavior. Linear output layer since PDE solutions can take any real value.

CreateDefaultPSENetLayers(int, int, int, int)

Creates default PSENet (Progressive Scale Expansion Network) layers.

public static IEnumerable<ILayer<T>> CreateDefaultPSENetLayers(int imageSize = 640, int backboneChannels = 256, int featureChannels = 256, int numKernels = 7)

Parameters

imageSize int

Input image size (default: 640).

backboneChannels int

Backbone channels (default: 256).

featureChannels int

Feature channels (default: 256).

numKernels int

Number of scale kernels (default: 7).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a PSENet model.

CreateDefaultPix2StructLayers(int, int, int, int, int, int, int, int)

Creates default Pix2Struct layers for screenshot parsing.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultPix2StructLayers(int hiddenDim = 1024, int numEncoderLayers = 18, int numDecoderLayers = 18, int numHeads = 16, int vocabSize = 50000, int patchSize = 16, int maxPatches = 4096, int maxSequenceLength = 1024)

Parameters

hiddenDim int

Hidden dimension (default: 1024).

numEncoderLayers int

Number of encoder layers (default: 18).

numDecoderLayers int

Number of decoder layers (default: 18).

numHeads int

Number of attention heads (default: 16).

vocabSize int

Vocabulary size (default: 50000).

patchSize int

Patch size (default: 16).

maxPatches int

Maximum patches (default: 4096).

maxSequenceLength int

Maximum sequence length (default: 1024).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)

Tuple of encoder and decoder layers.

Remarks

Reference: "Pix2Struct: Screenshot Parsing as Pretraining" (ICML 2023)

CreateDefaultQuantumNetworkLayers(NeuralNetworkArchitecture<T>, int)

Creates a default configuration of layers for a Quantum Neural Network.

public static IEnumerable<ILayer<T>> CreateDefaultQuantumNetworkLayers(NeuralNetworkArchitecture<T> architecture, int numQubits = 4)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

numQubits int

The number of qubits to use in quantum layers (default: 4).

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for a Quantum Neural Network.

Remarks

For Beginners: A Quantum Neural Network combines quantum computing concepts with neural networks. Think of qubits as special units that can exist in multiple states at once (unlike regular bits that are either 0 or 1). This gives quantum networks potential advantages for certain problems. The numQubits parameter controls how many of these special quantum units are used in each quantum layer.

Exceptions

ArgumentNullException

Thrown when architecture is null.

ArgumentException

Thrown when numQubits is not positive.

CreateDefaultRBFNetworkLayers(NeuralNetworkArchitecture<T>, int, IRadialBasisFunction<T>?)

Creates a default Radial Basis Function (RBF) neural network layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultRBFNetworkLayers(NeuralNetworkArchitecture<T> architecture, int hiddenSize = 0, IRadialBasisFunction<T>? rbfFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

hiddenSize int

The size of the hidden layer. If set to 0 or negative, a default size will be calculated.

rbfFunction IRadialBasisFunction<T>

The radial basis function to use. If null, a default Gaussian RBF will be used.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for RBF network processing.

Remarks

For Beginners: A Radial Basis Function (RBF) Network is a special type of neural network that uses "distance" to make predictions. Instead of gradually learning patterns through weights like standard neural networks, RBF networks measure how similar or different an input is from known examples.

Think of it like this: if you want to identify a fruit, you might compare how similar it looks to fruits you already know. An RBF network works in a similar way - it has "reference points" and measures how close new data is to these points.

RBF networks are particularly good at function approximation, pattern recognition, and time series prediction.

CreateDefaultRNNLayers(NeuralNetworkArchitecture<T>)

Creates a default Recurrent Neural Network (RNN) layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultRNNLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for RNN-based processing.

Remarks

For Beginners: A Recurrent Neural Network (RNN) is designed to work with sequential data by maintaining a form of "memory" of previous inputs. Unlike standard neural networks, RNNs can use their internal state to process sequences of inputs, making them ideal for tasks like text analysis, speech recognition, or time series prediction.

This method automatically configures appropriate RNN layers with sensible defaults, including hidden layer sizes and activation functions.

CreateDefaultRVMLayers(int, int, int, int)

Creates default layers for RVM (Robust Video Matting).

public static IEnumerable<ILayer<T>> CreateDefaultRVMLayers(int inputChannels = 3, int inputHeight = 512, int inputWidth = 512, int numFeatures = 32)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int

Returns

IEnumerable<ILayer<T>>

CreateDefaultResNetLayers(NeuralNetworkArchitecture<T>, int, int)

Creates a Residual Neural Network (ResNet) with configurable blocks.

public static IEnumerable<ILayer<T>> CreateDefaultResNetLayers(NeuralNetworkArchitecture<T> architecture, int blockCount = 3, int blockSize = 2)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

blockCount int

Number of residual blocks (default: 3).

blockSize int

Number of convolutional layers in each block (default: 2).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a ResNet.

Remarks

For Beginners: A Residual Network (ResNet) is designed to solve the "vanishing gradient problem" that occurs when training very deep networks. It does this by adding "skip connections" that allow information to bypass some layers.

Think of it like this: In a traditional network, each layer must learn everything from scratch. In a ResNet, each layer only needs to learn the "difference" (or residual) between its input and the desired output, which is often easier to learn.

Key components: - Initial convolutional layer: Processes the raw input - Residual blocks: Groups of layers with skip connections - Global pooling: Reduces the spatial dimensions to a single value per feature map - Final dense layer: Makes the prediction based on the extracted features

CreateDefaultSAM2Layers(int, int, int, int)

Creates all SAM2 layers for backward compatibility.

[Obsolete("Use individual SAM2 factory methods (CreateSAM2ImageEncoderLayers, etc.) for proper multi-branch architecture.")]
public static IEnumerable<ILayer<T>> CreateDefaultSAM2Layers(int inputChannels = 3, int inputHeight = 1024, int inputWidth = 1024, int numFeatures = 256)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int

Returns

IEnumerable<ILayer<T>>

Remarks

Warning: This method returns layers from multiple branches that cannot be chained sequentially. Use the individual factory methods (CreateSAM2ImageEncoderLayers, CreateSAM2PromptEncoderLayers, CreateSAM2MemoryLayers, CreateSAM2MaskDecoderLayers) for proper multi-branch handling.

CreateDefaultSGPTLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for an SGPT (Sentence GPT) decoder-only embedding model.

public static IEnumerable<ILayer<T>> CreateDefaultSGPTLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 50257, int embeddingDimension = 768, int maxSequenceLength = 1024, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultSPLADELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a SPLADE (Sparse Lexical and Expansion Model) embedding model.

public static IEnumerable<ILayer<T>> CreateDefaultSPLADELayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultSVTRLayers(int, int, int, int, int, int)

Creates default SVTR (Scene Text Visual Transformer Recognizer) layers.

public static IEnumerable<ILayer<T>> CreateDefaultSVTRLayers(int imageWidth = 256, int imageHeight = 64, int hiddenDim = 192, int numLayers = 8, int numHeads = 6, int charsetSize = 95)

Parameters

imageWidth int

Input image width (default: 256).

imageHeight int

Input image height (default: 64).

hiddenDim int

Hidden dimension (default: 192).

numLayers int

Number of transformer layers (default: 8).

numHeads int

Number of attention heads (default: 6).

charsetSize int

Character set size (default: 95).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming an SVTR model.

CreateDefaultSiameseLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates default layers for a Siamese neural network using a Transformer-based encoder.

public static IEnumerable<ILayer<T>> CreateDefaultSiameseLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

vocabSize int

The size of the vocabulary (default: 30522).

embeddingDimension int

The dimension of the embedding vectors (default: 768).

maxSequenceLength int

The maximum length of input sequences (default: 512).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Siamese encoder.

Remarks

For Beginners: A Siamese Network uses two identical "twin" networks to process different inputs. This method sets up the structure for one of those twins, typically using a Transformer encoder to turn text into a coordinate (embedding) that can be compared to others.

CreateDefaultSimCSELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a SimCSE (Simple Contrastive Learning of Sentence Embeddings) model.

public static IEnumerable<ILayer<T>> CreateDefaultSimCSELayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultSlowFastLayers(int, int, int, int, int, int, int)

Creates all SlowFast layers for backward compatibility (returns only slow pathway).

[Obsolete("Use individual SlowFast factory methods for proper dual-pathway architecture.")]
public static IEnumerable<ILayer<T>> CreateDefaultSlowFastLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int numClasses = 400, int slowChannels = 64, int fastChannels = 8, int alpha = 8)

Parameters

inputChannels int
inputHeight int
inputWidth int
numClasses int
slowChannels int
fastChannels int
alpha int

Returns

IEnumerable<ILayer<T>>

Remarks

Warning: SlowFast is a dual-pathway architecture that cannot be represented as a single sequential layer list. Use the individual factory methods: - CreateSlowFastSlowPathwayLayers - CreateSlowFastFastPathwayLayers - CreateSlowFastFusionLayers

CreateDefaultSourceSeparationLayers(int, int, int, int, double)

Creates default music source separation layers (U-Net style).

public static IEnumerable<ILayer<T>> CreateDefaultSourceSeparationLayers(int numMels = 513, int baseChannels = 32, int numSources = 4, int maxFrames = 512, double dropoutRate = 0.1)

Parameters

numMels int

Number of spectrogram frequency bins (default: 513 for STFT with 1024 window).

baseChannels int

Base channel count for U-Net (default: 32).

numSources int

Number of output sources (default: 4 for vocals, drums, bass, other).

maxFrames int

Maximum time frames (default: 512).

dropoutRate double

Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers for music source separation.

Remarks

U-Net inspired architecture for source separation with:

  • Encoder path with downsampling
  • Bottleneck with attention
  • Decoder path with upsampling and skip connections
  • Multi-source mask prediction

Reference: "Open-Unmix - A Reference Implementation for Music Source Separation"

CreateDefaultSpeakerEmbeddingLayers(int, int, int, int, int, double)

Creates default speaker embedding layers for speaker verification and identification.

public static IEnumerable<ILayer<T>> CreateDefaultSpeakerEmbeddingLayers(int numMels = 80, int hiddenDim = 512, int embeddingDim = 256, int numLayers = 3, int maxFrames = 500, double dropoutRate = 0.1)

Parameters

numMels int

Number of mel spectrogram bins (default: 80).

hiddenDim int

Hidden layer dimension (default: 512).

embeddingDim int

Output embedding dimension (default: 256).

numLayers int

Number of LSTM-like layers (default: 3).

maxFrames int

Maximum input frames (default: 500).

dropoutRate double

Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers for speaker embedding extraction.

Remarks

ECAPA-TDNN inspired architecture for speaker embedding with:

  • Frame-level feature extraction with attention
  • Temporal context aggregation
  • Attentive statistics pooling
  • Speaker embedding projection

Reference: "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN"

CreateDefaultSpikingLayers(NeuralNetworkArchitecture<T>, SpikingNeuronType, double, double, bool, bool)

Creates default layers for a Spiking Neural Network (SNN).

public static IEnumerable<ILayer<T>> CreateDefaultSpikingLayers(NeuralNetworkArchitecture<T> architecture, SpikingNeuronType neuronType = SpikingNeuronType.LeakyIntegrateAndFire, double tau = 10, double refractoryPeriod = 2, bool useLayerNormalization = false, bool useOutputConversion = true)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

neuronType SpikingNeuronType

The type of spiking neuron to use.

tau double

The membrane time constant that controls how quickly neurons respond to inputs.

refractoryPeriod double

The period after firing during which a neuron cannot fire again.

useLayerNormalization bool

Whether to use layer normalization to stabilize training.

useOutputConversion bool

Whether to convert spike outputs to continuous values.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Spiking Neural Network.

Remarks

For Beginners: Spiking Neural Networks (SNNs) are a type of neural network that more closely mimics how real neurons in the brain work. Unlike traditional neural networks that use continuous values, SNNs use "spikes" (binary on/off signals) to communicate between neurons. This makes them more biologically realistic and potentially more energy-efficient for certain tasks.

The tau parameter controls how quickly a neuron "forgets" previous inputs - larger values make the neuron remember inputs for longer. The refractory period is like a "rest time" after a neuron fires, during which it cannot fire again, similar to how real neurons behave.

CreateDefaultSpiralNetLayers(NeuralNetworkArchitecture<T>, int, int, int[]?, double[]?, int[]?, bool, double, bool)

Creates the default layer sequence for a SpiralNet mesh neural network.

public static IEnumerable<ILayer<T>> CreateDefaultSpiralNetLayers(NeuralNetworkArchitecture<T> architecture, int inputFeatures = 3, int spiralLength = 9, int[]? convChannels = null, double[]? poolRatios = null, int[]? fcSizes = null, bool useBatchNorm = true, double dropoutRate = 0.5, bool useGlobalAveragePooling = true)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

inputFeatures int

Number of input features per vertex (default: 3 for coordinates).

spiralLength int

Length of spiral sequences for convolutions.

convChannels int[]

Channel sizes for each spiral convolution block.

poolRatios double[]

Pooling ratios for mesh simplification at each level.

fcSizes int[]

Sizes of fully connected layers before output.

useBatchNorm bool

Whether to use batch normalization after convolutions.

dropoutRate double

Dropout rate for fully connected layers.

useGlobalAveragePooling bool

Whether to use global average (true) or max (false) pooling.

Returns

IEnumerable<ILayer<T>>

An enumerable of layers forming the SpiralNet architecture.

Remarks

For Beginners: This method builds the default layer stack for SpiralNet++.

Architecture pattern: - Multiple spiral convolution blocks (SpiralConv + optional BatchNorm) - Global pooling to aggregate vertex features - Fully connected layers for classification

Applications:

  • 3D face recognition and reconstruction
  • Human body shape analysis
  • Medical mesh analysis

Exceptions

InvalidOperationException

Thrown when the architecture has invalid output size.

CreateDefaultStableAudioLayers(int, int, int, int, int, int, int, double)

Creates default Stable Audio layers for text-to-audio generation.

public static IEnumerable<ILayer<T>> CreateDefaultStableAudioLayers(int textHiddenDim = 768, int latentDim = 64, int ditHiddenDim = 1024, int numDitBlocks = 24, int numHeads = 16, int maxTextLength = 512, int maxAudioLength = 2048, double dropoutRate = 0.1)

Parameters

textHiddenDim int

Text encoder hidden dimension (default: 768).

latentDim int

Latent space dimension (default: 64).

ditHiddenDim int

DiT hidden dimension (default: 1024).

numDitBlocks int

Number of DiT transformer blocks (default: 24).

numHeads int

Number of attention heads (default: 16).

maxTextLength int

Maximum text sequence length (default: 512).

maxAudioLength int

Maximum audio latent sequence length (default: 2048).

dropoutRate double

Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Stable Audio model.

Remarks

Stable Audio by Stability AI uses a Diffusion Transformer (DiT) architecture:

  • T5-based text encoder for conditioning
  • Variational autoencoder for audio latent compression
  • DiT (Diffusion Transformer) for denoising in latent space
  • Supports variable-length audio generation with timing conditioning

Reference: "Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion" by Evans et al., 2024

CreateDefaultTRIELayers(int, int, int, int, int, int)

Creates default TRIE (Text Reading and Information Extraction) layers.

public static IEnumerable<ILayer<T>> CreateDefaultTRIELayers(int imageSize = 512, int visualDim = 256, int textDim = 256, int graphDim = 256, int numEntityTypes = 10, int maxEntities = 100)

Parameters

imageSize int

Input image size (default: 512).

visualDim int

Visual encoder dimension (default: 256).

textDim int

Text encoder dimension (default: 256).

graphDim int

Graph dimension (default: 256).

numEntityTypes int

Number of entity types (default: 10).

maxEntities int

Maximum entities (default: 100).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a TRIE model.

CreateDefaultTableTransformerLayers(int, int, int, int, int, int, int)

Creates default layers for TableTransformer model.

public static IEnumerable<ILayer<T>> CreateDefaultTableTransformerLayers(int imageSize = 800, int hiddenDim = 256, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int numQueries = 100, int numStructureClasses = 7)

Parameters

imageSize int

Input image size (default: 800).

hiddenDim int

Transformer hidden dimension (default: 256).

numEncoderLayers int

Number of encoder layers (default: 6).

numDecoderLayers int

Number of decoder layers (default: 6).

numHeads int

Number of attention heads (default: 8).

numQueries int

Number of object queries (default: 100).

numStructureClasses int

Number of structure classes (default: 7).

Returns

IEnumerable<ILayer<T>>

Enumerable of layers for TableTransformer.

Remarks

TableTransformer uses a DETR-style architecture with ResNet backbone.

Reference: "PubTables-1M: Towards Comprehensive Table Extraction" (CVPR 2022)

CreateDefaultTimeSformerLayers(int, int, int, int, int, int, int)

Creates default layers for TimeSformer video classification.

public static IEnumerable<ILayer<T>> CreateDefaultTimeSformerLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int embedDim = 768, int numLayers = 12, int patchSize = 16, int numClasses = 400)

Parameters

inputChannels int
inputHeight int
inputWidth int
embedDim int
numLayers int
patchSize int
numClasses int

Returns

IEnumerable<ILayer<T>>

CreateDefaultTrOCRLayers(int, int, int, int, int, int, int, int, int, int)

Creates default layers for TrOCR text recognition model.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultTrOCRLayers(int imageSize = 384, int patchSize = 16, int encoderHiddenDim = 768, int decoderHiddenDim = 768, int numEncoderLayers = 12, int numDecoderLayers = 6, int numEncoderHeads = 12, int numDecoderHeads = 12, int vocabSize = 50265, int maxSequenceLength = 128)

Parameters

imageSize int

Input image size (default: 384).

patchSize int

ViT patch size (default: 16).

encoderHiddenDim int

Encoder hidden dimension (default: 768).

decoderHiddenDim int

Decoder hidden dimension (default: 768).

numEncoderLayers int

Number of encoder layers (default: 12).

numDecoderLayers int

Number of decoder layers (default: 6).

numEncoderHeads int

Number of encoder heads (default: 12).

numDecoderHeads int

Number of decoder heads (default: 12).

vocabSize int

Vocabulary size (default: 50265).

maxSequenceLength int

Maximum sequence length (default: 128).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)

Tuple of encoder and decoder layers.

Remarks

TrOCR uses a Vision Transformer (ViT) encoder and a Transformer decoder.

Reference: "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" (AAAI 2022)

CreateDefaultTransformerLayers(TransformerArchitecture<T>)

Creates a default Transformer neural network with pre-configured encoder and decoder layers.

public static IEnumerable<ILayer<T>> CreateDefaultTransformerLayers(TransformerArchitecture<T> architecture)

Parameters

architecture TransformerArchitecture<T>

The transformer architecture configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Transformer neural network.

Remarks

For Beginners: A Transformer is a powerful type of neural network especially good at processing sequences like text or time series data. Unlike older networks, Transformers can look at all parts of the input at once (using "attention") rather than processing it step by step. This makes them excellent for tasks like translation, text generation, and understanding language.

Key concepts: - Attention: Allows the model to focus on relevant parts of the input regardless of position - Multi-head attention: Lets the model focus on different aspects of the input simultaneously - Encoder: Processes the input sequence - Decoder: Generates the output sequence - Positional encoding: Helps the model understand the order of elements in a sequence

CreateDefaultTtsLayers(int, int, int, int, int, int, int, int, int, double)

Creates default TTS (Text-to-Speech) layers for speech synthesis.

public static IEnumerable<ILayer<T>> CreateDefaultTtsLayers(int textHiddenDim = 256, int audioHiddenDim = 512, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int numMels = 80, int maxTextLength = 512, int maxMelFrames = 1000, int vocabSize = 148, double dropoutRate = 0.1)

Parameters

textHiddenDim int

Text encoder hidden dimension (default: 256).

audioHiddenDim int

Audio decoder hidden dimension (default: 512).

numEncoderLayers int

Number of encoder transformer layers (default: 6).

numDecoderLayers int

Number of decoder transformer layers (default: 6).

numHeads int

Number of attention heads (default: 8).

numMels int

Number of mel spectrogram bins (default: 80).

maxTextLength int

Maximum input text length (default: 512).

maxMelFrames int

Maximum mel spectrogram frames (default: 1000).

vocabSize int

Phoneme/character vocabulary size (default: 148).

dropoutRate double

Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a TTS encoder-decoder architecture.

Remarks

TTS architecture with:

  • Character/phoneme embedding with positional encoding
  • Transformer encoder for text representation
  • Transformer decoder with cross-attention for mel generation
  • Post-net convolutional refinement (simulated with dense layers)

Reference: "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions" (Tacotron 2)

CreateDefaultUDOPLayers(int, int, int, int, int, int, int)

Creates default UDOP layers for unified document processing.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultUDOPLayers(int hiddenDim = 1024, int numEncoderLayers = 12, int numDecoderLayers = 12, int numHeads = 16, int vocabSize = 50000, int imageSize = 224, int maxSequenceLength = 2048)

Parameters

hiddenDim int

Hidden dimension (default: 1024).

numEncoderLayers int

Number of encoder layers (default: 12).

numDecoderLayers int

Number of decoder layers (default: 12).

numHeads int

Number of attention heads (default: 16).

vocabSize int

Vocabulary size (default: 50000).

imageSize int

Input image size (default: 224).

maxSequenceLength int

Maximum sequence length (default: 2048).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)

Tuple of encoder and decoder layers.

Remarks

Reference: "UDOP: Unifying Vision, Text, and Layout" (CVPR 2023)

CreateDefaultUNet3DLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates default layers for a 3D U-Net architecture for volumetric segmentation.

public static IEnumerable<ILayer<T>> CreateDefaultUNet3DLayers(NeuralNetworkArchitecture<T> architecture, int voxelResolution = 32, int numEncoderBlocks = 4, int baseFilters = 32)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

voxelResolution int

The resolution of the voxel grid (e.g., 32 for 32x32x32). Default is 32.

numEncoderBlocks int

The number of encoder blocks. Default is 4.

baseFilters int

The number of filters in the first convolutional layer. Doubles with each block. Default is 32.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for 3D volumetric segmentation.

Remarks

For Beginners: A 3D U-Net is like a specialized 3D image processor that can identify different parts of a 3D volume (like organs in a CT scan or objects in a point cloud).

The U-shape architecture: - Encoder: Progressively downsamples to capture context (like zooming out) - Bottleneck: Smallest representation capturing global features - Decoder: Progressively upsamples to restore resolution (like zooming in) - Skip connections: Link encoder to decoder to preserve fine details

Applications include: - 3D semantic segmentation of point clouds - Medical image segmentation (organs, tumors in CT/MRI) - Part segmentation of 3D shapes

Exceptions

InvalidOperationException

Thrown when the architecture has invalid dimensions.

CreateDefaultUniversalDELayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Universal Differential Equation (UDE) network.

public static IEnumerable<ILayer<T>> CreateDefaultUniversalDELayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 2, int hiddenLayerSize = 32)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

hiddenLayerCount int

Number of hidden layers (default: 2).

hiddenLayerSize int

Number of neurons in each hidden layer (default: 32).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a UDE neural network component.

Remarks

For Beginners: Universal Differential Equations combine known physics with neural networks. The neural network learns the unknown parts of the dynamics while known physics equations are added explicitly. This is perfect for scientific applications where you know some of the physics but not all of it.

The network takes [state, time] as input and outputs the learned correction to the dynamics. Uses Tanh activation for smooth derivatives needed in ODE integration. Output uses linear (identity) activation since corrections can be positive or negative.

CreateDefaultVAELayers(NeuralNetworkArchitecture<T>, int)

Creates a default Variational Autoencoder (VAE) with pre-configured layers.

public static IEnumerable<ILayer<T>> CreateDefaultVAELayers(NeuralNetworkArchitecture<T> architecture, int latentSize)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

latentSize int

The size of the latent space dimension.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Variational Autoencoder.

Remarks

For Beginners: A Variational Autoencoder (VAE) is a type of neural network that learns to compress data into a smaller representation (encoding) and then reconstruct it back (decoding). What makes VAEs special is that they create a "fuzzy" compressed representation rather than an exact one, which helps the network learn meaningful patterns in your data. This makes VAEs excellent for generating new data similar to your training examples.

The latent space is the compressed representation where your data exists in a simplified form. Think of it as a "creative space" where the network understands the essential features of your data.

CreateDefaultVGGLayers(NeuralNetworkArchitecture<T>, VGGConfiguration)

Creates layers for a VGG network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultVGGLayers(NeuralNetworkArchitecture<T> architecture, VGGConfiguration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

configuration VGGConfiguration

The VGG-specific configuration.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a VGG network.

Remarks

For Beginners: VGG networks are deep convolutional neural networks known for their simplicity and effectiveness. They use stacks of 3x3 convolutions followed by max pooling to progressively extract higher-level features from images.

The VGG architecture consists of:

  • 5 convolutional blocks with increasing number of filters (64 -> 128 -> 256 -> 512 -> 512)
  • Max pooling after each block to reduce spatial dimensions by half
  • Optional batch normalization after each convolution (in _BN variants)
  • 3 fully connected layers (4096 -> 4096 -> numClasses)
  • Dropout regularization in the fully connected layers

CreateDefaultVRTLayers(int, int, int, int, int, int, int)

Creates layers for a VRT (Video Restoration Transformer) model.

public static IEnumerable<ILayer<T>> CreateDefaultVRTLayers(int inputChannels = 3, int inputHeight = 64, int inputWidth = 64, int embedDim = 120, int numFrames = 6, int numBlocks = 8, int scaleFactor = 4)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input frame height.

inputWidth int

Input frame width.

embedDim int

Embedding dimension (default: 120).

numFrames int

Number of temporal frames (default: 6).

numBlocks int

Number of transformer blocks (default: 8).

scaleFactor int

Upscaling factor for super-resolution. Supported values: 1, 2, or 4 (default: 4).

Returns

IEnumerable<ILayer<T>>

A collection of layers for video restoration.

Remarks

For Beginners: VRT (Video Restoration Transformer) is a powerful model for: - Video super-resolution (increasing video resolution) - Video deblurring (removing motion blur) - Video denoising (removing noise from videos)

It uses attention mechanisms to leverage both spatial and temporal information from multiple video frames to produce high-quality restored frames.

Architecture (based on the paper):

  1. Shallow feature extraction from input frames
  2. Temporal mutual self-attention (TMSA) blocks
  3. Deep feature extraction with parallel warping
  4. Reconstruction module for output

Reference: "VRT: A Video Restoration Transformer" https://arxiv.org/abs/2201.12288

Exceptions

ArgumentException

Thrown when scaleFactor is not 1, 2, or 4.

CreateDefaultVariationalPINNLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Variational Physics-Informed Neural Network (VPINN).

public static IEnumerable<ILayer<T>> CreateDefaultVariationalPINNLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 4, int hiddenLayerSize = 50)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

hiddenLayerCount int

Number of hidden layers (default: 4).

hiddenLayerSize int

Number of neurons in each hidden layer (default: 50).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a VPINN.

Remarks

For Beginners: Variational PINNs solve PDEs using the weak (variational) form instead of the strong form. This is similar to Finite Element Methods but using neural networks. Often more stable for complex PDEs than standard PINNs.

Uses Tanh activation throughout for smooth derivatives needed in variational formulation. Linear output layer since PDE solutions can take any real value.

CreateDefaultVideoMAELayers(int, int, int, int, int, int)

Creates default layers for VideoMAE (Video Masked Autoencoder) action recognition model.

public static IEnumerable<ILayer<T>> CreateDefaultVideoMAELayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int numFeatures = 768, int numClasses = 400, int tubeletSize = 2)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input height (default: 224).

inputWidth int

Input width (default: 224).

numFeatures int

Number of feature channels (default: 768).

numClasses int

Number of action classes (default: 400 for Kinetics).

tubeletSize int

Temporal size of each tube (default: 2).

Returns

IEnumerable<ILayer<T>>

An enumerable of layers configured for VideoMAE.

Remarks

For Beginners: VideoMAE is a self-supervised learning model that learns video representations by masking and reconstructing video patches. It's used for action recognition and video understanding tasks.

Architecture: - 3D patch embedding (spatiotemporal) - Transformer encoder blocks - Classification head for action recognition - Decoder for masked reconstruction during pretraining

Reference: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training" https://arxiv.org/abs/2203.12602

CreateDefaultVideoStabilizationLayers(int, int, int)

Creates layers for a video stabilization model (StabNet-style).

public static IEnumerable<ILayer<T>> CreateDefaultVideoStabilizationLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input frame height.

inputWidth int

Input frame width.

Returns

IEnumerable<ILayer<T>>

A collection of layers for video stabilization.

Remarks

For Beginners: Video stabilization removes camera shake. The model predicts how to warp each frame to align with a smooth camera path. This is similar to what smartphone cameras do in real-time.

Architecture:

  1. Feature encoder processes input frames
  2. Motion estimator predicts camera motion
  3. Smoother learns the smooth target path
  4. Warper transforms frames to match smooth path

CreateDefaultVideoSuperResolutionLayers(int, int, int, int, int, int, bool)

Creates layers for a video super-resolution model (Real-ESRGAN/BasicVSR++ style).

public static IEnumerable<ILayer<T>> CreateDefaultVideoSuperResolutionLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int numFeatures = 64, int numResBlocks = 16, int scaleFactor = 2, bool useTemporalConsistency = true)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input video height.

inputWidth int

Input video width.

numFeatures int

Number of feature channels (default: 64).

numResBlocks int

Number of residual blocks (default: 16).

scaleFactor int

Upscaling factor (default: 2).

useTemporalConsistency bool

Whether to add temporal aggregation layer (default: true).

Returns

IEnumerable<ILayer<T>>

A collection of layers for video super-resolution.

Remarks

For Beginners: Super-resolution models increase video resolution. This architecture uses residual blocks (skip connections) to preserve details while learning to add new ones. The upsampling at the end increases the spatial size by the scale factor.

Architecture overview:

  1. Initial convolution to extract features
  2. Multiple residual blocks for deep feature learning
  3. Temporal aggregation for video consistency (optional)
  4. Pixel shuffle upsampling for resolution increase
  5. Final convolution for output reconstruction

CreateDefaultVoxLingua107Layers(NeuralNetworkArchitecture<T>, int, int, int, int[]?)

Creates default VoxLingua107 layers for 107-language identification.

public static IEnumerable<ILayer<T>> CreateDefaultVoxLingua107Layers(NeuralNetworkArchitecture<T> architecture, int numMels = 80, int tdnnChannels = 1024, int embeddingDimension = 256, int[]? dilations = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

numMels int

Number of mel filterbank channels (default: 80).

tdnnChannels int

Number of TDNN channels (default: 1024).

embeddingDimension int

Embedding dimension (default: 256).

dilations int[]

Dilation factors for TDNN layers (default: [1, 2, 3, 4, 1]).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a VoxLingua107 language identifier.

Remarks

VoxLingua107 uses ECAPA-TDNN architecture trained on 107 languages from the VoxLingua107 dataset (YouTube speech samples).

CreateDefaultVoxelCNNLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates default layers for a Voxel-based 3D Convolutional Neural Network.

public static IEnumerable<ILayer<T>> CreateDefaultVoxelCNNLayers(NeuralNetworkArchitecture<T> architecture, int voxelResolution = 32, int numConvBlocks = 3, int baseFilters = 32)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture specification.

voxelResolution int

The resolution of the voxel grid (e.g., 32 for 32x32x32). Default is 32.

numConvBlocks int

The number of convolutional blocks (each block has Conv3D + MaxPool3D). Default is 3.

baseFilters int

The number of filters in the first convolutional layer. Doubles with each block. Default is 32.

Returns

IEnumerable<ILayer<T>>

A collection of layers configured for voxel-based 3D classification.

Remarks

For Beginners: A Voxel CNN is like a 3D version of a regular image classifier. Instead of looking at a 2D image, it examines a 3D grid of "blocks" (voxels) to understand 3D shapes. This is like how Minecraft represents the world - each block is either filled or empty, and the pattern of blocks creates recognizable objects.

The architecture follows a standard pattern: - Multiple Conv3D + MaxPool3D blocks to extract hierarchical 3D features - Each block doubles the number of filters while halving the spatial resolution - Global average pooling to aggregate spatial information - Dense output layer for classification

Applications include: - Recognizing 3D objects from voxelized point clouds (e.g., ModelNet40) - Medical image analysis (CT, MRI volumetric scans) - Spatial occupancy prediction from depth sensors

Exceptions

InvalidOperationException

Thrown when the architecture has invalid input or output dimensions.

CreateDefaultWav2Vec2LanguageIdentifierLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, double)

Creates default Wav2Vec2 layers for spoken language identification.

public static IEnumerable<ILayer<T>> CreateDefaultWav2Vec2LanguageIdentifierLayers(NeuralNetworkArchitecture<T> architecture, int hiddenSize = 768, int numLayers = 12, int numAttentionHeads = 12, int intermediateSize = 3072, int numLanguages = 20, double dropoutRate = 0.1)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

hiddenSize int

Hidden size of transformer (default: 768).

numLayers int

Number of transformer layers (default: 12).

numAttentionHeads int

Number of attention heads (default: 12).

intermediateSize int

Feed-forward intermediate size (default: 3072).

numLanguages int

Number of languages to classify (default: 20).

dropoutRate double

Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Wav2Vec2 language identifier.

Remarks

Wav2Vec2-LID uses Meta's self-supervised speech representation model: - 7-layer CNN feature encoder processing raw waveform - Transformer encoder for contextual representations - Classification head for language prediction

CreateDefaultWhisperLayers(int, int, int, int, int, int, int, int, double)

Creates default layers for Whisper-style speech recognition models.

public static IEnumerable<ILayer<T>> CreateDefaultWhisperLayers(int numMels = 80, int modelDimension = 512, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int feedForwardDim = 2048, int vocabularySize = 51865, int maxSequenceLength = 1500, double dropoutRate = 0.1)

Parameters

numMels int

Number of mel spectrogram bins (default: 80).

modelDimension int

Hidden dimension of the model (default: 512).

numEncoderLayers int

Number of encoder layers (default: 6).

numDecoderLayers int

Number of decoder layers (default: 6).

numHeads int

Number of attention heads (default: 8).

feedForwardDim int

Feed-forward dimension (default: 2048).

vocabularySize int

Output vocabulary size (default: 51865).

maxSequenceLength int

Maximum sequence length (default: 1500).

dropoutRate double

Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Whisper-style ASR model.

Remarks

For Beginners: Whisper is an encoder-decoder transformer for speech recognition.

The architecture consists of:

  1. Audio encoder: Converts mel spectrograms to hidden representations
    • Convolutional layers to process spectrogram
    • Transformer encoder layers with self-attention
  2. Text decoder: Generates text tokens autoregressively
    • Embedding layer for text tokens
    • Transformer decoder layers with self-attention
    • Output projection to vocabulary

This creates a trainable model structure from scratch. The decoder layers expect encoder outputs to be provided during the forward pass (as implemented in WhisperModel<T>). For inference with pre-trained weights, use the ONNX-based WhisperModel.CreateAsync() method instead.

CreateDefaultWhisperLayers(int, int, int, int, int, int, int, int, int, double)

Creates default Whisper layers for automatic speech recognition.

public static IEnumerable<ILayer<T>> CreateDefaultWhisperLayers(int modelDim = 512, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int ffDim = 2048, int numMels = 80, int maxFrames = 3000, int maxTokens = 448, int vocabSize = 51865, double dropoutRate = 0)

Parameters

modelDim int

Model hidden dimension (default: 512 for Base).

numEncoderLayers int

Number of encoder transformer layers (default: 6 for Base).

numDecoderLayers int

Number of decoder transformer layers (default: 6 for Base).

numHeads int

Number of attention heads (default: 8 for Base).

ffDim int

Feed-forward hidden dimension (default: 2048 for Base).

numMels int

Number of mel spectrogram bins (default: 80).

maxFrames int

Maximum mel spectrogram frames (default: 3000 for 30s audio).

maxTokens int

Maximum output token sequence length (default: 448).

vocabSize int

Whisper vocabulary size (default: 51865).

dropoutRate double

Dropout rate (default: 0.0 for inference-optimized).

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Whisper encoder-decoder architecture.

Remarks

Whisper is OpenAI's state-of-the-art automatic speech recognition model with:

  • Mel spectrogram audio preprocessing (80 bins, 16kHz)
  • Convolutional stem for initial audio feature extraction
  • Transformer encoder for audio representation learning
  • Transformer decoder with cross-attention for text generation
  • Support for 99+ languages and translation to English

Reference: "Robust Speech Recognition via Large-Scale Weak Supervision" by Radford et al., 2022

CreateDefaultWord2VecLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Word2Vec model (Skip-Gram or CBOW).

public static IEnumerable<ILayer<T>> CreateDefaultWord2VecLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize, int embeddingDimension)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

vocabSize int

The size of the vocabulary.

embeddingDimension int

The dimension of the embedding vectors.

Returns

IEnumerable<ILayer<T>>

A collection of layers forming a Word2Vec model.

Remarks

For Beginners: Word2Vec learns to represent words as vectors of numbers (embeddings) such that words with similar meanings are close to each other.

CreateDefaultXMemLayers(int, int, int, int)

Creates layers for an XMem long-term video object segmentation model.

public static IEnumerable<ILayer<T>> CreateDefaultXMemLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 854, int numFeatures = 256)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input frame height (default: 480).

inputWidth int

Input frame width (default: 854).

numFeatures int

Feature dimension (default: 256).

Returns

IEnumerable<ILayer<T>>

A collection of layers for long-term video object segmentation.

Remarks

For Beginners: XMem is designed for tracking objects in very long videos using a three-tier memory system inspired by human memory: - Sensory memory: Very recent frames (high detail, fast to forget) - Working memory: Important recent frames (moderate detail) - Long-term memory: Key historical frames (compressed, permanent)

Reference: "XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model" https://arxiv.org/abs/2207.07115

CreateSAM2ImageEncoderLayers(int, int, int, int)

Creates the image encoder layers for SAM2 (Segment Anything Model 2).

public static IEnumerable<ILayer<T>> CreateSAM2ImageEncoderLayers(int inputChannels = 3, int inputHeight = 1024, int inputWidth = 1024, int numFeatures = 256)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input height (default: 1024).

inputWidth int

Input width (default: 1024).

numFeatures int

Number of output feature channels (default: 256).

Returns

IEnumerable<ILayer<T>>

Image encoder layers that downsample input to feature maps.

Remarks

For Beginners: This creates the image encoder part of SAM2, which processes input images into feature maps. The output has shape [numFeatures, H/16, W/16].

Note: SAM2 is a multi-branch architecture. Use separate factory methods: - CreateSAM2ImageEncoderLayers: Image feature extraction (this method) - CreateSAM2PromptEncoderLayers: Point/box/mask prompt encoding - CreateSAM2MemoryLayers: Temporal memory attention - CreateSAM2MaskDecoderLayers: Mask prediction head

CreateSAM2IoUHead(int, int, int, int)

Creates the IoU (Intersection over Union) prediction head for SAM2.

public static IEnumerable<ILayer<T>> CreateSAM2IoUHead(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64, int numMaskCandidates = 4)

Parameters

numFeatures int

Number of input feature channels (default: 256).

featureHeight int

Height of feature maps (default: 64).

featureWidth int

Width of feature maps (default: 64).

numMaskCandidates int

Number of mask candidates (default: 4).

Returns

IEnumerable<ILayer<T>>

IoU prediction layers. Output shape: [numMaskCandidates]

Remarks

For Beginners: This head predicts the quality (IoU score) for each mask candidate. Higher scores indicate better masks. Used to select the best mask from candidates.

CreateSAM2MaskDecoderLayers(int, int, int)

Creates the shared mask decoder refinement layers for SAM2.

public static IEnumerable<ILayer<T>> CreateSAM2MaskDecoderLayers(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64)

Parameters

numFeatures int

Number of feature channels (default: 256).

featureHeight int

Height of feature maps (default: 64).

featureWidth int

Width of feature maps (default: 64).

Returns

IEnumerable<ILayer<T>>

Shared refinement layers that process fused features.

Remarks

For Beginners: These layers refine the combined image and prompt features before branching into separate prediction heads. Output shape: [numFeatures, h, w]

Usage: Apply these layers first, then branch to the three separate heads: - CreateSAM2MaskHead: Produces mask candidates - CreateSAM2IoUHead: Predicts mask quality scores - CreateSAM2OcclusionHead: Predicts occlusion

CreateSAM2MaskHead(int, int, int, int)

Creates the mask prediction head for SAM2.

public static IEnumerable<ILayer<T>> CreateSAM2MaskHead(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64, int numMaskCandidates = 4)

Parameters

numFeatures int

Number of input feature channels (default: 256).

featureHeight int

Height of feature maps (default: 64).

featureWidth int

Width of feature maps (default: 64).

numMaskCandidates int

Number of mask candidates to output (default: 4).

Returns

IEnumerable<ILayer<T>>

Mask prediction layers. Output shape: [numMaskCandidates, h, w]

Remarks

For Beginners: This head produces multiple candidate segmentation masks. Each candidate is a probability map indicating object presence at each pixel.

CreateSAM2MemoryLayers(int, int, int)

Creates the memory attention layers for SAM2 temporal consistency.

public static IEnumerable<ILayer<T>> CreateSAM2MemoryLayers(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64)

Parameters

numFeatures int

Number of feature channels (default: 256).

featureHeight int

Height of feature maps (default: 64).

featureWidth int

Width of feature maps (default: 64).

Returns

IEnumerable<ILayer<T>>

Memory attention layers for video object tracking.

Remarks

For Beginners: Memory layers help SAM2 track objects across video frames by maintaining a memory of past segmentations and matching them to new frames.

CreateSAM2OcclusionHead(int, int, int)

Creates the occlusion prediction head for SAM2.

public static IEnumerable<ILayer<T>> CreateSAM2OcclusionHead(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64)

Parameters

numFeatures int

Number of input feature channels (default: 256).

featureHeight int

Height of feature maps (default: 64).

featureWidth int

Width of feature maps (default: 64).

Returns

IEnumerable<ILayer<T>>

Occlusion prediction layers. Output shape: [1]

Remarks

For Beginners: This head predicts whether the tracked object is occluded (hidden by other objects). A high score indicates the object may be temporarily invisible.

CreateSAM2PromptEncoderLayers(int, int, int)

Creates the prompt encoder layers for SAM2 (point, box, and mask prompts).

public static IEnumerable<ILayer<T>> CreateSAM2PromptEncoderLayers(int numFeatures = 256, int maskHeight = 256, int maskWidth = 256)

Parameters

numFeatures int

Number of output feature channels (default: 256).

maskHeight int

Height of mask prompt input (default: 256).

maskWidth int

Width of mask prompt input (default: 256).

Returns

IEnumerable<ILayer<T>>

Prompt encoder layers for different prompt types.

Remarks

For Beginners: SAM2 accepts different types of prompts to tell it what to segment: - Points: Click on the object (x, y coordinates) - Boxes: Draw a bounding box (x1, y1, x2, y2) - Masks: Provide an initial mask estimate

Usage: These layers are applied to prompt inputs separately, then combined with image features in the mask decoder. They are NOT chained sequentially with the image encoder.

CreateSimpleVideoSuperResolutionLayers(int, int, int, int)

Creates a simple super-resolution architecture for testing and lightweight use.

public static IEnumerable<ILayer<T>> CreateSimpleVideoSuperResolutionLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int scaleFactor = 2)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input video height.

inputWidth int

Input video width.

scaleFactor int

Upscaling factor (default: 2).

Returns

IEnumerable<ILayer<T>>

A collection of layers for simple super-resolution.

Remarks

For Beginners: This is a smaller, faster model that trades quality for speed. Good for real-time applications or when GPU memory is limited.

CreateSlowFastFastPathwayLayers(int, int, int, int)

Creates the fast pathway layers for SlowFast video recognition.

public static IEnumerable<ILayer<T>> CreateSlowFastFastPathwayLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int fastChannels = 8)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input height (default: 224).

inputWidth int

Input width (default: 224).

fastChannels int

Base channel count for fast pathway (default: 8).

Returns

IEnumerable<ILayer<T>>

Fast pathway layers that process more frames at lower capacity.

Remarks

For Beginners: The fast pathway processes video at a high frame rate (e.g., 32 fps) but with lower channel capacity (1/8 of slow pathway). It captures motion and temporal dynamics. Output shape: [fastChannels * 8, H/16, W/16]

CreateSlowFastFusionLayers(int, int, int, int, int)

Creates the fusion and classification layers for SlowFast.

public static IEnumerable<ILayer<T>> CreateSlowFastFusionLayers(int slowChannels = 64, int fastChannels = 8, int featureHeight = 14, int featureWidth = 14, int numClasses = 400)

Parameters

slowChannels int

Base channel count for slow pathway (default: 64).

fastChannels int

Base channel count for fast pathway (default: 8).

featureHeight int

Height of feature maps after pathways (default: 14).

featureWidth int

Width of feature maps after pathways (default: 14).

numClasses int

Number of action classes (default: 400 for Kinetics).

Returns

IEnumerable<ILayer<T>>

Fusion layers that combine pathways and classify actions.

Remarks

For Beginners: This fuses the slow and fast pathway features (after concatenation) and produces the final action classification. The SlowFast model should: 1. Run slow pathway on subsampled frames 2. Run fast pathway on all frames 3. Concatenate outputs along channel dimension 4. Apply these fusion layers

CreateSlowFastSlowPathwayLayers(int, int, int, int)

Creates the slow pathway layers for SlowFast video recognition.

public static IEnumerable<ILayer<T>> CreateSlowFastSlowPathwayLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int slowChannels = 64)

Parameters

inputChannels int

Number of input channels (default: 3 for RGB).

inputHeight int

Input height (default: 224).

inputWidth int

Input width (default: 224).

slowChannels int

Base channel count for slow pathway (default: 64).

Returns

IEnumerable<ILayer<T>>

Slow pathway layers that process fewer frames at higher capacity.

Remarks

For Beginners: The slow pathway processes video at a low frame rate (e.g., 4 fps) but with high channel capacity. It captures spatial semantics and appearance features. Output shape: [slowChannels * 8, H/16, W/16]

Note: SlowFast is a dual-pathway architecture. Use separate factory methods: - CreateSlowFastSlowPathwayLayers: Low frame rate, high capacity (this method) - CreateSlowFastFastPathwayLayers: High frame rate, low capacity - CreateSlowFastFusionLayers: Combines pathways for classification