Class LayerHelper<T>

Namespace: AiDotNet.Helpers

Assembly: AiDotNet.dll

Provides helper methods for creating various neural network layer configurations.

public static class LayerHelper<T>

Type Parameters

T: The numeric type used for calculations (typically float or double).

Inheritance: object

LayerHelper<T>

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

This class contains factory methods that create pre-configured sets of neural network layers for common architectures like standard feed-forward networks, CNNs, ResNets, and more.

Methods

CreateDefaultABINetLayers(int, int, int, int, int, int)

Creates default ABINet (Autonomous, Bidirectional, Iterative) layers.

public static IEnumerable<ILayer<T>> CreateDefaultABINetLayers(int imageWidth = 128, int imageHeight = 32, int visionDim = 512, int languageDim = 512, int numIterations = 3, int charsetSize = 95)

Parameters

imageWidth int: Input image width (default: 128).
imageHeight int: Input image height (default: 32).
visionDim int: Vision encoder dimension (default: 512).
languageDim int: Language model dimension (default: 512).
numIterations int: Number of refinement iterations (default: 3).
charsetSize int: Character set size (default: 95).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an ABINet model.

CreateDefaultAnimateDiffLayers(int, int, int, int, int)

Creates layers for an AnimateDiff motion module that adds temporal coherence.

public static IEnumerable<ILayer<T>> CreateDefaultAnimateDiffLayers(int inputChannels = 320, int inputHeight = 64, int inputWidth = 64, int numLayers = 8, int numFrames = 16)

Parameters

inputChannels int: Number of input feature channels (default: 320).
inputHeight int: Input feature height (default: 64).
inputWidth int: Input feature width (default: 64).
numLayers int: Number of motion transformer layers (default: 8).
numFrames int: Number of video frames (default: 16).

Returns

IEnumerable<ILayer<T>>: A collection of layers for motion modeling.

Remarks

For Beginners: AnimateDiff is a motion module that plugs into existing image generation models (like Stable Diffusion) to create animated videos. It learns temporal dynamics from video data.

Architecture (based on the paper):

Input features come from the base image model
Temporal attention layers model motion across frames
Cross-attention with motion context enables coherent animation
Output features blend back into the base model

The motion module is designed to be inserted at multiple points in the U-Net.

Reference: "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models" https://arxiv.org/abs/2307.04725

CreateDefaultAttentionLayers(NeuralNetworkArchitecture<T>)

Creates a default set of attention-based layers for transformer-style architectures.

public static IEnumerable<ILayer<T>> CreateDefaultAttentionLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an attention-based neural network.

Remarks

For Beginners: Attention mechanisms allow neural networks to focus on specific parts of the input that are most relevant for a given task. Similar to how humans pay attention to specific details in a conversation, these layers help the network "pay attention" to important parts of the data. Transformers use this mechanism to process sequences (like text) very effectively.

CreateDefaultAudioGenLayers(int, int, int, int, int, int, int, int, double)

Creates default AudioGen layers for text-to-audio generation.

public static IEnumerable<ILayer<T>> CreateDefaultAudioGenLayers(int textHiddenDim = 768, int lmHiddenDim = 1536, int numLmLayers = 24, int numHeads = 16, int numCodebooks = 4, int codebookSize = 1024, int maxTextLength = 256, int maxAudioTokens = 1500, double dropoutRate = 0.1)

Parameters

textHiddenDim int: Text encoder hidden dimension (default: 768 for T5-base).
lmHiddenDim int: Language model hidden dimension (default: 1536).
numLmLayers int: Number of language model transformer layers (default: 24).
numHeads int: Number of attention heads (default: 16).
numCodebooks int: Number of EnCodec codebooks (default: 4).
codebookSize int: Size of each codebook vocabulary (default: 1024).
maxTextLength int: Maximum text sequence length (default: 256).
maxAudioTokens int: Maximum audio tokens (~50 tokens/sec) (default: 1500 for 30s).
dropoutRate double: Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an AudioGen model.

Remarks

AudioGen is a text-to-audio generation model that uses a transformer language model operating over EnCodec audio codes. Unlike MusicGen, it focuses on general audio and environmental sounds rather than music.

T5-based text encoder for conditioning
Transformer decoder generating audio codes autoregressively
EnCodec neural audio codec for audio reconstruction

Reference: "AudioGen: Textually Guided Audio Generation" by Kreuk et al., 2022

CreateDefaultAudioLDMLayers(int, int, int, int, int[]?, int, int, int, double)

Creates default AudioLDM layers for text-to-audio generation using latent diffusion.

public static IEnumerable<ILayer<T>> CreateDefaultAudioLDMLayers(int textHiddenDim = 768, int latentDim = 8, int unetChannels = 256, int numResBlocks = 2, int[]? attentionResolutions = null, int numHeads = 8, int numMels = 64, int maxTextLength = 77, double dropoutRate = 0.1)

Parameters

textHiddenDim int: Text encoder hidden dimension (default: 768 for CLAP).
latentDim int: Latent space dimension (default: 8).
unetChannels int: U-Net base channels (default: 256).
numResBlocks int: Number of residual blocks per level (default: 2).
attentionResolutions int[]: Resolutions at which to apply attention (default: [4, 2, 1]).
numHeads int: Number of attention heads (default: 8).
numMels int: Number of mel spectrogram channels (default: 64).
maxTextLength int: Maximum text sequence length (default: 77).
dropoutRate double: Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an AudioLDM model.

Remarks

AudioLDM uses latent diffusion for text-to-audio generation:

CLAP text encoder for conditioning
VAE to encode/decode mel spectrograms to latent space
U-Net for denoising in latent space
HiFi-GAN vocoder for waveform generation

Reference: "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models" by Liu et al., 2023

CreateDefaultAutoEncoderLayers(NeuralNetworkArchitecture<T>)

Creates a default autoencoder neural network architecture.

public static IEnumerable<ILayer<T>> CreateDefaultAutoEncoderLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an autoencoder neural network.

Remarks

For Beginners: An autoencoder is a type of neural network that learns to compress data into a smaller representation and then reconstruct it back to the original form. Think of it like learning to create a thumbnail of an image and then expanding it back to full size. The network has two main parts: an encoder that compresses the data and a decoder that reconstructs it.

CreateDefaultBGELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a BGE (BAAI General Embedding) model.

public static IEnumerable<ILayer<T>> CreateDefaultBGELayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultBayesianNeuralNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default configuration of layers for a Bayesian neural network (Bayes-by-Backprop style).

public static IEnumerable<ILayer<T>> CreateDefaultBayesianNeuralNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>

Returns

IEnumerable<ILayer<T>>

Remarks

This mirrors the library's default dense+activation patterns, but uses Bayesian dense layers so the network can express epistemic uncertainty through weight distributions.

CreateDefaultBlip2Layers(int, int, int, int, int, int, int, int, int, int, int, int)

Creates default layers for a BLIP-2 neural network.

public static IEnumerable<ILayer<T>> CreateDefaultBlip2Layers(int imageSize = 224, int channels = 3, int patchSize = 14, int vocabularySize = 30522, int embeddingDimension = 256, int qformerHiddenDim = 768, int visionHiddenDim = 1408, int lmHiddenDim = 2560, int numQformerLayers = 12, int numHeads = 12, int numLmDecoderLayers = 6, int maxSequenceLength = 32)

Parameters

imageSize int
channels int
patchSize int
vocabularySize int
embeddingDimension int
qformerHiddenDim int
visionHiddenDim int
lmHiddenDim int
numQformerLayers int
numHeads int
numLmDecoderLayers int
maxSequenceLength int

Returns

IEnumerable<ILayer<T>>

CreateDefaultByteTrackLayers(int, int, int, int, int)

Creates default layers for ByteTrack multi-object tracking.

public static IEnumerable<ILayer<T>> CreateDefaultByteTrackLayers(int inputChannels = 3, int inputHeight = 800, int inputWidth = 1440, int numFeatures = 256, int numClasses = 1)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numClasses int

Returns

IEnumerable<ILayer<T>>

CreateDefaultCNNLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates a Convolutional Neural Network (CNN) with configurable layers.

public static IEnumerable<ILayer<T>> CreateDefaultCNNLayers(NeuralNetworkArchitecture<T> architecture, int convLayerCount = 2, int filterCount = 32, int kernelSize = 3, int denseLayerCount = 1, int denseLayerSize = 64, int outputSize = 1)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
convLayerCount int: Number of convolutional layers (default: 2).
filterCount int: Number of filters in each convolutional layer (default: 32).
kernelSize int: Size of the convolutional kernel (default: 3).
denseLayerCount int: Number of dense layers after convolutional layers (default: 1).
denseLayerSize int: Number of neurons in each dense layer (default: 64).
outputSize int: Number of output neurons (default: 1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a CNN.

Remarks

For Beginners: A Convolutional Neural Network (CNN) is specialized for processing grid-like data, such as images. Instead of connecting every input to every neuron (which would be inefficient for images), CNNs use filters that scan across the image to detect features like edges, textures, and shapes.

Key components in this CNN: - Convolutional layers: Detect features in the input using filters - Pooling layers: Reduce the size of the data while keeping important information - Flatten layer: Converts the multi-dimensional data to a flat vector - Dense layers: Process the extracted features to make predictions

CreateDefaultCRAFTLayers(int, int, int)

Creates default CRAFT layers for character-level text detection.

public static IEnumerable<ILayer<T>> CreateDefaultCRAFTLayers(int imageSize = 768, int backboneChannels = 512, int upscaleChannels = 256)

Parameters

imageSize int: Input image size (default: 768).
backboneChannels int: Backbone output channels (default: 512).
upscaleChannels int: Upscale network channels (default: 256).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a CRAFT model.

Remarks

Reference: "Character Region Awareness for Text Detection" (CVPR 2019)

CreateDefaultCRNNLayers(int, int, int, int, int, int)

Creates default CRNN layers for sequence text recognition.

public static IEnumerable<ILayer<T>> CreateDefaultCRNNLayers(int imageWidth = 128, int imageHeight = 32, int cnnChannels = 512, int rnnHiddenSize = 256, int rnnLayers = 2, int charsetSize = 95)

Parameters

imageWidth int: Input image width (default: 128).
imageHeight int: Input image height (default: 32).
cnnChannels int: CNN output channels (default: 512).
rnnHiddenSize int: RNN hidden size (default: 256).
rnnLayers int: Number of RNN layers (default: 2).
charsetSize int: Character set size (default: 95).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a CRNN model.

Remarks

Reference: "An End-to-End Trainable Neural Network for Image-based Sequence Recognition" (TPAMI 2017)

CreateDefaultCapsuleNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default capsule network architecture.

public static IEnumerable<ILayer<T>> CreateDefaultCapsuleNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a capsule network.

Remarks

For Beginners: A capsule network is an advanced type of neural network that tries to better understand spatial relationships in data. Unlike traditional networks that just detect features, capsule networks also track the position, orientation, and size of features. Think of it like the difference between recognizing a face by just its parts (eyes, nose, mouth) versus understanding how those parts relate to each other in 3D space.

The network consists of special "capsule" layers that group neurons together to represent entities and their properties, allowing the network to better understand complex structures in data.

CreateDefaultClipLayers(NeuralNetworkArchitecture<T>, int)

Creates default layers for CLIP-style multimodal networks.

public static IEnumerable<ILayer<T>> CreateDefaultClipLayers(NeuralNetworkArchitecture<T> architecture, int projectionDim = 512)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
projectionDim int: The projection dimension for embeddings (default: 512).

Returns

IEnumerable<ILayer<T>>: A collection of projection layers for CLIP fine-tuning.

Remarks

CLIP uses pre-trained ONNX encoders for most of its work, but these layers provide optional projection heads for fine-tuning or feature extraction.

For Beginners: CLIP has two main parts: an image encoder and a text encoder. These pre-trained encoders are loaded from ONNX files. The projection layers here are optional additions that can: - Adapt the embeddings for specific tasks - Allow fine-tuning on new domains - Match embedding dimensions between different model variants

If you're just using CLIP for inference (getting embeddings), you typically don't need these layers. They're useful when you want to adapt CLIP for a specific task.

CreateDefaultCogVideoLayers(int, int, int, int, int, int)

Creates layers for a CogVideo text-to-video generation model.

public static IEnumerable<ILayer<T>> CreateDefaultCogVideoLayers(int inputChannels = 4, int inputHeight = 32, int inputWidth = 32, int embedDim = 1024, int numLayers = 24, int numFrames = 16)

Parameters

inputChannels int: Number of input channels for latent (default: 4).
inputHeight int: Input latent height (default: 32).
inputWidth int: Input latent width (default: 32).
embedDim int: Embedding dimension (default: 1024).
numLayers int: Number of transformer layers (default: 24).
numFrames int: Number of video frames to generate (default: 16).

Returns

IEnumerable<ILayer<T>>: A collection of layers for video generation.

Remarks

For Beginners: CogVideo generates videos from text descriptions. It works in the latent space (compressed representation) and uses a diffusion-based approach to iteratively refine noise into coherent video.

Architecture (based on the CogVideoX paper):

Text encoder processes the input prompt
Latent space diffusion model generates video frames
VAE decoder converts latent to pixel space

This creates the denoising U-Net backbone that refines latent codes.

Reference: "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer" https://arxiv.org/abs/2408.06072

CreateDefaultColBERTLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a ColBERT (Contextualized Late Interaction over BERT) model.

public static IEnumerable<ILayer<T>> CreateDefaultColBERTLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 128, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultCutieLayers(int, int, int, int)

Creates layers for a Cutie video object segmentation model.

public static IEnumerable<ILayer<T>> CreateDefaultCutieLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 854, int numFeatures = 256)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input frame height (default: 480).
inputWidth int: Input frame width (default: 854).
numFeatures int: Feature dimension (default: 256).

Returns

IEnumerable<ILayer<T>>: A collection of layers for video object segmentation.

Remarks

For Beginners: Cutie is designed for semi-supervised video object segmentation (VOS). Given a mask for an object in the first frame, it tracks and segments that object throughout the entire video with high accuracy.

Architecture:

Image encoder (ResNet-like backbone) extracts features
Object encoder processes mask with features
Memory attention matches current frame to stored memories
Mask decoder produces segmentation output

Reference: "Putting the Object Back into Video Object Segmentation" https://arxiv.org/abs/2310.12982

CreateDefaultDBNetLayers(int, int, int)

Creates default layers for DBNet text detection model.

public static IEnumerable<ILayer<T>> CreateDefaultDBNetLayers(int imageSize = 640, int backboneChannels = 256, int innerChannels = 256)

Parameters

imageSize int: Input image size (default: 640).
backboneChannels int: Backbone output channels (default: 256).
innerChannels int: FPN inner channels (default: 256).

Returns

IEnumerable<ILayer<T>>: Enumerable of layers for DBNet.

Remarks

DBNet uses a ResNet backbone with FPN for multi-scale features, followed by probability and threshold prediction heads.

Reference: "Real-time Scene Text Detection with Differentiable Binarization" (AAAI 2020)

CreateDefaultDIFRINTLayers(int, int, int, int, int)

Creates default layers for DIFRINT video stabilization.

public static IEnumerable<ILayer<T>> CreateDefaultDIFRINTLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 640, int numFeatures = 64, int numIterations = 3)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numIterations int

Returns

IEnumerable<ILayer<T>>

CreateDefaultDNCLayers(NeuralNetworkArchitecture<T>, int, int, int, int)

Creates a default Differentiable Neural Computer (DNC) with pre-configured layers.

public static IEnumerable<ILayer<T>> CreateDefaultDNCLayers(NeuralNetworkArchitecture<T> architecture, int controllerSize, int memoryWordSize, int readHeads, int interfaceSize)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
controllerSize int: The size of the controller network.
memoryWordSize int: The size of each memory word.
readHeads int: The number of read heads.
interfaceSize int: The size of the interface between controller and memory.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Differentiable Neural Computer.

Remarks

For Beginners: A Differentiable Neural Computer (DNC) is like a neural network with a built-in memory system. Traditional neural networks process information and then forget it, but a DNC can store information in its "memory" and retrieve it later when needed. This makes DNCs good at tasks that require remembering information over time, like answering questions about a story or navigating through complex environments.

CreateDefaultDeepBeliefNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default Deep Belief Network (DBN) with pre-configured layers.

public static IEnumerable<ILayer<T>> CreateDefaultDeepBeliefNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Deep Belief Network.

Remarks

For Beginners: A Deep Belief Network is a type of neural network that learns to recognize patterns in data by building multiple layers that each specialize in finding specific features. It works by training each layer one at a time (called "pre-training"), which helps the network learn more effectively, especially when you don't have a lot of labeled training data.

CreateDefaultDeepBoltzmannMachineLayers(NeuralNetworkArchitecture<T>)

Creates default layers for a Deep Boltzmann Machine (DBM).

public static IEnumerable<ILayer<T>> CreateDefaultDeepBoltzmannMachineLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration that defines input and output shapes.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Deep Boltzmann Machine.

Remarks

For Beginners: A Deep Boltzmann Machine is a type of neural network that learns to recognize patterns in data without supervision. It's made up of multiple layers of "hidden units" that learn to represent features of the input data. DBMs are particularly good at learning complex patterns and can be used for tasks like feature learning, dimensionality reduction, and generating new data similar to the training set.

CreateDefaultDeepOperatorNetworkLayers(int, int, int, int, int)

Creates default layers for a Deep Operator Network (DeepONet).

public static (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers) CreateDefaultDeepOperatorNetworkLayers(int branchInputSize, int trunkInputSize, int outputSize = 1, int hiddenLayerCount = 3, int hiddenLayerSize = 64)

Parameters

branchInputSize int: Size of the branch network input (function samples).
trunkInputSize int: Size of the trunk network input (query locations).
outputSize int: Number of output components (default: 1 for scalar operators). For multi-output operators, each output component uses hiddenLayerSize basis functions, so the final layer outputs hiddenLayerSize * outputSize values that are reshaped and summed.
hiddenLayerCount int: Number of hidden layers in each sub-network (default: 3).
hiddenLayerSize int: Number of neurons in each hidden layer, and the number of basis functions per output component (default: 64).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers): A tuple of (branchLayers, trunkLayers) for the DeepONet architecture.

Remarks

For Beginners: DeepONet learns operators - functions that take functions as input. For example, an operator might take a temperature distribution as input and output the resulting heat flow. The branch network encodes the input function, while the trunk network handles where you want to evaluate the output.

Architecture: Branch encodes input function, Trunk encodes query location. Output = sum(Branch * Trunk) + bias, allowing learning of complex operators.

Multi-output handling: For operators with multiple output components (e.g., velocity with x,y,z components), set outputSize to the number of components. Each component gets its own set of basis functions. The branch and trunk networks output hiddenLayerSize * outputSize values, which are grouped as [component1_basis1..p, component2_basis1..p, ...] where p = hiddenLayerSize.

CreateDefaultDeepQNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default Deep Q-Network (DQN) with pre-configured layers for reinforcement learning.

public static IEnumerable<ILayer<T>> CreateDefaultDeepQNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Deep Q-Network.

Remarks

For Beginners: A Deep Q-Network is a type of neural network used in reinforcement learning, which is how computers learn to make decisions by trying different actions and receiving rewards. Think of it like teaching a dog new tricks with treats. The network learns which actions (like moving left or right in a game) will lead to the highest rewards over time.

CreateDefaultDeepRitzLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for the Deep Ritz Method network.

public static IEnumerable<ILayer<T>> CreateDefaultDeepRitzLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 4, int hiddenLayerSize = 50)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
hiddenLayerCount int: Number of hidden layers (default: 4).
hiddenLayerSize int: Number of neurons in each hidden layer (default: 50).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Deep Ritz network.

Remarks

For Beginners: The Deep Ritz Method solves PDEs by minimizing an energy functional instead of directly enforcing the PDE. This is based on the Ritz method from calculus of variations. The network learns the function that minimizes the energy.

Similar architecture to VPINN but used with energy-based loss functions. Tanh activation provides smooth second derivatives needed for energy computations.

CreateDefaultDenseNetLayers(NeuralNetworkArchitecture<T>, DenseNetConfiguration)

Creates default layers for a DenseNet network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultDenseNetLayers(NeuralNetworkArchitecture<T> architecture, DenseNetConfiguration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
configuration DenseNetConfiguration: The DenseNet-specific configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a DenseNet network.

Remarks

For Beginners: DenseNet (Densely Connected Convolutional Network) connects each layer to every other layer in a feed-forward fashion. This creates strong gradient flow and feature reuse, enabling very deep networks with fewer parameters.

The DenseNet architecture consists of:

Stem: Initial 7x7 conv with stride 2, followed by 3x3 max pooling
Dense Blocks: Multiple dense blocks with transition layers between them
Transition Layers: 1x1 conv for channel reduction followed by 2x2 avg pooling
Classification Head: Global average pooling followed by a dense layer

CreateDefaultDepthAnythingV2Layers(int, int, int, int, int)

Creates default layers for Depth Anything V2 monocular depth estimation model.

public static IEnumerable<ILayer<T>> CreateDefaultDepthAnythingV2Layers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 640, int numFeatures = 768, int numEncoderBlocks = 12)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input height (default: 480).
inputWidth int: Input width (default: 640).
numFeatures int: Number of feature channels (default: 768 for Base).
numEncoderBlocks int: Number of encoder transformer blocks (default: 12).

Returns

IEnumerable<ILayer<T>>: An enumerable of layers configured for Depth Anything V2.

Remarks

For Beginners: Depth Anything V2 estimates depth maps from single images. Given an RGB image, it predicts the relative distance of each pixel from the camera.

Architecture: - ViT-based encoder with DINOv2 initialization - Multi-scale decoder for dense prediction - Depth prediction head

Reference: "Depth Anything V2" https://arxiv.org/abs/2406.09414

CreateDefaultDessurtLayers(int, int, int, int, int, int)

Creates default Dessurt (self-supervised document transformer) layers.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultDessurtLayers(int encoderDim = 768, int decoderDim = 768, int encoderLayers = 12, int decoderLayers = 6, int numHeads = 12, int vocabSize = 50265)

Parameters

encoderDim int: Encoder dimension (default: 768).
decoderDim int: Decoder dimension (default: 768).
encoderLayers int: Number of encoder layers (default: 12).
decoderLayers int: Number of decoder layers (default: 6).
numHeads int: Number of attention heads (default: 12).
vocabSize int: Vocabulary size (default: 50265).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers): Encoder and decoder layers for a Dessurt model.

CreateDefaultDiTLayers(int, int, int, int, int, int)

Creates default DiT (Document Image Transformer) layers.

public static IEnumerable<ILayer<T>> CreateDefaultDiTLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int patchSize = 16, int imageSize = 224, int numClasses = 16)

Parameters

hiddenDim int: Hidden dimension (default: 768).
numLayers int: Number of transformer layers (default: 12).
numHeads int: Number of attention heads (default: 12).
patchSize int: Patch size for ViT (default: 16).
imageSize int: Input image size (default: 224).
numClasses int: Number of output classes (default: 16).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a DiT model.

CreateDefaultDocBankLayers(int, int, int, int)

Creates default layers for DocBank page segmentation model.

public static IEnumerable<ILayer<T>> CreateDefaultDocBankLayers(int imageSize = 1024, int backboneChannels = 256, int numClasses = 13, int hiddenDim = 256)

Parameters

imageSize int: Input image size (default: 1024).
backboneChannels int: Backbone output channels (default: 256).
numClasses int: Number of segmentation classes (default: 13).
hiddenDim int: Hidden dimension for segmentation head (default: 256).

Returns

IEnumerable<ILayer<T>>: Enumerable of layers for DocBank.

Remarks

DocBank uses a ResNet backbone with FPN for semantic segmentation.

Reference: "DocBank: A Benchmark Dataset for Document Layout Analysis" (COLING 2020)

CreateDefaultDocFormerLayers(int, int, int, int, int, int, int)

Creates default DocFormer layers for document understanding with shared spatial encodings.

public static IEnumerable<ILayer<T>> CreateDefaultDocFormerLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 30522, int imageSize = 224, int spatialDim = 128, int numClasses = 16)

Parameters

hiddenDim int: Hidden dimension (default: 768).
numLayers int: Number of transformer layers (default: 12).
numHeads int: Number of attention heads (default: 12).
vocabSize int: Vocabulary size (default: 30522).
imageSize int: Input image size (default: 224).
spatialDim int: Spatial embedding dimension (default: 128).
numClasses int: Number of output classes (default: 16).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a DocFormer model.

Remarks

DocFormer uses shared spatial encodings across text, visual, and layout modalities.

Reference: "DocFormer: End-to-End Transformer for Document Understanding" (ICCV 2021)

CreateDefaultDocGCNLayers(int, int, int, int)

Creates default DocGCN (Document Graph Convolutional Network) layers.

public static IEnumerable<ILayer<T>> CreateDefaultDocGCNLayers(int inputDim = 768, int hiddenDim = 256, int numGCNLayers = 3, int numClasses = 7)

Parameters

inputDim int: Input feature dimension (default: 768).
hiddenDim int: Hidden dimension (default: 256).
numGCNLayers int: Number of GCN layers (default: 3).
numClasses int: Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a DocGCN model.

CreateDefaultDocOwlLayers(int, int, int, int, int, int)

Creates default DocOwl (mPLUG-DocOwl) layers for document understanding.

public static IEnumerable<ILayer<T>> CreateDefaultDocOwlLayers(int visionDim = 1024, int textDim = 4096, int visionLayers = 24, int textLayers = 32, int numHeads = 16, int vocabSize = 32000)

Parameters

visionDim int: Vision encoder dimension (default: 1024).
textDim int: Text encoder dimension (default: 4096).
visionLayers int: Number of vision layers (default: 24).
textLayers int: Number of text layers (default: 32).
numHeads int: Number of attention heads (default: 16).
vocabSize int: Vocabulary size (default: 32000).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a DocOwl model.

CreateDefaultDonutLayers(int, int, int, int, int[]?, int[]?, int, int, int, int, int, int, int, int)

Creates default Donut layers for OCR-free document understanding.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultDonutLayers(int imageHeight = 1920, int imageWidth = 2560, int inputChannels = 3, int embedDim = 128, int[]? depths = null, int[]? numHeads = null, int windowSize = 10, int patchSize = 4, int mlpRatio = 4, int decoderHiddenDim = 1024, int numDecoderLayers = 4, int decoderHeads = 16, int vocabSize = 57522, int maxGenerationLength = 768)

Parameters

imageHeight int: Input image height (default: 1920 for donut-base).
imageWidth int: Input image width (default: 2560 for donut-base).
inputChannels int: Number of input channels (default: 3 for RGB).
embedDim int: Initial embedding dimension (default: 128 for Swin-B).
depths int[]: Depths of each Swin stage (default: {2,2,14,2} for donut-base).
numHeads int[]: Attention heads per stage (default: {4,8,16,32}).
windowSize int: Window size for attention (default: 10 for donut-base).
patchSize int: Initial patch size (default: 4).
mlpRatio int: MLP expansion ratio (default: 4).
decoderHiddenDim int: Decoder hidden dimension (default: 1024).
numDecoderLayers int: Number of decoder layers (default: 4).
decoderHeads int: Number of decoder attention heads (default: 16).
vocabSize int: Vocabulary size (default: 57522).
maxGenerationLength int: Maximum output sequence length (default: 768).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers): A tuple of (EncoderLayers, DecoderLayers) forming a Donut architecture.

Remarks

Donut (Document Understanding Transformer) is an OCR-free end-to-end model: - Swin Transformer-B encoder with hierarchical stages for image features - BART-style decoder for text generation - Direct pixel-to-text conversion without explicit OCR

For Beginners: This creates a model that can "read" documents directly from pixels without needing a separate OCR step. The encoder extracts visual features at multiple scales using the Swin Transformer architecture, while the decoder generates text autoregressively.

Default Configuration (donut-base): - Input: 2560×1920 RGB images - Encoder: Swin-B with depths {2,2,14,2}, 128 initial dim, window size 10 - Decoder: 4-layer BART-style with 1024 hidden dim

Reference: "OCR-free Document Understanding Transformer" (ECCV 2022)

CreateDefaultEASTLayers(int, int, int, string)

Creates default EAST (Efficient and Accurate Scene Text Detector) layers.

public static IEnumerable<ILayer<T>> CreateDefaultEASTLayers(int imageSize = 512, int backboneChannels = 512, int featureChannels = 128, string geometryType = "RBOX")

Parameters

imageSize int: Input image size (default: 512).
backboneChannels int: Backbone output channels (default: 512).
featureChannels int: Feature map channels (default: 128).
geometryType string: Geometry output type: RBOX or QUAD (default: RBOX).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an EAST model.

CreateDefaultECAPATDNNLanguageIdentifierLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int[]?)

Creates default ECAPA-TDNN layers for spoken language identification.

public static IEnumerable<ILayer<T>> CreateDefaultECAPATDNNLanguageIdentifierLayers(NeuralNetworkArchitecture<T> architecture, int numMels = 80, int tdnnChannels = 1024, int embeddingDimension = 192, int numLanguages = 20, int[]? dilations = null)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
numMels int: Number of mel filterbank channels (default: 80).
tdnnChannels int: Number of TDNN channels (default: 1024).
embeddingDimension int: Embedding dimension (default: 192).
numLanguages int: Number of languages to classify (default: 20).
dilations int[]: Dilation factors for TDNN layers (default: [1, 2, 3, 4, 1]).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an ECAPA-TDNN language identifier.

Remarks

ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation TDNN) is a state-of-the-art architecture for speaker and language recognition using: - SE-Res2Net blocks with channel attention - Multi-layer feature aggregation (MFA) - Attentive statistics pooling

CreateDefaultEDVRLayers(int, int, int, int, int, int, int)

Creates default layers for EDVR video restoration.

public static IEnumerable<ILayer<T>> CreateDefaultEDVRLayers(int inputChannels = 3, int inputHeight = 256, int inputWidth = 256, int numFeatures = 64, int numFrames = 5, int numGroups = 8, int numBlocks = 5)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numFrames int
numGroups int
numBlocks int

Returns

IEnumerable<ILayer<T>>

CreateDefaultELMLayers(NeuralNetworkArchitecture<T>, int)

Creates default layers for an Extreme Learning Machine (ELM) neural network.

public static IEnumerable<ILayer<T>> CreateDefaultELMLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerSize)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
hiddenLayerSize int: The size of the hidden layer.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an Extreme Learning Machine.

Remarks

For Beginners: An Extreme Learning Machine (ELM) is a simplified neural network where only the output layer weights are trained. The hidden layer weights are randomly initialized and never updated. This makes ELMs very fast to train compared to traditional neural networks, while still providing good performance for many tasks. Think of it as a "shortcut" approach to neural network training.

ELMs work by projecting the input data into a higher-dimensional space using random weights, then finding the best output weights to solve the problem. They're particularly useful when you need a quick solution and don't have time for extensive training.

CreateDefaultESNLayers(int, int, int, double, double)

Creates a default Echo State Network (ESN) with pre-configured layers.

public static IEnumerable<ILayer<T>> CreateDefaultESNLayers(int inputSize, int outputSize, int reservoirSize, double spectralRadius = 0.9, double sparsity = 0.1)

Parameters

inputSize int: The size of the input layer.
outputSize int: The size of the output layer.
reservoirSize int: The size of the reservoir (hidden layer).
spectralRadius double: Controls the stability of the reservoir dynamics (default: 0.9).
sparsity double: The connection sparsity in the reservoir (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an Echo State Network.

Remarks

For Beginners: An Echo State Network is a special type of recurrent neural network where most of the connections between neurons are fixed (not trained). Only the connections from the hidden layer to the output are trained. Think of it like having a pool of water (the reservoir) that you disturb with input signals, and then you learn to read the ripple patterns to predict outputs. This makes ESNs very fast to train compared to other recurrent networks.

CreateDefaultEfficientNetLayers(NeuralNetworkArchitecture<T>, EfficientNetConfiguration)

Creates default layers for an EfficientNet network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultEfficientNetLayers(NeuralNetworkArchitecture<T> architecture, EfficientNetConfiguration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
configuration EfficientNetConfiguration: The EfficientNet-specific configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an EfficientNet network.

Remarks

For Beginners: EfficientNet uses compound scaling to balance network depth, width, and resolution. Each variant (B0-B7) represents a different scale factor, achieving excellent accuracy with fewer parameters than previous architectures.

CreateDefaultFLAVRLayers(int, int, int, int, int)

Creates default layers for FLAVR frame interpolation.

public static IEnumerable<ILayer<T>> CreateDefaultFLAVRLayers(int inputChannels = 3, int inputHeight = 256, int inputWidth = 256, int numFeatures = 64, int numInputFrames = 4)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numInputFrames int

Returns

IEnumerable<ILayer<T>>

CreateDefaultFastDVDNetLayers(int, int, int, int, int)

Creates default layers for FastDVDNet video denoising.

public static IEnumerable<ILayer<T>> CreateDefaultFastDVDNetLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 854, int numFeatures = 32, int numInputFrames = 5)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int
numInputFrames int

Returns

IEnumerable<ILayer<T>>

CreateDefaultFastTextLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates default layers for a FastText model.

public static IEnumerable<ILayer<T>> CreateDefaultFastTextLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize, int bucketSize, int embeddingDimension)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
vocabSize int: The size of the vocabulary.
bucketSize int: The number of buckets for n-gram hashing.
embeddingDimension int: The dimension of the embedding vectors.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a FastText model.

Remarks

For Beginners: FastText improves on Word2Vec by considering sub-word information (character n-grams). It represents words as the sum of their n-gram embeddings.

CreateDefaultFeedForwardLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a feed-forward neural network.

public static IEnumerable<ILayer<T>> CreateDefaultFeedForwardLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 2, int hiddenLayerSize = 64)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration that defines input and output shapes.
hiddenLayerCount int: Number of hidden layers (default: 2).
hiddenLayerSize int: Number of neurons in each hidden layer (default: 64).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a feed-forward neural network.

Remarks

For Beginners: This method builds a basic feed-forward neural network. Think of it as a series of connected layers where information flows from the input, through "hidden" processing layers, to the output.

Key components: - Input layer: Receives the raw data - Hidden layers: Process and transform the data, learning patterns - Output layer: Produces the final prediction or classification

The network automatically adjusts for different types of tasks (like classification or regression) by choosing appropriate activation functions for the output layer.

CreateDefaultFlowFormerLayers(int, int, int, int, int)

Creates default layers for FlowFormer optical flow estimation.

public static IEnumerable<ILayer<T>> CreateDefaultFlowFormerLayers(int inputChannels = 3, int inputHeight = 448, int inputWidth = 1024, int embedDim = 256, int numLayers = 6)

Parameters

inputChannels int
inputHeight int
inputWidth int
embedDim int
numLayers int

Returns

IEnumerable<ILayer<T>>

CreateDefaultFourierNeuralOperatorLayers(NeuralNetworkArchitecture<T>, int[], int, int, int)

Creates default layers for a Fourier Neural Operator (FNO).

public static IEnumerable<ILayer<T>> CreateDefaultFourierNeuralOperatorLayers(NeuralNetworkArchitecture<T> architecture, int[] spatialDimensions, int numFourierLayers = 4, int hiddenChannels = 64, int numModes = 12)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
spatialDimensions int[]: Dimensions of the spatial domain (e.g., [64, 64] for 2D grid, [32] for 1D). This determines the FFT size for spectral operations.
numFourierLayers int: Number of Fourier layers (default: 4).
hiddenChannels int: Number of hidden channels/width (default: 64).
numModes int: Number of Fourier modes to retain (default: 12). Lower = smoother, higher = more detail. Should be less than or equal to smallest spatial dimension.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Fourier Neural Operator.

Remarks

For Beginners: Fourier Neural Operators learn mappings between function spaces by operating in frequency domain. They're especially powerful for PDEs because many physical phenomena have simple representations in frequency space.

Architecture:

Lifting layer: Projects input to higher-dimensional channel space
Fourier layers: Apply spectral convolution (FFT → learnable weights → IFFT) + local linear transform
Projection layers: Map back to output dimension

Key FNO Properties:

Resolution-invariant: Train at one resolution, evaluate at another
Global receptive field through spectral operations
Efficient for smooth functions (low-frequency dominated)

Note: For full FNO functionality with training, use the FourierNeuralOperator<T> class directly, which provides a complete neural operator implementation.

Exceptions

ArgumentNullException: Thrown when spatialDimensions is null.
ArgumentException: Thrown when spatialDimensions is empty.

CreateDefaultFrameInterpolationLayers(int, int, int, int)

Creates layers for a frame interpolation model (FILM/RIFE-style).

public static IEnumerable<ILayer<T>> CreateDefaultFrameInterpolationLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int numFeatures = 64)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input frame height.
inputWidth int: Input frame width.
numFeatures int: Number of feature channels (default: 64).

Returns

IEnumerable<ILayer<T>>: A collection of layers for frame interpolation.

Remarks

For Beginners: Frame interpolation creates new frames between existing ones to make video smoother (e.g., 30fps to 60fps). The model learns to "imagine" what the in-between frames should look like based on the surrounding frames.

Architecture:

Feature pyramid extracts multi-scale features
Flow estimation predicts motion
Synthesis network generates interpolated frames

CreateDefaultGNNLayers(NeuralNetworkArchitecture<T>)

Creates default layers for a Graph Neural Network (GNN).

public static IEnumerable<ILayer<T>> CreateDefaultGNNLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Graph Neural Network.

Remarks

For Beginners: Graph Neural Networks (GNNs) are specialized neural networks designed to work with graph-structured data, where information is represented as nodes (points) connected by edges (lines). Examples include social networks, molecular structures, or road networks.

Unlike standard neural networks that process individual data points independently, GNNs can understand relationships between data points. They work by passing information between connected nodes, allowing each node to "learn" from its neighbors. This makes GNNs powerful for tasks where relationships between entities matter, such as recommending friends on social media, predicting protein interactions, or analyzing traffic patterns.

CreateDefaultGRULayers(NeuralNetworkArchitecture<T>)

Creates a default Gated Recurrent Unit (GRU) neural network layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultGRULayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for GRU-based processing.

Remarks

For Beginners: A GRU (Gated Recurrent Unit) is a type of recurrent neural network that's especially good at learning patterns in sequences of data, like text or time series. It's similar to LSTM but with a simpler structure, making it faster to train while still capturing long-term dependencies in data.

This method automatically configures appropriate GRU layers based on your task type, with sensible defaults for hidden layer sizes and activation functions.

Exceptions

InvalidOperationException: Thrown when the architecture has invalid input or output dimensions.

CreateDefaultGenreClassifierLayers(int, int, int, int, int, double)

Creates default genre classification layers.

public static IEnumerable<ILayer<T>> CreateDefaultGenreClassifierLayers(int numMels = 128, int hiddenDim = 256, int numClasses = 10, int maxFrames = 1000, int numAttentionLayers = 4, double dropoutRate = 0.3)

Parameters

numMels int: Number of mel spectrogram bins (default: 128).
hiddenDim int: Hidden layer dimension (default: 256).
numClasses int: Number of genre classes (default: 10).
maxFrames int: Maximum input frames (default: 1000).
numAttentionLayers int: Number of attention layers (default: 4).
dropoutRate double: Dropout rate (default: 0.3).

Returns

IEnumerable<ILayer<T>>: A collection of layers for genre classification.

Remarks

Audio classification architecture with:

Mel spectrogram feature extraction
Transformer encoder for temporal modeling
Global average pooling
Classification head with softmax output

CreateDefaultGloVeLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a GloVe (Global Vectors) model.

public static IEnumerable<ILayer<T>> CreateDefaultGloVeLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize, int embeddingDimension)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
vocabSize int: The size of the vocabulary.
embeddingDimension int: The dimension of the embedding vectors.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a GloVe model.

Remarks

For Beginners: GloVe creates word embeddings by learning from the co-occurrence statistics of words. It uses two sets of embeddings and two sets of biases.

Note: The layers returned by this method are not intended to be used as a sequential feed-forward stack. They represent the four components (W, W_tilde, b, b_tilde) required for the GloVe model's custom forward pass.

CreateDefaultGraphAttentionLayers(NeuralNetworkArchitecture<T>, int, int, double)

Creates default layers for a Graph Attention Network (GAT).

public static IEnumerable<ILayer<T>> CreateDefaultGraphAttentionLayers(NeuralNetworkArchitecture<T> architecture, int numHeads = 8, int numLayers = 2, double dropoutRate = 0.6)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
numHeads int: Number of attention heads per layer (default: 8).
numLayers int: Number of GAT layers (default: 2).
dropoutRate double: Dropout rate for attention coefficients (default: 0.6).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for GAT processing.

Remarks

For Beginners: GAT uses attention mechanisms to learn which neighbors are most important for each node, allowing dynamic weighting of neighbor contributions.

CreateDefaultGraphClassificationLayers(NeuralNetworkArchitecture<T>, int, int, int, double)

Creates default layers for a Graph Classification model.

public static IEnumerable<ILayer<T>> CreateDefaultGraphClassificationLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 64, int embeddingDim = 128, int numGnnLayers = 3, double dropoutRate = 0.5)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
hiddenDim int: Hidden dimension size (default: 64).
embeddingDim int: Graph embedding dimension (default: 128).
numGnnLayers int: Number of GNN layers (default: 3).
dropoutRate double: Dropout rate for regularization (default: 0.5).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for graph classification.

Remarks

For Beginners: Graph classification predicts labels for entire graphs. This architecture uses multiple GCN layers followed by pooling and classification.

CreateDefaultGraphGenerationLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Graph Generation model (VGAE encoder).

public static IEnumerable<ILayer<T>> CreateDefaultGraphGenerationLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 32, int numEncoderLayers = 2)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
hiddenDim int: Hidden dimension size (default: 32).
numEncoderLayers int: Number of encoder GNN layers (default: 2).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for graph generation encoder.

Remarks

For Beginners: Graph generation models learn to create new graph structures. This encoder uses GCN layers to map node features to a latent space.

CreateDefaultGraphIsomorphismLayers(NeuralNetworkArchitecture<T>, int, int, bool, double)

Creates default layers for a Graph Isomorphism Network (GIN).

public static IEnumerable<ILayer<T>> CreateDefaultGraphIsomorphismLayers(NeuralNetworkArchitecture<T> architecture, int mlpHiddenDim = 64, int numLayers = 5, bool learnEpsilon = true, double initialEpsilon = 0)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
mlpHiddenDim int: Hidden dimension for MLP within GIN layers (default: 64).
numLayers int: Number of GIN layers (default: 5).
learnEpsilon bool: Whether to learn epsilon parameter (default: true).
initialEpsilon double: Initial value for epsilon (default: 0.0).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for GIN processing.

Remarks

For Beginners: GIN is provably as powerful as the Weisfeiler-Lehman graph isomorphism test, making it optimal for distinguishing graph structures.

CreateDefaultGraphSAGELayers(NeuralNetworkArchitecture<T>, SAGEAggregatorType, int, bool)

Creates default layers for a GraphSAGE (Graph Sample and Aggregate) Network.

public static IEnumerable<ILayer<T>> CreateDefaultGraphSAGELayers(NeuralNetworkArchitecture<T> architecture, SAGEAggregatorType aggregatorType = SAGEAggregatorType.Mean, int numLayers = 2, bool normalize = true)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
aggregatorType SAGEAggregatorType: The type of aggregation function (default: Mean).
numLayers int: Number of GraphSAGE layers (default: 2).
normalize bool: Whether to apply L2 normalization (default: true).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for GraphSAGE processing.

Remarks

For Beginners: GraphSAGE learns to aggregate neighbor information for inductive learning. It can generalize to new, unseen nodes by learning aggregation functions.

CreateDefaultHTMLayers(NeuralNetworkArchitecture<T>, int, int, double)

Creates a default Hierarchical Temporal Memory (HTM) neural network layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultHTMLayers(NeuralNetworkArchitecture<T> architecture, int columnCount, int cellsPerColumn, double sparsityThreshold)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
columnCount int: The number of columns in the HTM network.
cellsPerColumn int: The number of cells per column.
sparsityThreshold double: The sparsity threshold for the spatial pooler.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for HTM processing.

Remarks

For Beginners: Hierarchical Temporal Memory (HTM) is a machine learning technology that mimics certain structural and algorithmic properties of the neocortex (the part of the brain responsible for higher-order thinking). HTM is particularly good at learning patterns in sequential data and making predictions.

Key HTM concepts: - Columns: Vertical arrangements of cells that work together - Cells: The basic processing units (like neurons) - Sparsity: Only a small percentage of cells are active at any time, which helps with learning

Exceptions

InvalidOperationException: Thrown when the architecture has invalid input or output dimensions.

CreateDefaultHamiltonianLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Hamiltonian Neural Network.

public static IEnumerable<ILayer<T>> CreateDefaultHamiltonianLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 3, int hiddenLayerSize = 64)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration that defines input and output shapes.
hiddenLayerCount int: Number of hidden layers (default: 3).
hiddenLayerSize int: Number of neurons in each hidden layer (default: 64).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Hamiltonian neural network.

Remarks

For Beginners: Hamiltonian Neural Networks (HNNs) learn the energy function (Hamiltonian) of a physical system. The network takes a state vector [q, p] (positions and momenta) as input and outputs a scalar energy value.

Key design choices: - Uses Tanh activation in hidden layers for smooth, bounded outputs that help with gradient computation - Output layer has linear activation since the Hamiltonian can be any real number - Architecture is designed for computing gradients (∂H/∂q, ∂H/∂p) to derive dynamics

The network structure enables Hamilton's equations:

dq/dt = ∂H/∂p (velocity from momentum gradient)
dp/dt = -∂H/∂q (force from position gradient)

This guarantees energy conservation by construction.

CreateDefaultInfographicVQALayers(int, int, int, int, int, int, int, int)

Creates default InfographicVQA layers for infographic understanding.

public static IEnumerable<ILayer<T>> CreateDefaultInfographicVQALayers(int imageSize = 1024, int visionDim = 768, int textDim = 768, int fusionDim = 768, int visionLayers = 12, int fusionLayers = 6, int numHeads = 12, int vocabSize = 30522)

Parameters

imageSize int: Input image size (default: 1024).
visionDim int: Vision encoder dimension (default: 768).
textDim int: Text encoder dimension (default: 768).
fusionDim int: Fusion dimension (default: 768).
visionLayers int: Number of vision layers (default: 12).
fusionLayers int: Number of fusion layers (default: 6).
numHeads int: Number of attention heads (default: 12).
vocabSize int: Vocabulary size (default: 30522).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an InfographicVQA model.

CreateDefaultInstructorLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for an Instructor/E5 (Instruction-Tuned) embedding model.

public static IEnumerable<ILayer<T>> CreateDefaultInstructorLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultInternVideo2Layers(int, int, int, int, int, int)

Creates layers for an InternVideo2-style video understanding model.

public static IEnumerable<ILayer<T>> CreateDefaultInternVideo2Layers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int embedDim = 768, int numEncoderLayers = 12, int patchSize = 14)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input frame height.
inputWidth int: Input frame width.
embedDim int: Embedding dimension (default: 768).
numEncoderLayers int: Number of transformer encoder layers (default: 12).
patchSize int: Patch size for video tokenization (default: 14).

Returns

IEnumerable<ILayer<T>>: A collection of layers for video understanding.

Remarks

For Beginners: InternVideo2 understands video content by encoding frames into embeddings that capture both spatial (what's in each frame) and temporal (how things change over time) information. It can be used for: - Video classification (identifying what's happening) - Video-text retrieval (finding videos matching descriptions) - Video question answering

Architecture (based on the paper):

Patch embedding converts video frames into tokens
Spatial attention processes within-frame relationships
Temporal attention processes across-frame relationships
FFN layers add non-linearity and expressiveness
Projection maps to a shared video-text embedding space

Reference: "InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding" https://arxiv.org/abs/2403.15377

CreateDefaultLSMLayers(NeuralNetworkArchitecture<T>, int, double, double, double, double)

Creates a default configuration of layers for a Liquid State Machine (LSM) neural network.

public static IEnumerable<ILayer<T>> CreateDefaultLSMLayers(NeuralNetworkArchitecture<T> architecture, int reservoirSize = 100, double connectionProbability = 0.1, double spectralRadius = 0.9, double inputScaling = 1, double leakingRate = 1)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
reservoirSize int: The size of the reservoir (number of neurons in the reservoir layer). Default is 100.
connectionProbability double: The probability of connection between neurons in the reservoir. Default is 0.1 (10%).
spectralRadius double: Controls the stability of the reservoir dynamics. Default is 0.9.
inputScaling double: Scaling factor for input connections. Default is 1.0.
leakingRate double: Controls how quickly the reservoir responds to new inputs. Default is 1.0.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for a Liquid State Machine.

Remarks

For Beginners: A Liquid State Machine is a special type of neural network inspired by how the brain processes information. The key component is the "reservoir" - imagine it as a pool of randomly connected neurons that create complex patterns when input is fed into them.

The reservoirSize is how many neurons are in this pool
The connectionProbability determines how densely connected these neurons are
The spectralRadius affects how stable the patterns in the reservoir are
The inputScaling controls how strongly the input affects the reservoir
The leakingRate determines how quickly the reservoir responds to new information

LSMs are particularly good at processing time-dependent data like speech or video.

Exceptions

ArgumentNullException: Thrown when architecture is null.
InvalidOperationException: Thrown when input shape is not specified or input/output size is not positive.

CreateDefaultLSTMNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default configuration of layers for a Long Short-Term Memory (LSTM) neural network.

public static IEnumerable<ILayer<T>> CreateDefaultLSTMNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for an LSTM neural network.

Remarks

For Beginners: LSTM (Long Short-Term Memory) networks are a special kind of neural network designed to remember information for long periods of time. Think of them like a person with a good memory who can recall things from the past to make decisions in the present.

LSTMs are particularly useful for: - Text prediction (like autocomplete on your phone) - Speech recognition - Time series forecasting (like stock prices or weather) - Any task where the order of data matters

Key terms explained: - Hidden Size: How much information the network can remember at once (bigger = more memory) - Layers: How many processing steps the data goes through (more layers = more complex patterns) - Activation Function: How neurons decide whether to fire (like Tanh or Sigmoid) - Recurrent Activation: Special activation function used for the memory gates

Exceptions

ArgumentNullException: Thrown when architecture is null.
InvalidOperationException: Thrown when input shape is not specified or input/output size is not positive.

CreateDefaultLagrangianLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Lagrangian Neural Network.

public static IEnumerable<ILayer<T>> CreateDefaultLagrangianLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 3, int hiddenLayerSize = 64)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration that defines input and output shapes.
hiddenLayerCount int: Number of hidden layers (default: 3).
hiddenLayerSize int: Number of neurons in each hidden layer (default: 64).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Lagrangian neural network.

Remarks

For Beginners: Lagrangian Neural Networks (LNNs) learn the Lagrangian function L(q, q̇) of a physical system. The Lagrangian is typically L = T - V (kinetic minus potential energy).

Key design choices: - Uses Tanh activation in hidden layers for smooth derivatives needed in Euler-Lagrange equations - Output is scalar (the Lagrangian value) - Structure supports computing second derivatives for equations of motion

The Euler-Lagrange equation: d/dt(∂L/∂q̇) = ∂L/∂q This gives the equations of motion while automatically respecting conservation laws.

CreateDefaultLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates a standard feed-forward neural network with configurable hidden layers.

public static IEnumerable<ILayer<T>> CreateDefaultLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 1, int hiddenLayerSize = 64, int outputSize = 1)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
hiddenLayerCount int: Number of hidden layers (default: 1).
hiddenLayerSize int: Number of neurons in each hidden layer (default: 64).
outputSize int: Number of output neurons (default: 1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a feed-forward neural network.

Remarks

For Beginners: A feed-forward neural network is the simplest type of neural network where information flows in one direction from input to output. Think of it as an assembly line where each layer processes the data and passes it to the next layer.

This method creates: - An input layer that takes your data - One or more hidden layers that learn patterns in your data - An output layer that produces the final prediction

CreateDefaultLayoutGraphLayers(int, int, int, int)

Creates default LayoutGraph layers for graph-based layout analysis.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutGraphLayers(int inputDim = 768, int hiddenDim = 256, int numGraphLayers = 4, int numClasses = 7)

Parameters

inputDim int: Input feature dimension (default: 768).
hiddenDim int: Hidden dimension (default: 256).
numGraphLayers int: Number of graph layers (default: 4).
numClasses int: Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a LayoutGraph model.

CreateDefaultLayoutLMLayers(int, int, int, int, int, int)

Creates default LayoutLM (v1) layers for document understanding with layout-aware pre-training.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutLMLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 30522, int maxSequenceLength = 512, int numClasses = 7)

Parameters

hiddenDim int: Hidden dimension (default: 768 for BERT-base).
numLayers int: Number of transformer layers (default: 12).
numHeads int: Number of attention heads (default: 12).
vocabSize int: Vocabulary size (default: 30522 for BERT).
maxSequenceLength int: Maximum sequence length (default: 512).
numClasses int: Number of output classes (default: 7 for FUNSD).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a LayoutLM model.

Remarks

LayoutLM v1 combines BERT text embeddings with 2D position embeddings to jointly model text and layout. Unlike v2/v3, it does NOT use visual features.

Reference: "LayoutLM: Pre-training of Text and Layout for Document Image Understanding" (KDD 2020) https://arxiv.org/abs/1912.13318

CreateDefaultLayoutLMv2Layers(int, int, int, int, int, int, int)

Creates default LayoutLMv2 layers for document understanding with visual features.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutLMv2Layers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 30522, int imageSize = 224, int visualBackboneChannels = 256, int numClasses = 7)

Parameters

hiddenDim int: Hidden dimension (default: 768 for BERT-base).
numLayers int: Number of transformer layers (default: 12).
numHeads int: Number of attention heads (default: 12).
vocabSize int: Vocabulary size (default: 30522 for BERT).
imageSize int: Input image size (default: 224).
visualBackboneChannels int: Visual backbone output channels (default: 256).
numClasses int: Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a LayoutLMv2 model.

Remarks

LayoutLMv2 extends LayoutLM by adding visual features from a ResNeXt-FPN backbone, enabling the model to understand documents through text, layout, AND image features.

Key components: - Visual backbone (ResNeXt-101 with FPN) - Text encoder (BERT-base) - Spatial-aware self-attention mechanism

Reference: "LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding" (ACL 2021) https://arxiv.org/abs/2012.14740

CreateDefaultLayoutLMv3Layers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int)

Creates default LayoutLMv3 layers for document understanding.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutLMv3Layers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 50265, int imageSize = 224, int patchSize = 16, int numClasses = 17)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
hiddenDim int: Hidden dimension size (default: 768 from paper).
numLayers int: Number of transformer layers (default: 12 from paper).
numHeads int: Number of attention heads (default: 12 from paper).
vocabSize int: Vocabulary size (default: 50265 for RoBERTa tokenizer).
imageSize int: Input image size (default: 224).
patchSize int: Vision patch size (default: 16).
numClasses int: Number of output classes (default: 17 for layout detection).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a LayoutLMv3 architecture.

Remarks

LayoutLMv3 uses unified multimodal pre-training with: - Text embedding layer (RoBERTa-style) - Image patch embedding (ViT-style) - Transformer encoder with spatial-aware self-attention - Classification head for layout detection or other tasks

Reference: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking" (ICCV 2022)

CreateDefaultLayoutXLMLayers(int, int, int, int, int, int, int)

Creates default LayoutXLM layers for multilingual document understanding.

public static IEnumerable<ILayer<T>> CreateDefaultLayoutXLMLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 250002, int imageSize = 224, int visualBackboneChannels = 256, int numClasses = 7)

Parameters

hiddenDim int: Hidden dimension (default: 768).
numLayers int: Number of transformer layers (default: 12).
numHeads int: Number of attention heads (default: 12).
vocabSize int: Vocabulary size (default: 250002 for XLM-RoBERTa).
imageSize int: Input image size (default: 224).
visualBackboneChannels int: Visual backbone channels (default: 256).
numClasses int: Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a LayoutXLM model.

CreateDefaultLiLTLayers(int, int, int, int, int, int)

Creates default LiLT (Language-Independent Layout Transformer) layers.

public static IEnumerable<ILayer<T>> CreateDefaultLiLTLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int layoutDim = 768, int vocabSize = 30522, int numClasses = 7)

Parameters

hiddenDim int: Hidden dimension (default: 768).
numLayers int: Number of transformer layers (default: 12).
numHeads int: Number of attention heads (default: 12).
layoutDim int: Layout embedding dimension (default: 768).
vocabSize int: Vocabulary size (default: 30522).
numClasses int: Number of output classes (default: 7).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a LiLT model.

CreateDefaultLinkPredictionLayers(NeuralNetworkArchitecture<T>, int, int, int, double)

Creates default layers for a Link Prediction model encoder.

public static IEnumerable<ILayer<T>> CreateDefaultLinkPredictionLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 64, int embeddingDim = 32, int numLayers = 2, double dropoutRate = 0.5)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
hiddenDim int: Hidden dimension size (default: 64).
embeddingDim int: Node embedding dimension (default: 32).
numLayers int: Number of GNN layers (default: 2).
dropoutRate double: Dropout rate for regularization (default: 0.5).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for link prediction.

Remarks

For Beginners: Link prediction predicts whether edges should exist between nodes. This encoder learns node embeddings that can be combined to score potential edges.

CreateDefaultMATCHALayers(int, int, int, int, int, int, int)

Creates default MATCHA (chart understanding) layers.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultMATCHALayers(int encoderDim = 1536, int decoderDim = 1536, int encoderLayers = 18, int decoderLayers = 18, int numHeads = 24, int vocabSize = 50265, int maxPatchesPerImage = 4096)

Parameters

encoderDim int: Encoder dimension (default: 1536).
decoderDim int: Decoder dimension (default: 1536).
encoderLayers int: Number of encoder layers (default: 18).
decoderLayers int: Number of decoder layers (default: 18).
numHeads int: Number of attention heads (default: 24).
vocabSize int: Vocabulary size (default: 50265).
maxPatchesPerImage int: Maximum patches per image (default: 4096).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers): Encoder and decoder layers for a MATCHA model.

CreateDefaultMRLLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a Matryoshka Representation Learning (MRL) model.

public static IEnumerable<ILayer<T>> CreateDefaultMRLLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int maxEmbeddingDimension = 1536, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
maxEmbeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultMemoryNetworkLayers(NeuralNetworkArchitecture<T>, int, int)

Creates a default Memory Network layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultMemoryNetworkLayers(NeuralNetworkArchitecture<T> architecture, int memorySize, int embeddingSize)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
memorySize int: The size of the memory component (number of memory slots).
embeddingSize int: The dimension of the embedding vectors.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for a Memory Network.

Remarks

For Beginners: A Memory Network is a type of neural network that has an explicit memory component. Think of it like a notebook that the network can write to and read from while processing information. This makes it particularly good at tasks that require remembering context from earlier in a sequence, such as answering questions about a story or maintaining a conversation.

The memory size parameter controls how many "pages" are in the notebook, while the embedding size determines how detailed each "note" can be.

Exceptions

InvalidOperationException: Thrown when the architecture has invalid input or output dimensions.

CreateDefaultMeshCNNLayers(NeuralNetworkArchitecture<T>, int, int[]?, int[]?, int[]?, int, bool, double, bool)

Creates default layers for a MeshCNN architecture for mesh classification/segmentation.

public static IEnumerable<ILayer<T>> CreateDefaultMeshCNNLayers(NeuralNetworkArchitecture<T> architecture, int inputFeatures = 5, int[]? convChannels = null, int[]? poolTargets = null, int[]? fcSizes = null, int numNeighbors = 4, bool useBatchNorm = true, double dropoutRate = 0.5, bool useGlobalAveragePooling = false)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
inputFeatures int: Number of input features per edge. Default is 5.
convChannels int[]: Channel sizes for each edge convolution block.
poolTargets int[]: Target edge counts after each pooling operation.
fcSizes int[]: Sizes of fully connected layers before output.
numNeighbors int: Number of neighboring edges per edge. Default is 4.
useBatchNorm bool: Whether to use batch normalization. Default is true.
dropoutRate double: Dropout rate for regularization. Default is 0.5.
useGlobalAveragePooling bool: Whether to use global average pooling. Default is false (max pooling).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for mesh processing.

Remarks

For Beginners: MeshCNN processes 3D mesh data by learning from edge features.

The architecture consists of: - Edge convolution blocks: Learn patterns from edge neighborhoods - Mesh pooling: Simplify the mesh by removing less important edges - Global pooling: Aggregate all edge features into a fixed-size vector - Fully connected layers: Map aggregated features to class predictions

Applications include: - 3D shape classification from mesh data - Mesh segmentation (labeling different parts) - Learning from CAD models and 3D scans

Exceptions

InvalidOperationException: Thrown when the architecture has invalid output size.

CreateDefaultMiDaSLayers(int, int, int, int, int)

Creates default layers for MiDaS depth estimation.

public static IEnumerable<ILayer<T>> CreateDefaultMiDaSLayers(int inputChannels = 3, int inputHeight = 384, int inputWidth = 384, int embedDim = 768, int numEncoderLayers = 12)

Parameters

inputChannels int
inputHeight int
inputWidth int
embedDim int
numEncoderLayers int

Returns

IEnumerable<ILayer<T>>

CreateDefaultMobileNetV2Layers(NeuralNetworkArchitecture<T>, MobileNetV2Configuration)

Creates default layers for a MobileNetV2 network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultMobileNetV2Layers(NeuralNetworkArchitecture<T> architecture, MobileNetV2Configuration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
configuration MobileNetV2Configuration: The MobileNetV2-specific configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a MobileNetV2 network.

Remarks

For Beginners: MobileNetV2 is designed for efficient mobile inference, using inverted residual blocks with linear bottlenecks to achieve high accuracy with low computational cost.

CreateDefaultMobileNetV3Layers(NeuralNetworkArchitecture<T>, MobileNetV3Configuration)

Creates default layers for a MobileNetV3 network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultMobileNetV3Layers(NeuralNetworkArchitecture<T> architecture, MobileNetV3Configuration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
configuration MobileNetV3Configuration: The MobileNetV3-specific configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a MobileNetV3 network.

Remarks

For Beginners: MobileNetV3 builds on MobileNetV2 with additional optimizations including squeeze-and-excitation blocks and hard-swish activation for improved accuracy and efficiency.

CreateDefaultMusicGenLayers(int, int, int, int, int, int, int, int, double)

Creates default MusicGen layers for text-to-music generation.

public static IEnumerable<ILayer<T>> CreateDefaultMusicGenLayers(int textHiddenDim = 768, int lmHiddenDim = 1536, int numLmLayers = 24, int numHeads = 16, int numCodebooks = 4, int codebookSize = 2048, int maxTextLength = 256, int maxAudioTokens = 1500, double dropoutRate = 0.1)

Parameters

textHiddenDim int: Text encoder hidden dimension (default: 768 for T5-base).
lmHiddenDim int: Language model hidden dimension (default: 1536).
numLmLayers int: Number of language model transformer layers (default: 24).
numHeads int: Number of attention heads (default: 16).
numCodebooks int: Number of EnCodec codebooks (default: 4).
codebookSize int: Size of each codebook vocabulary (default: 2048).
maxTextLength int: Maximum text sequence length (default: 256).
maxAudioTokens int: Maximum audio tokens (~50 tokens/sec) (default: 1500 for 30s).
dropoutRate double: Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a MusicGen model.

Remarks

MusicGen is Meta's text-to-music generation model that uses a single-stage transformer language model operating over EnCodec audio codes. Key features:

Delay pattern for codebook interleaving (reduces sequence length)
T5-based text encoder for conditioning
Transformer decoder generating audio codes autoregressively
EnCodec neural audio codec for high-quality audio reconstruction

Reference: "Simple and Controllable Music Generation" by Copet et al., 2023

CreateDefaultNTMLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates a default configuration of layers for a Neural Turing Machine (NTM).

public static IEnumerable<ILayer<T>> CreateDefaultNTMLayers(NeuralNetworkArchitecture<T> architecture, int memorySize = 128, int memoryVectorSize = 20, int controllerSize = 100)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
memorySize int: The number of memory locations (default: 128).
memoryVectorSize int: The size of each memory vector (default: 20).
controllerSize int: The size of the controller network (default: 100).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for a Neural Turing Machine.

Remarks

For Beginners: A Neural Turing Machine (NTM) is a type of neural network that has an external memory component, similar to how computers have RAM. The network learns to read from and write to this memory, which helps it solve tasks that require remembering information over long periods.

- memorySize: How many "slots" are in the memory (like pages in a notebook) - memoryVectorSize: How much information each memory slot can hold - controllerSize: How complex the "brain" of the network is that decides what to read/write

Exceptions

ArgumentNullException: Thrown when architecture is null.
ArgumentException: Thrown when memory parameters are not positive.

CreateDefaultNeuralNetworkLayers(NeuralNetworkArchitecture<T>)

Creates a default configuration of layers for a standard neural network.

public static IEnumerable<ILayer<T>> CreateDefaultNeuralNetworkLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for a standard neural network.

Remarks

For Beginners: This method creates the basic building blocks (layers) of a neural network. Think of layers as a series of connected processing units that transform your input data step by step until it produces the desired output. The complexity parameter in the architecture determines how many layers and neurons your network will have - Simple networks have fewer layers while Deep networks have more layers for handling more complex problems.

Exceptions

ArgumentNullException: Thrown when architecture is null.
InvalidOperationException: Thrown when input size or output size is not positive.

CreateDefaultNodeClassificationLayers(NeuralNetworkArchitecture<T>, int, int, double)

Creates default layers for a Node Classification model.

public static IEnumerable<ILayer<T>> CreateDefaultNodeClassificationLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 64, int numLayers = 2, double dropoutRate = 0.5)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
hiddenDim int: Hidden dimension size (default: 64).
numLayers int: Number of GNN layers (default: 2).
dropoutRate double: Dropout rate for regularization (default: 0.5).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for node classification.

Remarks

For Beginners: Node classification predicts labels for individual nodes in a graph. This architecture uses GCN layers with dropout for semi-supervised learning on graphs.

CreateDefaultNougatLayers(int, int, int, int, int, int, int, int)

Creates default Nougat layers for academic document understanding.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultNougatLayers(int hiddenDim = 1024, int numEncoderLayers = 12, int numDecoderLayers = 10, int numHeads = 16, int vocabSize = 50000, int imageSize = 896, int patchSize = 16, int maxSequenceLength = 4096)

Parameters

hiddenDim int: Hidden dimension (default: 1024).
numEncoderLayers int: Number of encoder layers (default: 12).
numDecoderLayers int: Number of decoder layers (default: 10).
numHeads int: Number of attention heads (default: 16).
vocabSize int: Vocabulary size (default: 50000).
imageSize int: Input image size (default: 896).
patchSize int: Patch size (default: 16).
maxSequenceLength int: Maximum sequence length (default: 4096).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers): Tuple of encoder and decoder layers.

Remarks

Reference: "Nougat: Neural Optical Understanding for Academic Documents" (arXiv 2023)

CreateDefaultOccupancyLayers(NeuralNetworkArchitecture<T>)

Creates default layers for an occupancy detection neural network without temporal data.

public static IEnumerable<ILayer<T>> CreateDefaultOccupancyLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration that defines input and output shapes.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a non-temporal occupancy detection network.

Remarks

For Beginners: This method builds a simpler neural network for detecting occupancy (whether a space is occupied by people) using data from a single point in time, rather than a sequence of time points. It uses standard Dense layers (also called fully connected layers) to process the input features.

Non-temporal data means the model makes predictions based only on current data points without considering how values have changed over time. For example, using the current temperature, humidity, and CO2 levels to predict occupancy without looking at historical values.

CreateDefaultOccupancyTemporalLayers(NeuralNetworkArchitecture<T>, int)

Creates default layers for an occupancy detection neural network with temporal data.

public static IEnumerable<ILayer<T>> CreateDefaultOccupancyTemporalLayers(NeuralNetworkArchitecture<T> architecture, int historyWindowSize)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration that defines input and output shapes.
historyWindowSize int: The number of time steps to consider in the temporal data (how many past observations to include).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a temporal occupancy detection network.

Remarks

For Beginners: This method builds a neural network specifically designed to detect occupancy (whether a space is occupied by people) using data that changes over time. It uses special layer types like LSTM (Long Short-Term Memory) that can "remember" patterns in sequential data, and attention mechanisms that help the network focus on the most important time steps in the data sequence.

Temporal data refers to data collected over time, where the sequence and patterns across time points are important for making predictions. For example, sensor readings collected every minute over several hours would be temporal data.

CreateDefaultOpticalFlowLayers(int, int, int, int)

Creates layers for an optical flow estimation model (RAFT-style).

public static IEnumerable<ILayer<T>> CreateDefaultOpticalFlowLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int hiddenDim = 192)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input frame height.
inputWidth int: Input frame width.
hiddenDim int: Hidden dimension for flow estimation (default: 192).

Returns

IEnumerable<ILayer<T>>: A collection of layers for optical flow estimation.

Remarks

For Beginners: Optical flow tells you how each pixel moves between two frames. This is useful for motion analysis, video editing, and as input to other models. The output is a 2-channel tensor showing horizontal and vertical motion.

Architecture:

Feature encoder extracts features from both frames
Correlation volume computes matching scores
Iterative refinement improves the flow estimate

CreateDefaultPICKLayers(int, int, int, int, int, int)

Creates default PICK layers for key information extraction.

public static IEnumerable<ILayer<T>> CreateDefaultPICKLayers(int hiddenDim = 256, int numGcnLayers = 2, int numHeads = 8, int vocabSize = 30522, int numEntityTypes = 14, int maxSequenceLength = 512)

Parameters

hiddenDim int: Hidden dimension (default: 256).
numGcnLayers int: Number of GCN layers (default: 2).
numHeads int: Number of attention heads (default: 8).
vocabSize int: Vocabulary size (default: 30522).
numEntityTypes int: Number of entity types (default: 14).
maxSequenceLength int: Maximum sequence length (default: 512).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a PICK model.

Remarks

Reference: "PICK: Processing Key Information Extraction" (ICPR 2020)

CreateDefaultPINNLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Physics-Informed Neural Network (PINN).

public static IEnumerable<ILayer<T>> CreateDefaultPINNLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 4, int hiddenLayerSize = 32)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
hiddenLayerCount int: Number of hidden layers (default: 4).
hiddenLayerSize int: Number of neurons in each hidden layer (default: 32).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a PINN.

Remarks

For Beginners: Physics-Informed Neural Networks (PINNs) solve PDEs by training a neural network to minimize the PDE residual at collocation points. The network learns the solution function u(x,t) while respecting the physics (PDE, boundary conditions, and initial conditions).

Uses Tanh activation for smooth derivatives (important for computing PDE residuals). Multiple hidden layers capture complex solution behavior. Linear output layer since PDE solutions can take any real value.

CreateDefaultPSENetLayers(int, int, int, int)

Creates default PSENet (Progressive Scale Expansion Network) layers.

public static IEnumerable<ILayer<T>> CreateDefaultPSENetLayers(int imageSize = 640, int backboneChannels = 256, int featureChannels = 256, int numKernels = 7)

Parameters

imageSize int: Input image size (default: 640).
backboneChannels int: Backbone channels (default: 256).
featureChannels int: Feature channels (default: 256).
numKernels int: Number of scale kernels (default: 7).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a PSENet model.

CreateDefaultPix2StructLayers(int, int, int, int, int, int, int, int)

Creates default Pix2Struct layers for screenshot parsing.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultPix2StructLayers(int hiddenDim = 1024, int numEncoderLayers = 18, int numDecoderLayers = 18, int numHeads = 16, int vocabSize = 50000, int patchSize = 16, int maxPatches = 4096, int maxSequenceLength = 1024)

Parameters

hiddenDim int: Hidden dimension (default: 1024).
numEncoderLayers int: Number of encoder layers (default: 18).
numDecoderLayers int: Number of decoder layers (default: 18).
numHeads int: Number of attention heads (default: 16).
vocabSize int: Vocabulary size (default: 50000).
patchSize int: Patch size (default: 16).
maxPatches int: Maximum patches (default: 4096).
maxSequenceLength int: Maximum sequence length (default: 1024).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers): Tuple of encoder and decoder layers.

Remarks

Reference: "Pix2Struct: Screenshot Parsing as Pretraining" (ICML 2023)

CreateDefaultQuantumNetworkLayers(NeuralNetworkArchitecture<T>, int)

Creates a default configuration of layers for a Quantum Neural Network.

public static IEnumerable<ILayer<T>> CreateDefaultQuantumNetworkLayers(NeuralNetworkArchitecture<T> architecture, int numQubits = 4)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
numQubits int: The number of qubits to use in quantum layers (default: 4).

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for a Quantum Neural Network.

Remarks

For Beginners: A Quantum Neural Network combines quantum computing concepts with neural networks. Think of qubits as special units that can exist in multiple states at once (unlike regular bits that are either 0 or 1). This gives quantum networks potential advantages for certain problems. The numQubits parameter controls how many of these special quantum units are used in each quantum layer.

Exceptions

ArgumentNullException: Thrown when architecture is null.
ArgumentException: Thrown when numQubits is not positive.

CreateDefaultRBFNetworkLayers(NeuralNetworkArchitecture<T>, int, IRadialBasisFunction<T>?)

Creates a default Radial Basis Function (RBF) neural network layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultRBFNetworkLayers(NeuralNetworkArchitecture<T> architecture, int hiddenSize = 0, IRadialBasisFunction<T>? rbfFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
hiddenSize int: The size of the hidden layer. If set to 0 or negative, a default size will be calculated.
rbfFunction IRadialBasisFunction<T>: The radial basis function to use. If null, a default Gaussian RBF will be used.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for RBF network processing.

Remarks

For Beginners: A Radial Basis Function (RBF) Network is a special type of neural network that uses "distance" to make predictions. Instead of gradually learning patterns through weights like standard neural networks, RBF networks measure how similar or different an input is from known examples.

Think of it like this: if you want to identify a fruit, you might compare how similar it looks to fruits you already know. An RBF network works in a similar way - it has "reference points" and measures how close new data is to these points.

RBF networks are particularly good at function approximation, pattern recognition, and time series prediction.

CreateDefaultRNNLayers(NeuralNetworkArchitecture<T>)

Creates a default Recurrent Neural Network (RNN) layer configuration.

public static IEnumerable<ILayer<T>> CreateDefaultRNNLayers(NeuralNetworkArchitecture<T> architecture)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for RNN-based processing.

Remarks

For Beginners: A Recurrent Neural Network (RNN) is designed to work with sequential data by maintaining a form of "memory" of previous inputs. Unlike standard neural networks, RNNs can use their internal state to process sequences of inputs, making them ideal for tasks like text analysis, speech recognition, or time series prediction.

This method automatically configures appropriate RNN layers with sensible defaults, including hidden layer sizes and activation functions.

CreateDefaultRVMLayers(int, int, int, int)

Creates default layers for RVM (Robust Video Matting).

public static IEnumerable<ILayer<T>> CreateDefaultRVMLayers(int inputChannels = 3, int inputHeight = 512, int inputWidth = 512, int numFeatures = 32)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int

Returns

IEnumerable<ILayer<T>>

CreateDefaultResNetLayers(NeuralNetworkArchitecture<T>, int, int)

Creates a Residual Neural Network (ResNet) with configurable blocks.

public static IEnumerable<ILayer<T>> CreateDefaultResNetLayers(NeuralNetworkArchitecture<T> architecture, int blockCount = 3, int blockSize = 2)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
blockCount int: Number of residual blocks (default: 3).
blockSize int: Number of convolutional layers in each block (default: 2).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a ResNet.

Remarks

For Beginners: A Residual Network (ResNet) is designed to solve the "vanishing gradient problem" that occurs when training very deep networks. It does this by adding "skip connections" that allow information to bypass some layers.

Think of it like this: In a traditional network, each layer must learn everything from scratch. In a ResNet, each layer only needs to learn the "difference" (or residual) between its input and the desired output, which is often easier to learn.

Key components: - Initial convolutional layer: Processes the raw input - Residual blocks: Groups of layers with skip connections - Global pooling: Reduces the spatial dimensions to a single value per feature map - Final dense layer: Makes the prediction based on the extracted features

CreateDefaultSAM2Layers(int, int, int, int)

Creates all SAM2 layers for backward compatibility.

[Obsolete("Use individual SAM2 factory methods (CreateSAM2ImageEncoderLayers, etc.) for proper multi-branch architecture.")]
public static IEnumerable<ILayer<T>> CreateDefaultSAM2Layers(int inputChannels = 3, int inputHeight = 1024, int inputWidth = 1024, int numFeatures = 256)

Parameters

inputChannels int
inputHeight int
inputWidth int
numFeatures int

Returns

IEnumerable<ILayer<T>>

Remarks

Warning: This method returns layers from multiple branches that cannot be chained sequentially. Use the individual factory methods (CreateSAM2ImageEncoderLayers, CreateSAM2PromptEncoderLayers, CreateSAM2MemoryLayers, CreateSAM2MaskDecoderLayers) for proper multi-branch handling.

CreateDefaultSGPTLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for an SGPT (Sentence GPT) decoder-only embedding model.

public static IEnumerable<ILayer<T>> CreateDefaultSGPTLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 50257, int embeddingDimension = 768, int maxSequenceLength = 1024, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultSPLADELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a SPLADE (Sparse Lexical and Expansion Model) embedding model.

public static IEnumerable<ILayer<T>> CreateDefaultSPLADELayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultSVTRLayers(int, int, int, int, int, int)

Creates default SVTR (Scene Text Visual Transformer Recognizer) layers.

public static IEnumerable<ILayer<T>> CreateDefaultSVTRLayers(int imageWidth = 256, int imageHeight = 64, int hiddenDim = 192, int numLayers = 8, int numHeads = 6, int charsetSize = 95)

Parameters

imageWidth int: Input image width (default: 256).
imageHeight int: Input image height (default: 64).
hiddenDim int: Hidden dimension (default: 192).
numLayers int: Number of transformer layers (default: 8).
numHeads int: Number of attention heads (default: 6).
charsetSize int: Character set size (default: 95).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming an SVTR model.

CreateDefaultSiameseLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates default layers for a Siamese neural network using a Transformer-based encoder.

public static IEnumerable<ILayer<T>> CreateDefaultSiameseLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
vocabSize int: The size of the vocabulary (default: 30522).
embeddingDimension int: The dimension of the embedding vectors (default: 768).
maxSequenceLength int: The maximum length of input sequences (default: 512).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Siamese encoder.

Remarks

For Beginners: A Siamese Network uses two identical "twin" networks to process different inputs. This method sets up the structure for one of those twins, typically using a Transformer encoder to turn text into a coordinate (embedding) that can be compared to others.

CreateDefaultSimCSELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Creates default layers for a SimCSE (Simple Contrastive Learning of Sentence Embeddings) model.

public static IEnumerable<ILayer<T>> CreateDefaultSimCSELayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)

Parameters

architecture NeuralNetworkArchitecture<T>
vocabSize int
embeddingDimension int
maxSequenceLength int
numLayers int
numHeads int
feedForwardDim int

Returns

IEnumerable<ILayer<T>>

CreateDefaultSlowFastLayers(int, int, int, int, int, int, int)

Creates all SlowFast layers for backward compatibility (returns only slow pathway).

[Obsolete("Use individual SlowFast factory methods for proper dual-pathway architecture.")]
public static IEnumerable<ILayer<T>> CreateDefaultSlowFastLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int numClasses = 400, int slowChannels = 64, int fastChannels = 8, int alpha = 8)

Parameters

inputChannels int
inputHeight int
inputWidth int
numClasses int
slowChannels int
fastChannels int
alpha int

Returns

IEnumerable<ILayer<T>>

Remarks

Warning: SlowFast is a dual-pathway architecture that cannot be represented as a single sequential layer list. Use the individual factory methods: - CreateSlowFastSlowPathwayLayers - CreateSlowFastFastPathwayLayers - CreateSlowFastFusionLayers

CreateDefaultSourceSeparationLayers(int, int, int, int, double)

Creates default music source separation layers (U-Net style).

public static IEnumerable<ILayer<T>> CreateDefaultSourceSeparationLayers(int numMels = 513, int baseChannels = 32, int numSources = 4, int maxFrames = 512, double dropoutRate = 0.1)

Parameters

numMels int: Number of spectrogram frequency bins (default: 513 for STFT with 1024 window).
baseChannels int: Base channel count for U-Net (default: 32).
numSources int: Number of output sources (default: 4 for vocals, drums, bass, other).
maxFrames int: Maximum time frames (default: 512).
dropoutRate double: Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers for music source separation.

Remarks

U-Net inspired architecture for source separation with:

Encoder path with downsampling
Bottleneck with attention
Decoder path with upsampling and skip connections
Multi-source mask prediction

Reference: "Open-Unmix - A Reference Implementation for Music Source Separation"

CreateDefaultSpeakerEmbeddingLayers(int, int, int, int, int, double)

Creates default speaker embedding layers for speaker verification and identification.

public static IEnumerable<ILayer<T>> CreateDefaultSpeakerEmbeddingLayers(int numMels = 80, int hiddenDim = 512, int embeddingDim = 256, int numLayers = 3, int maxFrames = 500, double dropoutRate = 0.1)

Parameters

numMels int: Number of mel spectrogram bins (default: 80).
hiddenDim int: Hidden layer dimension (default: 512).
embeddingDim int: Output embedding dimension (default: 256).
numLayers int: Number of LSTM-like layers (default: 3).
maxFrames int: Maximum input frames (default: 500).
dropoutRate double: Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers for speaker embedding extraction.

Remarks

ECAPA-TDNN inspired architecture for speaker embedding with:

Frame-level feature extraction with attention
Temporal context aggregation
Attentive statistics pooling
Speaker embedding projection

Reference: "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN"

CreateDefaultSpikingLayers(NeuralNetworkArchitecture<T>, SpikingNeuronType, double, double, bool, bool)

Creates default layers for a Spiking Neural Network (SNN).

public static IEnumerable<ILayer<T>> CreateDefaultSpikingLayers(NeuralNetworkArchitecture<T> architecture, SpikingNeuronType neuronType = SpikingNeuronType.LeakyIntegrateAndFire, double tau = 10, double refractoryPeriod = 2, bool useLayerNormalization = false, bool useOutputConversion = true)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
neuronType SpikingNeuronType: The type of spiking neuron to use.
tau double: The membrane time constant that controls how quickly neurons respond to inputs.
refractoryPeriod double: The period after firing during which a neuron cannot fire again.
useLayerNormalization bool: Whether to use layer normalization to stabilize training.
useOutputConversion bool: Whether to convert spike outputs to continuous values.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Spiking Neural Network.

Remarks

For Beginners: Spiking Neural Networks (SNNs) are a type of neural network that more closely mimics how real neurons in the brain work. Unlike traditional neural networks that use continuous values, SNNs use "spikes" (binary on/off signals) to communicate between neurons. This makes them more biologically realistic and potentially more energy-efficient for certain tasks.

The tau parameter controls how quickly a neuron "forgets" previous inputs - larger values make the neuron remember inputs for longer. The refractory period is like a "rest time" after a neuron fires, during which it cannot fire again, similar to how real neurons behave.

CreateDefaultSpiralNetLayers(NeuralNetworkArchitecture<T>, int, int, int[]?, double[]?, int[]?, bool, double, bool)

Creates the default layer sequence for a SpiralNet mesh neural network.

public static IEnumerable<ILayer<T>> CreateDefaultSpiralNetLayers(NeuralNetworkArchitecture<T> architecture, int inputFeatures = 3, int spiralLength = 9, int[]? convChannels = null, double[]? poolRatios = null, int[]? fcSizes = null, bool useBatchNorm = true, double dropoutRate = 0.5, bool useGlobalAveragePooling = true)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
inputFeatures int: Number of input features per vertex (default: 3 for coordinates).
spiralLength int: Length of spiral sequences for convolutions.
convChannels int[]: Channel sizes for each spiral convolution block.
poolRatios double[]: Pooling ratios for mesh simplification at each level.
fcSizes int[]: Sizes of fully connected layers before output.
useBatchNorm bool: Whether to use batch normalization after convolutions.
dropoutRate double: Dropout rate for fully connected layers.
useGlobalAveragePooling bool: Whether to use global average (true) or max (false) pooling.

Returns

IEnumerable<ILayer<T>>: An enumerable of layers forming the SpiralNet architecture.

Remarks

For Beginners: This method builds the default layer stack for SpiralNet++.

Architecture pattern: - Multiple spiral convolution blocks (SpiralConv + optional BatchNorm) - Global pooling to aggregate vertex features - Fully connected layers for classification

Applications:

3D face recognition and reconstruction
Human body shape analysis
Medical mesh analysis

Exceptions

InvalidOperationException: Thrown when the architecture has invalid output size.

CreateDefaultStableAudioLayers(int, int, int, int, int, int, int, double)

Creates default Stable Audio layers for text-to-audio generation.

public static IEnumerable<ILayer<T>> CreateDefaultStableAudioLayers(int textHiddenDim = 768, int latentDim = 64, int ditHiddenDim = 1024, int numDitBlocks = 24, int numHeads = 16, int maxTextLength = 512, int maxAudioLength = 2048, double dropoutRate = 0.1)

Parameters

textHiddenDim int: Text encoder hidden dimension (default: 768).
latentDim int: Latent space dimension (default: 64).
ditHiddenDim int: DiT hidden dimension (default: 1024).
numDitBlocks int: Number of DiT transformer blocks (default: 24).
numHeads int: Number of attention heads (default: 16).
maxTextLength int: Maximum text sequence length (default: 512).
maxAudioLength int: Maximum audio latent sequence length (default: 2048).
dropoutRate double: Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Stable Audio model.

Remarks

Stable Audio by Stability AI uses a Diffusion Transformer (DiT) architecture:

T5-based text encoder for conditioning
Variational autoencoder for audio latent compression
DiT (Diffusion Transformer) for denoising in latent space
Supports variable-length audio generation with timing conditioning

Reference: "Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion" by Evans et al., 2024

CreateDefaultTRIELayers(int, int, int, int, int, int)

Creates default TRIE (Text Reading and Information Extraction) layers.

public static IEnumerable<ILayer<T>> CreateDefaultTRIELayers(int imageSize = 512, int visualDim = 256, int textDim = 256, int graphDim = 256, int numEntityTypes = 10, int maxEntities = 100)

Parameters

imageSize int: Input image size (default: 512).
visualDim int: Visual encoder dimension (default: 256).
textDim int: Text encoder dimension (default: 256).
graphDim int: Graph dimension (default: 256).
numEntityTypes int: Number of entity types (default: 10).
maxEntities int: Maximum entities (default: 100).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a TRIE model.

CreateDefaultTableTransformerLayers(int, int, int, int, int, int, int)

Creates default layers for TableTransformer model.

public static IEnumerable<ILayer<T>> CreateDefaultTableTransformerLayers(int imageSize = 800, int hiddenDim = 256, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int numQueries = 100, int numStructureClasses = 7)

Parameters

imageSize int: Input image size (default: 800).
hiddenDim int: Transformer hidden dimension (default: 256).
numEncoderLayers int: Number of encoder layers (default: 6).
numDecoderLayers int: Number of decoder layers (default: 6).
numHeads int: Number of attention heads (default: 8).
numQueries int: Number of object queries (default: 100).
numStructureClasses int: Number of structure classes (default: 7).

Returns

IEnumerable<ILayer<T>>: Enumerable of layers for TableTransformer.

Remarks

TableTransformer uses a DETR-style architecture with ResNet backbone.

Reference: "PubTables-1M: Towards Comprehensive Table Extraction" (CVPR 2022)

CreateDefaultTimeSformerLayers(int, int, int, int, int, int, int)

Creates default layers for TimeSformer video classification.

public static IEnumerable<ILayer<T>> CreateDefaultTimeSformerLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int embedDim = 768, int numLayers = 12, int patchSize = 16, int numClasses = 400)

Parameters

inputChannels int
inputHeight int
inputWidth int
embedDim int
numLayers int
patchSize int
numClasses int

Returns

IEnumerable<ILayer<T>>

CreateDefaultTrOCRLayers(int, int, int, int, int, int, int, int, int, int)

Creates default layers for TrOCR text recognition model.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultTrOCRLayers(int imageSize = 384, int patchSize = 16, int encoderHiddenDim = 768, int decoderHiddenDim = 768, int numEncoderLayers = 12, int numDecoderLayers = 6, int numEncoderHeads = 12, int numDecoderHeads = 12, int vocabSize = 50265, int maxSequenceLength = 128)

Parameters

imageSize int: Input image size (default: 384).
patchSize int: ViT patch size (default: 16).
encoderHiddenDim int: Encoder hidden dimension (default: 768).
decoderHiddenDim int: Decoder hidden dimension (default: 768).
numEncoderLayers int: Number of encoder layers (default: 12).
numDecoderLayers int: Number of decoder layers (default: 6).
numEncoderHeads int: Number of encoder heads (default: 12).
numDecoderHeads int: Number of decoder heads (default: 12).
vocabSize int: Vocabulary size (default: 50265).
maxSequenceLength int: Maximum sequence length (default: 128).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers): Tuple of encoder and decoder layers.

Remarks

TrOCR uses a Vision Transformer (ViT) encoder and a Transformer decoder.

Reference: "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" (AAAI 2022)

CreateDefaultTransformerLayers(TransformerArchitecture<T>)

Creates a default Transformer neural network with pre-configured encoder and decoder layers.

public static IEnumerable<ILayer<T>> CreateDefaultTransformerLayers(TransformerArchitecture<T> architecture)

Parameters

architecture TransformerArchitecture<T>: The transformer architecture configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Transformer neural network.

Remarks

For Beginners: A Transformer is a powerful type of neural network especially good at processing sequences like text or time series data. Unlike older networks, Transformers can look at all parts of the input at once (using "attention") rather than processing it step by step. This makes them excellent for tasks like translation, text generation, and understanding language.

Key concepts: - Attention: Allows the model to focus on relevant parts of the input regardless of position - Multi-head attention: Lets the model focus on different aspects of the input simultaneously - Encoder: Processes the input sequence - Decoder: Generates the output sequence - Positional encoding: Helps the model understand the order of elements in a sequence

CreateDefaultTtsLayers(int, int, int, int, int, int, int, int, int, double)

Creates default TTS (Text-to-Speech) layers for speech synthesis.

public static IEnumerable<ILayer<T>> CreateDefaultTtsLayers(int textHiddenDim = 256, int audioHiddenDim = 512, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int numMels = 80, int maxTextLength = 512, int maxMelFrames = 1000, int vocabSize = 148, double dropoutRate = 0.1)

Parameters

textHiddenDim int: Text encoder hidden dimension (default: 256).
audioHiddenDim int: Audio decoder hidden dimension (default: 512).
numEncoderLayers int: Number of encoder transformer layers (default: 6).
numDecoderLayers int: Number of decoder transformer layers (default: 6).
numHeads int: Number of attention heads (default: 8).
numMels int: Number of mel spectrogram bins (default: 80).
maxTextLength int: Maximum input text length (default: 512).
maxMelFrames int: Maximum mel spectrogram frames (default: 1000).
vocabSize int: Phoneme/character vocabulary size (default: 148).
dropoutRate double: Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a TTS encoder-decoder architecture.

Remarks

TTS architecture with:

Character/phoneme embedding with positional encoding
Transformer encoder for text representation
Transformer decoder with cross-attention for mel generation
Post-net convolutional refinement (simulated with dense layers)

Reference: "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions" (Tacotron 2)

CreateDefaultUDOPLayers(int, int, int, int, int, int, int)

Creates default UDOP layers for unified document processing.

public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultUDOPLayers(int hiddenDim = 1024, int numEncoderLayers = 12, int numDecoderLayers = 12, int numHeads = 16, int vocabSize = 50000, int imageSize = 224, int maxSequenceLength = 2048)

Parameters

hiddenDim int: Hidden dimension (default: 1024).
numEncoderLayers int: Number of encoder layers (default: 12).
numDecoderLayers int: Number of decoder layers (default: 12).
numHeads int: Number of attention heads (default: 16).
vocabSize int: Vocabulary size (default: 50000).
imageSize int: Input image size (default: 224).
maxSequenceLength int: Maximum sequence length (default: 2048).

Returns

(IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers): Tuple of encoder and decoder layers.

Remarks

Reference: "UDOP: Unifying Vision, Text, and Layout" (CVPR 2023)

CreateDefaultUNet3DLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates default layers for a 3D U-Net architecture for volumetric segmentation.

public static IEnumerable<ILayer<T>> CreateDefaultUNet3DLayers(NeuralNetworkArchitecture<T> architecture, int voxelResolution = 32, int numEncoderBlocks = 4, int baseFilters = 32)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
voxelResolution int: The resolution of the voxel grid (e.g., 32 for 32x32x32). Default is 32.
numEncoderBlocks int: The number of encoder blocks. Default is 4.
baseFilters int: The number of filters in the first convolutional layer. Doubles with each block. Default is 32.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for 3D volumetric segmentation.

Remarks

For Beginners: A 3D U-Net is like a specialized 3D image processor that can identify different parts of a 3D volume (like organs in a CT scan or objects in a point cloud).

The U-shape architecture: - Encoder: Progressively downsamples to capture context (like zooming out) - Bottleneck: Smallest representation capturing global features - Decoder: Progressively upsamples to restore resolution (like zooming in) - Skip connections: Link encoder to decoder to preserve fine details

Applications include: - 3D semantic segmentation of point clouds - Medical image segmentation (organs, tumors in CT/MRI) - Part segmentation of 3D shapes

Exceptions

InvalidOperationException: Thrown when the architecture has invalid dimensions.

CreateDefaultUniversalDELayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Universal Differential Equation (UDE) network.

public static IEnumerable<ILayer<T>> CreateDefaultUniversalDELayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 2, int hiddenLayerSize = 32)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
hiddenLayerCount int: Number of hidden layers (default: 2).
hiddenLayerSize int: Number of neurons in each hidden layer (default: 32).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a UDE neural network component.

Remarks

For Beginners: Universal Differential Equations combine known physics with neural networks. The neural network learns the unknown parts of the dynamics while known physics equations are added explicitly. This is perfect for scientific applications where you know some of the physics but not all of it.

The network takes [state, time] as input and outputs the learned correction to the dynamics. Uses Tanh activation for smooth derivatives needed in ODE integration. Output uses linear (identity) activation since corrections can be positive or negative.

CreateDefaultVAELayers(NeuralNetworkArchitecture<T>, int)

Creates a default Variational Autoencoder (VAE) with pre-configured layers.

public static IEnumerable<ILayer<T>> CreateDefaultVAELayers(NeuralNetworkArchitecture<T> architecture, int latentSize)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
latentSize int: The size of the latent space dimension.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Variational Autoencoder.

Remarks

For Beginners: A Variational Autoencoder (VAE) is a type of neural network that learns to compress data into a smaller representation (encoding) and then reconstruct it back (decoding). What makes VAEs special is that they create a "fuzzy" compressed representation rather than an exact one, which helps the network learn meaningful patterns in your data. This makes VAEs excellent for generating new data similar to your training examples.

The latent space is the compressed representation where your data exists in a simplified form. Think of it as a "creative space" where the network understands the essential features of your data.

CreateDefaultVGGLayers(NeuralNetworkArchitecture<T>, VGGConfiguration)

Creates layers for a VGG network based on the specified configuration.

public static IEnumerable<ILayer<T>> CreateDefaultVGGLayers(NeuralNetworkArchitecture<T> architecture, VGGConfiguration configuration)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
configuration VGGConfiguration: The VGG-specific configuration.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a VGG network.

Remarks

For Beginners: VGG networks are deep convolutional neural networks known for their simplicity and effectiveness. They use stacks of 3x3 convolutions followed by max pooling to progressively extract higher-level features from images.

The VGG architecture consists of:

5 convolutional blocks with increasing number of filters (64 -> 128 -> 256 -> 512 -> 512)
Max pooling after each block to reduce spatial dimensions by half
Optional batch normalization after each convolution (in _BN variants)
3 fully connected layers (4096 -> 4096 -> numClasses)
Dropout regularization in the fully connected layers

CreateDefaultVRTLayers(int, int, int, int, int, int, int)

Creates layers for a VRT (Video Restoration Transformer) model.

public static IEnumerable<ILayer<T>> CreateDefaultVRTLayers(int inputChannels = 3, int inputHeight = 64, int inputWidth = 64, int embedDim = 120, int numFrames = 6, int numBlocks = 8, int scaleFactor = 4)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input frame height.
inputWidth int: Input frame width.
embedDim int: Embedding dimension (default: 120).
numFrames int: Number of temporal frames (default: 6).
numBlocks int: Number of transformer blocks (default: 8).
scaleFactor int: Upscaling factor for super-resolution. Supported values: 1, 2, or 4 (default: 4).

Returns

IEnumerable<ILayer<T>>: A collection of layers for video restoration.

Remarks

For Beginners: VRT (Video Restoration Transformer) is a powerful model for: - Video super-resolution (increasing video resolution) - Video deblurring (removing motion blur) - Video denoising (removing noise from videos)

It uses attention mechanisms to leverage both spatial and temporal information from multiple video frames to produce high-quality restored frames.

Architecture (based on the paper):

Shallow feature extraction from input frames
Temporal mutual self-attention (TMSA) blocks
Deep feature extraction with parallel warping
Reconstruction module for output

Reference: "VRT: A Video Restoration Transformer" https://arxiv.org/abs/2201.12288

Exceptions

ArgumentException: Thrown when scaleFactor is not 1, 2, or 4.

CreateDefaultVariationalPINNLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Variational Physics-Informed Neural Network (VPINN).

public static IEnumerable<ILayer<T>> CreateDefaultVariationalPINNLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 4, int hiddenLayerSize = 50)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
hiddenLayerCount int: Number of hidden layers (default: 4).
hiddenLayerSize int: Number of neurons in each hidden layer (default: 50).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a VPINN.

Remarks

For Beginners: Variational PINNs solve PDEs using the weak (variational) form instead of the strong form. This is similar to Finite Element Methods but using neural networks. Often more stable for complex PDEs than standard PINNs.

Uses Tanh activation throughout for smooth derivatives needed in variational formulation. Linear output layer since PDE solutions can take any real value.

CreateDefaultVideoMAELayers(int, int, int, int, int, int)

Creates default layers for VideoMAE (Video Masked Autoencoder) action recognition model.

public static IEnumerable<ILayer<T>> CreateDefaultVideoMAELayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int numFeatures = 768, int numClasses = 400, int tubeletSize = 2)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input height (default: 224).
inputWidth int: Input width (default: 224).
numFeatures int: Number of feature channels (default: 768).
numClasses int: Number of action classes (default: 400 for Kinetics).
tubeletSize int: Temporal size of each tube (default: 2).

Returns

IEnumerable<ILayer<T>>: An enumerable of layers configured for VideoMAE.

Remarks

For Beginners: VideoMAE is a self-supervised learning model that learns video representations by masking and reconstructing video patches. It's used for action recognition and video understanding tasks.

Architecture: - 3D patch embedding (spatiotemporal) - Transformer encoder blocks - Classification head for action recognition - Decoder for masked reconstruction during pretraining

Reference: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training" https://arxiv.org/abs/2203.12602

CreateDefaultVideoStabilizationLayers(int, int, int)

Creates layers for a video stabilization model (StabNet-style).

public static IEnumerable<ILayer<T>> CreateDefaultVideoStabilizationLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input frame height.
inputWidth int: Input frame width.

Returns

IEnumerable<ILayer<T>>: A collection of layers for video stabilization.

Remarks

For Beginners: Video stabilization removes camera shake. The model predicts how to warp each frame to align with a smooth camera path. This is similar to what smartphone cameras do in real-time.

Architecture:

Feature encoder processes input frames
Motion estimator predicts camera motion
Smoother learns the smooth target path
Warper transforms frames to match smooth path

CreateDefaultVideoSuperResolutionLayers(int, int, int, int, int, int, bool)

Creates layers for a video super-resolution model (Real-ESRGAN/BasicVSR++ style).

public static IEnumerable<ILayer<T>> CreateDefaultVideoSuperResolutionLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int numFeatures = 64, int numResBlocks = 16, int scaleFactor = 2, bool useTemporalConsistency = true)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input video height.
inputWidth int: Input video width.
numFeatures int: Number of feature channels (default: 64).
numResBlocks int: Number of residual blocks (default: 16).
scaleFactor int: Upscaling factor (default: 2).
useTemporalConsistency bool: Whether to add temporal aggregation layer (default: true).

Returns

IEnumerable<ILayer<T>>: A collection of layers for video super-resolution.

Remarks

For Beginners: Super-resolution models increase video resolution. This architecture uses residual blocks (skip connections) to preserve details while learning to add new ones. The upsampling at the end increases the spatial size by the scale factor.

Architecture overview:

Initial convolution to extract features
Multiple residual blocks for deep feature learning
Temporal aggregation for video consistency (optional)
Pixel shuffle upsampling for resolution increase
Final convolution for output reconstruction

CreateDefaultVoxLingua107Layers(NeuralNetworkArchitecture<T>, int, int, int, int[]?)

Creates default VoxLingua107 layers for 107-language identification.

public static IEnumerable<ILayer<T>> CreateDefaultVoxLingua107Layers(NeuralNetworkArchitecture<T> architecture, int numMels = 80, int tdnnChannels = 1024, int embeddingDimension = 256, int[]? dilations = null)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
numMels int: Number of mel filterbank channels (default: 80).
tdnnChannels int: Number of TDNN channels (default: 1024).
embeddingDimension int: Embedding dimension (default: 256).
dilations int[]: Dilation factors for TDNN layers (default: [1, 2, 3, 4, 1]).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a VoxLingua107 language identifier.

Remarks

VoxLingua107 uses ECAPA-TDNN architecture trained on 107 languages from the VoxLingua107 dataset (YouTube speech samples).

CreateDefaultVoxelCNNLayers(NeuralNetworkArchitecture<T>, int, int, int)

Creates default layers for a Voxel-based 3D Convolutional Neural Network.

public static IEnumerable<ILayer<T>> CreateDefaultVoxelCNNLayers(NeuralNetworkArchitecture<T> architecture, int voxelResolution = 32, int numConvBlocks = 3, int baseFilters = 32)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture specification.
voxelResolution int: The resolution of the voxel grid (e.g., 32 for 32x32x32). Default is 32.
numConvBlocks int: The number of convolutional blocks (each block has Conv3D + MaxPool3D). Default is 3.
baseFilters int: The number of filters in the first convolutional layer. Doubles with each block. Default is 32.

Returns

IEnumerable<ILayer<T>>: A collection of layers configured for voxel-based 3D classification.

Remarks

For Beginners: A Voxel CNN is like a 3D version of a regular image classifier. Instead of looking at a 2D image, it examines a 3D grid of "blocks" (voxels) to understand 3D shapes. This is like how Minecraft represents the world - each block is either filled or empty, and the pattern of blocks creates recognizable objects.

The architecture follows a standard pattern: - Multiple Conv3D + MaxPool3D blocks to extract hierarchical 3D features - Each block doubles the number of filters while halving the spatial resolution - Global average pooling to aggregate spatial information - Dense output layer for classification

Applications include: - Recognizing 3D objects from voxelized point clouds (e.g., ModelNet40) - Medical image analysis (CT, MRI volumetric scans) - Spatial occupancy prediction from depth sensors

Exceptions

InvalidOperationException: Thrown when the architecture has invalid input or output dimensions.

CreateDefaultWav2Vec2LanguageIdentifierLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, double)

Creates default Wav2Vec2 layers for spoken language identification.

public static IEnumerable<ILayer<T>> CreateDefaultWav2Vec2LanguageIdentifierLayers(NeuralNetworkArchitecture<T> architecture, int hiddenSize = 768, int numLayers = 12, int numAttentionHeads = 12, int intermediateSize = 3072, int numLanguages = 20, double dropoutRate = 0.1)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
hiddenSize int: Hidden size of transformer (default: 768).
numLayers int: Number of transformer layers (default: 12).
numAttentionHeads int: Number of attention heads (default: 12).
intermediateSize int: Feed-forward intermediate size (default: 3072).
numLanguages int: Number of languages to classify (default: 20).
dropoutRate double: Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Wav2Vec2 language identifier.

Remarks

Wav2Vec2-LID uses Meta's self-supervised speech representation model: - 7-layer CNN feature encoder processing raw waveform - Transformer encoder for contextual representations - Classification head for language prediction

CreateDefaultWhisperLayers(int, int, int, int, int, int, int, int, double)

Creates default layers for Whisper-style speech recognition models.

public static IEnumerable<ILayer<T>> CreateDefaultWhisperLayers(int numMels = 80, int modelDimension = 512, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int feedForwardDim = 2048, int vocabularySize = 51865, int maxSequenceLength = 1500, double dropoutRate = 0.1)

Parameters

numMels int: Number of mel spectrogram bins (default: 80).
modelDimension int: Hidden dimension of the model (default: 512).
numEncoderLayers int: Number of encoder layers (default: 6).
numDecoderLayers int: Number of decoder layers (default: 6).
numHeads int: Number of attention heads (default: 8).
feedForwardDim int: Feed-forward dimension (default: 2048).
vocabularySize int: Output vocabulary size (default: 51865).
maxSequenceLength int: Maximum sequence length (default: 1500).
dropoutRate double: Dropout rate (default: 0.1).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Whisper-style ASR model.

Remarks

For Beginners: Whisper is an encoder-decoder transformer for speech recognition.

The architecture consists of:

Audio encoder: Converts mel spectrograms to hidden representations
- Convolutional layers to process spectrogram
- Transformer encoder layers with self-attention
Text decoder: Generates text tokens autoregressively
- Embedding layer for text tokens
- Transformer decoder layers with self-attention
- Output projection to vocabulary

This creates a trainable model structure from scratch. The decoder layers expect encoder outputs to be provided during the forward pass (as implemented in WhisperModel<T>). For inference with pre-trained weights, use the ONNX-based WhisperModel.CreateAsync() method instead.

CreateDefaultWhisperLayers(int, int, int, int, int, int, int, int, int, double)

Creates default Whisper layers for automatic speech recognition.

public static IEnumerable<ILayer<T>> CreateDefaultWhisperLayers(int modelDim = 512, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int ffDim = 2048, int numMels = 80, int maxFrames = 3000, int maxTokens = 448, int vocabSize = 51865, double dropoutRate = 0)

Parameters

modelDim int: Model hidden dimension (default: 512 for Base).
numEncoderLayers int: Number of encoder transformer layers (default: 6 for Base).
numDecoderLayers int: Number of decoder transformer layers (default: 6 for Base).
numHeads int: Number of attention heads (default: 8 for Base).
ffDim int: Feed-forward hidden dimension (default: 2048 for Base).
numMels int: Number of mel spectrogram bins (default: 80).
maxFrames int: Maximum mel spectrogram frames (default: 3000 for 30s audio).
maxTokens int: Maximum output token sequence length (default: 448).
vocabSize int: Whisper vocabulary size (default: 51865).
dropoutRate double: Dropout rate (default: 0.0 for inference-optimized).

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Whisper encoder-decoder architecture.

Remarks

Whisper is OpenAI's state-of-the-art automatic speech recognition model with:

Mel spectrogram audio preprocessing (80 bins, 16kHz)
Convolutional stem for initial audio feature extraction
Transformer encoder for audio representation learning
Transformer decoder with cross-attention for text generation
Support for 99+ languages and translation to English

Reference: "Robust Speech Recognition via Large-Scale Weak Supervision" by Radford et al., 2022

CreateDefaultWord2VecLayers(NeuralNetworkArchitecture<T>, int, int)

Creates default layers for a Word2Vec model (Skip-Gram or CBOW).

public static IEnumerable<ILayer<T>> CreateDefaultWord2VecLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize, int embeddingDimension)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
vocabSize int: The size of the vocabulary.
embeddingDimension int: The dimension of the embedding vectors.

Returns

IEnumerable<ILayer<T>>: A collection of layers forming a Word2Vec model.

Remarks

For Beginners: Word2Vec learns to represent words as vectors of numbers (embeddings) such that words with similar meanings are close to each other.

CreateDefaultXMemLayers(int, int, int, int)

Creates layers for an XMem long-term video object segmentation model.

public static IEnumerable<ILayer<T>> CreateDefaultXMemLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 854, int numFeatures = 256)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input frame height (default: 480).
inputWidth int: Input frame width (default: 854).
numFeatures int: Feature dimension (default: 256).

Returns

IEnumerable<ILayer<T>>: A collection of layers for long-term video object segmentation.

Remarks

For Beginners: XMem is designed for tracking objects in very long videos using a three-tier memory system inspired by human memory: - Sensory memory: Very recent frames (high detail, fast to forget) - Working memory: Important recent frames (moderate detail) - Long-term memory: Key historical frames (compressed, permanent)

Reference: "XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model" https://arxiv.org/abs/2207.07115

CreateSAM2ImageEncoderLayers(int, int, int, int)

Creates the image encoder layers for SAM2 (Segment Anything Model 2).

public static IEnumerable<ILayer<T>> CreateSAM2ImageEncoderLayers(int inputChannels = 3, int inputHeight = 1024, int inputWidth = 1024, int numFeatures = 256)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input height (default: 1024).
inputWidth int: Input width (default: 1024).
numFeatures int: Number of output feature channels (default: 256).

Returns

IEnumerable<ILayer<T>>: Image encoder layers that downsample input to feature maps.

Remarks

For Beginners: This creates the image encoder part of SAM2, which processes input images into feature maps. The output has shape [numFeatures, H/16, W/16].

Note: SAM2 is a multi-branch architecture. Use separate factory methods: - CreateSAM2ImageEncoderLayers: Image feature extraction (this method) - CreateSAM2PromptEncoderLayers: Point/box/mask prompt encoding - CreateSAM2MemoryLayers: Temporal memory attention - CreateSAM2MaskDecoderLayers: Mask prediction head

CreateSAM2IoUHead(int, int, int, int)

Creates the IoU (Intersection over Union) prediction head for SAM2.

public static IEnumerable<ILayer<T>> CreateSAM2IoUHead(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64, int numMaskCandidates = 4)

Parameters

numFeatures int: Number of input feature channels (default: 256).
featureHeight int: Height of feature maps (default: 64).
featureWidth int: Width of feature maps (default: 64).
numMaskCandidates int: Number of mask candidates (default: 4).

Returns

IEnumerable<ILayer<T>>: IoU prediction layers. Output shape: [numMaskCandidates]

Remarks

For Beginners: This head predicts the quality (IoU score) for each mask candidate. Higher scores indicate better masks. Used to select the best mask from candidates.

CreateSAM2MaskDecoderLayers(int, int, int)

Creates the shared mask decoder refinement layers for SAM2.

public static IEnumerable<ILayer<T>> CreateSAM2MaskDecoderLayers(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64)

Parameters

numFeatures int: Number of feature channels (default: 256).
featureHeight int: Height of feature maps (default: 64).
featureWidth int: Width of feature maps (default: 64).

Returns

IEnumerable<ILayer<T>>: Shared refinement layers that process fused features.

Remarks

For Beginners: These layers refine the combined image and prompt features before branching into separate prediction heads. Output shape: [numFeatures, h, w]

Usage: Apply these layers first, then branch to the three separate heads: - CreateSAM2MaskHead: Produces mask candidates - CreateSAM2IoUHead: Predicts mask quality scores - CreateSAM2OcclusionHead: Predicts occlusion

CreateSAM2MaskHead(int, int, int, int)

Creates the mask prediction head for SAM2.

public static IEnumerable<ILayer<T>> CreateSAM2MaskHead(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64, int numMaskCandidates = 4)

Parameters

numFeatures int: Number of input feature channels (default: 256).
featureHeight int: Height of feature maps (default: 64).
featureWidth int: Width of feature maps (default: 64).
numMaskCandidates int: Number of mask candidates to output (default: 4).

Returns

IEnumerable<ILayer<T>>: Mask prediction layers. Output shape: [numMaskCandidates, h, w]

Remarks

For Beginners: This head produces multiple candidate segmentation masks. Each candidate is a probability map indicating object presence at each pixel.

CreateSAM2MemoryLayers(int, int, int)

Creates the memory attention layers for SAM2 temporal consistency.

public static IEnumerable<ILayer<T>> CreateSAM2MemoryLayers(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64)

Parameters

numFeatures int: Number of feature channels (default: 256).
featureHeight int: Height of feature maps (default: 64).
featureWidth int: Width of feature maps (default: 64).

Returns

IEnumerable<ILayer<T>>: Memory attention layers for video object tracking.

Remarks

For Beginners: Memory layers help SAM2 track objects across video frames by maintaining a memory of past segmentations and matching them to new frames.

CreateSAM2OcclusionHead(int, int, int)

Creates the occlusion prediction head for SAM2.

public static IEnumerable<ILayer<T>> CreateSAM2OcclusionHead(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64)

Parameters

numFeatures int: Number of input feature channels (default: 256).
featureHeight int: Height of feature maps (default: 64).
featureWidth int: Width of feature maps (default: 64).

Returns

IEnumerable<ILayer<T>>: Occlusion prediction layers. Output shape: [1]

Remarks

For Beginners: This head predicts whether the tracked object is occluded (hidden by other objects). A high score indicates the object may be temporarily invisible.

CreateSAM2PromptEncoderLayers(int, int, int)

Creates the prompt encoder layers for SAM2 (point, box, and mask prompts).

public static IEnumerable<ILayer<T>> CreateSAM2PromptEncoderLayers(int numFeatures = 256, int maskHeight = 256, int maskWidth = 256)

Parameters

numFeatures int: Number of output feature channels (default: 256).
maskHeight int: Height of mask prompt input (default: 256).
maskWidth int: Width of mask prompt input (default: 256).

Returns

IEnumerable<ILayer<T>>: Prompt encoder layers for different prompt types.

Remarks

For Beginners: SAM2 accepts different types of prompts to tell it what to segment: - Points: Click on the object (x, y coordinates) - Boxes: Draw a bounding box (x1, y1, x2, y2) - Masks: Provide an initial mask estimate

Usage: These layers are applied to prompt inputs separately, then combined with image features in the mask decoder. They are NOT chained sequentially with the image encoder.

CreateSimpleVideoSuperResolutionLayers(int, int, int, int)

Creates a simple super-resolution architecture for testing and lightweight use.

public static IEnumerable<ILayer<T>> CreateSimpleVideoSuperResolutionLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int scaleFactor = 2)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input video height.
inputWidth int: Input video width.
scaleFactor int: Upscaling factor (default: 2).

Returns

IEnumerable<ILayer<T>>: A collection of layers for simple super-resolution.

Remarks

For Beginners: This is a smaller, faster model that trades quality for speed. Good for real-time applications or when GPU memory is limited.

CreateSlowFastFastPathwayLayers(int, int, int, int)

Creates the fast pathway layers for SlowFast video recognition.

public static IEnumerable<ILayer<T>> CreateSlowFastFastPathwayLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int fastChannels = 8)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input height (default: 224).
inputWidth int: Input width (default: 224).
fastChannels int: Base channel count for fast pathway (default: 8).

Returns

IEnumerable<ILayer<T>>: Fast pathway layers that process more frames at lower capacity.

Remarks

For Beginners: The fast pathway processes video at a high frame rate (e.g., 32 fps) but with lower channel capacity (1/8 of slow pathway). It captures motion and temporal dynamics. Output shape: [fastChannels * 8, H/16, W/16]

CreateSlowFastFusionLayers(int, int, int, int, int)

Creates the fusion and classification layers for SlowFast.

public static IEnumerable<ILayer<T>> CreateSlowFastFusionLayers(int slowChannels = 64, int fastChannels = 8, int featureHeight = 14, int featureWidth = 14, int numClasses = 400)

Parameters

slowChannels int: Base channel count for slow pathway (default: 64).
fastChannels int: Base channel count for fast pathway (default: 8).
featureHeight int: Height of feature maps after pathways (default: 14).
featureWidth int: Width of feature maps after pathways (default: 14).
numClasses int: Number of action classes (default: 400 for Kinetics).

Returns

IEnumerable<ILayer<T>>: Fusion layers that combine pathways and classify actions.

Remarks

For Beginners: This fuses the slow and fast pathway features (after concatenation) and produces the final action classification. The SlowFast model should: 1. Run slow pathway on subsampled frames 2. Run fast pathway on all frames 3. Concatenate outputs along channel dimension 4. Apply these fusion layers

CreateSlowFastSlowPathwayLayers(int, int, int, int)

Creates the slow pathway layers for SlowFast video recognition.

public static IEnumerable<ILayer<T>> CreateSlowFastSlowPathwayLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int slowChannels = 64)

Parameters

inputChannels int: Number of input channels (default: 3 for RGB).
inputHeight int: Input height (default: 224).
inputWidth int: Input width (default: 224).
slowChannels int: Base channel count for slow pathway (default: 64).

Returns

IEnumerable<ILayer<T>>: Slow pathway layers that process fewer frames at higher capacity.

Remarks

For Beginners: The slow pathway processes video at a low frame rate (e.g., 4 fps) but with high channel capacity. It captures spatial semantics and appearance features. Output shape: [slowChannels * 8, H/16, W/16]

Note: SlowFast is a dual-pathway architecture. Use separate factory methods: - CreateSlowFastSlowPathwayLayers: Low frame rate, high capacity (this method) - CreateSlowFastFastPathwayLayers: High frame rate, low capacity - CreateSlowFastFusionLayers: Combines pathways for classification

Table of Contents

Class LayerHelper<T>

Type Parameters

Remarks

Methods

CreateDefaultABINetLayers(int, int, int, int, int, int)

Parameters

Returns

CreateDefaultAnimateDiffLayers(int, int, int, int, int)

Parameters

Returns

Remarks

CreateDefaultAttentionLayers(NeuralNetworkArchitecture<T>)

Parameters

Returns

Remarks

CreateDefaultAudioGenLayers(int, int, int, int, int, int, int, int, double)

Parameters

Returns

Remarks

CreateDefaultAudioLDMLayers(int, int, int, int, int[]?, int, int, int, double)

Parameters

Returns

Remarks

CreateDefaultAutoEncoderLayers(NeuralNetworkArchitecture<T>)

Parameters

Returns

Remarks

CreateDefaultBGELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Parameters

Returns

CreateDefaultBayesianNeuralNetworkLayers(NeuralNetworkArchitecture<T>)

Parameters

Returns

Remarks

CreateDefaultBlip2Layers(int, int, int, int, int, int, int, int, int, int, int, int)

Parameters

Returns

CreateDefaultByteTrackLayers(int, int, int, int, int)

Parameters

Returns

CreateDefaultCNNLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Parameters

Returns

Remarks

CreateDefaultCRAFTLayers(int, int, int)

Parameters

Returns

Remarks

CreateDefaultCRNNLayers(int, int, int, int, int, int)

Parameters

Returns

Remarks

CreateDefaultCapsuleNetworkLayers(NeuralNetworkArchitecture<T>)

Parameters

Returns

Remarks

CreateDefaultClipLayers(NeuralNetworkArchitecture<T>, int)

Parameters

Returns

Remarks

CreateDefaultCogVideoLayers(int, int, int, int, int, int)

Parameters

Returns

Remarks

CreateDefaultColBERTLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)

Parameters

Returns

CreateDefaultCutieLayers(int, int, int, int)

Parameters

Returns

Remarks

CreateDefaultDBNetLayers(int, int, int)

Parameters

Returns

Remarks

CreateDefaultDIFRINTLayers(int, int, int, int, int)

Parameters

Returns

CreateDefaultDNCLayers(NeuralNetworkArchitecture<T>, int, int, int, int)