Class LayerHelper<T>
Provides helper methods for creating various neural network layer configurations.
public static class LayerHelper<T>
Type Parameters
TThe numeric type used for calculations (typically float or double).
- Inheritance
-
LayerHelper<T>
- Inherited Members
Remarks
This class contains factory methods that create pre-configured sets of neural network layers for common architectures like standard feed-forward networks, CNNs, ResNets, and more.
Methods
CreateDefaultABINetLayers(int, int, int, int, int, int)
Creates default ABINet (Autonomous, Bidirectional, Iterative) layers.
public static IEnumerable<ILayer<T>> CreateDefaultABINetLayers(int imageWidth = 128, int imageHeight = 32, int visionDim = 512, int languageDim = 512, int numIterations = 3, int charsetSize = 95)
Parameters
imageWidthintInput image width (default: 128).
imageHeightintInput image height (default: 32).
visionDimintVision encoder dimension (default: 512).
languageDimintLanguage model dimension (default: 512).
numIterationsintNumber of refinement iterations (default: 3).
charsetSizeintCharacter set size (default: 95).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an ABINet model.
CreateDefaultAnimateDiffLayers(int, int, int, int, int)
Creates layers for an AnimateDiff motion module that adds temporal coherence.
public static IEnumerable<ILayer<T>> CreateDefaultAnimateDiffLayers(int inputChannels = 320, int inputHeight = 64, int inputWidth = 64, int numLayers = 8, int numFrames = 16)
Parameters
inputChannelsintNumber of input feature channels (default: 320).
inputHeightintInput feature height (default: 64).
inputWidthintInput feature width (default: 64).
numLayersintNumber of motion transformer layers (default: 8).
numFramesintNumber of video frames (default: 16).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for motion modeling.
Remarks
For Beginners: AnimateDiff is a motion module that plugs into existing image generation models (like Stable Diffusion) to create animated videos. It learns temporal dynamics from video data.
Architecture (based on the paper):
- Input features come from the base image model
- Temporal attention layers model motion across frames
- Cross-attention with motion context enables coherent animation
- Output features blend back into the base model
The motion module is designed to be inserted at multiple points in the U-Net.
Reference: "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models" https://arxiv.org/abs/2307.04725
CreateDefaultAttentionLayers(NeuralNetworkArchitecture<T>)
Creates a default set of attention-based layers for transformer-style architectures.
public static IEnumerable<ILayer<T>> CreateDefaultAttentionLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an attention-based neural network.
Remarks
For Beginners: Attention mechanisms allow neural networks to focus on specific parts of the input that are most relevant for a given task. Similar to how humans pay attention to specific details in a conversation, these layers help the network "pay attention" to important parts of the data. Transformers use this mechanism to process sequences (like text) very effectively.
CreateDefaultAudioGenLayers(int, int, int, int, int, int, int, int, double)
Creates default AudioGen layers for text-to-audio generation.
public static IEnumerable<ILayer<T>> CreateDefaultAudioGenLayers(int textHiddenDim = 768, int lmHiddenDim = 1536, int numLmLayers = 24, int numHeads = 16, int numCodebooks = 4, int codebookSize = 1024, int maxTextLength = 256, int maxAudioTokens = 1500, double dropoutRate = 0.1)
Parameters
textHiddenDimintText encoder hidden dimension (default: 768 for T5-base).
lmHiddenDimintLanguage model hidden dimension (default: 1536).
numLmLayersintNumber of language model transformer layers (default: 24).
numHeadsintNumber of attention heads (default: 16).
numCodebooksintNumber of EnCodec codebooks (default: 4).
codebookSizeintSize of each codebook vocabulary (default: 1024).
maxTextLengthintMaximum text sequence length (default: 256).
maxAudioTokensintMaximum audio tokens (~50 tokens/sec) (default: 1500 for 30s).
dropoutRatedoubleDropout rate (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an AudioGen model.
Remarks
AudioGen is a text-to-audio generation model that uses a transformer language model operating over EnCodec audio codes. Unlike MusicGen, it focuses on general audio and environmental sounds rather than music.
- T5-based text encoder for conditioning
- Transformer decoder generating audio codes autoregressively
- EnCodec neural audio codec for audio reconstruction
Reference: "AudioGen: Textually Guided Audio Generation" by Kreuk et al., 2022
CreateDefaultAudioLDMLayers(int, int, int, int, int[]?, int, int, int, double)
Creates default AudioLDM layers for text-to-audio generation using latent diffusion.
public static IEnumerable<ILayer<T>> CreateDefaultAudioLDMLayers(int textHiddenDim = 768, int latentDim = 8, int unetChannels = 256, int numResBlocks = 2, int[]? attentionResolutions = null, int numHeads = 8, int numMels = 64, int maxTextLength = 77, double dropoutRate = 0.1)
Parameters
textHiddenDimintText encoder hidden dimension (default: 768 for CLAP).
latentDimintLatent space dimension (default: 8).
unetChannelsintU-Net base channels (default: 256).
numResBlocksintNumber of residual blocks per level (default: 2).
attentionResolutionsint[]Resolutions at which to apply attention (default: [4, 2, 1]).
numHeadsintNumber of attention heads (default: 8).
numMelsintNumber of mel spectrogram channels (default: 64).
maxTextLengthintMaximum text sequence length (default: 77).
dropoutRatedoubleDropout rate (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an AudioLDM model.
Remarks
AudioLDM uses latent diffusion for text-to-audio generation:
- CLAP text encoder for conditioning
- VAE to encode/decode mel spectrograms to latent space
- U-Net for denoising in latent space
- HiFi-GAN vocoder for waveform generation
Reference: "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models" by Liu et al., 2023
CreateDefaultAutoEncoderLayers(NeuralNetworkArchitecture<T>)
Creates a default autoencoder neural network architecture.
public static IEnumerable<ILayer<T>> CreateDefaultAutoEncoderLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an autoencoder neural network.
Remarks
For Beginners: An autoencoder is a type of neural network that learns to compress data into a smaller representation and then reconstruct it back to the original form. Think of it like learning to create a thumbnail of an image and then expanding it back to full size. The network has two main parts: an encoder that compresses the data and a decoder that reconstructs it.
CreateDefaultBGELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)
Creates default layers for a BGE (BAAI General Embedding) model.
public static IEnumerable<ILayer<T>> CreateDefaultBGELayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)
Parameters
architectureNeuralNetworkArchitecture<T>vocabSizeintembeddingDimensionintmaxSequenceLengthintnumLayersintnumHeadsintfeedForwardDimint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultBayesianNeuralNetworkLayers(NeuralNetworkArchitecture<T>)
Creates a default configuration of layers for a Bayesian neural network (Bayes-by-Backprop style).
public static IEnumerable<ILayer<T>> CreateDefaultBayesianNeuralNetworkLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>
Returns
- IEnumerable<ILayer<T>>
Remarks
This mirrors the library's default dense+activation patterns, but uses Bayesian dense layers so the network can express epistemic uncertainty through weight distributions.
CreateDefaultBlip2Layers(int, int, int, int, int, int, int, int, int, int, int, int)
Creates default layers for a BLIP-2 neural network.
public static IEnumerable<ILayer<T>> CreateDefaultBlip2Layers(int imageSize = 224, int channels = 3, int patchSize = 14, int vocabularySize = 30522, int embeddingDimension = 256, int qformerHiddenDim = 768, int visionHiddenDim = 1408, int lmHiddenDim = 2560, int numQformerLayers = 12, int numHeads = 12, int numLmDecoderLayers = 6, int maxSequenceLength = 32)
Parameters
imageSizeintchannelsintpatchSizeintvocabularySizeintembeddingDimensionintqformerHiddenDimintvisionHiddenDimintlmHiddenDimintnumQformerLayersintnumHeadsintnumLmDecoderLayersintmaxSequenceLengthint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultByteTrackLayers(int, int, int, int, int)
Creates default layers for ByteTrack multi-object tracking.
public static IEnumerable<ILayer<T>> CreateDefaultByteTrackLayers(int inputChannels = 3, int inputHeight = 800, int inputWidth = 1440, int numFeatures = 256, int numClasses = 1)
Parameters
Returns
- IEnumerable<ILayer<T>>
CreateDefaultCNNLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)
Creates a Convolutional Neural Network (CNN) with configurable layers.
public static IEnumerable<ILayer<T>> CreateDefaultCNNLayers(NeuralNetworkArchitecture<T> architecture, int convLayerCount = 2, int filterCount = 32, int kernelSize = 3, int denseLayerCount = 1, int denseLayerSize = 64, int outputSize = 1)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
convLayerCountintNumber of convolutional layers (default: 2).
filterCountintNumber of filters in each convolutional layer (default: 32).
kernelSizeintSize of the convolutional kernel (default: 3).
denseLayerCountintNumber of dense layers after convolutional layers (default: 1).
denseLayerSizeintNumber of neurons in each dense layer (default: 64).
outputSizeintNumber of output neurons (default: 1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a CNN.
Remarks
For Beginners: A Convolutional Neural Network (CNN) is specialized for processing grid-like data, such as images. Instead of connecting every input to every neuron (which would be inefficient for images), CNNs use filters that scan across the image to detect features like edges, textures, and shapes.
Key components in this CNN: - Convolutional layers: Detect features in the input using filters - Pooling layers: Reduce the size of the data while keeping important information - Flatten layer: Converts the multi-dimensional data to a flat vector - Dense layers: Process the extracted features to make predictions
CreateDefaultCRAFTLayers(int, int, int)
Creates default CRAFT layers for character-level text detection.
public static IEnumerable<ILayer<T>> CreateDefaultCRAFTLayers(int imageSize = 768, int backboneChannels = 512, int upscaleChannels = 256)
Parameters
imageSizeintInput image size (default: 768).
backboneChannelsintBackbone output channels (default: 512).
upscaleChannelsintUpscale network channels (default: 256).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a CRAFT model.
Remarks
Reference: "Character Region Awareness for Text Detection" (CVPR 2019)
CreateDefaultCRNNLayers(int, int, int, int, int, int)
Creates default CRNN layers for sequence text recognition.
public static IEnumerable<ILayer<T>> CreateDefaultCRNNLayers(int imageWidth = 128, int imageHeight = 32, int cnnChannels = 512, int rnnHiddenSize = 256, int rnnLayers = 2, int charsetSize = 95)
Parameters
imageWidthintInput image width (default: 128).
imageHeightintInput image height (default: 32).
cnnChannelsintCNN output channels (default: 512).
rnnHiddenSizeintRNN hidden size (default: 256).
rnnLayersintNumber of RNN layers (default: 2).
charsetSizeintCharacter set size (default: 95).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a CRNN model.
Remarks
Reference: "An End-to-End Trainable Neural Network for Image-based Sequence Recognition" (TPAMI 2017)
CreateDefaultCapsuleNetworkLayers(NeuralNetworkArchitecture<T>)
Creates a default capsule network architecture.
public static IEnumerable<ILayer<T>> CreateDefaultCapsuleNetworkLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a capsule network.
Remarks
For Beginners: A capsule network is an advanced type of neural network that tries to better understand spatial relationships in data. Unlike traditional networks that just detect features, capsule networks also track the position, orientation, and size of features. Think of it like the difference between recognizing a face by just its parts (eyes, nose, mouth) versus understanding how those parts relate to each other in 3D space.
The network consists of special "capsule" layers that group neurons together to represent entities and their properties, allowing the network to better understand complex structures in data.
CreateDefaultClipLayers(NeuralNetworkArchitecture<T>, int)
Creates default layers for CLIP-style multimodal networks.
public static IEnumerable<ILayer<T>> CreateDefaultClipLayers(NeuralNetworkArchitecture<T> architecture, int projectionDim = 512)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
projectionDimintThe projection dimension for embeddings (default: 512).
Returns
- IEnumerable<ILayer<T>>
A collection of projection layers for CLIP fine-tuning.
Remarks
CLIP uses pre-trained ONNX encoders for most of its work, but these layers provide optional projection heads for fine-tuning or feature extraction.
For Beginners: CLIP has two main parts: an image encoder and a text encoder. These pre-trained encoders are loaded from ONNX files. The projection layers here are optional additions that can: - Adapt the embeddings for specific tasks - Allow fine-tuning on new domains - Match embedding dimensions between different model variants
If you're just using CLIP for inference (getting embeddings), you typically don't need these layers. They're useful when you want to adapt CLIP for a specific task.
CreateDefaultCogVideoLayers(int, int, int, int, int, int)
Creates layers for a CogVideo text-to-video generation model.
public static IEnumerable<ILayer<T>> CreateDefaultCogVideoLayers(int inputChannels = 4, int inputHeight = 32, int inputWidth = 32, int embedDim = 1024, int numLayers = 24, int numFrames = 16)
Parameters
inputChannelsintNumber of input channels for latent (default: 4).
inputHeightintInput latent height (default: 32).
inputWidthintInput latent width (default: 32).
embedDimintEmbedding dimension (default: 1024).
numLayersintNumber of transformer layers (default: 24).
numFramesintNumber of video frames to generate (default: 16).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for video generation.
Remarks
For Beginners: CogVideo generates videos from text descriptions. It works in the latent space (compressed representation) and uses a diffusion-based approach to iteratively refine noise into coherent video.
Architecture (based on the CogVideoX paper):
- Text encoder processes the input prompt
- Latent space diffusion model generates video frames
- VAE decoder converts latent to pixel space
This creates the denoising U-Net backbone that refines latent codes.
Reference: "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer" https://arxiv.org/abs/2408.06072
CreateDefaultColBERTLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)
Creates default layers for a ColBERT (Contextualized Late Interaction over BERT) model.
public static IEnumerable<ILayer<T>> CreateDefaultColBERTLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 128, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)
Parameters
architectureNeuralNetworkArchitecture<T>vocabSizeintembeddingDimensionintmaxSequenceLengthintnumLayersintnumHeadsintfeedForwardDimint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultCutieLayers(int, int, int, int)
Creates layers for a Cutie video object segmentation model.
public static IEnumerable<ILayer<T>> CreateDefaultCutieLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 854, int numFeatures = 256)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput frame height (default: 480).
inputWidthintInput frame width (default: 854).
numFeaturesintFeature dimension (default: 256).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for video object segmentation.
Remarks
For Beginners: Cutie is designed for semi-supervised video object segmentation (VOS). Given a mask for an object in the first frame, it tracks and segments that object throughout the entire video with high accuracy.
Architecture:
- Image encoder (ResNet-like backbone) extracts features
- Object encoder processes mask with features
- Memory attention matches current frame to stored memories
- Mask decoder produces segmentation output
Reference: "Putting the Object Back into Video Object Segmentation" https://arxiv.org/abs/2310.12982
CreateDefaultDBNetLayers(int, int, int)
Creates default layers for DBNet text detection model.
public static IEnumerable<ILayer<T>> CreateDefaultDBNetLayers(int imageSize = 640, int backboneChannels = 256, int innerChannels = 256)
Parameters
imageSizeintInput image size (default: 640).
backboneChannelsintBackbone output channels (default: 256).
innerChannelsintFPN inner channels (default: 256).
Returns
- IEnumerable<ILayer<T>>
Enumerable of layers for DBNet.
Remarks
DBNet uses a ResNet backbone with FPN for multi-scale features, followed by probability and threshold prediction heads.
Reference: "Real-time Scene Text Detection with Differentiable Binarization" (AAAI 2020)
CreateDefaultDIFRINTLayers(int, int, int, int, int)
Creates default layers for DIFRINT video stabilization.
public static IEnumerable<ILayer<T>> CreateDefaultDIFRINTLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 640, int numFeatures = 64, int numIterations = 3)
Parameters
Returns
- IEnumerable<ILayer<T>>
CreateDefaultDNCLayers(NeuralNetworkArchitecture<T>, int, int, int, int)
Creates a default Differentiable Neural Computer (DNC) with pre-configured layers.
public static IEnumerable<ILayer<T>> CreateDefaultDNCLayers(NeuralNetworkArchitecture<T> architecture, int controllerSize, int memoryWordSize, int readHeads, int interfaceSize)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
controllerSizeintThe size of the controller network.
memoryWordSizeintThe size of each memory word.
readHeadsintThe number of read heads.
interfaceSizeintThe size of the interface between controller and memory.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Differentiable Neural Computer.
Remarks
For Beginners: A Differentiable Neural Computer (DNC) is like a neural network with a built-in memory system. Traditional neural networks process information and then forget it, but a DNC can store information in its "memory" and retrieve it later when needed. This makes DNCs good at tasks that require remembering information over time, like answering questions about a story or navigating through complex environments.
CreateDefaultDeepBeliefNetworkLayers(NeuralNetworkArchitecture<T>)
Creates a default Deep Belief Network (DBN) with pre-configured layers.
public static IEnumerable<ILayer<T>> CreateDefaultDeepBeliefNetworkLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Deep Belief Network.
Remarks
For Beginners: A Deep Belief Network is a type of neural network that learns to recognize patterns in data by building multiple layers that each specialize in finding specific features. It works by training each layer one at a time (called "pre-training"), which helps the network learn more effectively, especially when you don't have a lot of labeled training data.
CreateDefaultDeepBoltzmannMachineLayers(NeuralNetworkArchitecture<T>)
Creates default layers for a Deep Boltzmann Machine (DBM).
public static IEnumerable<ILayer<T>> CreateDefaultDeepBoltzmannMachineLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration that defines input and output shapes.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Deep Boltzmann Machine.
Remarks
For Beginners: A Deep Boltzmann Machine is a type of neural network that learns to recognize patterns in data without supervision. It's made up of multiple layers of "hidden units" that learn to represent features of the input data. DBMs are particularly good at learning complex patterns and can be used for tasks like feature learning, dimensionality reduction, and generating new data similar to the training set.
CreateDefaultDeepOperatorNetworkLayers(int, int, int, int, int)
Creates default layers for a Deep Operator Network (DeepONet).
public static (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers) CreateDefaultDeepOperatorNetworkLayers(int branchInputSize, int trunkInputSize, int outputSize = 1, int hiddenLayerCount = 3, int hiddenLayerSize = 64)
Parameters
branchInputSizeintSize of the branch network input (function samples).
trunkInputSizeintSize of the trunk network input (query locations).
outputSizeintNumber of output components (default: 1 for scalar operators). For multi-output operators, each output component uses
hiddenLayerSizebasis functions, so the final layer outputshiddenLayerSize * outputSizevalues that are reshaped and summed.hiddenLayerCountintNumber of hidden layers in each sub-network (default: 3).
hiddenLayerSizeintNumber of neurons in each hidden layer, and the number of basis functions per output component (default: 64).
Returns
- (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)
A tuple of (branchLayers, trunkLayers) for the DeepONet architecture.
Remarks
For Beginners: DeepONet learns operators - functions that take functions as input. For example, an operator might take a temperature distribution as input and output the resulting heat flow. The branch network encodes the input function, while the trunk network handles where you want to evaluate the output.
Architecture: Branch encodes input function, Trunk encodes query location. Output = sum(Branch * Trunk) + bias, allowing learning of complex operators.
Multi-output handling: For operators with multiple output components (e.g., velocity
with x,y,z components), set outputSize to the number of components.
Each component gets its own set of basis functions. The branch and trunk networks
output hiddenLayerSize * outputSize values, which are grouped as
[component1_basis1..p, component2_basis1..p, ...] where p = hiddenLayerSize.
CreateDefaultDeepQNetworkLayers(NeuralNetworkArchitecture<T>)
Creates a default Deep Q-Network (DQN) with pre-configured layers for reinforcement learning.
public static IEnumerable<ILayer<T>> CreateDefaultDeepQNetworkLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Deep Q-Network.
Remarks
For Beginners: A Deep Q-Network is a type of neural network used in reinforcement learning, which is how computers learn to make decisions by trying different actions and receiving rewards. Think of it like teaching a dog new tricks with treats. The network learns which actions (like moving left or right in a game) will lead to the highest rewards over time.
CreateDefaultDeepRitzLayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for the Deep Ritz Method network.
public static IEnumerable<ILayer<T>> CreateDefaultDeepRitzLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 4, int hiddenLayerSize = 50)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
hiddenLayerCountintNumber of hidden layers (default: 4).
hiddenLayerSizeintNumber of neurons in each hidden layer (default: 50).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Deep Ritz network.
Remarks
For Beginners: The Deep Ritz Method solves PDEs by minimizing an energy functional instead of directly enforcing the PDE. This is based on the Ritz method from calculus of variations. The network learns the function that minimizes the energy.
Similar architecture to VPINN but used with energy-based loss functions. Tanh activation provides smooth second derivatives needed for energy computations.
CreateDefaultDenseNetLayers(NeuralNetworkArchitecture<T>, DenseNetConfiguration)
Creates default layers for a DenseNet network based on the specified configuration.
public static IEnumerable<ILayer<T>> CreateDefaultDenseNetLayers(NeuralNetworkArchitecture<T> architecture, DenseNetConfiguration configuration)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
configurationDenseNetConfigurationThe DenseNet-specific configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a DenseNet network.
Remarks
For Beginners: DenseNet (Densely Connected Convolutional Network) connects each layer to every other layer in a feed-forward fashion. This creates strong gradient flow and feature reuse, enabling very deep networks with fewer parameters.
The DenseNet architecture consists of:
- Stem: Initial 7x7 conv with stride 2, followed by 3x3 max pooling
- Dense Blocks: Multiple dense blocks with transition layers between them
- Transition Layers: 1x1 conv for channel reduction followed by 2x2 avg pooling
- Classification Head: Global average pooling followed by a dense layer
CreateDefaultDepthAnythingV2Layers(int, int, int, int, int)
Creates default layers for Depth Anything V2 monocular depth estimation model.
public static IEnumerable<ILayer<T>> CreateDefaultDepthAnythingV2Layers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 640, int numFeatures = 768, int numEncoderBlocks = 12)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput height (default: 480).
inputWidthintInput width (default: 640).
numFeaturesintNumber of feature channels (default: 768 for Base).
numEncoderBlocksintNumber of encoder transformer blocks (default: 12).
Returns
- IEnumerable<ILayer<T>>
An enumerable of layers configured for Depth Anything V2.
Remarks
For Beginners: Depth Anything V2 estimates depth maps from single images. Given an RGB image, it predicts the relative distance of each pixel from the camera.
Architecture: - ViT-based encoder with DINOv2 initialization - Multi-scale decoder for dense prediction - Depth prediction head
Reference: "Depth Anything V2" https://arxiv.org/abs/2406.09414
CreateDefaultDessurtLayers(int, int, int, int, int, int)
Creates default Dessurt (self-supervised document transformer) layers.
public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultDessurtLayers(int encoderDim = 768, int decoderDim = 768, int encoderLayers = 12, int decoderLayers = 6, int numHeads = 12, int vocabSize = 50265)
Parameters
encoderDimintEncoder dimension (default: 768).
decoderDimintDecoder dimension (default: 768).
encoderLayersintNumber of encoder layers (default: 12).
decoderLayersintNumber of decoder layers (default: 6).
numHeadsintNumber of attention heads (default: 12).
vocabSizeintVocabulary size (default: 50265).
Returns
- (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)
Encoder and decoder layers for a Dessurt model.
CreateDefaultDiTLayers(int, int, int, int, int, int)
Creates default DiT (Document Image Transformer) layers.
public static IEnumerable<ILayer<T>> CreateDefaultDiTLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int patchSize = 16, int imageSize = 224, int numClasses = 16)
Parameters
hiddenDimintHidden dimension (default: 768).
numLayersintNumber of transformer layers (default: 12).
numHeadsintNumber of attention heads (default: 12).
patchSizeintPatch size for ViT (default: 16).
imageSizeintInput image size (default: 224).
numClassesintNumber of output classes (default: 16).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a DiT model.
CreateDefaultDocBankLayers(int, int, int, int)
Creates default layers for DocBank page segmentation model.
public static IEnumerable<ILayer<T>> CreateDefaultDocBankLayers(int imageSize = 1024, int backboneChannels = 256, int numClasses = 13, int hiddenDim = 256)
Parameters
imageSizeintInput image size (default: 1024).
backboneChannelsintBackbone output channels (default: 256).
numClassesintNumber of segmentation classes (default: 13).
hiddenDimintHidden dimension for segmentation head (default: 256).
Returns
- IEnumerable<ILayer<T>>
Enumerable of layers for DocBank.
Remarks
DocBank uses a ResNet backbone with FPN for semantic segmentation.
Reference: "DocBank: A Benchmark Dataset for Document Layout Analysis" (COLING 2020)
CreateDefaultDocFormerLayers(int, int, int, int, int, int, int)
Creates default DocFormer layers for document understanding with shared spatial encodings.
public static IEnumerable<ILayer<T>> CreateDefaultDocFormerLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 30522, int imageSize = 224, int spatialDim = 128, int numClasses = 16)
Parameters
hiddenDimintHidden dimension (default: 768).
numLayersintNumber of transformer layers (default: 12).
numHeadsintNumber of attention heads (default: 12).
vocabSizeintVocabulary size (default: 30522).
imageSizeintInput image size (default: 224).
spatialDimintSpatial embedding dimension (default: 128).
numClassesintNumber of output classes (default: 16).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a DocFormer model.
Remarks
DocFormer uses shared spatial encodings across text, visual, and layout modalities.
Reference: "DocFormer: End-to-End Transformer for Document Understanding" (ICCV 2021)
CreateDefaultDocGCNLayers(int, int, int, int)
Creates default DocGCN (Document Graph Convolutional Network) layers.
public static IEnumerable<ILayer<T>> CreateDefaultDocGCNLayers(int inputDim = 768, int hiddenDim = 256, int numGCNLayers = 3, int numClasses = 7)
Parameters
inputDimintInput feature dimension (default: 768).
hiddenDimintHidden dimension (default: 256).
numGCNLayersintNumber of GCN layers (default: 3).
numClassesintNumber of output classes (default: 7).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a DocGCN model.
CreateDefaultDocOwlLayers(int, int, int, int, int, int)
Creates default DocOwl (mPLUG-DocOwl) layers for document understanding.
public static IEnumerable<ILayer<T>> CreateDefaultDocOwlLayers(int visionDim = 1024, int textDim = 4096, int visionLayers = 24, int textLayers = 32, int numHeads = 16, int vocabSize = 32000)
Parameters
visionDimintVision encoder dimension (default: 1024).
textDimintText encoder dimension (default: 4096).
visionLayersintNumber of vision layers (default: 24).
textLayersintNumber of text layers (default: 32).
numHeadsintNumber of attention heads (default: 16).
vocabSizeintVocabulary size (default: 32000).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a DocOwl model.
CreateDefaultDonutLayers(int, int, int, int, int[]?, int[]?, int, int, int, int, int, int, int, int)
Creates default Donut layers for OCR-free document understanding.
public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultDonutLayers(int imageHeight = 1920, int imageWidth = 2560, int inputChannels = 3, int embedDim = 128, int[]? depths = null, int[]? numHeads = null, int windowSize = 10, int patchSize = 4, int mlpRatio = 4, int decoderHiddenDim = 1024, int numDecoderLayers = 4, int decoderHeads = 16, int vocabSize = 57522, int maxGenerationLength = 768)
Parameters
imageHeightintInput image height (default: 1920 for donut-base).
imageWidthintInput image width (default: 2560 for donut-base).
inputChannelsintNumber of input channels (default: 3 for RGB).
embedDimintInitial embedding dimension (default: 128 for Swin-B).
depthsint[]Depths of each Swin stage (default: {2,2,14,2} for donut-base).
numHeadsint[]Attention heads per stage (default: {4,8,16,32}).
windowSizeintWindow size for attention (default: 10 for donut-base).
patchSizeintInitial patch size (default: 4).
mlpRatiointMLP expansion ratio (default: 4).
decoderHiddenDimintDecoder hidden dimension (default: 1024).
numDecoderLayersintNumber of decoder layers (default: 4).
decoderHeadsintNumber of decoder attention heads (default: 16).
vocabSizeintVocabulary size (default: 57522).
maxGenerationLengthintMaximum output sequence length (default: 768).
Returns
- (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)
A tuple of (EncoderLayers, DecoderLayers) forming a Donut architecture.
Remarks
Donut (Document Understanding Transformer) is an OCR-free end-to-end model: - Swin Transformer-B encoder with hierarchical stages for image features - BART-style decoder for text generation - Direct pixel-to-text conversion without explicit OCR
For Beginners: This creates a model that can "read" documents directly from pixels without needing a separate OCR step. The encoder extracts visual features at multiple scales using the Swin Transformer architecture, while the decoder generates text autoregressively.
Default Configuration (donut-base): - Input: 2560×1920 RGB images - Encoder: Swin-B with depths {2,2,14,2}, 128 initial dim, window size 10 - Decoder: 4-layer BART-style with 1024 hidden dim
Reference: "OCR-free Document Understanding Transformer" (ECCV 2022)
CreateDefaultEASTLayers(int, int, int, string)
Creates default EAST (Efficient and Accurate Scene Text Detector) layers.
public static IEnumerable<ILayer<T>> CreateDefaultEASTLayers(int imageSize = 512, int backboneChannels = 512, int featureChannels = 128, string geometryType = "RBOX")
Parameters
imageSizeintInput image size (default: 512).
backboneChannelsintBackbone output channels (default: 512).
featureChannelsintFeature map channels (default: 128).
geometryTypestringGeometry output type: RBOX or QUAD (default: RBOX).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an EAST model.
CreateDefaultECAPATDNNLanguageIdentifierLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int[]?)
Creates default ECAPA-TDNN layers for spoken language identification.
public static IEnumerable<ILayer<T>> CreateDefaultECAPATDNNLanguageIdentifierLayers(NeuralNetworkArchitecture<T> architecture, int numMels = 80, int tdnnChannels = 1024, int embeddingDimension = 192, int numLanguages = 20, int[]? dilations = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
numMelsintNumber of mel filterbank channels (default: 80).
tdnnChannelsintNumber of TDNN channels (default: 1024).
embeddingDimensionintEmbedding dimension (default: 192).
numLanguagesintNumber of languages to classify (default: 20).
dilationsint[]Dilation factors for TDNN layers (default: [1, 2, 3, 4, 1]).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an ECAPA-TDNN language identifier.
Remarks
ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation TDNN) is a state-of-the-art architecture for speaker and language recognition using: - SE-Res2Net blocks with channel attention - Multi-layer feature aggregation (MFA) - Attentive statistics pooling
CreateDefaultEDVRLayers(int, int, int, int, int, int, int)
Creates default layers for EDVR video restoration.
public static IEnumerable<ILayer<T>> CreateDefaultEDVRLayers(int inputChannels = 3, int inputHeight = 256, int inputWidth = 256, int numFeatures = 64, int numFrames = 5, int numGroups = 8, int numBlocks = 5)
Parameters
inputChannelsintinputHeightintinputWidthintnumFeaturesintnumFramesintnumGroupsintnumBlocksint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultELMLayers(NeuralNetworkArchitecture<T>, int)
Creates default layers for an Extreme Learning Machine (ELM) neural network.
public static IEnumerable<ILayer<T>> CreateDefaultELMLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerSize)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
hiddenLayerSizeintThe size of the hidden layer.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an Extreme Learning Machine.
Remarks
For Beginners: An Extreme Learning Machine (ELM) is a simplified neural network where only the output layer weights are trained. The hidden layer weights are randomly initialized and never updated. This makes ELMs very fast to train compared to traditional neural networks, while still providing good performance for many tasks. Think of it as a "shortcut" approach to neural network training.
ELMs work by projecting the input data into a higher-dimensional space using random weights, then finding the best output weights to solve the problem. They're particularly useful when you need a quick solution and don't have time for extensive training.
CreateDefaultESNLayers(int, int, int, double, double)
Creates a default Echo State Network (ESN) with pre-configured layers.
public static IEnumerable<ILayer<T>> CreateDefaultESNLayers(int inputSize, int outputSize, int reservoirSize, double spectralRadius = 0.9, double sparsity = 0.1)
Parameters
inputSizeintThe size of the input layer.
outputSizeintThe size of the output layer.
reservoirSizeintThe size of the reservoir (hidden layer).
spectralRadiusdoubleControls the stability of the reservoir dynamics (default: 0.9).
sparsitydoubleThe connection sparsity in the reservoir (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an Echo State Network.
Remarks
For Beginners: An Echo State Network is a special type of recurrent neural network where most of the connections between neurons are fixed (not trained). Only the connections from the hidden layer to the output are trained. Think of it like having a pool of water (the reservoir) that you disturb with input signals, and then you learn to read the ripple patterns to predict outputs. This makes ESNs very fast to train compared to other recurrent networks.
CreateDefaultEfficientNetLayers(NeuralNetworkArchitecture<T>, EfficientNetConfiguration)
Creates default layers for an EfficientNet network based on the specified configuration.
public static IEnumerable<ILayer<T>> CreateDefaultEfficientNetLayers(NeuralNetworkArchitecture<T> architecture, EfficientNetConfiguration configuration)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
configurationEfficientNetConfigurationThe EfficientNet-specific configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an EfficientNet network.
Remarks
For Beginners: EfficientNet uses compound scaling to balance network depth, width, and resolution. Each variant (B0-B7) represents a different scale factor, achieving excellent accuracy with fewer parameters than previous architectures.
CreateDefaultFLAVRLayers(int, int, int, int, int)
Creates default layers for FLAVR frame interpolation.
public static IEnumerable<ILayer<T>> CreateDefaultFLAVRLayers(int inputChannels = 3, int inputHeight = 256, int inputWidth = 256, int numFeatures = 64, int numInputFrames = 4)
Parameters
Returns
- IEnumerable<ILayer<T>>
CreateDefaultFastDVDNetLayers(int, int, int, int, int)
Creates default layers for FastDVDNet video denoising.
public static IEnumerable<ILayer<T>> CreateDefaultFastDVDNetLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 854, int numFeatures = 32, int numInputFrames = 5)
Parameters
Returns
- IEnumerable<ILayer<T>>
CreateDefaultFastTextLayers(NeuralNetworkArchitecture<T>, int, int, int)
Creates default layers for a FastText model.
public static IEnumerable<ILayer<T>> CreateDefaultFastTextLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize, int bucketSize, int embeddingDimension)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
vocabSizeintThe size of the vocabulary.
bucketSizeintThe number of buckets for n-gram hashing.
embeddingDimensionintThe dimension of the embedding vectors.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a FastText model.
Remarks
For Beginners: FastText improves on Word2Vec by considering sub-word information (character n-grams). It represents words as the sum of their n-gram embeddings.
CreateDefaultFeedForwardLayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for a feed-forward neural network.
public static IEnumerable<ILayer<T>> CreateDefaultFeedForwardLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 2, int hiddenLayerSize = 64)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration that defines input and output shapes.
hiddenLayerCountintNumber of hidden layers (default: 2).
hiddenLayerSizeintNumber of neurons in each hidden layer (default: 64).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a feed-forward neural network.
Remarks
For Beginners: This method builds a basic feed-forward neural network. Think of it as a series of connected layers where information flows from the input, through "hidden" processing layers, to the output.
Key components: - Input layer: Receives the raw data - Hidden layers: Process and transform the data, learning patterns - Output layer: Produces the final prediction or classification
The network automatically adjusts for different types of tasks (like classification or regression) by choosing appropriate activation functions for the output layer.
CreateDefaultFlowFormerLayers(int, int, int, int, int)
Creates default layers for FlowFormer optical flow estimation.
public static IEnumerable<ILayer<T>> CreateDefaultFlowFormerLayers(int inputChannels = 3, int inputHeight = 448, int inputWidth = 1024, int embedDim = 256, int numLayers = 6)
Parameters
Returns
- IEnumerable<ILayer<T>>
CreateDefaultFourierNeuralOperatorLayers(NeuralNetworkArchitecture<T>, int[], int, int, int)
Creates default layers for a Fourier Neural Operator (FNO).
public static IEnumerable<ILayer<T>> CreateDefaultFourierNeuralOperatorLayers(NeuralNetworkArchitecture<T> architecture, int[] spatialDimensions, int numFourierLayers = 4, int hiddenChannels = 64, int numModes = 12)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
spatialDimensionsint[]Dimensions of the spatial domain (e.g., [64, 64] for 2D grid, [32] for 1D). This determines the FFT size for spectral operations.
numFourierLayersintNumber of Fourier layers (default: 4).
hiddenChannelsintNumber of hidden channels/width (default: 64).
numModesintNumber of Fourier modes to retain (default: 12). Lower = smoother, higher = more detail. Should be less than or equal to smallest spatial dimension.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Fourier Neural Operator.
Remarks
For Beginners: Fourier Neural Operators learn mappings between function spaces by operating in frequency domain. They're especially powerful for PDEs because many physical phenomena have simple representations in frequency space.
Architecture:
- Lifting layer: Projects input to higher-dimensional channel space
- Fourier layers: Apply spectral convolution (FFT → learnable weights → IFFT) + local linear transform
- Projection layers: Map back to output dimension
Key FNO Properties:
- Resolution-invariant: Train at one resolution, evaluate at another
- Global receptive field through spectral operations
- Efficient for smooth functions (low-frequency dominated)
Note: For full FNO functionality with training, use the FourierNeuralOperator<T> class directly, which provides a complete neural operator implementation.
Exceptions
- ArgumentNullException
Thrown when spatialDimensions is null.
- ArgumentException
Thrown when spatialDimensions is empty.
CreateDefaultFrameInterpolationLayers(int, int, int, int)
Creates layers for a frame interpolation model (FILM/RIFE-style).
public static IEnumerable<ILayer<T>> CreateDefaultFrameInterpolationLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int numFeatures = 64)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput frame height.
inputWidthintInput frame width.
numFeaturesintNumber of feature channels (default: 64).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for frame interpolation.
Remarks
For Beginners: Frame interpolation creates new frames between existing ones to make video smoother (e.g., 30fps to 60fps). The model learns to "imagine" what the in-between frames should look like based on the surrounding frames.
Architecture:
- Feature pyramid extracts multi-scale features
- Flow estimation predicts motion
- Synthesis network generates interpolated frames
CreateDefaultGNNLayers(NeuralNetworkArchitecture<T>)
Creates default layers for a Graph Neural Network (GNN).
public static IEnumerable<ILayer<T>> CreateDefaultGNNLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Graph Neural Network.
Remarks
For Beginners: Graph Neural Networks (GNNs) are specialized neural networks designed to work with graph-structured data, where information is represented as nodes (points) connected by edges (lines). Examples include social networks, molecular structures, or road networks.
Unlike standard neural networks that process individual data points independently, GNNs can understand relationships between data points. They work by passing information between connected nodes, allowing each node to "learn" from its neighbors. This makes GNNs powerful for tasks where relationships between entities matter, such as recommending friends on social media, predicting protein interactions, or analyzing traffic patterns.
CreateDefaultGRULayers(NeuralNetworkArchitecture<T>)
Creates a default Gated Recurrent Unit (GRU) neural network layer configuration.
public static IEnumerable<ILayer<T>> CreateDefaultGRULayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for GRU-based processing.
Remarks
For Beginners: A GRU (Gated Recurrent Unit) is a type of recurrent neural network that's especially good at learning patterns in sequences of data, like text or time series. It's similar to LSTM but with a simpler structure, making it faster to train while still capturing long-term dependencies in data.
This method automatically configures appropriate GRU layers based on your task type, with sensible defaults for hidden layer sizes and activation functions.
Exceptions
- InvalidOperationException
Thrown when the architecture has invalid input or output dimensions.
CreateDefaultGenreClassifierLayers(int, int, int, int, int, double)
Creates default genre classification layers.
public static IEnumerable<ILayer<T>> CreateDefaultGenreClassifierLayers(int numMels = 128, int hiddenDim = 256, int numClasses = 10, int maxFrames = 1000, int numAttentionLayers = 4, double dropoutRate = 0.3)
Parameters
numMelsintNumber of mel spectrogram bins (default: 128).
hiddenDimintHidden layer dimension (default: 256).
numClassesintNumber of genre classes (default: 10).
maxFramesintMaximum input frames (default: 1000).
numAttentionLayersintNumber of attention layers (default: 4).
dropoutRatedoubleDropout rate (default: 0.3).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for genre classification.
Remarks
Audio classification architecture with:
- Mel spectrogram feature extraction
- Transformer encoder for temporal modeling
- Global average pooling
- Classification head with softmax output
CreateDefaultGloVeLayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for a GloVe (Global Vectors) model.
public static IEnumerable<ILayer<T>> CreateDefaultGloVeLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize, int embeddingDimension)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
vocabSizeintThe size of the vocabulary.
embeddingDimensionintThe dimension of the embedding vectors.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a GloVe model.
Remarks
For Beginners: GloVe creates word embeddings by learning from the co-occurrence statistics of words. It uses two sets of embeddings and two sets of biases.
Note: The layers returned by this method are not intended to be used as a sequential feed-forward stack. They represent the four components (W, W_tilde, b, b_tilde) required for the GloVe model's custom forward pass.
CreateDefaultGraphAttentionLayers(NeuralNetworkArchitecture<T>, int, int, double)
Creates default layers for a Graph Attention Network (GAT).
public static IEnumerable<ILayer<T>> CreateDefaultGraphAttentionLayers(NeuralNetworkArchitecture<T> architecture, int numHeads = 8, int numLayers = 2, double dropoutRate = 0.6)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
numHeadsintNumber of attention heads per layer (default: 8).
numLayersintNumber of GAT layers (default: 2).
dropoutRatedoubleDropout rate for attention coefficients (default: 0.6).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for GAT processing.
Remarks
For Beginners: GAT uses attention mechanisms to learn which neighbors are most important for each node, allowing dynamic weighting of neighbor contributions.
CreateDefaultGraphClassificationLayers(NeuralNetworkArchitecture<T>, int, int, int, double)
Creates default layers for a Graph Classification model.
public static IEnumerable<ILayer<T>> CreateDefaultGraphClassificationLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 64, int embeddingDim = 128, int numGnnLayers = 3, double dropoutRate = 0.5)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
hiddenDimintHidden dimension size (default: 64).
embeddingDimintGraph embedding dimension (default: 128).
numGnnLayersintNumber of GNN layers (default: 3).
dropoutRatedoubleDropout rate for regularization (default: 0.5).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for graph classification.
Remarks
For Beginners: Graph classification predicts labels for entire graphs. This architecture uses multiple GCN layers followed by pooling and classification.
CreateDefaultGraphGenerationLayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for a Graph Generation model (VGAE encoder).
public static IEnumerable<ILayer<T>> CreateDefaultGraphGenerationLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 32, int numEncoderLayers = 2)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
hiddenDimintHidden dimension size (default: 32).
numEncoderLayersintNumber of encoder GNN layers (default: 2).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for graph generation encoder.
Remarks
For Beginners: Graph generation models learn to create new graph structures. This encoder uses GCN layers to map node features to a latent space.
CreateDefaultGraphIsomorphismLayers(NeuralNetworkArchitecture<T>, int, int, bool, double)
Creates default layers for a Graph Isomorphism Network (GIN).
public static IEnumerable<ILayer<T>> CreateDefaultGraphIsomorphismLayers(NeuralNetworkArchitecture<T> architecture, int mlpHiddenDim = 64, int numLayers = 5, bool learnEpsilon = true, double initialEpsilon = 0)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
mlpHiddenDimintHidden dimension for MLP within GIN layers (default: 64).
numLayersintNumber of GIN layers (default: 5).
learnEpsilonboolWhether to learn epsilon parameter (default: true).
initialEpsilondoubleInitial value for epsilon (default: 0.0).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for GIN processing.
Remarks
For Beginners: GIN is provably as powerful as the Weisfeiler-Lehman graph isomorphism test, making it optimal for distinguishing graph structures.
CreateDefaultGraphSAGELayers(NeuralNetworkArchitecture<T>, SAGEAggregatorType, int, bool)
Creates default layers for a GraphSAGE (Graph Sample and Aggregate) Network.
public static IEnumerable<ILayer<T>> CreateDefaultGraphSAGELayers(NeuralNetworkArchitecture<T> architecture, SAGEAggregatorType aggregatorType = SAGEAggregatorType.Mean, int numLayers = 2, bool normalize = true)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
aggregatorTypeSAGEAggregatorTypeThe type of aggregation function (default: Mean).
numLayersintNumber of GraphSAGE layers (default: 2).
normalizeboolWhether to apply L2 normalization (default: true).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for GraphSAGE processing.
Remarks
For Beginners: GraphSAGE learns to aggregate neighbor information for inductive learning. It can generalize to new, unseen nodes by learning aggregation functions.
CreateDefaultHTMLayers(NeuralNetworkArchitecture<T>, int, int, double)
Creates a default Hierarchical Temporal Memory (HTM) neural network layer configuration.
public static IEnumerable<ILayer<T>> CreateDefaultHTMLayers(NeuralNetworkArchitecture<T> architecture, int columnCount, int cellsPerColumn, double sparsityThreshold)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
columnCountintThe number of columns in the HTM network.
cellsPerColumnintThe number of cells per column.
sparsityThresholddoubleThe sparsity threshold for the spatial pooler.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for HTM processing.
Remarks
For Beginners: Hierarchical Temporal Memory (HTM) is a machine learning technology that mimics certain structural and algorithmic properties of the neocortex (the part of the brain responsible for higher-order thinking). HTM is particularly good at learning patterns in sequential data and making predictions.
Key HTM concepts: - Columns: Vertical arrangements of cells that work together - Cells: The basic processing units (like neurons) - Sparsity: Only a small percentage of cells are active at any time, which helps with learning
Exceptions
- InvalidOperationException
Thrown when the architecture has invalid input or output dimensions.
CreateDefaultHamiltonianLayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for a Hamiltonian Neural Network.
public static IEnumerable<ILayer<T>> CreateDefaultHamiltonianLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 3, int hiddenLayerSize = 64)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration that defines input and output shapes.
hiddenLayerCountintNumber of hidden layers (default: 3).
hiddenLayerSizeintNumber of neurons in each hidden layer (default: 64).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Hamiltonian neural network.
Remarks
For Beginners: Hamiltonian Neural Networks (HNNs) learn the energy function (Hamiltonian) of a physical system. The network takes a state vector [q, p] (positions and momenta) as input and outputs a scalar energy value.
Key design choices: - Uses Tanh activation in hidden layers for smooth, bounded outputs that help with gradient computation - Output layer has linear activation since the Hamiltonian can be any real number - Architecture is designed for computing gradients (∂H/∂q, ∂H/∂p) to derive dynamics
The network structure enables Hamilton's equations:
- dq/dt = ∂H/∂p (velocity from momentum gradient)
- dp/dt = -∂H/∂q (force from position gradient)
This guarantees energy conservation by construction.
CreateDefaultInfographicVQALayers(int, int, int, int, int, int, int, int)
Creates default InfographicVQA layers for infographic understanding.
public static IEnumerable<ILayer<T>> CreateDefaultInfographicVQALayers(int imageSize = 1024, int visionDim = 768, int textDim = 768, int fusionDim = 768, int visionLayers = 12, int fusionLayers = 6, int numHeads = 12, int vocabSize = 30522)
Parameters
imageSizeintInput image size (default: 1024).
visionDimintVision encoder dimension (default: 768).
textDimintText encoder dimension (default: 768).
fusionDimintFusion dimension (default: 768).
visionLayersintNumber of vision layers (default: 12).
fusionLayersintNumber of fusion layers (default: 6).
numHeadsintNumber of attention heads (default: 12).
vocabSizeintVocabulary size (default: 30522).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an InfographicVQA model.
CreateDefaultInstructorLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)
Creates default layers for an Instructor/E5 (Instruction-Tuned) embedding model.
public static IEnumerable<ILayer<T>> CreateDefaultInstructorLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)
Parameters
architectureNeuralNetworkArchitecture<T>vocabSizeintembeddingDimensionintmaxSequenceLengthintnumLayersintnumHeadsintfeedForwardDimint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultInternVideo2Layers(int, int, int, int, int, int)
Creates layers for an InternVideo2-style video understanding model.
public static IEnumerable<ILayer<T>> CreateDefaultInternVideo2Layers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int embedDim = 768, int numEncoderLayers = 12, int patchSize = 14)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput frame height.
inputWidthintInput frame width.
embedDimintEmbedding dimension (default: 768).
numEncoderLayersintNumber of transformer encoder layers (default: 12).
patchSizeintPatch size for video tokenization (default: 14).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for video understanding.
Remarks
For Beginners: InternVideo2 understands video content by encoding frames into embeddings that capture both spatial (what's in each frame) and temporal (how things change over time) information. It can be used for: - Video classification (identifying what's happening) - Video-text retrieval (finding videos matching descriptions) - Video question answering
Architecture (based on the paper):
- Patch embedding converts video frames into tokens
- Spatial attention processes within-frame relationships
- Temporal attention processes across-frame relationships
- FFN layers add non-linearity and expressiveness
- Projection maps to a shared video-text embedding space
Reference: "InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding" https://arxiv.org/abs/2403.15377
CreateDefaultLSMLayers(NeuralNetworkArchitecture<T>, int, double, double, double, double)
Creates a default configuration of layers for a Liquid State Machine (LSM) neural network.
public static IEnumerable<ILayer<T>> CreateDefaultLSMLayers(NeuralNetworkArchitecture<T> architecture, int reservoirSize = 100, double connectionProbability = 0.1, double spectralRadius = 0.9, double inputScaling = 1, double leakingRate = 1)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
reservoirSizeintThe size of the reservoir (number of neurons in the reservoir layer). Default is 100.
connectionProbabilitydoubleThe probability of connection between neurons in the reservoir. Default is 0.1 (10%).
spectralRadiusdoubleControls the stability of the reservoir dynamics. Default is 0.9.
inputScalingdoubleScaling factor for input connections. Default is 1.0.
leakingRatedoubleControls how quickly the reservoir responds to new inputs. Default is 1.0.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for a Liquid State Machine.
Remarks
For Beginners: A Liquid State Machine is a special type of neural network inspired by how the brain processes information. The key component is the "reservoir" - imagine it as a pool of randomly connected neurons that create complex patterns when input is fed into them.
- The reservoirSize is how many neurons are in this pool
- The connectionProbability determines how densely connected these neurons are
- The spectralRadius affects how stable the patterns in the reservoir are
- The inputScaling controls how strongly the input affects the reservoir
- The leakingRate determines how quickly the reservoir responds to new information
LSMs are particularly good at processing time-dependent data like speech or video.
Exceptions
- ArgumentNullException
Thrown when architecture is null.
- InvalidOperationException
Thrown when input shape is not specified or input/output size is not positive.
CreateDefaultLSTMNetworkLayers(NeuralNetworkArchitecture<T>)
Creates a default configuration of layers for a Long Short-Term Memory (LSTM) neural network.
public static IEnumerable<ILayer<T>> CreateDefaultLSTMNetworkLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for an LSTM neural network.
Remarks
For Beginners: LSTM (Long Short-Term Memory) networks are a special kind of neural network designed to remember information for long periods of time. Think of them like a person with a good memory who can recall things from the past to make decisions in the present.
LSTMs are particularly useful for: - Text prediction (like autocomplete on your phone) - Speech recognition - Time series forecasting (like stock prices or weather) - Any task where the order of data matters
Key terms explained: - Hidden Size: How much information the network can remember at once (bigger = more memory) - Layers: How many processing steps the data goes through (more layers = more complex patterns) - Activation Function: How neurons decide whether to fire (like Tanh or Sigmoid) - Recurrent Activation: Special activation function used for the memory gates
Exceptions
- ArgumentNullException
Thrown when architecture is null.
- InvalidOperationException
Thrown when input shape is not specified or input/output size is not positive.
CreateDefaultLagrangianLayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for a Lagrangian Neural Network.
public static IEnumerable<ILayer<T>> CreateDefaultLagrangianLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 3, int hiddenLayerSize = 64)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration that defines input and output shapes.
hiddenLayerCountintNumber of hidden layers (default: 3).
hiddenLayerSizeintNumber of neurons in each hidden layer (default: 64).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Lagrangian neural network.
Remarks
For Beginners: Lagrangian Neural Networks (LNNs) learn the Lagrangian function L(q, q̇) of a physical system. The Lagrangian is typically L = T - V (kinetic minus potential energy).
Key design choices: - Uses Tanh activation in hidden layers for smooth derivatives needed in Euler-Lagrange equations - Output is scalar (the Lagrangian value) - Structure supports computing second derivatives for equations of motion
The Euler-Lagrange equation: d/dt(∂L/∂q̇) = ∂L/∂q This gives the equations of motion while automatically respecting conservation laws.
CreateDefaultLayers(NeuralNetworkArchitecture<T>, int, int, int)
Creates a standard feed-forward neural network with configurable hidden layers.
public static IEnumerable<ILayer<T>> CreateDefaultLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 1, int hiddenLayerSize = 64, int outputSize = 1)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
hiddenLayerCountintNumber of hidden layers (default: 1).
hiddenLayerSizeintNumber of neurons in each hidden layer (default: 64).
outputSizeintNumber of output neurons (default: 1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a feed-forward neural network.
Remarks
For Beginners: A feed-forward neural network is the simplest type of neural network where information flows in one direction from input to output. Think of it as an assembly line where each layer processes the data and passes it to the next layer.
This method creates: - An input layer that takes your data - One or more hidden layers that learn patterns in your data - An output layer that produces the final prediction
CreateDefaultLayoutGraphLayers(int, int, int, int)
Creates default LayoutGraph layers for graph-based layout analysis.
public static IEnumerable<ILayer<T>> CreateDefaultLayoutGraphLayers(int inputDim = 768, int hiddenDim = 256, int numGraphLayers = 4, int numClasses = 7)
Parameters
inputDimintInput feature dimension (default: 768).
hiddenDimintHidden dimension (default: 256).
numGraphLayersintNumber of graph layers (default: 4).
numClassesintNumber of output classes (default: 7).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a LayoutGraph model.
CreateDefaultLayoutLMLayers(int, int, int, int, int, int)
Creates default LayoutLM (v1) layers for document understanding with layout-aware pre-training.
public static IEnumerable<ILayer<T>> CreateDefaultLayoutLMLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 30522, int maxSequenceLength = 512, int numClasses = 7)
Parameters
hiddenDimintHidden dimension (default: 768 for BERT-base).
numLayersintNumber of transformer layers (default: 12).
numHeadsintNumber of attention heads (default: 12).
vocabSizeintVocabulary size (default: 30522 for BERT).
maxSequenceLengthintMaximum sequence length (default: 512).
numClassesintNumber of output classes (default: 7 for FUNSD).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a LayoutLM model.
Remarks
LayoutLM v1 combines BERT text embeddings with 2D position embeddings to jointly model text and layout. Unlike v2/v3, it does NOT use visual features.
Reference: "LayoutLM: Pre-training of Text and Layout for Document Image Understanding" (KDD 2020) https://arxiv.org/abs/1912.13318
CreateDefaultLayoutLMv2Layers(int, int, int, int, int, int, int)
Creates default LayoutLMv2 layers for document understanding with visual features.
public static IEnumerable<ILayer<T>> CreateDefaultLayoutLMv2Layers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 30522, int imageSize = 224, int visualBackboneChannels = 256, int numClasses = 7)
Parameters
hiddenDimintHidden dimension (default: 768 for BERT-base).
numLayersintNumber of transformer layers (default: 12).
numHeadsintNumber of attention heads (default: 12).
vocabSizeintVocabulary size (default: 30522 for BERT).
imageSizeintInput image size (default: 224).
visualBackboneChannelsintVisual backbone output channels (default: 256).
numClassesintNumber of output classes (default: 7).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a LayoutLMv2 model.
Remarks
LayoutLMv2 extends LayoutLM by adding visual features from a ResNeXt-FPN backbone, enabling the model to understand documents through text, layout, AND image features.
Key components: - Visual backbone (ResNeXt-101 with FPN) - Text encoder (BERT-base) - Spatial-aware self-attention mechanism
Reference: "LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding" (ACL 2021) https://arxiv.org/abs/2012.14740
CreateDefaultLayoutLMv3Layers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int)
Creates default LayoutLMv3 layers for document understanding.
public static IEnumerable<ILayer<T>> CreateDefaultLayoutLMv3Layers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 50265, int imageSize = 224, int patchSize = 16, int numClasses = 17)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
hiddenDimintHidden dimension size (default: 768 from paper).
numLayersintNumber of transformer layers (default: 12 from paper).
numHeadsintNumber of attention heads (default: 12 from paper).
vocabSizeintVocabulary size (default: 50265 for RoBERTa tokenizer).
imageSizeintInput image size (default: 224).
patchSizeintVision patch size (default: 16).
numClassesintNumber of output classes (default: 17 for layout detection).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a LayoutLMv3 architecture.
Remarks
LayoutLMv3 uses unified multimodal pre-training with: - Text embedding layer (RoBERTa-style) - Image patch embedding (ViT-style) - Transformer encoder with spatial-aware self-attention - Classification head for layout detection or other tasks
Reference: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking" (ICCV 2022)
CreateDefaultLayoutXLMLayers(int, int, int, int, int, int, int)
Creates default LayoutXLM layers for multilingual document understanding.
public static IEnumerable<ILayer<T>> CreateDefaultLayoutXLMLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int vocabSize = 250002, int imageSize = 224, int visualBackboneChannels = 256, int numClasses = 7)
Parameters
hiddenDimintHidden dimension (default: 768).
numLayersintNumber of transformer layers (default: 12).
numHeadsintNumber of attention heads (default: 12).
vocabSizeintVocabulary size (default: 250002 for XLM-RoBERTa).
imageSizeintInput image size (default: 224).
visualBackboneChannelsintVisual backbone channels (default: 256).
numClassesintNumber of output classes (default: 7).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a LayoutXLM model.
CreateDefaultLiLTLayers(int, int, int, int, int, int)
Creates default LiLT (Language-Independent Layout Transformer) layers.
public static IEnumerable<ILayer<T>> CreateDefaultLiLTLayers(int hiddenDim = 768, int numLayers = 12, int numHeads = 12, int layoutDim = 768, int vocabSize = 30522, int numClasses = 7)
Parameters
hiddenDimintHidden dimension (default: 768).
numLayersintNumber of transformer layers (default: 12).
numHeadsintNumber of attention heads (default: 12).
layoutDimintLayout embedding dimension (default: 768).
vocabSizeintVocabulary size (default: 30522).
numClassesintNumber of output classes (default: 7).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a LiLT model.
CreateDefaultLinkPredictionLayers(NeuralNetworkArchitecture<T>, int, int, int, double)
Creates default layers for a Link Prediction model encoder.
public static IEnumerable<ILayer<T>> CreateDefaultLinkPredictionLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 64, int embeddingDim = 32, int numLayers = 2, double dropoutRate = 0.5)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
hiddenDimintHidden dimension size (default: 64).
embeddingDimintNode embedding dimension (default: 32).
numLayersintNumber of GNN layers (default: 2).
dropoutRatedoubleDropout rate for regularization (default: 0.5).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for link prediction.
Remarks
For Beginners: Link prediction predicts whether edges should exist between nodes. This encoder learns node embeddings that can be combined to score potential edges.
CreateDefaultMATCHALayers(int, int, int, int, int, int, int)
Creates default MATCHA (chart understanding) layers.
public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultMATCHALayers(int encoderDim = 1536, int decoderDim = 1536, int encoderLayers = 18, int decoderLayers = 18, int numHeads = 24, int vocabSize = 50265, int maxPatchesPerImage = 4096)
Parameters
encoderDimintEncoder dimension (default: 1536).
decoderDimintDecoder dimension (default: 1536).
encoderLayersintNumber of encoder layers (default: 18).
decoderLayersintNumber of decoder layers (default: 18).
numHeadsintNumber of attention heads (default: 24).
vocabSizeintVocabulary size (default: 50265).
maxPatchesPerImageintMaximum patches per image (default: 4096).
Returns
- (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)
Encoder and decoder layers for a MATCHA model.
CreateDefaultMRLLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)
Creates default layers for a Matryoshka Representation Learning (MRL) model.
public static IEnumerable<ILayer<T>> CreateDefaultMRLLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int maxEmbeddingDimension = 1536, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)
Parameters
architectureNeuralNetworkArchitecture<T>vocabSizeintmaxEmbeddingDimensionintmaxSequenceLengthintnumLayersintnumHeadsintfeedForwardDimint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultMemoryNetworkLayers(NeuralNetworkArchitecture<T>, int, int)
Creates a default Memory Network layer configuration.
public static IEnumerable<ILayer<T>> CreateDefaultMemoryNetworkLayers(NeuralNetworkArchitecture<T> architecture, int memorySize, int embeddingSize)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
memorySizeintThe size of the memory component (number of memory slots).
embeddingSizeintThe dimension of the embedding vectors.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for a Memory Network.
Remarks
For Beginners: A Memory Network is a type of neural network that has an explicit memory component. Think of it like a notebook that the network can write to and read from while processing information. This makes it particularly good at tasks that require remembering context from earlier in a sequence, such as answering questions about a story or maintaining a conversation.
The memory size parameter controls how many "pages" are in the notebook, while the embedding size determines how detailed each "note" can be.
Exceptions
- InvalidOperationException
Thrown when the architecture has invalid input or output dimensions.
CreateDefaultMeshCNNLayers(NeuralNetworkArchitecture<T>, int, int[]?, int[]?, int[]?, int, bool, double, bool)
Creates default layers for a MeshCNN architecture for mesh classification/segmentation.
public static IEnumerable<ILayer<T>> CreateDefaultMeshCNNLayers(NeuralNetworkArchitecture<T> architecture, int inputFeatures = 5, int[]? convChannels = null, int[]? poolTargets = null, int[]? fcSizes = null, int numNeighbors = 4, bool useBatchNorm = true, double dropoutRate = 0.5, bool useGlobalAveragePooling = false)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
inputFeaturesintNumber of input features per edge. Default is 5.
convChannelsint[]Channel sizes for each edge convolution block.
poolTargetsint[]Target edge counts after each pooling operation.
fcSizesint[]Sizes of fully connected layers before output.
numNeighborsintNumber of neighboring edges per edge. Default is 4.
useBatchNormboolWhether to use batch normalization. Default is true.
dropoutRatedoubleDropout rate for regularization. Default is 0.5.
useGlobalAveragePoolingboolWhether to use global average pooling. Default is false (max pooling).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for mesh processing.
Remarks
For Beginners: MeshCNN processes 3D mesh data by learning from edge features.
The architecture consists of: - Edge convolution blocks: Learn patterns from edge neighborhoods - Mesh pooling: Simplify the mesh by removing less important edges - Global pooling: Aggregate all edge features into a fixed-size vector - Fully connected layers: Map aggregated features to class predictions
Applications include: - 3D shape classification from mesh data - Mesh segmentation (labeling different parts) - Learning from CAD models and 3D scans
Exceptions
- InvalidOperationException
Thrown when the architecture has invalid output size.
CreateDefaultMiDaSLayers(int, int, int, int, int)
Creates default layers for MiDaS depth estimation.
public static IEnumerable<ILayer<T>> CreateDefaultMiDaSLayers(int inputChannels = 3, int inputHeight = 384, int inputWidth = 384, int embedDim = 768, int numEncoderLayers = 12)
Parameters
Returns
- IEnumerable<ILayer<T>>
CreateDefaultMobileNetV2Layers(NeuralNetworkArchitecture<T>, MobileNetV2Configuration)
Creates default layers for a MobileNetV2 network based on the specified configuration.
public static IEnumerable<ILayer<T>> CreateDefaultMobileNetV2Layers(NeuralNetworkArchitecture<T> architecture, MobileNetV2Configuration configuration)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
configurationMobileNetV2ConfigurationThe MobileNetV2-specific configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a MobileNetV2 network.
Remarks
For Beginners: MobileNetV2 is designed for efficient mobile inference, using inverted residual blocks with linear bottlenecks to achieve high accuracy with low computational cost.
CreateDefaultMobileNetV3Layers(NeuralNetworkArchitecture<T>, MobileNetV3Configuration)
Creates default layers for a MobileNetV3 network based on the specified configuration.
public static IEnumerable<ILayer<T>> CreateDefaultMobileNetV3Layers(NeuralNetworkArchitecture<T> architecture, MobileNetV3Configuration configuration)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
configurationMobileNetV3ConfigurationThe MobileNetV3-specific configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a MobileNetV3 network.
Remarks
For Beginners: MobileNetV3 builds on MobileNetV2 with additional optimizations including squeeze-and-excitation blocks and hard-swish activation for improved accuracy and efficiency.
CreateDefaultMusicGenLayers(int, int, int, int, int, int, int, int, double)
Creates default MusicGen layers for text-to-music generation.
public static IEnumerable<ILayer<T>> CreateDefaultMusicGenLayers(int textHiddenDim = 768, int lmHiddenDim = 1536, int numLmLayers = 24, int numHeads = 16, int numCodebooks = 4, int codebookSize = 2048, int maxTextLength = 256, int maxAudioTokens = 1500, double dropoutRate = 0.1)
Parameters
textHiddenDimintText encoder hidden dimension (default: 768 for T5-base).
lmHiddenDimintLanguage model hidden dimension (default: 1536).
numLmLayersintNumber of language model transformer layers (default: 24).
numHeadsintNumber of attention heads (default: 16).
numCodebooksintNumber of EnCodec codebooks (default: 4).
codebookSizeintSize of each codebook vocabulary (default: 2048).
maxTextLengthintMaximum text sequence length (default: 256).
maxAudioTokensintMaximum audio tokens (~50 tokens/sec) (default: 1500 for 30s).
dropoutRatedoubleDropout rate (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a MusicGen model.
Remarks
MusicGen is Meta's text-to-music generation model that uses a single-stage transformer language model operating over EnCodec audio codes. Key features:
- Delay pattern for codebook interleaving (reduces sequence length)
- T5-based text encoder for conditioning
- Transformer decoder generating audio codes autoregressively
- EnCodec neural audio codec for high-quality audio reconstruction
Reference: "Simple and Controllable Music Generation" by Copet et al., 2023
CreateDefaultNTMLayers(NeuralNetworkArchitecture<T>, int, int, int)
Creates a default configuration of layers for a Neural Turing Machine (NTM).
public static IEnumerable<ILayer<T>> CreateDefaultNTMLayers(NeuralNetworkArchitecture<T> architecture, int memorySize = 128, int memoryVectorSize = 20, int controllerSize = 100)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
memorySizeintThe number of memory locations (default: 128).
memoryVectorSizeintThe size of each memory vector (default: 20).
controllerSizeintThe size of the controller network (default: 100).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for a Neural Turing Machine.
Remarks
For Beginners: A Neural Turing Machine (NTM) is a type of neural network that has an external memory component, similar to how computers have RAM. The network learns to read from and write to this memory, which helps it solve tasks that require remembering information over long periods.
- memorySize: How many "slots" are in the memory (like pages in a notebook) - memoryVectorSize: How much information each memory slot can hold - controllerSize: How complex the "brain" of the network is that decides what to read/write
Exceptions
- ArgumentNullException
Thrown when architecture is null.
- ArgumentException
Thrown when memory parameters are not positive.
CreateDefaultNeuralNetworkLayers(NeuralNetworkArchitecture<T>)
Creates a default configuration of layers for a standard neural network.
public static IEnumerable<ILayer<T>> CreateDefaultNeuralNetworkLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for a standard neural network.
Remarks
For Beginners: This method creates the basic building blocks (layers) of a neural network. Think of layers as a series of connected processing units that transform your input data step by step until it produces the desired output. The complexity parameter in the architecture determines how many layers and neurons your network will have - Simple networks have fewer layers while Deep networks have more layers for handling more complex problems.
Exceptions
- ArgumentNullException
Thrown when architecture is null.
- InvalidOperationException
Thrown when input size or output size is not positive.
CreateDefaultNodeClassificationLayers(NeuralNetworkArchitecture<T>, int, int, double)
Creates default layers for a Node Classification model.
public static IEnumerable<ILayer<T>> CreateDefaultNodeClassificationLayers(NeuralNetworkArchitecture<T> architecture, int hiddenDim = 64, int numLayers = 2, double dropoutRate = 0.5)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
hiddenDimintHidden dimension size (default: 64).
numLayersintNumber of GNN layers (default: 2).
dropoutRatedoubleDropout rate for regularization (default: 0.5).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for node classification.
Remarks
For Beginners: Node classification predicts labels for individual nodes in a graph. This architecture uses GCN layers with dropout for semi-supervised learning on graphs.
CreateDefaultNougatLayers(int, int, int, int, int, int, int, int)
Creates default Nougat layers for academic document understanding.
public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultNougatLayers(int hiddenDim = 1024, int numEncoderLayers = 12, int numDecoderLayers = 10, int numHeads = 16, int vocabSize = 50000, int imageSize = 896, int patchSize = 16, int maxSequenceLength = 4096)
Parameters
hiddenDimintHidden dimension (default: 1024).
numEncoderLayersintNumber of encoder layers (default: 12).
numDecoderLayersintNumber of decoder layers (default: 10).
numHeadsintNumber of attention heads (default: 16).
vocabSizeintVocabulary size (default: 50000).
imageSizeintInput image size (default: 896).
patchSizeintPatch size (default: 16).
maxSequenceLengthintMaximum sequence length (default: 4096).
Returns
- (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)
Tuple of encoder and decoder layers.
Remarks
Reference: "Nougat: Neural Optical Understanding for Academic Documents" (arXiv 2023)
CreateDefaultOccupancyLayers(NeuralNetworkArchitecture<T>)
Creates default layers for an occupancy detection neural network without temporal data.
public static IEnumerable<ILayer<T>> CreateDefaultOccupancyLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration that defines input and output shapes.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a non-temporal occupancy detection network.
Remarks
For Beginners: This method builds a simpler neural network for detecting occupancy (whether a space is occupied by people) using data from a single point in time, rather than a sequence of time points. It uses standard Dense layers (also called fully connected layers) to process the input features.
Non-temporal data means the model makes predictions based only on current data points without considering how values have changed over time. For example, using the current temperature, humidity, and CO2 levels to predict occupancy without looking at historical values.
CreateDefaultOccupancyTemporalLayers(NeuralNetworkArchitecture<T>, int)
Creates default layers for an occupancy detection neural network with temporal data.
public static IEnumerable<ILayer<T>> CreateDefaultOccupancyTemporalLayers(NeuralNetworkArchitecture<T> architecture, int historyWindowSize)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration that defines input and output shapes.
historyWindowSizeintThe number of time steps to consider in the temporal data (how many past observations to include).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a temporal occupancy detection network.
Remarks
For Beginners: This method builds a neural network specifically designed to detect occupancy (whether a space is occupied by people) using data that changes over time. It uses special layer types like LSTM (Long Short-Term Memory) that can "remember" patterns in sequential data, and attention mechanisms that help the network focus on the most important time steps in the data sequence.
Temporal data refers to data collected over time, where the sequence and patterns across time points are important for making predictions. For example, sensor readings collected every minute over several hours would be temporal data.
CreateDefaultOpticalFlowLayers(int, int, int, int)
Creates layers for an optical flow estimation model (RAFT-style).
public static IEnumerable<ILayer<T>> CreateDefaultOpticalFlowLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int hiddenDim = 192)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput frame height.
inputWidthintInput frame width.
hiddenDimintHidden dimension for flow estimation (default: 192).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for optical flow estimation.
Remarks
For Beginners: Optical flow tells you how each pixel moves between two frames. This is useful for motion analysis, video editing, and as input to other models. The output is a 2-channel tensor showing horizontal and vertical motion.
Architecture:
- Feature encoder extracts features from both frames
- Correlation volume computes matching scores
- Iterative refinement improves the flow estimate
CreateDefaultPICKLayers(int, int, int, int, int, int)
Creates default PICK layers for key information extraction.
public static IEnumerable<ILayer<T>> CreateDefaultPICKLayers(int hiddenDim = 256, int numGcnLayers = 2, int numHeads = 8, int vocabSize = 30522, int numEntityTypes = 14, int maxSequenceLength = 512)
Parameters
hiddenDimintHidden dimension (default: 256).
numGcnLayersintNumber of GCN layers (default: 2).
numHeadsintNumber of attention heads (default: 8).
vocabSizeintVocabulary size (default: 30522).
numEntityTypesintNumber of entity types (default: 14).
maxSequenceLengthintMaximum sequence length (default: 512).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a PICK model.
Remarks
Reference: "PICK: Processing Key Information Extraction" (ICPR 2020)
CreateDefaultPINNLayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for a Physics-Informed Neural Network (PINN).
public static IEnumerable<ILayer<T>> CreateDefaultPINNLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 4, int hiddenLayerSize = 32)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
hiddenLayerCountintNumber of hidden layers (default: 4).
hiddenLayerSizeintNumber of neurons in each hidden layer (default: 32).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a PINN.
Remarks
For Beginners: Physics-Informed Neural Networks (PINNs) solve PDEs by training a neural network to minimize the PDE residual at collocation points. The network learns the solution function u(x,t) while respecting the physics (PDE, boundary conditions, and initial conditions).
Uses Tanh activation for smooth derivatives (important for computing PDE residuals). Multiple hidden layers capture complex solution behavior. Linear output layer since PDE solutions can take any real value.
CreateDefaultPSENetLayers(int, int, int, int)
Creates default PSENet (Progressive Scale Expansion Network) layers.
public static IEnumerable<ILayer<T>> CreateDefaultPSENetLayers(int imageSize = 640, int backboneChannels = 256, int featureChannels = 256, int numKernels = 7)
Parameters
imageSizeintInput image size (default: 640).
backboneChannelsintBackbone channels (default: 256).
featureChannelsintFeature channels (default: 256).
numKernelsintNumber of scale kernels (default: 7).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a PSENet model.
CreateDefaultPix2StructLayers(int, int, int, int, int, int, int, int)
Creates default Pix2Struct layers for screenshot parsing.
public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultPix2StructLayers(int hiddenDim = 1024, int numEncoderLayers = 18, int numDecoderLayers = 18, int numHeads = 16, int vocabSize = 50000, int patchSize = 16, int maxPatches = 4096, int maxSequenceLength = 1024)
Parameters
hiddenDimintHidden dimension (default: 1024).
numEncoderLayersintNumber of encoder layers (default: 18).
numDecoderLayersintNumber of decoder layers (default: 18).
numHeadsintNumber of attention heads (default: 16).
vocabSizeintVocabulary size (default: 50000).
patchSizeintPatch size (default: 16).
maxPatchesintMaximum patches (default: 4096).
maxSequenceLengthintMaximum sequence length (default: 1024).
Returns
- (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)
Tuple of encoder and decoder layers.
Remarks
Reference: "Pix2Struct: Screenshot Parsing as Pretraining" (ICML 2023)
CreateDefaultQuantumNetworkLayers(NeuralNetworkArchitecture<T>, int)
Creates a default configuration of layers for a Quantum Neural Network.
public static IEnumerable<ILayer<T>> CreateDefaultQuantumNetworkLayers(NeuralNetworkArchitecture<T> architecture, int numQubits = 4)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
numQubitsintThe number of qubits to use in quantum layers (default: 4).
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for a Quantum Neural Network.
Remarks
For Beginners: A Quantum Neural Network combines quantum computing concepts with neural networks. Think of qubits as special units that can exist in multiple states at once (unlike regular bits that are either 0 or 1). This gives quantum networks potential advantages for certain problems. The numQubits parameter controls how many of these special quantum units are used in each quantum layer.
Exceptions
- ArgumentNullException
Thrown when architecture is null.
- ArgumentException
Thrown when numQubits is not positive.
CreateDefaultRBFNetworkLayers(NeuralNetworkArchitecture<T>, int, IRadialBasisFunction<T>?)
Creates a default Radial Basis Function (RBF) neural network layer configuration.
public static IEnumerable<ILayer<T>> CreateDefaultRBFNetworkLayers(NeuralNetworkArchitecture<T> architecture, int hiddenSize = 0, IRadialBasisFunction<T>? rbfFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
hiddenSizeintThe size of the hidden layer. If set to 0 or negative, a default size will be calculated.
rbfFunctionIRadialBasisFunction<T>The radial basis function to use. If null, a default Gaussian RBF will be used.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for RBF network processing.
Remarks
For Beginners: A Radial Basis Function (RBF) Network is a special type of neural network that uses "distance" to make predictions. Instead of gradually learning patterns through weights like standard neural networks, RBF networks measure how similar or different an input is from known examples.
Think of it like this: if you want to identify a fruit, you might compare how similar it looks to fruits you already know. An RBF network works in a similar way - it has "reference points" and measures how close new data is to these points.
RBF networks are particularly good at function approximation, pattern recognition, and time series prediction.
CreateDefaultRNNLayers(NeuralNetworkArchitecture<T>)
Creates a default Recurrent Neural Network (RNN) layer configuration.
public static IEnumerable<ILayer<T>> CreateDefaultRNNLayers(NeuralNetworkArchitecture<T> architecture)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for RNN-based processing.
Remarks
For Beginners: A Recurrent Neural Network (RNN) is designed to work with sequential data by maintaining a form of "memory" of previous inputs. Unlike standard neural networks, RNNs can use their internal state to process sequences of inputs, making them ideal for tasks like text analysis, speech recognition, or time series prediction.
This method automatically configures appropriate RNN layers with sensible defaults, including hidden layer sizes and activation functions.
CreateDefaultRVMLayers(int, int, int, int)
Creates default layers for RVM (Robust Video Matting).
public static IEnumerable<ILayer<T>> CreateDefaultRVMLayers(int inputChannels = 3, int inputHeight = 512, int inputWidth = 512, int numFeatures = 32)
Parameters
Returns
- IEnumerable<ILayer<T>>
CreateDefaultResNetLayers(NeuralNetworkArchitecture<T>, int, int)
Creates a Residual Neural Network (ResNet) with configurable blocks.
public static IEnumerable<ILayer<T>> CreateDefaultResNetLayers(NeuralNetworkArchitecture<T> architecture, int blockCount = 3, int blockSize = 2)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
blockCountintNumber of residual blocks (default: 3).
blockSizeintNumber of convolutional layers in each block (default: 2).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a ResNet.
Remarks
For Beginners: A Residual Network (ResNet) is designed to solve the "vanishing gradient problem" that occurs when training very deep networks. It does this by adding "skip connections" that allow information to bypass some layers.
Think of it like this: In a traditional network, each layer must learn everything from scratch. In a ResNet, each layer only needs to learn the "difference" (or residual) between its input and the desired output, which is often easier to learn.
Key components: - Initial convolutional layer: Processes the raw input - Residual blocks: Groups of layers with skip connections - Global pooling: Reduces the spatial dimensions to a single value per feature map - Final dense layer: Makes the prediction based on the extracted features
CreateDefaultSAM2Layers(int, int, int, int)
Creates all SAM2 layers for backward compatibility.
[Obsolete("Use individual SAM2 factory methods (CreateSAM2ImageEncoderLayers, etc.) for proper multi-branch architecture.")]
public static IEnumerable<ILayer<T>> CreateDefaultSAM2Layers(int inputChannels = 3, int inputHeight = 1024, int inputWidth = 1024, int numFeatures = 256)
Parameters
Returns
- IEnumerable<ILayer<T>>
Remarks
Warning: This method returns layers from multiple branches that cannot be chained sequentially. Use the individual factory methods (CreateSAM2ImageEncoderLayers, CreateSAM2PromptEncoderLayers, CreateSAM2MemoryLayers, CreateSAM2MaskDecoderLayers) for proper multi-branch handling.
CreateDefaultSGPTLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)
Creates default layers for an SGPT (Sentence GPT) decoder-only embedding model.
public static IEnumerable<ILayer<T>> CreateDefaultSGPTLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 50257, int embeddingDimension = 768, int maxSequenceLength = 1024, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)
Parameters
architectureNeuralNetworkArchitecture<T>vocabSizeintembeddingDimensionintmaxSequenceLengthintnumLayersintnumHeadsintfeedForwardDimint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultSPLADELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)
Creates default layers for a SPLADE (Sparse Lexical and Expansion Model) embedding model.
public static IEnumerable<ILayer<T>> CreateDefaultSPLADELayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)
Parameters
architectureNeuralNetworkArchitecture<T>vocabSizeintembeddingDimensionintmaxSequenceLengthintnumLayersintnumHeadsintfeedForwardDimint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultSVTRLayers(int, int, int, int, int, int)
Creates default SVTR (Scene Text Visual Transformer Recognizer) layers.
public static IEnumerable<ILayer<T>> CreateDefaultSVTRLayers(int imageWidth = 256, int imageHeight = 64, int hiddenDim = 192, int numLayers = 8, int numHeads = 6, int charsetSize = 95)
Parameters
imageWidthintInput image width (default: 256).
imageHeightintInput image height (default: 64).
hiddenDimintHidden dimension (default: 192).
numLayersintNumber of transformer layers (default: 8).
numHeadsintNumber of attention heads (default: 6).
charsetSizeintCharacter set size (default: 95).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming an SVTR model.
CreateDefaultSiameseLayers(NeuralNetworkArchitecture<T>, int, int, int)
Creates default layers for a Siamese neural network using a Transformer-based encoder.
public static IEnumerable<ILayer<T>> CreateDefaultSiameseLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
vocabSizeintThe size of the vocabulary (default: 30522).
embeddingDimensionintThe dimension of the embedding vectors (default: 768).
maxSequenceLengthintThe maximum length of input sequences (default: 512).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Siamese encoder.
Remarks
For Beginners: A Siamese Network uses two identical "twin" networks to process different inputs. This method sets up the structure for one of those twins, typically using a Transformer encoder to turn text into a coordinate (embedding) that can be compared to others.
CreateDefaultSimCSELayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, int)
Creates default layers for a SimCSE (Simple Contrastive Learning of Sentence Embeddings) model.
public static IEnumerable<ILayer<T>> CreateDefaultSimCSELayers(NeuralNetworkArchitecture<T> architecture, int vocabSize = 30522, int embeddingDimension = 768, int maxSequenceLength = 512, int numLayers = 12, int numHeads = 12, int feedForwardDim = 3072)
Parameters
architectureNeuralNetworkArchitecture<T>vocabSizeintembeddingDimensionintmaxSequenceLengthintnumLayersintnumHeadsintfeedForwardDimint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultSlowFastLayers(int, int, int, int, int, int, int)
Creates all SlowFast layers for backward compatibility (returns only slow pathway).
[Obsolete("Use individual SlowFast factory methods for proper dual-pathway architecture.")]
public static IEnumerable<ILayer<T>> CreateDefaultSlowFastLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int numClasses = 400, int slowChannels = 64, int fastChannels = 8, int alpha = 8)
Parameters
inputChannelsintinputHeightintinputWidthintnumClassesintslowChannelsintfastChannelsintalphaint
Returns
- IEnumerable<ILayer<T>>
Remarks
Warning: SlowFast is a dual-pathway architecture that cannot be represented as a single sequential layer list. Use the individual factory methods: - CreateSlowFastSlowPathwayLayers - CreateSlowFastFastPathwayLayers - CreateSlowFastFusionLayers
CreateDefaultSourceSeparationLayers(int, int, int, int, double)
Creates default music source separation layers (U-Net style).
public static IEnumerable<ILayer<T>> CreateDefaultSourceSeparationLayers(int numMels = 513, int baseChannels = 32, int numSources = 4, int maxFrames = 512, double dropoutRate = 0.1)
Parameters
numMelsintNumber of spectrogram frequency bins (default: 513 for STFT with 1024 window).
baseChannelsintBase channel count for U-Net (default: 32).
numSourcesintNumber of output sources (default: 4 for vocals, drums, bass, other).
maxFramesintMaximum time frames (default: 512).
dropoutRatedoubleDropout rate (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for music source separation.
Remarks
U-Net inspired architecture for source separation with:
- Encoder path with downsampling
- Bottleneck with attention
- Decoder path with upsampling and skip connections
- Multi-source mask prediction
Reference: "Open-Unmix - A Reference Implementation for Music Source Separation"
CreateDefaultSpeakerEmbeddingLayers(int, int, int, int, int, double)
Creates default speaker embedding layers for speaker verification and identification.
public static IEnumerable<ILayer<T>> CreateDefaultSpeakerEmbeddingLayers(int numMels = 80, int hiddenDim = 512, int embeddingDim = 256, int numLayers = 3, int maxFrames = 500, double dropoutRate = 0.1)
Parameters
numMelsintNumber of mel spectrogram bins (default: 80).
hiddenDimintHidden layer dimension (default: 512).
embeddingDimintOutput embedding dimension (default: 256).
numLayersintNumber of LSTM-like layers (default: 3).
maxFramesintMaximum input frames (default: 500).
dropoutRatedoubleDropout rate (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for speaker embedding extraction.
Remarks
ECAPA-TDNN inspired architecture for speaker embedding with:
- Frame-level feature extraction with attention
- Temporal context aggregation
- Attentive statistics pooling
- Speaker embedding projection
Reference: "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN"
CreateDefaultSpikingLayers(NeuralNetworkArchitecture<T>, SpikingNeuronType, double, double, bool, bool)
Creates default layers for a Spiking Neural Network (SNN).
public static IEnumerable<ILayer<T>> CreateDefaultSpikingLayers(NeuralNetworkArchitecture<T> architecture, SpikingNeuronType neuronType = SpikingNeuronType.LeakyIntegrateAndFire, double tau = 10, double refractoryPeriod = 2, bool useLayerNormalization = false, bool useOutputConversion = true)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
neuronTypeSpikingNeuronTypeThe type of spiking neuron to use.
taudoubleThe membrane time constant that controls how quickly neurons respond to inputs.
refractoryPerioddoubleThe period after firing during which a neuron cannot fire again.
useLayerNormalizationboolWhether to use layer normalization to stabilize training.
useOutputConversionboolWhether to convert spike outputs to continuous values.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Spiking Neural Network.
Remarks
For Beginners: Spiking Neural Networks (SNNs) are a type of neural network that more closely mimics how real neurons in the brain work. Unlike traditional neural networks that use continuous values, SNNs use "spikes" (binary on/off signals) to communicate between neurons. This makes them more biologically realistic and potentially more energy-efficient for certain tasks.
The tau parameter controls how quickly a neuron "forgets" previous inputs - larger values make the neuron remember inputs for longer. The refractory period is like a "rest time" after a neuron fires, during which it cannot fire again, similar to how real neurons behave.
CreateDefaultSpiralNetLayers(NeuralNetworkArchitecture<T>, int, int, int[]?, double[]?, int[]?, bool, double, bool)
Creates the default layer sequence for a SpiralNet mesh neural network.
public static IEnumerable<ILayer<T>> CreateDefaultSpiralNetLayers(NeuralNetworkArchitecture<T> architecture, int inputFeatures = 3, int spiralLength = 9, int[]? convChannels = null, double[]? poolRatios = null, int[]? fcSizes = null, bool useBatchNorm = true, double dropoutRate = 0.5, bool useGlobalAveragePooling = true)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
inputFeaturesintNumber of input features per vertex (default: 3 for coordinates).
spiralLengthintLength of spiral sequences for convolutions.
convChannelsint[]Channel sizes for each spiral convolution block.
poolRatiosdouble[]Pooling ratios for mesh simplification at each level.
fcSizesint[]Sizes of fully connected layers before output.
useBatchNormboolWhether to use batch normalization after convolutions.
dropoutRatedoubleDropout rate for fully connected layers.
useGlobalAveragePoolingboolWhether to use global average (true) or max (false) pooling.
Returns
- IEnumerable<ILayer<T>>
An enumerable of layers forming the SpiralNet architecture.
Remarks
For Beginners: This method builds the default layer stack for SpiralNet++.
Architecture pattern: - Multiple spiral convolution blocks (SpiralConv + optional BatchNorm) - Global pooling to aggregate vertex features - Fully connected layers for classification
Applications:
- 3D face recognition and reconstruction
- Human body shape analysis
- Medical mesh analysis
Exceptions
- InvalidOperationException
Thrown when the architecture has invalid output size.
CreateDefaultStableAudioLayers(int, int, int, int, int, int, int, double)
Creates default Stable Audio layers for text-to-audio generation.
public static IEnumerable<ILayer<T>> CreateDefaultStableAudioLayers(int textHiddenDim = 768, int latentDim = 64, int ditHiddenDim = 1024, int numDitBlocks = 24, int numHeads = 16, int maxTextLength = 512, int maxAudioLength = 2048, double dropoutRate = 0.1)
Parameters
textHiddenDimintText encoder hidden dimension (default: 768).
latentDimintLatent space dimension (default: 64).
ditHiddenDimintDiT hidden dimension (default: 1024).
numDitBlocksintNumber of DiT transformer blocks (default: 24).
numHeadsintNumber of attention heads (default: 16).
maxTextLengthintMaximum text sequence length (default: 512).
maxAudioLengthintMaximum audio latent sequence length (default: 2048).
dropoutRatedoubleDropout rate (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Stable Audio model.
Remarks
Stable Audio by Stability AI uses a Diffusion Transformer (DiT) architecture:
- T5-based text encoder for conditioning
- Variational autoencoder for audio latent compression
- DiT (Diffusion Transformer) for denoising in latent space
- Supports variable-length audio generation with timing conditioning
Reference: "Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion" by Evans et al., 2024
CreateDefaultTRIELayers(int, int, int, int, int, int)
Creates default TRIE (Text Reading and Information Extraction) layers.
public static IEnumerable<ILayer<T>> CreateDefaultTRIELayers(int imageSize = 512, int visualDim = 256, int textDim = 256, int graphDim = 256, int numEntityTypes = 10, int maxEntities = 100)
Parameters
imageSizeintInput image size (default: 512).
visualDimintVisual encoder dimension (default: 256).
textDimintText encoder dimension (default: 256).
graphDimintGraph dimension (default: 256).
numEntityTypesintNumber of entity types (default: 10).
maxEntitiesintMaximum entities (default: 100).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a TRIE model.
CreateDefaultTableTransformerLayers(int, int, int, int, int, int, int)
Creates default layers for TableTransformer model.
public static IEnumerable<ILayer<T>> CreateDefaultTableTransformerLayers(int imageSize = 800, int hiddenDim = 256, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int numQueries = 100, int numStructureClasses = 7)
Parameters
imageSizeintInput image size (default: 800).
hiddenDimintTransformer hidden dimension (default: 256).
numEncoderLayersintNumber of encoder layers (default: 6).
numDecoderLayersintNumber of decoder layers (default: 6).
numHeadsintNumber of attention heads (default: 8).
numQueriesintNumber of object queries (default: 100).
numStructureClassesintNumber of structure classes (default: 7).
Returns
- IEnumerable<ILayer<T>>
Enumerable of layers for TableTransformer.
Remarks
TableTransformer uses a DETR-style architecture with ResNet backbone.
Reference: "PubTables-1M: Towards Comprehensive Table Extraction" (CVPR 2022)
CreateDefaultTimeSformerLayers(int, int, int, int, int, int, int)
Creates default layers for TimeSformer video classification.
public static IEnumerable<ILayer<T>> CreateDefaultTimeSformerLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int embedDim = 768, int numLayers = 12, int patchSize = 16, int numClasses = 400)
Parameters
inputChannelsintinputHeightintinputWidthintembedDimintnumLayersintpatchSizeintnumClassesint
Returns
- IEnumerable<ILayer<T>>
CreateDefaultTrOCRLayers(int, int, int, int, int, int, int, int, int, int)
Creates default layers for TrOCR text recognition model.
public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultTrOCRLayers(int imageSize = 384, int patchSize = 16, int encoderHiddenDim = 768, int decoderHiddenDim = 768, int numEncoderLayers = 12, int numDecoderLayers = 6, int numEncoderHeads = 12, int numDecoderHeads = 12, int vocabSize = 50265, int maxSequenceLength = 128)
Parameters
imageSizeintInput image size (default: 384).
patchSizeintViT patch size (default: 16).
encoderHiddenDimintEncoder hidden dimension (default: 768).
decoderHiddenDimintDecoder hidden dimension (default: 768).
numEncoderLayersintNumber of encoder layers (default: 12).
numDecoderLayersintNumber of decoder layers (default: 6).
numEncoderHeadsintNumber of encoder heads (default: 12).
numDecoderHeadsintNumber of decoder heads (default: 12).
vocabSizeintVocabulary size (default: 50265).
maxSequenceLengthintMaximum sequence length (default: 128).
Returns
- (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)
Tuple of encoder and decoder layers.
Remarks
TrOCR uses a Vision Transformer (ViT) encoder and a Transformer decoder.
Reference: "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" (AAAI 2022)
CreateDefaultTransformerLayers(TransformerArchitecture<T>)
Creates a default Transformer neural network with pre-configured encoder and decoder layers.
public static IEnumerable<ILayer<T>> CreateDefaultTransformerLayers(TransformerArchitecture<T> architecture)
Parameters
architectureTransformerArchitecture<T>The transformer architecture configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Transformer neural network.
Remarks
For Beginners: A Transformer is a powerful type of neural network especially good at processing sequences like text or time series data. Unlike older networks, Transformers can look at all parts of the input at once (using "attention") rather than processing it step by step. This makes them excellent for tasks like translation, text generation, and understanding language.
Key concepts: - Attention: Allows the model to focus on relevant parts of the input regardless of position - Multi-head attention: Lets the model focus on different aspects of the input simultaneously - Encoder: Processes the input sequence - Decoder: Generates the output sequence - Positional encoding: Helps the model understand the order of elements in a sequence
CreateDefaultTtsLayers(int, int, int, int, int, int, int, int, int, double)
Creates default TTS (Text-to-Speech) layers for speech synthesis.
public static IEnumerable<ILayer<T>> CreateDefaultTtsLayers(int textHiddenDim = 256, int audioHiddenDim = 512, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int numMels = 80, int maxTextLength = 512, int maxMelFrames = 1000, int vocabSize = 148, double dropoutRate = 0.1)
Parameters
textHiddenDimintText encoder hidden dimension (default: 256).
audioHiddenDimintAudio decoder hidden dimension (default: 512).
numEncoderLayersintNumber of encoder transformer layers (default: 6).
numDecoderLayersintNumber of decoder transformer layers (default: 6).
numHeadsintNumber of attention heads (default: 8).
numMelsintNumber of mel spectrogram bins (default: 80).
maxTextLengthintMaximum input text length (default: 512).
maxMelFramesintMaximum mel spectrogram frames (default: 1000).
vocabSizeintPhoneme/character vocabulary size (default: 148).
dropoutRatedoubleDropout rate (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a TTS encoder-decoder architecture.
Remarks
TTS architecture with:
- Character/phoneme embedding with positional encoding
- Transformer encoder for text representation
- Transformer decoder with cross-attention for mel generation
- Post-net convolutional refinement (simulated with dense layers)
Reference: "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions" (Tacotron 2)
CreateDefaultUDOPLayers(int, int, int, int, int, int, int)
Creates default UDOP layers for unified document processing.
public static (IEnumerable<ILayer<T>> EncoderLayers, IEnumerable<ILayer<T>> DecoderLayers) CreateDefaultUDOPLayers(int hiddenDim = 1024, int numEncoderLayers = 12, int numDecoderLayers = 12, int numHeads = 16, int vocabSize = 50000, int imageSize = 224, int maxSequenceLength = 2048)
Parameters
hiddenDimintHidden dimension (default: 1024).
numEncoderLayersintNumber of encoder layers (default: 12).
numDecoderLayersintNumber of decoder layers (default: 12).
numHeadsintNumber of attention heads (default: 16).
vocabSizeintVocabulary size (default: 50000).
imageSizeintInput image size (default: 224).
maxSequenceLengthintMaximum sequence length (default: 2048).
Returns
- (IEnumerable<ILayer<T>> BranchLayers, IEnumerable<ILayer<T>> TrunkLayers)
Tuple of encoder and decoder layers.
Remarks
Reference: "UDOP: Unifying Vision, Text, and Layout" (CVPR 2023)
CreateDefaultUNet3DLayers(NeuralNetworkArchitecture<T>, int, int, int)
Creates default layers for a 3D U-Net architecture for volumetric segmentation.
public static IEnumerable<ILayer<T>> CreateDefaultUNet3DLayers(NeuralNetworkArchitecture<T> architecture, int voxelResolution = 32, int numEncoderBlocks = 4, int baseFilters = 32)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
voxelResolutionintThe resolution of the voxel grid (e.g., 32 for 32x32x32). Default is 32.
numEncoderBlocksintThe number of encoder blocks. Default is 4.
baseFiltersintThe number of filters in the first convolutional layer. Doubles with each block. Default is 32.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for 3D volumetric segmentation.
Remarks
For Beginners: A 3D U-Net is like a specialized 3D image processor that can identify different parts of a 3D volume (like organs in a CT scan or objects in a point cloud).
The U-shape architecture: - Encoder: Progressively downsamples to capture context (like zooming out) - Bottleneck: Smallest representation capturing global features - Decoder: Progressively upsamples to restore resolution (like zooming in) - Skip connections: Link encoder to decoder to preserve fine details
Applications include: - 3D semantic segmentation of point clouds - Medical image segmentation (organs, tumors in CT/MRI) - Part segmentation of 3D shapes
Exceptions
- InvalidOperationException
Thrown when the architecture has invalid dimensions.
CreateDefaultUniversalDELayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for a Universal Differential Equation (UDE) network.
public static IEnumerable<ILayer<T>> CreateDefaultUniversalDELayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 2, int hiddenLayerSize = 32)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
hiddenLayerCountintNumber of hidden layers (default: 2).
hiddenLayerSizeintNumber of neurons in each hidden layer (default: 32).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a UDE neural network component.
Remarks
For Beginners: Universal Differential Equations combine known physics with neural networks. The neural network learns the unknown parts of the dynamics while known physics equations are added explicitly. This is perfect for scientific applications where you know some of the physics but not all of it.
The network takes [state, time] as input and outputs the learned correction to the dynamics. Uses Tanh activation for smooth derivatives needed in ODE integration. Output uses linear (identity) activation since corrections can be positive or negative.
CreateDefaultVAELayers(NeuralNetworkArchitecture<T>, int)
Creates a default Variational Autoencoder (VAE) with pre-configured layers.
public static IEnumerable<ILayer<T>> CreateDefaultVAELayers(NeuralNetworkArchitecture<T> architecture, int latentSize)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
latentSizeintThe size of the latent space dimension.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Variational Autoencoder.
Remarks
For Beginners: A Variational Autoencoder (VAE) is a type of neural network that learns to compress data into a smaller representation (encoding) and then reconstruct it back (decoding). What makes VAEs special is that they create a "fuzzy" compressed representation rather than an exact one, which helps the network learn meaningful patterns in your data. This makes VAEs excellent for generating new data similar to your training examples.
The latent space is the compressed representation where your data exists in a simplified form. Think of it as a "creative space" where the network understands the essential features of your data.
CreateDefaultVGGLayers(NeuralNetworkArchitecture<T>, VGGConfiguration)
Creates layers for a VGG network based on the specified configuration.
public static IEnumerable<ILayer<T>> CreateDefaultVGGLayers(NeuralNetworkArchitecture<T> architecture, VGGConfiguration configuration)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
configurationVGGConfigurationThe VGG-specific configuration.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a VGG network.
Remarks
For Beginners: VGG networks are deep convolutional neural networks known for their simplicity and effectiveness. They use stacks of 3x3 convolutions followed by max pooling to progressively extract higher-level features from images.
The VGG architecture consists of:
- 5 convolutional blocks with increasing number of filters (64 -> 128 -> 256 -> 512 -> 512)
- Max pooling after each block to reduce spatial dimensions by half
- Optional batch normalization after each convolution (in _BN variants)
- 3 fully connected layers (4096 -> 4096 -> numClasses)
- Dropout regularization in the fully connected layers
CreateDefaultVRTLayers(int, int, int, int, int, int, int)
Creates layers for a VRT (Video Restoration Transformer) model.
public static IEnumerable<ILayer<T>> CreateDefaultVRTLayers(int inputChannels = 3, int inputHeight = 64, int inputWidth = 64, int embedDim = 120, int numFrames = 6, int numBlocks = 8, int scaleFactor = 4)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput frame height.
inputWidthintInput frame width.
embedDimintEmbedding dimension (default: 120).
numFramesintNumber of temporal frames (default: 6).
numBlocksintNumber of transformer blocks (default: 8).
scaleFactorintUpscaling factor for super-resolution. Supported values: 1, 2, or 4 (default: 4).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for video restoration.
Remarks
For Beginners: VRT (Video Restoration Transformer) is a powerful model for: - Video super-resolution (increasing video resolution) - Video deblurring (removing motion blur) - Video denoising (removing noise from videos)
It uses attention mechanisms to leverage both spatial and temporal information from multiple video frames to produce high-quality restored frames.
Architecture (based on the paper):
- Shallow feature extraction from input frames
- Temporal mutual self-attention (TMSA) blocks
- Deep feature extraction with parallel warping
- Reconstruction module for output
Reference: "VRT: A Video Restoration Transformer" https://arxiv.org/abs/2201.12288
Exceptions
- ArgumentException
Thrown when scaleFactor is not 1, 2, or 4.
CreateDefaultVariationalPINNLayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for a Variational Physics-Informed Neural Network (VPINN).
public static IEnumerable<ILayer<T>> CreateDefaultVariationalPINNLayers(NeuralNetworkArchitecture<T> architecture, int hiddenLayerCount = 4, int hiddenLayerSize = 50)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
hiddenLayerCountintNumber of hidden layers (default: 4).
hiddenLayerSizeintNumber of neurons in each hidden layer (default: 50).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a VPINN.
Remarks
For Beginners: Variational PINNs solve PDEs using the weak (variational) form instead of the strong form. This is similar to Finite Element Methods but using neural networks. Often more stable for complex PDEs than standard PINNs.
Uses Tanh activation throughout for smooth derivatives needed in variational formulation. Linear output layer since PDE solutions can take any real value.
CreateDefaultVideoMAELayers(int, int, int, int, int, int)
Creates default layers for VideoMAE (Video Masked Autoencoder) action recognition model.
public static IEnumerable<ILayer<T>> CreateDefaultVideoMAELayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int numFeatures = 768, int numClasses = 400, int tubeletSize = 2)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput height (default: 224).
inputWidthintInput width (default: 224).
numFeaturesintNumber of feature channels (default: 768).
numClassesintNumber of action classes (default: 400 for Kinetics).
tubeletSizeintTemporal size of each tube (default: 2).
Returns
- IEnumerable<ILayer<T>>
An enumerable of layers configured for VideoMAE.
Remarks
For Beginners: VideoMAE is a self-supervised learning model that learns video representations by masking and reconstructing video patches. It's used for action recognition and video understanding tasks.
Architecture: - 3D patch embedding (spatiotemporal) - Transformer encoder blocks - Classification head for action recognition - Decoder for masked reconstruction during pretraining
Reference: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training" https://arxiv.org/abs/2203.12602
CreateDefaultVideoStabilizationLayers(int, int, int)
Creates layers for a video stabilization model (StabNet-style).
public static IEnumerable<ILayer<T>> CreateDefaultVideoStabilizationLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput frame height.
inputWidthintInput frame width.
Returns
- IEnumerable<ILayer<T>>
A collection of layers for video stabilization.
Remarks
For Beginners: Video stabilization removes camera shake. The model predicts how to warp each frame to align with a smooth camera path. This is similar to what smartphone cameras do in real-time.
Architecture:
- Feature encoder processes input frames
- Motion estimator predicts camera motion
- Smoother learns the smooth target path
- Warper transforms frames to match smooth path
CreateDefaultVideoSuperResolutionLayers(int, int, int, int, int, int, bool)
Creates layers for a video super-resolution model (Real-ESRGAN/BasicVSR++ style).
public static IEnumerable<ILayer<T>> CreateDefaultVideoSuperResolutionLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int numFeatures = 64, int numResBlocks = 16, int scaleFactor = 2, bool useTemporalConsistency = true)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput video height.
inputWidthintInput video width.
numFeaturesintNumber of feature channels (default: 64).
numResBlocksintNumber of residual blocks (default: 16).
scaleFactorintUpscaling factor (default: 2).
useTemporalConsistencyboolWhether to add temporal aggregation layer (default: true).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for video super-resolution.
Remarks
For Beginners: Super-resolution models increase video resolution. This architecture uses residual blocks (skip connections) to preserve details while learning to add new ones. The upsampling at the end increases the spatial size by the scale factor.
Architecture overview:
- Initial convolution to extract features
- Multiple residual blocks for deep feature learning
- Temporal aggregation for video consistency (optional)
- Pixel shuffle upsampling for resolution increase
- Final convolution for output reconstruction
CreateDefaultVoxLingua107Layers(NeuralNetworkArchitecture<T>, int, int, int, int[]?)
Creates default VoxLingua107 layers for 107-language identification.
public static IEnumerable<ILayer<T>> CreateDefaultVoxLingua107Layers(NeuralNetworkArchitecture<T> architecture, int numMels = 80, int tdnnChannels = 1024, int embeddingDimension = 256, int[]? dilations = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
numMelsintNumber of mel filterbank channels (default: 80).
tdnnChannelsintNumber of TDNN channels (default: 1024).
embeddingDimensionintEmbedding dimension (default: 256).
dilationsint[]Dilation factors for TDNN layers (default: [1, 2, 3, 4, 1]).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a VoxLingua107 language identifier.
Remarks
VoxLingua107 uses ECAPA-TDNN architecture trained on 107 languages from the VoxLingua107 dataset (YouTube speech samples).
CreateDefaultVoxelCNNLayers(NeuralNetworkArchitecture<T>, int, int, int)
Creates default layers for a Voxel-based 3D Convolutional Neural Network.
public static IEnumerable<ILayer<T>> CreateDefaultVoxelCNNLayers(NeuralNetworkArchitecture<T> architecture, int voxelResolution = 32, int numConvBlocks = 3, int baseFilters = 32)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture specification.
voxelResolutionintThe resolution of the voxel grid (e.g., 32 for 32x32x32). Default is 32.
numConvBlocksintThe number of convolutional blocks (each block has Conv3D + MaxPool3D). Default is 3.
baseFiltersintThe number of filters in the first convolutional layer. Doubles with each block. Default is 32.
Returns
- IEnumerable<ILayer<T>>
A collection of layers configured for voxel-based 3D classification.
Remarks
For Beginners: A Voxel CNN is like a 3D version of a regular image classifier. Instead of looking at a 2D image, it examines a 3D grid of "blocks" (voxels) to understand 3D shapes. This is like how Minecraft represents the world - each block is either filled or empty, and the pattern of blocks creates recognizable objects.
The architecture follows a standard pattern: - Multiple Conv3D + MaxPool3D blocks to extract hierarchical 3D features - Each block doubles the number of filters while halving the spatial resolution - Global average pooling to aggregate spatial information - Dense output layer for classification
Applications include: - Recognizing 3D objects from voxelized point clouds (e.g., ModelNet40) - Medical image analysis (CT, MRI volumetric scans) - Spatial occupancy prediction from depth sensors
Exceptions
- InvalidOperationException
Thrown when the architecture has invalid input or output dimensions.
CreateDefaultWav2Vec2LanguageIdentifierLayers(NeuralNetworkArchitecture<T>, int, int, int, int, int, double)
Creates default Wav2Vec2 layers for spoken language identification.
public static IEnumerable<ILayer<T>> CreateDefaultWav2Vec2LanguageIdentifierLayers(NeuralNetworkArchitecture<T> architecture, int hiddenSize = 768, int numLayers = 12, int numAttentionHeads = 12, int intermediateSize = 3072, int numLanguages = 20, double dropoutRate = 0.1)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
hiddenSizeintHidden size of transformer (default: 768).
numLayersintNumber of transformer layers (default: 12).
numAttentionHeadsintNumber of attention heads (default: 12).
intermediateSizeintFeed-forward intermediate size (default: 3072).
numLanguagesintNumber of languages to classify (default: 20).
dropoutRatedoubleDropout rate (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Wav2Vec2 language identifier.
Remarks
Wav2Vec2-LID uses Meta's self-supervised speech representation model: - 7-layer CNN feature encoder processing raw waveform - Transformer encoder for contextual representations - Classification head for language prediction
CreateDefaultWhisperLayers(int, int, int, int, int, int, int, int, double)
Creates default layers for Whisper-style speech recognition models.
public static IEnumerable<ILayer<T>> CreateDefaultWhisperLayers(int numMels = 80, int modelDimension = 512, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int feedForwardDim = 2048, int vocabularySize = 51865, int maxSequenceLength = 1500, double dropoutRate = 0.1)
Parameters
numMelsintNumber of mel spectrogram bins (default: 80).
modelDimensionintHidden dimension of the model (default: 512).
numEncoderLayersintNumber of encoder layers (default: 6).
numDecoderLayersintNumber of decoder layers (default: 6).
numHeadsintNumber of attention heads (default: 8).
feedForwardDimintFeed-forward dimension (default: 2048).
vocabularySizeintOutput vocabulary size (default: 51865).
maxSequenceLengthintMaximum sequence length (default: 1500).
dropoutRatedoubleDropout rate (default: 0.1).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Whisper-style ASR model.
Remarks
For Beginners: Whisper is an encoder-decoder transformer for speech recognition.
The architecture consists of:
- Audio encoder: Converts mel spectrograms to hidden representations
- Convolutional layers to process spectrogram
- Transformer encoder layers with self-attention
- Text decoder: Generates text tokens autoregressively
- Embedding layer for text tokens
- Transformer decoder layers with self-attention
- Output projection to vocabulary
This creates a trainable model structure from scratch. The decoder layers expect encoder outputs to be provided during the forward pass (as implemented in WhisperModel<T>). For inference with pre-trained weights, use the ONNX-based WhisperModel.CreateAsync() method instead.
CreateDefaultWhisperLayers(int, int, int, int, int, int, int, int, int, double)
Creates default Whisper layers for automatic speech recognition.
public static IEnumerable<ILayer<T>> CreateDefaultWhisperLayers(int modelDim = 512, int numEncoderLayers = 6, int numDecoderLayers = 6, int numHeads = 8, int ffDim = 2048, int numMels = 80, int maxFrames = 3000, int maxTokens = 448, int vocabSize = 51865, double dropoutRate = 0)
Parameters
modelDimintModel hidden dimension (default: 512 for Base).
numEncoderLayersintNumber of encoder transformer layers (default: 6 for Base).
numDecoderLayersintNumber of decoder transformer layers (default: 6 for Base).
numHeadsintNumber of attention heads (default: 8 for Base).
ffDimintFeed-forward hidden dimension (default: 2048 for Base).
numMelsintNumber of mel spectrogram bins (default: 80).
maxFramesintMaximum mel spectrogram frames (default: 3000 for 30s audio).
maxTokensintMaximum output token sequence length (default: 448).
vocabSizeintWhisper vocabulary size (default: 51865).
dropoutRatedoubleDropout rate (default: 0.0 for inference-optimized).
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Whisper encoder-decoder architecture.
Remarks
Whisper is OpenAI's state-of-the-art automatic speech recognition model with:
- Mel spectrogram audio preprocessing (80 bins, 16kHz)
- Convolutional stem for initial audio feature extraction
- Transformer encoder for audio representation learning
- Transformer decoder with cross-attention for text generation
- Support for 99+ languages and translation to English
Reference: "Robust Speech Recognition via Large-Scale Weak Supervision" by Radford et al., 2022
CreateDefaultWord2VecLayers(NeuralNetworkArchitecture<T>, int, int)
Creates default layers for a Word2Vec model (Skip-Gram or CBOW).
public static IEnumerable<ILayer<T>> CreateDefaultWord2VecLayers(NeuralNetworkArchitecture<T> architecture, int vocabSize, int embeddingDimension)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
vocabSizeintThe size of the vocabulary.
embeddingDimensionintThe dimension of the embedding vectors.
Returns
- IEnumerable<ILayer<T>>
A collection of layers forming a Word2Vec model.
Remarks
For Beginners: Word2Vec learns to represent words as vectors of numbers (embeddings) such that words with similar meanings are close to each other.
CreateDefaultXMemLayers(int, int, int, int)
Creates layers for an XMem long-term video object segmentation model.
public static IEnumerable<ILayer<T>> CreateDefaultXMemLayers(int inputChannels = 3, int inputHeight = 480, int inputWidth = 854, int numFeatures = 256)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput frame height (default: 480).
inputWidthintInput frame width (default: 854).
numFeaturesintFeature dimension (default: 256).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for long-term video object segmentation.
Remarks
For Beginners: XMem is designed for tracking objects in very long videos using a three-tier memory system inspired by human memory: - Sensory memory: Very recent frames (high detail, fast to forget) - Working memory: Important recent frames (moderate detail) - Long-term memory: Key historical frames (compressed, permanent)
Reference: "XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model" https://arxiv.org/abs/2207.07115
CreateSAM2ImageEncoderLayers(int, int, int, int)
Creates the image encoder layers for SAM2 (Segment Anything Model 2).
public static IEnumerable<ILayer<T>> CreateSAM2ImageEncoderLayers(int inputChannels = 3, int inputHeight = 1024, int inputWidth = 1024, int numFeatures = 256)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput height (default: 1024).
inputWidthintInput width (default: 1024).
numFeaturesintNumber of output feature channels (default: 256).
Returns
- IEnumerable<ILayer<T>>
Image encoder layers that downsample input to feature maps.
Remarks
For Beginners: This creates the image encoder part of SAM2, which processes input images into feature maps. The output has shape [numFeatures, H/16, W/16].
Note: SAM2 is a multi-branch architecture. Use separate factory methods: - CreateSAM2ImageEncoderLayers: Image feature extraction (this method) - CreateSAM2PromptEncoderLayers: Point/box/mask prompt encoding - CreateSAM2MemoryLayers: Temporal memory attention - CreateSAM2MaskDecoderLayers: Mask prediction head
CreateSAM2IoUHead(int, int, int, int)
Creates the IoU (Intersection over Union) prediction head for SAM2.
public static IEnumerable<ILayer<T>> CreateSAM2IoUHead(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64, int numMaskCandidates = 4)
Parameters
numFeaturesintNumber of input feature channels (default: 256).
featureHeightintHeight of feature maps (default: 64).
featureWidthintWidth of feature maps (default: 64).
numMaskCandidatesintNumber of mask candidates (default: 4).
Returns
- IEnumerable<ILayer<T>>
IoU prediction layers. Output shape: [numMaskCandidates]
Remarks
For Beginners: This head predicts the quality (IoU score) for each mask candidate. Higher scores indicate better masks. Used to select the best mask from candidates.
CreateSAM2MaskDecoderLayers(int, int, int)
Creates the shared mask decoder refinement layers for SAM2.
public static IEnumerable<ILayer<T>> CreateSAM2MaskDecoderLayers(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64)
Parameters
numFeaturesintNumber of feature channels (default: 256).
featureHeightintHeight of feature maps (default: 64).
featureWidthintWidth of feature maps (default: 64).
Returns
- IEnumerable<ILayer<T>>
Shared refinement layers that process fused features.
Remarks
For Beginners: These layers refine the combined image and prompt features before branching into separate prediction heads. Output shape: [numFeatures, h, w]
Usage: Apply these layers first, then branch to the three separate heads: - CreateSAM2MaskHead: Produces mask candidates - CreateSAM2IoUHead: Predicts mask quality scores - CreateSAM2OcclusionHead: Predicts occlusion
CreateSAM2MaskHead(int, int, int, int)
Creates the mask prediction head for SAM2.
public static IEnumerable<ILayer<T>> CreateSAM2MaskHead(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64, int numMaskCandidates = 4)
Parameters
numFeaturesintNumber of input feature channels (default: 256).
featureHeightintHeight of feature maps (default: 64).
featureWidthintWidth of feature maps (default: 64).
numMaskCandidatesintNumber of mask candidates to output (default: 4).
Returns
- IEnumerable<ILayer<T>>
Mask prediction layers. Output shape: [numMaskCandidates, h, w]
Remarks
For Beginners: This head produces multiple candidate segmentation masks. Each candidate is a probability map indicating object presence at each pixel.
CreateSAM2MemoryLayers(int, int, int)
Creates the memory attention layers for SAM2 temporal consistency.
public static IEnumerable<ILayer<T>> CreateSAM2MemoryLayers(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64)
Parameters
numFeaturesintNumber of feature channels (default: 256).
featureHeightintHeight of feature maps (default: 64).
featureWidthintWidth of feature maps (default: 64).
Returns
- IEnumerable<ILayer<T>>
Memory attention layers for video object tracking.
Remarks
For Beginners: Memory layers help SAM2 track objects across video frames by maintaining a memory of past segmentations and matching them to new frames.
CreateSAM2OcclusionHead(int, int, int)
Creates the occlusion prediction head for SAM2.
public static IEnumerable<ILayer<T>> CreateSAM2OcclusionHead(int numFeatures = 256, int featureHeight = 64, int featureWidth = 64)
Parameters
numFeaturesintNumber of input feature channels (default: 256).
featureHeightintHeight of feature maps (default: 64).
featureWidthintWidth of feature maps (default: 64).
Returns
- IEnumerable<ILayer<T>>
Occlusion prediction layers. Output shape: [1]
Remarks
For Beginners: This head predicts whether the tracked object is occluded (hidden by other objects). A high score indicates the object may be temporarily invisible.
CreateSAM2PromptEncoderLayers(int, int, int)
Creates the prompt encoder layers for SAM2 (point, box, and mask prompts).
public static IEnumerable<ILayer<T>> CreateSAM2PromptEncoderLayers(int numFeatures = 256, int maskHeight = 256, int maskWidth = 256)
Parameters
numFeaturesintNumber of output feature channels (default: 256).
maskHeightintHeight of mask prompt input (default: 256).
maskWidthintWidth of mask prompt input (default: 256).
Returns
- IEnumerable<ILayer<T>>
Prompt encoder layers for different prompt types.
Remarks
For Beginners: SAM2 accepts different types of prompts to tell it what to segment: - Points: Click on the object (x, y coordinates) - Boxes: Draw a bounding box (x1, y1, x2, y2) - Masks: Provide an initial mask estimate
Usage: These layers are applied to prompt inputs separately, then combined with image features in the mask decoder. They are NOT chained sequentially with the image encoder.
CreateSimpleVideoSuperResolutionLayers(int, int, int, int)
Creates a simple super-resolution architecture for testing and lightweight use.
public static IEnumerable<ILayer<T>> CreateSimpleVideoSuperResolutionLayers(int inputChannels = 3, int inputHeight = 128, int inputWidth = 128, int scaleFactor = 2)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput video height.
inputWidthintInput video width.
scaleFactorintUpscaling factor (default: 2).
Returns
- IEnumerable<ILayer<T>>
A collection of layers for simple super-resolution.
Remarks
For Beginners: This is a smaller, faster model that trades quality for speed. Good for real-time applications or when GPU memory is limited.
CreateSlowFastFastPathwayLayers(int, int, int, int)
Creates the fast pathway layers for SlowFast video recognition.
public static IEnumerable<ILayer<T>> CreateSlowFastFastPathwayLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int fastChannels = 8)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput height (default: 224).
inputWidthintInput width (default: 224).
fastChannelsintBase channel count for fast pathway (default: 8).
Returns
- IEnumerable<ILayer<T>>
Fast pathway layers that process more frames at lower capacity.
Remarks
For Beginners: The fast pathway processes video at a high frame rate (e.g., 32 fps) but with lower channel capacity (1/8 of slow pathway). It captures motion and temporal dynamics. Output shape: [fastChannels * 8, H/16, W/16]
CreateSlowFastFusionLayers(int, int, int, int, int)
Creates the fusion and classification layers for SlowFast.
public static IEnumerable<ILayer<T>> CreateSlowFastFusionLayers(int slowChannels = 64, int fastChannels = 8, int featureHeight = 14, int featureWidth = 14, int numClasses = 400)
Parameters
slowChannelsintBase channel count for slow pathway (default: 64).
fastChannelsintBase channel count for fast pathway (default: 8).
featureHeightintHeight of feature maps after pathways (default: 14).
featureWidthintWidth of feature maps after pathways (default: 14).
numClassesintNumber of action classes (default: 400 for Kinetics).
Returns
- IEnumerable<ILayer<T>>
Fusion layers that combine pathways and classify actions.
Remarks
For Beginners: This fuses the slow and fast pathway features (after concatenation) and produces the final action classification. The SlowFast model should: 1. Run slow pathway on subsampled frames 2. Run fast pathway on all frames 3. Concatenate outputs along channel dimension 4. Apply these fusion layers
CreateSlowFastSlowPathwayLayers(int, int, int, int)
Creates the slow pathway layers for SlowFast video recognition.
public static IEnumerable<ILayer<T>> CreateSlowFastSlowPathwayLayers(int inputChannels = 3, int inputHeight = 224, int inputWidth = 224, int slowChannels = 64)
Parameters
inputChannelsintNumber of input channels (default: 3 for RGB).
inputHeightintInput height (default: 224).
inputWidthintInput width (default: 224).
slowChannelsintBase channel count for slow pathway (default: 64).
Returns
- IEnumerable<ILayer<T>>
Slow pathway layers that process fewer frames at higher capacity.
Remarks
For Beginners: The slow pathway processes video at a low frame rate (e.g., 4 fps) but with high channel capacity. It captures spatial semantics and appearance features. Output shape: [slowChannels * 8, H/16, W/16]
Note: SlowFast is a dual-pathway architecture. Use separate factory methods: - CreateSlowFastSlowPathwayLayers: Low frame rate, high capacity (this method) - CreateSlowFastFastPathwayLayers: High frame rate, low capacity - CreateSlowFastFusionLayers: Combines pathways for classification