Table of Contents

Class VideoCLIPNeuralNetwork<T>

Namespace
AiDotNet.NeuralNetworks
Assembly
AiDotNet.dll

VideoCLIP neural network for video-text alignment and temporal understanding.

public class VideoCLIPNeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IVideoCLIPModel<T>, IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
VideoCLIPNeuralNetwork<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

VideoCLIP extends CLIP's contrastive learning paradigm to the video domain, enabling text-to-video and video-to-text retrieval, action recognition, and temporal understanding.

For Beginners: VideoCLIP is like CLIP but for videos!

Architecture overview:

  1. Vision Encoder: Extracts features from each frame (shared CLIP ViT)
  2. Temporal Encoder: Aggregates frame features over time
  3. Text Encoder: Processes text descriptions
  4. Contrastive Learning: Aligns video and text in shared embedding space

Key capabilities:

  • Video retrieval: Find videos matching text descriptions
  • Action recognition: Classify actions without training
  • Moment localization: Find specific moments in videos
  • Video QA: Answer questions about video content

Constructors

VideoCLIPNeuralNetwork(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int, int, int, int, int, int, int, double, string, ITokenizer?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a VideoCLIP network using native library layers.

public VideoCLIPNeuralNetwork(NeuralNetworkArchitecture<T> architecture, int imageSize = 224, int channels = 3, int patchSize = 16, int vocabularySize = 49408, int maxSequenceLength = 77, int embeddingDimension = 512, int visionHiddenDim = 768, int textHiddenDim = 512, int numFrameEncoderLayers = 12, int numTemporalLayers = 4, int numTextLayers = 12, int numHeads = 12, int numFrames = 8, double frameRate = 1, string temporalAggregation = "temporal_transformer", ITokenizer? tokenizer = null, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>
imageSize int
channels int
patchSize int
vocabularySize int
maxSequenceLength int
embeddingDimension int
visionHiddenDim int
textHiddenDim int
numFrameEncoderLayers int
numTemporalLayers int
numTextLayers int
numHeads int
numFrames int
frameRate double
temporalAggregation string
tokenizer ITokenizer
optimizer IOptimizer<T, Tensor<T>, Tensor<T>>
lossFunction ILossFunction<T>

VideoCLIPNeuralNetwork(NeuralNetworkArchitecture<T>, string, string, ITokenizer, int, double, string, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a VideoCLIP network using pretrained ONNX models.

public VideoCLIPNeuralNetwork(NeuralNetworkArchitecture<T> architecture, string videoEncoderPath, string textEncoderPath, ITokenizer tokenizer, int numFrames = 8, double frameRate = 1, string temporalAggregation = "temporal_transformer", int embeddingDimension = 512, int maxSequenceLength = 77, int imageSize = 224, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>
videoEncoderPath string
textEncoderPath string
tokenizer ITokenizer
numFrames int
frameRate double
temporalAggregation string
embeddingDimension int
maxSequenceLength int
imageSize int
optimizer IOptimizer<T, Tensor<T>, Tensor<T>>
lossFunction ILossFunction<T>

Properties

EmbeddingDimension

Gets the dimensionality of the embedding space.

public int EmbeddingDimension { get; }

Property Value

int

FrameRate

Gets the frame rate (frames per second) for video sampling.

public double FrameRate { get; }

Property Value

double

ImageSize

Gets the expected image size (square images: ImageSize x ImageSize pixels).

public int ImageSize { get; }

Property Value

int

MaxSequenceLength

Gets the maximum sequence length for text input.

public int MaxSequenceLength { get; }

Property Value

int

NumFrames

Gets the number of frames the model processes per video clip.

public int NumFrames { get; }

Property Value

int

ParameterCount

Gets the total number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

For Beginners: This tells you how many adjustable values (weights and biases) your neural network has. More complex networks typically have more parameters and can learn more complex patterns, but also require more data to train effectively. This is part of the IFullModel interface for consistency with other model types.

Performance: This property uses caching to avoid recomputing the sum on every access. The cache is invalidated when layers are modified.

TemporalAggregation

Gets the temporal aggregation method used.

public string TemporalAggregation { get; }

Property Value

string

Remarks

Common methods: "mean_pooling", "temporal_transformer", "late_fusion"

Methods

AnswerVideoQuestion(IEnumerable<Tensor<T>>, string, int)

Answers a question about video content.

public string AnswerVideoQuestion(IEnumerable<Tensor<T>> frames, string question, int maxLength = 64)

Parameters

frames IEnumerable<Tensor<T>>

Video frames.

question string

Question about the video.

maxLength int

Maximum answer length.

Returns

string

Generated answer.

Remarks

For Beginners: Ask questions about videos!

Examples:

  • "What is the person doing?" → "Playing guitar"
  • "How many people are in the video?" → "Three"
  • "What happens at the end?" → "The dog catches the frisbee"

Backward(Tensor<T>)

Backward pass through video encoder layers.

public Tensor<T> Backward(Tensor<T> gradient)

Parameters

gradient Tensor<T>

Returns

Tensor<T>

ComputeSimilarity(Vector<T>, Vector<T>)

Computes similarity between two embeddings.

public T ComputeSimilarity(Vector<T> textEmbedding, Vector<T> imageEmbedding)

Parameters

textEmbedding Vector<T>
imageEmbedding Vector<T>

Returns

T

Similarity score (cosine similarity for normalized embeddings).

ComputeTemporalSimilarityMatrix(IEnumerable<Tensor<T>>, IEnumerable<Tensor<T>>)

Computes temporal similarity matrix between video segments.

public Tensor<T> ComputeTemporalSimilarityMatrix(IEnumerable<Tensor<T>> video1Frames, IEnumerable<Tensor<T>> video2Frames)

Parameters

video1Frames IEnumerable<Tensor<T>>

First video frames.

video2Frames IEnumerable<Tensor<T>>

Second video frames.

Returns

Tensor<T>

Similarity matrix with shape [numFrames1, numFrames2].

Remarks

Useful for video alignment, finding corresponding moments, or detecting repetitions.

ComputeVideoTextSimilarity(string, IEnumerable<Tensor<T>>)

Computes similarity between a text description and a video.

public T ComputeVideoTextSimilarity(string text, IEnumerable<Tensor<T>> frames)

Parameters

text string

Text description of an action or event.

frames IEnumerable<Tensor<T>>

Video frames to compare against.

Returns

T

Similarity score, typically in range [-1, 1].

CreateNewInstance()

Creates a new instance of the same type as this neural network.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

A new instance of the same neural network type.

Remarks

For Beginners: This creates a blank version of the same type of neural network.

It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data that was not covered by the general deserialization process.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

The BinaryReader to read the data from.

Remarks

This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.

For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.

Dispose(bool)

Protected Dispose pattern implementation.

protected override void Dispose(bool disposing)

Parameters

disposing bool

True if called from Dispose(), false if called from finalizer.

EncodeImage(double[])

Encodes an image into an embedding vector.

public Vector<T> EncodeImage(double[] imageData)

Parameters

imageData double[]

The preprocessed image data as a flattened array in CHW format.

Returns

Vector<T>

A normalized embedding vector.

EncodeImageBatch(IEnumerable<double[]>)

Encodes multiple images into embedding vectors in a batch.

public Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)

Parameters

imageDataBatch IEnumerable<double[]>

The preprocessed images as flattened arrays.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding image.

EncodeText(string)

Encodes text into an embedding vector.

public Vector<T> EncodeText(string text)

Parameters

text string

The text to encode.

Returns

Vector<T>

A normalized embedding vector.

EncodeTextBatch(IEnumerable<string>)

Encodes multiple texts into embedding vectors in a batch.

public Matrix<T> EncodeTextBatch(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

The texts to encode.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding text.

ExtractFrameFeatures(IEnumerable<Tensor<T>>)

Extracts frame-level features before temporal aggregation.

public Tensor<T> ExtractFrameFeatures(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>

Video frames.

Returns

Tensor<T>

Feature tensor with shape [numFrames, featureDim].

GenerateVideoCaption(IEnumerable<Tensor<T>>, int)

Generates a caption describing the video content.

public string GenerateVideoCaption(IEnumerable<Tensor<T>> frames, int maxLength = 77)

Parameters

frames IEnumerable<Tensor<T>>

Video frames to caption.

maxLength int

Maximum caption length.

Returns

string

Generated caption describing the video.

GetImageEmbedding(Tensor<T>)

public Vector<T> GetImageEmbedding(Tensor<T> image)

Parameters

image Tensor<T>

Returns

Vector<T>

GetImageEmbeddings(IEnumerable<Tensor<T>>)

public IEnumerable<Vector<T>> GetImageEmbeddings(IEnumerable<Tensor<T>> images)

Parameters

images IEnumerable<Tensor<T>>

Returns

IEnumerable<Vector<T>>

GetModelMetadata()

Gets the metadata for this neural network model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

A ModelMetaData object containing information about the model.

GetParameters()

Gets all trainable parameters of the network as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all parameters of the network.

Remarks

For Beginners: Neural networks learn by adjusting their "parameters" (also called weights and biases). This method collects all those adjustable values into a single list so they can be updated during training.

GetTextEmbedding(string)

public Vector<T> GetTextEmbedding(string text)

Parameters

text string

Returns

Vector<T>

GetTextEmbeddings(IEnumerable<string>)

public IEnumerable<Vector<T>> GetTextEmbeddings(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

Returns

IEnumerable<Vector<T>>

GetVideoEmbedding(IEnumerable<Tensor<T>>)

Converts a video (sequence of frames) into an embedding vector.

public Vector<T> GetVideoEmbedding(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>

Sequence of preprocessed frame tensors with shape [channels, height, width].

Returns

Vector<T>

A normalized embedding vector representing the entire video.

Remarks

For Beginners: This converts a video into a single vector!

Process:

  1. Each frame is encoded independently (like CLIP)
  2. Frame features are aggregated over time
  3. Result is a single vector capturing the video's content and actions

Now you can compare videos to text or other videos!

GetVideoEmbeddings(IEnumerable<IEnumerable<Tensor<T>>>)

Converts multiple videos into embedding vectors in a batch.

public IEnumerable<Vector<T>> GetVideoEmbeddings(IEnumerable<IEnumerable<Tensor<T>>> videos)

Parameters

videos IEnumerable<IEnumerable<Tensor<T>>>

Collection of videos, each as a sequence of frames.

Returns

IEnumerable<Vector<T>>

Collection of normalized embedding vectors.

InitializeLayers()

Initializes the layers of the neural network based on the architecture.

protected override void InitializeLayers()

Remarks

For Beginners: This method sets up all the layers in your neural network according to the architecture you've defined. It's like assembling the parts of your network before you can use it.

LocalizeMoments(IEnumerable<Tensor<T>>, string, int)

Localizes moments in a video that match a text description.

public IEnumerable<(int StartFrame, int EndFrame, T Score)> LocalizeMoments(IEnumerable<Tensor<T>> frames, string query, int windowSize = 16)

Parameters

frames IEnumerable<Tensor<T>>

Full video as sequence of frames.

query string

Text describing the moment to find.

windowSize int

Number of frames per moment window.

Returns

IEnumerable<(int StartFrame, int EndFrame, T Score)>

List of (startFrame, endFrame, score) for matching moments.

Remarks

For Beginners: Find specific moments in a video!

Example:

  • Video: 5 minutes of a cooking show
  • Query: "chopping vegetables"
  • Result: [(300, 450, 0.92), (1200, 1350, 0.87)] - two segments where chopping happens

Predict(Tensor<T>)

Makes a prediction using the neural network.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

The input data to process.

Returns

Tensor<T>

The network's prediction.

Remarks

For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).

PredictNextAction(IEnumerable<Tensor<T>>, IEnumerable<string>)

Predicts the next action or event in a video.

public Dictionary<string, T> PredictNextAction(IEnumerable<Tensor<T>> frames, IEnumerable<string> possibleNextActions)

Parameters

frames IEnumerable<Tensor<T>>

Observed video frames.

possibleNextActions IEnumerable<string>

Candidate actions that might happen next.

Returns

Dictionary<string, T>

Probability distribution over possible next actions.

RetrieveTextsForVideo(IEnumerable<Tensor<T>>, IEnumerable<string>, int)

Retrieves the most relevant text descriptions for a video.

public IEnumerable<(int Index, T Score)> RetrieveTextsForVideo(IEnumerable<Tensor<T>> frames, IEnumerable<string> candidateTexts, int topK = 10)

Parameters

frames IEnumerable<Tensor<T>>

Video frames to find descriptions for.

candidateTexts IEnumerable<string>

Pool of text descriptions to search.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices of best matching texts with scores.

RetrieveVideos(string, IEnumerable<Vector<T>>, int)

Retrieves the most relevant videos for a text query.

public IEnumerable<(int Index, T Score)> RetrieveVideos(string query, IEnumerable<Vector<T>> videoEmbeddings, int topK = 10)

Parameters

query string

Text description of desired video content.

videoEmbeddings IEnumerable<Vector<T>>

Pre-computed embeddings of video database.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices of top matching videos with their scores.

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data that is not covered by the general serialization process.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

The BinaryWriter to write the data to.

Remarks

This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.

For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.

SetParameters(Vector<T>)

Sets the parameters of the neural network.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The parameters to set.

Remarks

This method distributes the parameters to all layers in the network. The parameters should be in the same format as returned by GetParameters.

Train(Tensor<T>, Tensor<T>)

Trains the neural network on a single input-output pair.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>

The input data.

expectedOutput Tensor<T>

The expected output for the given input.

Remarks

This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.

For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)

The network then:

  1. Makes a prediction based on the input
  2. Compares its prediction to the expected output
  3. Calculates how wrong it was (the loss)
  4. Adjusts its internal values to do better next time

After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.

UpdateParameters(Vector<T>)

Updates the network's parameters with new values.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The new parameter values to set.

Remarks

For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.

This is typically used by optimization algorithms that calculate better parameter values based on training data.

ZeroShotActionRecognition(IEnumerable<Tensor<T>>, IEnumerable<string>)

Performs zero-shot action classification on a video.

public Dictionary<string, T> ZeroShotActionRecognition(IEnumerable<Tensor<T>> frames, IEnumerable<string> actionLabels)

Parameters

frames IEnumerable<Tensor<T>>

Video frames to classify.

actionLabels IEnumerable<string>

Candidate action labels.

Returns

Dictionary<string, T>

Dictionary mapping actions to probability scores.

Remarks

For Beginners: Recognize actions without training!

Example:

  • Video: Someone shooting a basketball
  • Labels: ["playing basketball", "playing soccer", "swimming", "running"]
  • Result: {"playing basketball": 0.85, "running": 0.08, ...}

Works with any action you can describe in text!

ZeroShotClassify(Tensor<T>, IEnumerable<string>)

public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels)

Parameters

image Tensor<T>
classLabels IEnumerable<string>

Returns

Dictionary<string, T>

ZeroShotClassify(double[], IEnumerable<string>)

Performs zero-shot classification of an image against text labels.

public Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)

Parameters

imageData double[]

The preprocessed image data.

labels IEnumerable<string>

The candidate class labels.

Returns

Dictionary<string, T>

A dictionary mapping each label to its probability score.