Table of Contents

Class VideoCLIP<T>

Namespace
AiDotNet.Video.Understanding
Assembly
AiDotNet.dll

VideoCLIP model for video-text understanding and retrieval.

public class VideoCLIP<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable

Type Parameters

T

The numeric type used for calculations (e.g., float, double).

Inheritance
VideoCLIP<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

For Beginners: VideoCLIP learns to understand both videos and text descriptions in a shared "embedding space" where similar concepts are close together.

Key capabilities:

  • Video-to-Text Search: Find text descriptions that match a video
  • Text-to-Video Search: Find videos that match a text query
  • Zero-Shot Classification: Classify videos into categories without training
  • Video Captioning: Generate descriptions for videos
  • Video Question Answering: Answer questions about video content

The model creates embeddings (numerical representations) for both videos and text that can be compared using similarity measures. Videos and their corresponding descriptions will have similar embeddings.

Technical Details: - Contrastive learning on video-text pairs - Temporal transformer for video understanding - Text transformer for language understanding - Joint embedding space with cosine similarity - Pre-trained on large-scale video-text datasets

Reference: Xu et al., "VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding" EMNLP 2021.

Constructors

VideoCLIP(NeuralNetworkArchitecture<T>, int, int, int, int, double, string?, string?)

Initializes a new instance of the VideoCLIP class.

public VideoCLIP(NeuralNetworkArchitecture<T> architecture, int numFrames = 32, int embeddingDim = 512, int textMaxLength = 77, int vocabSize = 49408, double temperature = 0.07, string? vocabPath = null, string? mergesPath = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

numFrames int

Number of video frames to process.

embeddingDim int

Dimension of the shared embedding space.

textMaxLength int

Maximum text sequence length.

vocabSize int

Vocabulary size for text encoding.

temperature double

Temperature for softmax scaling.

vocabPath string

Optional path to CLIP vocabulary JSON file for production tokenization.

mergesPath string

Optional path to CLIP BPE merges file for production tokenization.

Remarks

For Production Use: Provide vocabPath and mergesPath to use proper CLIP tokenization. Download these files from HuggingFace's openai/clip-vit-base-patch32 repository: - vocab.json: Token vocabulary mapping - merges.txt: BPE merge rules

For Testing: Omit vocabPath and mergesPath to use a simple test tokenizer.

Properties

SupportsTraining

Gets whether training is supported.

public override bool SupportsTraining { get; }

Property Value

bool

Methods

ComputeSimilarity(Tensor<T>, Tensor<T>)

Computes similarity between video and text embeddings.

public double ComputeSimilarity(Tensor<T> videoEmbedding, Tensor<T> textEmbedding)

Parameters

videoEmbedding Tensor<T>

Video embedding.

textEmbedding Tensor<T>

Text embedding.

Returns

double

Similarity score (higher = more similar).

ComputeSimilarityMatrix(List<Tensor<T>>, List<Tensor<T>>)

Computes video-text similarity matrix for a batch.

public Tensor<T> ComputeSimilarityMatrix(List<Tensor<T>> videoFramesBatch, List<Tensor<T>> textsBatch)

Parameters

videoFramesBatch List<Tensor<T>>

Batch of videos [B, T, C, H, W].

textsBatch List<Tensor<T>>

Batch of text token IDs [B, SeqLen].

Returns

Tensor<T>

Similarity matrix [B, B] where (i,j) is similarity between video i and text j.

CreateNewInstance()

Creates a new instance of the same type as this neural network.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

A new instance of the same neural network type.

Remarks

For Beginners: This creates a blank version of the same type of neural network.

It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data that was not covered by the general deserialization process.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

The BinaryReader to read the data from.

Remarks

This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.

For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.

EncodeText(Tensor<T>)

Encodes text into an embedding vector.

public Tensor<T> EncodeText(Tensor<T> tokenIds)

Parameters

tokenIds Tensor<T>

Token IDs [SeqLen] or [B, SeqLen].

Returns

Tensor<T>

Text embedding [EmbeddingDim] or [B, EmbeddingDim].

Remarks

For Beginners: This converts text (as token IDs) into a numerical vector. Text with similar meaning will have similar embeddings.

EncodeVideo(Tensor<T>)

Encodes a video into an embedding vector.

public Tensor<T> EncodeVideo(Tensor<T> videoFrames)

Parameters

videoFrames Tensor<T>

Input video [T, C, H, W] or [B, T, C, H, W].

Returns

Tensor<T>

Video embedding [EmbeddingDim] or [B, EmbeddingDim].

Remarks

For Beginners: This converts a video into a numerical vector (embedding). Videos with similar content will have similar embeddings.

GetModelMetadata()

Gets the metadata for this neural network model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

A ModelMetaData object containing information about the model.

InitializeLayers()

Initializes the layers of the neural network based on the architecture.

protected override void InitializeLayers()

Remarks

For Beginners: This method sets up all the layers in your neural network according to the architecture you've defined. It's like assembling the parts of your network before you can use it.

Predict(Tensor<T>)

Makes a prediction using the neural network.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

The input data to process.

Returns

Tensor<T>

The network's prediction.

Remarks

For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data that is not covered by the general serialization process.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

The BinaryWriter to write the data to.

Remarks

This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.

For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.

TextToVideoRetrieval(string, List<Tensor<T>>, int)

Retrieves the most similar videos to a text query.

public List<(int VideoIndex, double Similarity)> TextToVideoRetrieval(string query, List<Tensor<T>> videoEmbeddings, int topK = 10)

Parameters

query string

Text query describing the desired video.

videoEmbeddings List<Tensor<T>>

Pre-computed video embeddings.

topK int

Number of results to return.

Returns

List<(int ClassIndex, double Probability)>

List of (videoIndex, similarity) pairs, sorted by similarity.

Train(Tensor<T>, Tensor<T>)

Trains the neural network on a single input-output pair.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>

The input data.

expectedOutput Tensor<T>

The expected output for the given input.

Remarks

This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.

For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)

The network then:

  1. Makes a prediction based on the input
  2. Compares its prediction to the expected output
  3. Calculates how wrong it was (the loss)
  4. Adjusts its internal values to do better next time

After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.

UpdateParameters(Vector<T>)

Updates the network's parameters with new values.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The new parameter values to set.

Remarks

For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.

This is typically used by optimization algorithms that calculate better parameter values based on training data.

VideoToTextRetrieval(Tensor<T>, List<string>, int)

Retrieves the most similar text descriptions for a video.

public List<(string Text, double Similarity)> VideoToTextRetrieval(Tensor<T> videoFrames, List<string> candidateTexts, int topK = 10)

Parameters

videoFrames Tensor<T>

Input video frames.

candidateTexts List<string>

List of candidate text descriptions.

topK int

Number of results to return.

Returns

List<(string Label, double Confidence)>

List of (text, similarity) pairs, sorted by similarity.

ZeroShotClassify(Tensor<T>, List<string>)

Performs zero-shot video classification.

public List<(string ClassName, double Probability)> ZeroShotClassify(Tensor<T> videoFrames, List<string> classTexts)

Parameters

videoFrames Tensor<T>

Input video frames.

classTexts List<string>

List of class descriptions (e.g., "a video of cooking").

Returns

List<(string Label, double Confidence)>

Probability distribution over classes.

Remarks

For Beginners: This classifies videos without any training on those specific categories. Simply provide text descriptions of each class (like "a video of someone running"), and the model will determine which description best matches the video.