Interface IVideoCLIPModel<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Defines the contract for VideoCLIP-style models that align video and text in a shared embedding space.

public interface IVideoCLIPModel<T> : IMultimodalEmbedding<T>

Type Parameters

T: The numeric type used for calculations.

Inherited Members: IMultimodalEmbedding<T>.EncodeText(string)

IMultimodalEmbedding<T>.EncodeTextBatch(IEnumerable<string>)

IMultimodalEmbedding<T>.EncodeImage(double[])

IMultimodalEmbedding<T>.EncodeImageBatch(IEnumerable<double[]>)

IMultimodalEmbedding<T>.ComputeSimilarity(Vector<T>, Vector<T>)

IMultimodalEmbedding<T>.ZeroShotClassify(double[], IEnumerable<string>)

IMultimodalEmbedding<T>.EmbeddingDimension

IMultimodalEmbedding<T>.MaxSequenceLength

IMultimodalEmbedding<T>.ImageSize

Remarks

VideoCLIP extends CLIP's contrastive learning paradigm to the video domain, enabling text-to-video and video-to-text retrieval, action recognition, and temporal understanding.

For Beginners: VideoCLIP is like CLIP but for videos!

While CLIP matches images with text, VideoCLIP matches VIDEOS with text:

Understands actions and events that unfold over time
Can find videos matching text descriptions
Can generate descriptions for video clips

Key capabilities:

Temporal understanding: "A person picks up a ball then throws it"
Action recognition: "Playing basketball", "Cooking", "Dancing"
Video retrieval: Find videos matching any text query
Video-text alignment: Match video segments to text descriptions

Architecture differences from CLIP:

Processes multiple frames, not just one image
Uses temporal attention/pooling across frames
Learns motion and action patterns

Properties

FrameRate

Gets the frame rate (frames per second) for video sampling.

double FrameRate { get; }

Property Value

double

NumFrames

Gets the number of frames the model processes per video clip.

int NumFrames { get; }

Property Value

int

TemporalAggregation

Gets the temporal aggregation method used.

string TemporalAggregation { get; }

Property Value

string

Remarks

Common methods: "mean_pooling", "temporal_transformer", "late_fusion"

Methods

AnswerVideoQuestion(IEnumerable<Tensor<T>>, string, int)

Answers a question about video content.

string AnswerVideoQuestion(IEnumerable<Tensor<T>> frames, string question, int maxLength = 64)

Parameters

frames IEnumerable<Tensor<T>>: Video frames.
question string: Question about the video.
maxLength int: Maximum answer length.

Returns

string: Generated answer.

Remarks

For Beginners: Ask questions about videos!

Examples:

"What is the person doing?" → "Playing guitar"
"How many people are in the video?" → "Three"
"What happens at the end?" → "The dog catches the frisbee"

ComputeTemporalSimilarityMatrix(IEnumerable<Tensor<T>>, IEnumerable<Tensor<T>>)

Computes temporal similarity matrix between video segments.

Tensor<T> ComputeTemporalSimilarityMatrix(IEnumerable<Tensor<T>> video1Frames, IEnumerable<Tensor<T>> video2Frames)

Parameters

video1Frames IEnumerable<Tensor<T>>: First video frames.
video2Frames IEnumerable<Tensor<T>>: Second video frames.

Returns

Tensor<T>: Similarity matrix with shape [numFrames1, numFrames2].

Remarks

Useful for video alignment, finding corresponding moments, or detecting repetitions.

ComputeVideoTextSimilarity(string, IEnumerable<Tensor<T>>)

Computes similarity between a text description and a video.

T ComputeVideoTextSimilarity(string text, IEnumerable<Tensor<T>> frames)

Parameters

text string: Text description of an action or event.
frames IEnumerable<Tensor<T>>: Video frames to compare against.

Returns

T: Similarity score, typically in range [-1, 1].

ExtractFrameFeatures(IEnumerable<Tensor<T>>)

Extracts frame-level features before temporal aggregation.

Tensor<T> ExtractFrameFeatures(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>: Video frames.

Returns

Tensor<T>: Feature tensor with shape [numFrames, featureDim].

GenerateVideoCaption(IEnumerable<Tensor<T>>, int)

Generates a caption describing the video content.

string GenerateVideoCaption(IEnumerable<Tensor<T>> frames, int maxLength = 77)

Parameters

frames IEnumerable<Tensor<T>>: Video frames to caption.
maxLength int: Maximum caption length.

Returns

string: Generated caption describing the video.

GetVideoEmbedding(IEnumerable<Tensor<T>>)

Converts a video (sequence of frames) into an embedding vector.

Vector<T> GetVideoEmbedding(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>: Sequence of preprocessed frame tensors with shape [channels, height, width].

Returns

Vector<T>: A normalized embedding vector representing the entire video.

Remarks

For Beginners: This converts a video into a single vector!

Process:

Each frame is encoded independently (like CLIP)
Frame features are aggregated over time
Result is a single vector capturing the video's content and actions

Now you can compare videos to text or other videos!

GetVideoEmbeddings(IEnumerable<IEnumerable<Tensor<T>>>)

Converts multiple videos into embedding vectors in a batch.

IEnumerable<Vector<T>> GetVideoEmbeddings(IEnumerable<IEnumerable<Tensor<T>>> videos)

Parameters

videos IEnumerable<IEnumerable<Tensor<T>>>: Collection of videos, each as a sequence of frames.

Returns

IEnumerable<Vector<T>>: Collection of normalized embedding vectors.

LocalizeMoments(IEnumerable<Tensor<T>>, string, int)

Localizes moments in a video that match a text description.

IEnumerable<(int StartFrame, int EndFrame, T Score)> LocalizeMoments(IEnumerable<Tensor<T>> frames, string query, int windowSize = 16)

Parameters

frames IEnumerable<Tensor<T>>: Full video as sequence of frames.
query string: Text describing the moment to find.
windowSize int: Number of frames per moment window.

Returns

IEnumerable<(int StartFrame, int EndFrame, T Score)>: List of (startFrame, endFrame, score) for matching moments.

Remarks

For Beginners: Find specific moments in a video!

Example:

Video: 5 minutes of a cooking show
Query: "chopping vegetables"
Result: [(300, 450, 0.92), (1200, 1350, 0.87)] - two segments where chopping happens

PredictNextAction(IEnumerable<Tensor<T>>, IEnumerable<string>)

Predicts the next action or event in a video.

Dictionary<string, T> PredictNextAction(IEnumerable<Tensor<T>> frames, IEnumerable<string> possibleNextActions)

Parameters

frames IEnumerable<Tensor<T>>: Observed video frames.
possibleNextActions IEnumerable<string>: Candidate actions that might happen next.

Returns

Dictionary<string, T>: Probability distribution over possible next actions.

RetrieveTextsForVideo(IEnumerable<Tensor<T>>, IEnumerable<string>, int)

Retrieves the most relevant text descriptions for a video.

IEnumerable<(int Index, T Score)> RetrieveTextsForVideo(IEnumerable<Tensor<T>> frames, IEnumerable<string> candidateTexts, int topK = 10)

Parameters

frames IEnumerable<Tensor<T>>: Video frames to find descriptions for.
candidateTexts IEnumerable<string>: Pool of text descriptions to search.
topK int: Number of results to return.

Returns

IEnumerable<(int Index, T Score)>: Indices of best matching texts with scores.

RetrieveVideos(string, IEnumerable<Vector<T>>, int)

Retrieves the most relevant videos for a text query.

IEnumerable<(int Index, T Score)> RetrieveVideos(string query, IEnumerable<Vector<T>> videoEmbeddings, int topK = 10)

Parameters

query string: Text description of desired video content.
videoEmbeddings IEnumerable<Vector<T>>: Pre-computed embeddings of video database.
topK int: Number of results to return.

Returns

IEnumerable<(int Index, T Score)>: Indices of top matching videos with their scores.

ZeroShotActionRecognition(IEnumerable<Tensor<T>>, IEnumerable<string>)

Performs zero-shot action classification on a video.

Dictionary<string, T> ZeroShotActionRecognition(IEnumerable<Tensor<T>> frames, IEnumerable<string> actionLabels)

Parameters

frames IEnumerable<Tensor<T>>: Video frames to classify.
actionLabels IEnumerable<string>: Candidate action labels.

Returns

Dictionary<string, T>: Dictionary mapping actions to probability scores.

Remarks

For Beginners: Recognize actions without training!

Example:

Video: Someone shooting a basketball
Labels: ["playing basketball", "playing soccer", "swimming", "running"]
Result: {"playing basketball": 0.85, "running": 0.08, ...}

Works with any action you can describe in text!

Table of Contents

Interface IVideoCLIPModel<T>

Type Parameters

Remarks

Properties

FrameRate

Property Value

NumFrames

Property Value

TemporalAggregation

Property Value

Remarks

Methods

AnswerVideoQuestion(IEnumerable<Tensor<T>>, string, int)

Parameters

Returns

Remarks

ComputeTemporalSimilarityMatrix(IEnumerable<Tensor<T>>, IEnumerable<Tensor<T>>)

Parameters

Returns

Remarks

ComputeVideoTextSimilarity(string, IEnumerable<Tensor<T>>)

Parameters

Returns

ExtractFrameFeatures(IEnumerable<Tensor<T>>)

Parameters

Returns

GenerateVideoCaption(IEnumerable<Tensor<T>>, int)

Parameters

Returns

GetVideoEmbedding(IEnumerable<Tensor<T>>)

Parameters

Returns

Remarks

GetVideoEmbeddings(IEnumerable<IEnumerable<Tensor<T>>>)

Parameters

Returns

LocalizeMoments(IEnumerable<Tensor<T>>, string, int)

Parameters

Returns

Remarks

PredictNextAction(IEnumerable<Tensor<T>>, IEnumerable<string>)

Parameters

Returns

RetrieveTextsForVideo(IEnumerable<Tensor<T>>, IEnumerable<string>, int)

Parameters

Returns

RetrieveVideos(string, IEnumerable<Vector<T>>, int)

Parameters

Returns

ZeroShotActionRecognition(IEnumerable<Tensor<T>>, IEnumerable<string>)

Parameters

Returns

Remarks