Table of Contents

Interface IVideoCLIPModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for VideoCLIP-style models that align video and text in a shared embedding space.

public interface IVideoCLIPModel<T> : IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members

Remarks

VideoCLIP extends CLIP's contrastive learning paradigm to the video domain, enabling text-to-video and video-to-text retrieval, action recognition, and temporal understanding.

For Beginners: VideoCLIP is like CLIP but for videos!

While CLIP matches images with text, VideoCLIP matches VIDEOS with text:

  • Understands actions and events that unfold over time
  • Can find videos matching text descriptions
  • Can generate descriptions for video clips

Key capabilities:

  • Temporal understanding: "A person picks up a ball then throws it"
  • Action recognition: "Playing basketball", "Cooking", "Dancing"
  • Video retrieval: Find videos matching any text query
  • Video-text alignment: Match video segments to text descriptions

Architecture differences from CLIP:

  • Processes multiple frames, not just one image
  • Uses temporal attention/pooling across frames
  • Learns motion and action patterns

Properties

FrameRate

Gets the frame rate (frames per second) for video sampling.

double FrameRate { get; }

Property Value

double

NumFrames

Gets the number of frames the model processes per video clip.

int NumFrames { get; }

Property Value

int

TemporalAggregation

Gets the temporal aggregation method used.

string TemporalAggregation { get; }

Property Value

string

Remarks

Common methods: "mean_pooling", "temporal_transformer", "late_fusion"

Methods

AnswerVideoQuestion(IEnumerable<Tensor<T>>, string, int)

Answers a question about video content.

string AnswerVideoQuestion(IEnumerable<Tensor<T>> frames, string question, int maxLength = 64)

Parameters

frames IEnumerable<Tensor<T>>

Video frames.

question string

Question about the video.

maxLength int

Maximum answer length.

Returns

string

Generated answer.

Remarks

For Beginners: Ask questions about videos!

Examples:

  • "What is the person doing?" → "Playing guitar"
  • "How many people are in the video?" → "Three"
  • "What happens at the end?" → "The dog catches the frisbee"

ComputeTemporalSimilarityMatrix(IEnumerable<Tensor<T>>, IEnumerable<Tensor<T>>)

Computes temporal similarity matrix between video segments.

Tensor<T> ComputeTemporalSimilarityMatrix(IEnumerable<Tensor<T>> video1Frames, IEnumerable<Tensor<T>> video2Frames)

Parameters

video1Frames IEnumerable<Tensor<T>>

First video frames.

video2Frames IEnumerable<Tensor<T>>

Second video frames.

Returns

Tensor<T>

Similarity matrix with shape [numFrames1, numFrames2].

Remarks

Useful for video alignment, finding corresponding moments, or detecting repetitions.

ComputeVideoTextSimilarity(string, IEnumerable<Tensor<T>>)

Computes similarity between a text description and a video.

T ComputeVideoTextSimilarity(string text, IEnumerable<Tensor<T>> frames)

Parameters

text string

Text description of an action or event.

frames IEnumerable<Tensor<T>>

Video frames to compare against.

Returns

T

Similarity score, typically in range [-1, 1].

ExtractFrameFeatures(IEnumerable<Tensor<T>>)

Extracts frame-level features before temporal aggregation.

Tensor<T> ExtractFrameFeatures(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>

Video frames.

Returns

Tensor<T>

Feature tensor with shape [numFrames, featureDim].

GenerateVideoCaption(IEnumerable<Tensor<T>>, int)

Generates a caption describing the video content.

string GenerateVideoCaption(IEnumerable<Tensor<T>> frames, int maxLength = 77)

Parameters

frames IEnumerable<Tensor<T>>

Video frames to caption.

maxLength int

Maximum caption length.

Returns

string

Generated caption describing the video.

GetVideoEmbedding(IEnumerable<Tensor<T>>)

Converts a video (sequence of frames) into an embedding vector.

Vector<T> GetVideoEmbedding(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>

Sequence of preprocessed frame tensors with shape [channels, height, width].

Returns

Vector<T>

A normalized embedding vector representing the entire video.

Remarks

For Beginners: This converts a video into a single vector!

Process:

  1. Each frame is encoded independently (like CLIP)
  2. Frame features are aggregated over time
  3. Result is a single vector capturing the video's content and actions

Now you can compare videos to text or other videos!

GetVideoEmbeddings(IEnumerable<IEnumerable<Tensor<T>>>)

Converts multiple videos into embedding vectors in a batch.

IEnumerable<Vector<T>> GetVideoEmbeddings(IEnumerable<IEnumerable<Tensor<T>>> videos)

Parameters

videos IEnumerable<IEnumerable<Tensor<T>>>

Collection of videos, each as a sequence of frames.

Returns

IEnumerable<Vector<T>>

Collection of normalized embedding vectors.

LocalizeMoments(IEnumerable<Tensor<T>>, string, int)

Localizes moments in a video that match a text description.

IEnumerable<(int StartFrame, int EndFrame, T Score)> LocalizeMoments(IEnumerable<Tensor<T>> frames, string query, int windowSize = 16)

Parameters

frames IEnumerable<Tensor<T>>

Full video as sequence of frames.

query string

Text describing the moment to find.

windowSize int

Number of frames per moment window.

Returns

IEnumerable<(int StartFrame, int EndFrame, T Score)>

List of (startFrame, endFrame, score) for matching moments.

Remarks

For Beginners: Find specific moments in a video!

Example:

  • Video: 5 minutes of a cooking show
  • Query: "chopping vegetables"
  • Result: [(300, 450, 0.92), (1200, 1350, 0.87)] - two segments where chopping happens

PredictNextAction(IEnumerable<Tensor<T>>, IEnumerable<string>)

Predicts the next action or event in a video.

Dictionary<string, T> PredictNextAction(IEnumerable<Tensor<T>> frames, IEnumerable<string> possibleNextActions)

Parameters

frames IEnumerable<Tensor<T>>

Observed video frames.

possibleNextActions IEnumerable<string>

Candidate actions that might happen next.

Returns

Dictionary<string, T>

Probability distribution over possible next actions.

RetrieveTextsForVideo(IEnumerable<Tensor<T>>, IEnumerable<string>, int)

Retrieves the most relevant text descriptions for a video.

IEnumerable<(int Index, T Score)> RetrieveTextsForVideo(IEnumerable<Tensor<T>> frames, IEnumerable<string> candidateTexts, int topK = 10)

Parameters

frames IEnumerable<Tensor<T>>

Video frames to find descriptions for.

candidateTexts IEnumerable<string>

Pool of text descriptions to search.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices of best matching texts with scores.

RetrieveVideos(string, IEnumerable<Vector<T>>, int)

Retrieves the most relevant videos for a text query.

IEnumerable<(int Index, T Score)> RetrieveVideos(string query, IEnumerable<Vector<T>> videoEmbeddings, int topK = 10)

Parameters

query string

Text description of desired video content.

videoEmbeddings IEnumerable<Vector<T>>

Pre-computed embeddings of video database.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices of top matching videos with their scores.

ZeroShotActionRecognition(IEnumerable<Tensor<T>>, IEnumerable<string>)

Performs zero-shot action classification on a video.

Dictionary<string, T> ZeroShotActionRecognition(IEnumerable<Tensor<T>> frames, IEnumerable<string> actionLabels)

Parameters

frames IEnumerable<Tensor<T>>

Video frames to classify.

actionLabels IEnumerable<string>

Candidate action labels.

Returns

Dictionary<string, T>

Dictionary mapping actions to probability scores.

Remarks

For Beginners: Recognize actions without training!

Example:

  • Video: Someone shooting a basketball
  • Labels: ["playing basketball", "playing soccer", "swimming", "running"]
  • Result: {"playing basketball": 0.85, "running": 0.08, ...}

Works with any action you can describe in text!