Interface IVideoCLIPModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for VideoCLIP-style models that align video and text in a shared embedding space.
public interface IVideoCLIPModel<T> : IMultimodalEmbedding<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
Remarks
VideoCLIP extends CLIP's contrastive learning paradigm to the video domain, enabling text-to-video and video-to-text retrieval, action recognition, and temporal understanding.
For Beginners: VideoCLIP is like CLIP but for videos!
While CLIP matches images with text, VideoCLIP matches VIDEOS with text:
- Understands actions and events that unfold over time
- Can find videos matching text descriptions
- Can generate descriptions for video clips
Key capabilities:
- Temporal understanding: "A person picks up a ball then throws it"
- Action recognition: "Playing basketball", "Cooking", "Dancing"
- Video retrieval: Find videos matching any text query
- Video-text alignment: Match video segments to text descriptions
Architecture differences from CLIP:
- Processes multiple frames, not just one image
- Uses temporal attention/pooling across frames
- Learns motion and action patterns
Properties
FrameRate
Gets the frame rate (frames per second) for video sampling.
double FrameRate { get; }
Property Value
NumFrames
Gets the number of frames the model processes per video clip.
int NumFrames { get; }
Property Value
TemporalAggregation
Gets the temporal aggregation method used.
string TemporalAggregation { get; }
Property Value
Remarks
Common methods: "mean_pooling", "temporal_transformer", "late_fusion"
Methods
AnswerVideoQuestion(IEnumerable<Tensor<T>>, string, int)
Answers a question about video content.
string AnswerVideoQuestion(IEnumerable<Tensor<T>> frames, string question, int maxLength = 64)
Parameters
framesIEnumerable<Tensor<T>>Video frames.
questionstringQuestion about the video.
maxLengthintMaximum answer length.
Returns
- string
Generated answer.
Remarks
For Beginners: Ask questions about videos!
Examples:
- "What is the person doing?" → "Playing guitar"
- "How many people are in the video?" → "Three"
- "What happens at the end?" → "The dog catches the frisbee"
ComputeTemporalSimilarityMatrix(IEnumerable<Tensor<T>>, IEnumerable<Tensor<T>>)
Computes temporal similarity matrix between video segments.
Tensor<T> ComputeTemporalSimilarityMatrix(IEnumerable<Tensor<T>> video1Frames, IEnumerable<Tensor<T>> video2Frames)
Parameters
video1FramesIEnumerable<Tensor<T>>First video frames.
video2FramesIEnumerable<Tensor<T>>Second video frames.
Returns
- Tensor<T>
Similarity matrix with shape [numFrames1, numFrames2].
Remarks
Useful for video alignment, finding corresponding moments, or detecting repetitions.
ComputeVideoTextSimilarity(string, IEnumerable<Tensor<T>>)
Computes similarity between a text description and a video.
T ComputeVideoTextSimilarity(string text, IEnumerable<Tensor<T>> frames)
Parameters
textstringText description of an action or event.
framesIEnumerable<Tensor<T>>Video frames to compare against.
Returns
- T
Similarity score, typically in range [-1, 1].
ExtractFrameFeatures(IEnumerable<Tensor<T>>)
Extracts frame-level features before temporal aggregation.
Tensor<T> ExtractFrameFeatures(IEnumerable<Tensor<T>> frames)
Parameters
framesIEnumerable<Tensor<T>>Video frames.
Returns
- Tensor<T>
Feature tensor with shape [numFrames, featureDim].
GenerateVideoCaption(IEnumerable<Tensor<T>>, int)
Generates a caption describing the video content.
string GenerateVideoCaption(IEnumerable<Tensor<T>> frames, int maxLength = 77)
Parameters
framesIEnumerable<Tensor<T>>Video frames to caption.
maxLengthintMaximum caption length.
Returns
- string
Generated caption describing the video.
GetVideoEmbedding(IEnumerable<Tensor<T>>)
Converts a video (sequence of frames) into an embedding vector.
Vector<T> GetVideoEmbedding(IEnumerable<Tensor<T>> frames)
Parameters
framesIEnumerable<Tensor<T>>Sequence of preprocessed frame tensors with shape [channels, height, width].
Returns
- Vector<T>
A normalized embedding vector representing the entire video.
Remarks
For Beginners: This converts a video into a single vector!
Process:
- Each frame is encoded independently (like CLIP)
- Frame features are aggregated over time
- Result is a single vector capturing the video's content and actions
Now you can compare videos to text or other videos!
GetVideoEmbeddings(IEnumerable<IEnumerable<Tensor<T>>>)
Converts multiple videos into embedding vectors in a batch.
IEnumerable<Vector<T>> GetVideoEmbeddings(IEnumerable<IEnumerable<Tensor<T>>> videos)
Parameters
videosIEnumerable<IEnumerable<Tensor<T>>>Collection of videos, each as a sequence of frames.
Returns
- IEnumerable<Vector<T>>
Collection of normalized embedding vectors.
LocalizeMoments(IEnumerable<Tensor<T>>, string, int)
Localizes moments in a video that match a text description.
IEnumerable<(int StartFrame, int EndFrame, T Score)> LocalizeMoments(IEnumerable<Tensor<T>> frames, string query, int windowSize = 16)
Parameters
framesIEnumerable<Tensor<T>>Full video as sequence of frames.
querystringText describing the moment to find.
windowSizeintNumber of frames per moment window.
Returns
- IEnumerable<(int StartFrame, int EndFrame, T Score)>
List of (startFrame, endFrame, score) for matching moments.
Remarks
For Beginners: Find specific moments in a video!
Example:
- Video: 5 minutes of a cooking show
- Query: "chopping vegetables"
- Result: [(300, 450, 0.92), (1200, 1350, 0.87)] - two segments where chopping happens
PredictNextAction(IEnumerable<Tensor<T>>, IEnumerable<string>)
Predicts the next action or event in a video.
Dictionary<string, T> PredictNextAction(IEnumerable<Tensor<T>> frames, IEnumerable<string> possibleNextActions)
Parameters
framesIEnumerable<Tensor<T>>Observed video frames.
possibleNextActionsIEnumerable<string>Candidate actions that might happen next.
Returns
- Dictionary<string, T>
Probability distribution over possible next actions.
RetrieveTextsForVideo(IEnumerable<Tensor<T>>, IEnumerable<string>, int)
Retrieves the most relevant text descriptions for a video.
IEnumerable<(int Index, T Score)> RetrieveTextsForVideo(IEnumerable<Tensor<T>> frames, IEnumerable<string> candidateTexts, int topK = 10)
Parameters
framesIEnumerable<Tensor<T>>Video frames to find descriptions for.
candidateTextsIEnumerable<string>Pool of text descriptions to search.
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score)>
Indices of best matching texts with scores.
RetrieveVideos(string, IEnumerable<Vector<T>>, int)
Retrieves the most relevant videos for a text query.
IEnumerable<(int Index, T Score)> RetrieveVideos(string query, IEnumerable<Vector<T>> videoEmbeddings, int topK = 10)
Parameters
querystringText description of desired video content.
videoEmbeddingsIEnumerable<Vector<T>>Pre-computed embeddings of video database.
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score)>
Indices of top matching videos with their scores.
ZeroShotActionRecognition(IEnumerable<Tensor<T>>, IEnumerable<string>)
Performs zero-shot action classification on a video.
Dictionary<string, T> ZeroShotActionRecognition(IEnumerable<Tensor<T>> frames, IEnumerable<string> actionLabels)
Parameters
framesIEnumerable<Tensor<T>>Video frames to classify.
actionLabelsIEnumerable<string>Candidate action labels.
Returns
- Dictionary<string, T>
Dictionary mapping actions to probability scores.
Remarks
For Beginners: Recognize actions without training!
Example:
- Video: Someone shooting a basketball
- Labels: ["playing basketball", "playing soccer", "swimming", "running"]
- Result: {"playing basketball": 0.85, "running": 0.08, ...}
Works with any action you can describe in text!