Interface IMultimodalEmbedding<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Interface for multimodal embedding models that can encode multiple modalities (text, images, audio) into a shared embedding space.

public interface IMultimodalEmbedding<T>

Type Parameters

T: The numeric type used for computations.

Remarks

Multimodal embedding models like CLIP (Contrastive Language-Image Pre-training) learn to project different types of data into the same vector space, enabling cross-modal similarity search and zero-shot classification.

For Beginners: Imagine you want to search for images using text queries. A multimodal model learns to convert both "a photo of a cat" and an actual cat image into similar vectors, allowing direct comparison between text and images.

Properties

EmbeddingDimension

Gets the dimensionality of the embedding space.

int EmbeddingDimension { get; }

Property Value

int

ImageSize

Gets the expected image size (square images: ImageSize x ImageSize pixels).

int ImageSize { get; }

Property Value

int

MaxSequenceLength

Gets the maximum sequence length for text input.

int MaxSequenceLength { get; }

Property Value

int

Methods

ComputeSimilarity(Vector<T>, Vector<T>)

Computes similarity between two embeddings.

T ComputeSimilarity(Vector<T> embedding1, Vector<T> embedding2)

Parameters

embedding1 Vector<T>: The first embedding.
embedding2 Vector<T>: The second embedding.

Returns

T: Similarity score (cosine similarity for normalized embeddings).

EncodeImage(double[])

Encodes an image into an embedding vector.

Vector<T> EncodeImage(double[] imageData)

Parameters

imageData double[]: The preprocessed image data as a flattened array in CHW format.

Returns

Vector<T>: A normalized embedding vector.

EncodeImageBatch(IEnumerable<double[]>)

Encodes multiple images into embedding vectors in a batch.

Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)

Parameters

imageDataBatch IEnumerable<double[]>: The preprocessed images as flattened arrays.

Returns

Matrix<T>: A matrix where each row is an embedding for the corresponding image.

EncodeText(string)

Encodes text into an embedding vector.

Vector<T> EncodeText(string text)

Parameters

text string: The text to encode.

Returns

Vector<T>: A normalized embedding vector.

EncodeTextBatch(IEnumerable<string>)

Encodes multiple texts into embedding vectors in a batch.

Matrix<T> EncodeTextBatch(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>: The texts to encode.

Returns

Matrix<T>: A matrix where each row is an embedding for the corresponding text.

ZeroShotClassify(double[], IEnumerable<string>)

Performs zero-shot classification of an image against text labels.

Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)

Parameters

imageData double[]: The preprocessed image data.
labels IEnumerable<string>: The candidate class labels.

Returns

Dictionary<string, T>: A dictionary mapping each label to its probability score.

Table of Contents

Interface IMultimodalEmbedding<T>

Type Parameters

Remarks

Properties

EmbeddingDimension

Property Value

ImageSize

Property Value

MaxSequenceLength

Property Value

Methods

ComputeSimilarity(Vector<T>, Vector<T>)

Parameters

Returns

EncodeImage(double[])

Parameters

Returns

EncodeImageBatch(IEnumerable<double[]>)

Parameters

Returns

EncodeText(string)

Parameters

Returns

EncodeTextBatch(IEnumerable<string>)

Parameters

Returns

ZeroShotClassify(double[], IEnumerable<string>)

Parameters

Returns