Table of Contents

Interface IMultimodalEmbedding<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for multimodal embedding models that can encode multiple modalities (text, images, audio) into a shared embedding space.

public interface IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for computations.

Remarks

Multimodal embedding models like CLIP (Contrastive Language-Image Pre-training) learn to project different types of data into the same vector space, enabling cross-modal similarity search and zero-shot classification.

For Beginners: Imagine you want to search for images using text queries. A multimodal model learns to convert both "a photo of a cat" and an actual cat image into similar vectors, allowing direct comparison between text and images.

Properties

EmbeddingDimension

Gets the dimensionality of the embedding space.

int EmbeddingDimension { get; }

Property Value

int

ImageSize

Gets the expected image size (square images: ImageSize x ImageSize pixels).

int ImageSize { get; }

Property Value

int

MaxSequenceLength

Gets the maximum sequence length for text input.

int MaxSequenceLength { get; }

Property Value

int

Methods

ComputeSimilarity(Vector<T>, Vector<T>)

Computes similarity between two embeddings.

T ComputeSimilarity(Vector<T> embedding1, Vector<T> embedding2)

Parameters

embedding1 Vector<T>

The first embedding.

embedding2 Vector<T>

The second embedding.

Returns

T

Similarity score (cosine similarity for normalized embeddings).

EncodeImage(double[])

Encodes an image into an embedding vector.

Vector<T> EncodeImage(double[] imageData)

Parameters

imageData double[]

The preprocessed image data as a flattened array in CHW format.

Returns

Vector<T>

A normalized embedding vector.

EncodeImageBatch(IEnumerable<double[]>)

Encodes multiple images into embedding vectors in a batch.

Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)

Parameters

imageDataBatch IEnumerable<double[]>

The preprocessed images as flattened arrays.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding image.

EncodeText(string)

Encodes text into an embedding vector.

Vector<T> EncodeText(string text)

Parameters

text string

The text to encode.

Returns

Vector<T>

A normalized embedding vector.

EncodeTextBatch(IEnumerable<string>)

Encodes multiple texts into embedding vectors in a batch.

Matrix<T> EncodeTextBatch(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

The texts to encode.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding text.

ZeroShotClassify(double[], IEnumerable<string>)

Performs zero-shot classification of an image against text labels.

Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)

Parameters

imageData double[]

The preprocessed image data.

labels IEnumerable<string>

The candidate class labels.

Returns

Dictionary<string, T>

A dictionary mapping each label to its probability score.