Interface IImageBindModel<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Defines the contract for ImageBind models that bind multiple modalities (6+) into a shared embedding space.

public interface IImageBindModel<T>

Type Parameters

T: The numeric type used for calculations.

Remarks

ImageBind learns a joint embedding space across six modalities: images, text, audio, depth, thermal, and IMU data. It uses images as a binding modality - since web data contains many (image, text) pairs, (image, audio) pairs from videos, etc., the model can learn cross-modal relationships even without direct pairs between all modalities.

For Beginners: ImageBind connects ALL types of data together!

The breakthrough insight:

Images are paired with many things: text captions, video audio, depth sensors, etc.
By learning all these (image, X) pairs, images become a "bridge"
Now audio and text can be compared, even without audio-text training data!

Six modalities in one model:

Images: Regular RGB photos
Text: Natural language descriptions
Audio: Sound waveforms (speech, music, effects)
Video: Moving images (sequences of frames)
Thermal: Heat maps from infrared cameras
Depth: 3D distance information
IMU: Motion sensor data (accelerometer, gyroscope)

Why this matters:

Search audio by describing sounds in text
Find images that match a piece of music
Match thermal images to regular photos
Universal multimodal understanding!

Properties

EmbeddingDimension

Gets the dimensionality of the shared embedding space.

int EmbeddingDimension { get; }

Property Value

int

SupportedModalities

Gets the list of supported modalities.

IReadOnlyList<ModalityType> SupportedModalities { get; }

Property Value

IReadOnlyList<ModalityType>

Methods

ComputeAlignment(ModalityType, object, ModalityType, object)

Computes the alignment between two modalities given paired data.

(T AlignmentScore, Dictionary<string, object> Details) ComputeAlignment(ModalityType modality1, object data1, ModalityType modality2, object data2)

Parameters

modality1 ModalityType: First modality type.
data1 object: Data from first modality.
modality2 ModalityType: Second modality type.
data2 object: Data from second modality.

Returns

(T AlignmentScore, Dictionary<string, object> Details): Alignment score and optional alignment details.

ComputeCrossModalSimilarity(Vector<T>, Vector<T>)

Computes similarity between embeddings from any two modalities.

T ComputeCrossModalSimilarity(Vector<T> embedding1, Vector<T> embedding2)

Parameters

embedding1 Vector<T>: First embedding vector.
embedding2 Vector<T>: Second embedding vector.

Returns

T: Cosine similarity score in range [-1, 1].

ComputeEmergentAudioTextSimilarity(Tensor<T>, string)

Computes emergent cross-modal relationships without explicit pairing.

T ComputeEmergentAudioTextSimilarity(Tensor<T> audio, string text)

Parameters

audio Tensor<T>: Audio waveform.
text string: Text description.

Returns

T: Similarity score between audio and text.

Remarks

For Beginners: The magic of ImageBind!

Even though ImageBind was never trained on (audio, text) pairs directly, it can still compare them through the shared embedding space!

This works because:

Audio is aligned to images (from video)
Text is aligned to images (from captions)
Therefore, audio and text become implicitly aligned!

"emergent" means this capability appeared without explicit training.

CrossModalRetrieval(Vector<T>, IEnumerable<Vector<T>>, int)

Performs cross-modal retrieval from one modality to another.

IEnumerable<(int Index, T Score)> CrossModalRetrieval(Vector<T> queryEmbedding, IEnumerable<Vector<T>> targetEmbeddings, int topK = 10)

Parameters

queryEmbedding Vector<T>: Query embedding from source modality.
targetEmbeddings IEnumerable<Vector<T>>: Database of embeddings from target modality.
topK int: Number of results to return.

Returns

IEnumerable<(int Index, T Score)>: Indices and scores of most similar items.

Remarks

For Beginners: Search across different types of data!

Examples:

Audio → Images: "Find images that match this sound"
Text → Audio: "Find sounds matching 'thunderstorm'"
Thermal → RGB: "Find color photos of this heat signature"
IMU → Video: "Find videos of people doing this motion"

FindBestMatch(ModalityType, object, IEnumerable<(ModalityType Modality, object Data)>)

Finds the best matching modality representation for a query.

(ModalityType Modality, object Data, T Score) FindBestMatch(ModalityType queryModality, object queryData, IEnumerable<(ModalityType Modality, object Data)> candidates)

Parameters

queryModality ModalityType: Type of the query data.
queryData object: The query data.
candidates IEnumerable<(ModalityType Modality, object Data)>: Dictionary of (modality, data) candidates.

Returns

(ModalityType Modality, object Data, T Score): Best matching candidate with its similarity score.

FuseModalities(Dictionary<ModalityType, Vector<T>>, string)

Performs multimodal fusion by combining embeddings from multiple modalities.

Vector<T> FuseModalities(Dictionary<ModalityType, Vector<T>> modalityEmbeddings, string fusionMethod = "mean")

Parameters

modalityEmbeddings Dictionary<ModalityType, Vector<T>>: Dictionary of (modality, embedding) pairs.
fusionMethod string: Method for combining: "mean", "concat", "attention".

Returns

Vector<T>: Fused embedding vector.

GenerateDescriptions(ModalityType, object, IEnumerable<string>, int)

Generates text description for non-text modalities.

IEnumerable<(string Description, T Score)> GenerateDescriptions(ModalityType modality, object data, IEnumerable<string> candidateDescriptions, int topK = 5)

Parameters

modality ModalityType: The input modality type.
data object: The data to describe.
candidateDescriptions IEnumerable<string>: Pool of possible descriptions.
topK int: Number of best descriptions to return.

Returns

IEnumerable<(string Caption, T Score)>: Best matching descriptions with scores.

GetAudioEmbedding(Tensor<T>, int)

Converts audio into a shared embedding vector.

Vector<T> GetAudioEmbedding(Tensor<T> audioWaveform, int sampleRate = 16000)

Parameters

audioWaveform Tensor<T>: Audio waveform tensor [samples] or [channels, samples].
sampleRate int: Audio sample rate in Hz.

Returns

Vector<T>: Normalized embedding vector.

Remarks

For Beginners: Convert sound into the same vector space as images and text!

This allows:

Find images that match a sound (bird chirping → bird photos)
Search audio with text ("dog barking" → actual barking sounds)
Compare different sounds for similarity

GetDepthEmbedding(Tensor<T>)

Converts depth map into a shared embedding vector.

Vector<T> GetDepthEmbedding(Tensor<T> depthMap)

Parameters

depthMap Tensor<T>: Depth map tensor [height, width] with distance values.

Returns

Vector<T>: Normalized embedding vector.

Remarks

Depth maps represent 3D structure. ImageBind can find RGB images with similar spatial structure or match to text descriptions.

GetEmbedding(ModalityType, object)

Gets embedding for any supported modality using a generic interface.

Vector<T> GetEmbedding(ModalityType modality, object data)

Parameters

modality ModalityType: The type of modality.
data object: The data to embed (type depends on modality).

Returns

Vector<T>: Normalized embedding vector.

GetIMUEmbedding(Tensor<T>)

Converts IMU sensor data into a shared embedding vector.

Vector<T> GetIMUEmbedding(Tensor<T> imuData)

Parameters

imuData Tensor<T>: IMU readings [timesteps, 6] for accelerometer and gyroscope (x,y,z each).

Returns

Vector<T>: Normalized embedding vector.

Remarks

For Beginners: IMU is the motion sensor in your phone!

IMU captures movement patterns:

Walking, running, jumping
Phone gestures
Device orientation

ImageBind can match these motions to videos or text descriptions!

GetImageEmbedding(Tensor<T>)

Converts an image into a shared embedding vector.

Vector<T> GetImageEmbedding(Tensor<T> image)

Parameters

image Tensor<T>: Preprocessed image tensor [channels, height, width].

Returns

Vector<T>: Normalized embedding vector.

GetTextEmbedding(string)

Converts text into a shared embedding vector.

Vector<T> GetTextEmbedding(string text)

Parameters

text string: Text string to embed.

Returns

Vector<T>: Normalized embedding vector.

GetThermalEmbedding(Tensor<T>)

Converts thermal image into a shared embedding vector.

Vector<T> GetThermalEmbedding(Tensor<T> thermalImage)

Parameters

thermalImage Tensor<T>: Thermal/infrared image tensor.

Returns

Vector<T>: Normalized embedding vector.

Remarks

Thermal images capture heat signatures. ImageBind can match thermal images to their RGB counterparts or find related audio/text.

GetVideoEmbedding(IEnumerable<Tensor<T>>)

Converts video into a shared embedding vector.

Vector<T> GetVideoEmbedding(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>: Video frames as sequence of image tensors.

Returns

Vector<T>: Normalized embedding vector.

ZeroShotClassify(ModalityType, object, IEnumerable<string>)

Performs zero-shot classification across modalities.

Dictionary<string, T> ZeroShotClassify(ModalityType modality, object data, IEnumerable<string> classLabels)

Parameters

modality ModalityType: The modality of the input data.
data object: The data to classify.
classLabels IEnumerable<string>: Text labels for classification.

Returns

Dictionary<string, T>: Dictionary mapping labels to probability scores.

Remarks

Works for any supported modality - classify audio by text labels, classify thermal images, classify motion patterns, etc.

Table of Contents

Interface IImageBindModel<T>

Type Parameters

Remarks

Properties

EmbeddingDimension

Property Value

SupportedModalities

Property Value

Methods

ComputeAlignment(ModalityType, object, ModalityType, object)

Parameters

Returns

ComputeCrossModalSimilarity(Vector<T>, Vector<T>)

Parameters

Returns

ComputeEmergentAudioTextSimilarity(Tensor<T>, string)

Parameters

Returns

Remarks

CrossModalRetrieval(Vector<T>, IEnumerable<Vector<T>>, int)

Parameters

Returns

Remarks

FindBestMatch(ModalityType, object, IEnumerable<(ModalityType Modality, object Data)>)

Parameters

Returns

FuseModalities(Dictionary<ModalityType, Vector<T>>, string)

Parameters

Returns

GenerateDescriptions(ModalityType, object, IEnumerable<string>, int)

Parameters

Returns

GetAudioEmbedding(Tensor<T>, int)

Parameters

Returns

Remarks

GetDepthEmbedding(Tensor<T>)

Parameters

Returns

Remarks

GetEmbedding(ModalityType, object)

Parameters

Returns

GetIMUEmbedding(Tensor<T>)

Parameters

Returns

Remarks

GetImageEmbedding(Tensor<T>)

Parameters

Returns

GetTextEmbedding(string)

Parameters

Returns

GetThermalEmbedding(Tensor<T>)

Parameters

Returns

Remarks

GetVideoEmbedding(IEnumerable<Tensor<T>>)

Parameters

Returns

ZeroShotClassify(ModalityType, object, IEnumerable<string>)

Parameters

Returns

Remarks