Interface IImageBindModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for ImageBind models that bind multiple modalities (6+) into a shared embedding space.
public interface IImageBindModel<T>
Type Parameters
TThe numeric type used for calculations.
Remarks
ImageBind learns a joint embedding space across six modalities: images, text, audio, depth, thermal, and IMU data. It uses images as a binding modality - since web data contains many (image, text) pairs, (image, audio) pairs from videos, etc., the model can learn cross-modal relationships even without direct pairs between all modalities.
For Beginners: ImageBind connects ALL types of data together!
The breakthrough insight:
- Images are paired with many things: text captions, video audio, depth sensors, etc.
- By learning all these (image, X) pairs, images become a "bridge"
- Now audio and text can be compared, even without audio-text training data!
Six modalities in one model:
- Images: Regular RGB photos
- Text: Natural language descriptions
- Audio: Sound waveforms (speech, music, effects)
- Video: Moving images (sequences of frames)
- Thermal: Heat maps from infrared cameras
- Depth: 3D distance information
- IMU: Motion sensor data (accelerometer, gyroscope)
Why this matters:
- Search audio by describing sounds in text
- Find images that match a piece of music
- Match thermal images to regular photos
- Universal multimodal understanding!
Properties
EmbeddingDimension
Gets the dimensionality of the shared embedding space.
int EmbeddingDimension { get; }
Property Value
SupportedModalities
Gets the list of supported modalities.
IReadOnlyList<ModalityType> SupportedModalities { get; }
Property Value
Methods
ComputeAlignment(ModalityType, object, ModalityType, object)
Computes the alignment between two modalities given paired data.
(T AlignmentScore, Dictionary<string, object> Details) ComputeAlignment(ModalityType modality1, object data1, ModalityType modality2, object data2)
Parameters
modality1ModalityTypeFirst modality type.
data1objectData from first modality.
modality2ModalityTypeSecond modality type.
data2objectData from second modality.
Returns
- (T AlignmentScore, Dictionary<string, object> Details)
Alignment score and optional alignment details.
ComputeCrossModalSimilarity(Vector<T>, Vector<T>)
Computes similarity between embeddings from any two modalities.
T ComputeCrossModalSimilarity(Vector<T> embedding1, Vector<T> embedding2)
Parameters
embedding1Vector<T>First embedding vector.
embedding2Vector<T>Second embedding vector.
Returns
- T
Cosine similarity score in range [-1, 1].
ComputeEmergentAudioTextSimilarity(Tensor<T>, string)
Computes emergent cross-modal relationships without explicit pairing.
T ComputeEmergentAudioTextSimilarity(Tensor<T> audio, string text)
Parameters
audioTensor<T>Audio waveform.
textstringText description.
Returns
- T
Similarity score between audio and text.
Remarks
For Beginners: The magic of ImageBind!
Even though ImageBind was never trained on (audio, text) pairs directly, it can still compare them through the shared embedding space!
This works because:
- Audio is aligned to images (from video)
- Text is aligned to images (from captions)
- Therefore, audio and text become implicitly aligned!
"emergent" means this capability appeared without explicit training.
CrossModalRetrieval(Vector<T>, IEnumerable<Vector<T>>, int)
Performs cross-modal retrieval from one modality to another.
IEnumerable<(int Index, T Score)> CrossModalRetrieval(Vector<T> queryEmbedding, IEnumerable<Vector<T>> targetEmbeddings, int topK = 10)
Parameters
queryEmbeddingVector<T>Query embedding from source modality.
targetEmbeddingsIEnumerable<Vector<T>>Database of embeddings from target modality.
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score)>
Indices and scores of most similar items.
Remarks
For Beginners: Search across different types of data!
Examples:
- Audio → Images: "Find images that match this sound"
- Text → Audio: "Find sounds matching 'thunderstorm'"
- Thermal → RGB: "Find color photos of this heat signature"
- IMU → Video: "Find videos of people doing this motion"
FindBestMatch(ModalityType, object, IEnumerable<(ModalityType Modality, object Data)>)
Finds the best matching modality representation for a query.
(ModalityType Modality, object Data, T Score) FindBestMatch(ModalityType queryModality, object queryData, IEnumerable<(ModalityType Modality, object Data)> candidates)
Parameters
queryModalityModalityTypeType of the query data.
queryDataobjectThe query data.
candidatesIEnumerable<(ModalityType Modality, object Data)>Dictionary of (modality, data) candidates.
Returns
- (ModalityType Modality, object Data, T Score)
Best matching candidate with its similarity score.
FuseModalities(Dictionary<ModalityType, Vector<T>>, string)
Performs multimodal fusion by combining embeddings from multiple modalities.
Vector<T> FuseModalities(Dictionary<ModalityType, Vector<T>> modalityEmbeddings, string fusionMethod = "mean")
Parameters
modalityEmbeddingsDictionary<ModalityType, Vector<T>>Dictionary of (modality, embedding) pairs.
fusionMethodstringMethod for combining: "mean", "concat", "attention".
Returns
- Vector<T>
Fused embedding vector.
GenerateDescriptions(ModalityType, object, IEnumerable<string>, int)
Generates text description for non-text modalities.
IEnumerable<(string Description, T Score)> GenerateDescriptions(ModalityType modality, object data, IEnumerable<string> candidateDescriptions, int topK = 5)
Parameters
modalityModalityTypeThe input modality type.
dataobjectThe data to describe.
candidateDescriptionsIEnumerable<string>Pool of possible descriptions.
topKintNumber of best descriptions to return.
Returns
- IEnumerable<(string Caption, T Score)>
Best matching descriptions with scores.
GetAudioEmbedding(Tensor<T>, int)
Converts audio into a shared embedding vector.
Vector<T> GetAudioEmbedding(Tensor<T> audioWaveform, int sampleRate = 16000)
Parameters
audioWaveformTensor<T>Audio waveform tensor [samples] or [channels, samples].
sampleRateintAudio sample rate in Hz.
Returns
- Vector<T>
Normalized embedding vector.
Remarks
For Beginners: Convert sound into the same vector space as images and text!
This allows:
- Find images that match a sound (bird chirping → bird photos)
- Search audio with text ("dog barking" → actual barking sounds)
- Compare different sounds for similarity
GetDepthEmbedding(Tensor<T>)
Converts depth map into a shared embedding vector.
Vector<T> GetDepthEmbedding(Tensor<T> depthMap)
Parameters
depthMapTensor<T>Depth map tensor [height, width] with distance values.
Returns
- Vector<T>
Normalized embedding vector.
Remarks
Depth maps represent 3D structure. ImageBind can find RGB images with similar spatial structure or match to text descriptions.
GetEmbedding(ModalityType, object)
Gets embedding for any supported modality using a generic interface.
Vector<T> GetEmbedding(ModalityType modality, object data)
Parameters
modalityModalityTypeThe type of modality.
dataobjectThe data to embed (type depends on modality).
Returns
- Vector<T>
Normalized embedding vector.
GetIMUEmbedding(Tensor<T>)
Converts IMU sensor data into a shared embedding vector.
Vector<T> GetIMUEmbedding(Tensor<T> imuData)
Parameters
imuDataTensor<T>IMU readings [timesteps, 6] for accelerometer and gyroscope (x,y,z each).
Returns
- Vector<T>
Normalized embedding vector.
Remarks
For Beginners: IMU is the motion sensor in your phone!
IMU captures movement patterns:
- Walking, running, jumping
- Phone gestures
- Device orientation
ImageBind can match these motions to videos or text descriptions!
GetImageEmbedding(Tensor<T>)
Converts an image into a shared embedding vector.
Vector<T> GetImageEmbedding(Tensor<T> image)
Parameters
imageTensor<T>Preprocessed image tensor [channels, height, width].
Returns
- Vector<T>
Normalized embedding vector.
GetTextEmbedding(string)
Converts text into a shared embedding vector.
Vector<T> GetTextEmbedding(string text)
Parameters
textstringText string to embed.
Returns
- Vector<T>
Normalized embedding vector.
GetThermalEmbedding(Tensor<T>)
Converts thermal image into a shared embedding vector.
Vector<T> GetThermalEmbedding(Tensor<T> thermalImage)
Parameters
thermalImageTensor<T>Thermal/infrared image tensor.
Returns
- Vector<T>
Normalized embedding vector.
Remarks
Thermal images capture heat signatures. ImageBind can match thermal images to their RGB counterparts or find related audio/text.
GetVideoEmbedding(IEnumerable<Tensor<T>>)
Converts video into a shared embedding vector.
Vector<T> GetVideoEmbedding(IEnumerable<Tensor<T>> frames)
Parameters
framesIEnumerable<Tensor<T>>Video frames as sequence of image tensors.
Returns
- Vector<T>
Normalized embedding vector.
ZeroShotClassify(ModalityType, object, IEnumerable<string>)
Performs zero-shot classification across modalities.
Dictionary<string, T> ZeroShotClassify(ModalityType modality, object data, IEnumerable<string> classLabels)
Parameters
modalityModalityTypeThe modality of the input data.
dataobjectThe data to classify.
classLabelsIEnumerable<string>Text labels for classification.
Returns
- Dictionary<string, T>
Dictionary mapping labels to probability scores.
Remarks
Works for any supported modality - classify audio by text labels, classify thermal images, classify motion patterns, etc.