Table of Contents

Class ImageBindNeuralNetwork<T>

Namespace
AiDotNet.NeuralNetworks
Assembly
AiDotNet.dll

ImageBind neural network for binding multiple modalities (6+) into a shared embedding space.

public class ImageBindNeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IImageBindModel<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
ImageBindNeuralNetwork<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

ImageBind learns a joint embedding space across multiple modalities: images, text, audio, depth, thermal, and IMU data. It uses images as a binding modality - since web data contains many (image, text) pairs, (image, audio) pairs from videos, etc., the model can learn cross-modal relationships even without direct pairs between all modalities.

For Beginners: ImageBind connects ALL types of data together!

Architecture overview:

  1. Modality-Specific Encoders: Each modality has its own encoder (ViT for images, Transformer for text, etc.)
  2. Projection Heads: Map each modality's features to the shared embedding space
  3. Contrastive Learning: Align modalities using image as the bridge modality

Key capabilities:

  • Cross-modal retrieval: Find images matching audio, text matching video, etc.
  • Zero-shot classification: Classify any modality using text labels
  • Emergent alignment: Compare modalities never directly paired during training

Constructors

ImageBindNeuralNetwork(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int, int, int, int, int, int, int, ITokenizer?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates an ImageBind network using native library layers.

public ImageBindNeuralNetwork(NeuralNetworkArchitecture<T> architecture, int imageSize = 224, int channels = 3, int patchSize = 14, int vocabularySize = 49408, int maxSequenceLength = 77, int embeddingDimension = 1024, int hiddenDim = 1280, int numEncoderLayers = 32, int numHeads = 16, int audioSampleRate = 16000, int audioMaxDuration = 10, int imuTimesteps = 2000, int numVideoFrames = 2, ITokenizer? tokenizer = null, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>
imageSize int
channels int
patchSize int
vocabularySize int
maxSequenceLength int
embeddingDimension int
hiddenDim int
numEncoderLayers int
numHeads int
audioSampleRate int
audioMaxDuration int
imuTimesteps int
numVideoFrames int
tokenizer ITokenizer
optimizer IOptimizer<T, Tensor<T>, Tensor<T>>
lossFunction ILossFunction<T>

ImageBindNeuralNetwork(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, int, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates an ImageBind network using pretrained ONNX models.

public ImageBindNeuralNetwork(NeuralNetworkArchitecture<T> architecture, string imageEncoderPath, string textEncoderPath, string audioEncoderPath, ITokenizer tokenizer, int embeddingDimension = 1024, int maxSequenceLength = 77, int imageSize = 224, int audioSampleRate = 16000, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>
imageEncoderPath string
textEncoderPath string
audioEncoderPath string
tokenizer ITokenizer
embeddingDimension int
maxSequenceLength int
imageSize int
audioSampleRate int
optimizer IOptimizer<T, Tensor<T>, Tensor<T>>
lossFunction ILossFunction<T>

Properties

EmbeddingDimension

Gets the dimensionality of the shared embedding space.

public int EmbeddingDimension { get; }

Property Value

int

ParameterCount

Gets the total number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

For Beginners: This tells you how many adjustable values (weights and biases) your neural network has. More complex networks typically have more parameters and can learn more complex patterns, but also require more data to train effectively. This is part of the IFullModel interface for consistency with other model types.

Performance: This property uses caching to avoid recomputing the sum on every access. The cache is invalidated when layers are modified.

SupportedModalities

Gets the list of supported modalities.

public IReadOnlyList<ModalityType> SupportedModalities { get; }

Property Value

IReadOnlyList<ModalityType>

Methods

Backward(Tensor<T>)

Backward pass through encoder layers.

public Tensor<T> Backward(Tensor<T> gradient)

Parameters

gradient Tensor<T>

Returns

Tensor<T>

ComputeAlignment(ModalityType, object, ModalityType, object)

Computes the alignment between two modalities given paired data.

public (T AlignmentScore, Dictionary<string, object> Details) ComputeAlignment(ModalityType modality1, object data1, ModalityType modality2, object data2)

Parameters

modality1 ModalityType

First modality type.

data1 object

Data from first modality.

modality2 ModalityType

Second modality type.

data2 object

Data from second modality.

Returns

(T AlignmentScore, Dictionary<string, object> Details)

Alignment score and optional alignment details.

ComputeCrossModalSimilarity(Vector<T>, Vector<T>)

Computes similarity between embeddings from any two modalities.

public T ComputeCrossModalSimilarity(Vector<T> embedding1, Vector<T> embedding2)

Parameters

embedding1 Vector<T>

First embedding vector.

embedding2 Vector<T>

Second embedding vector.

Returns

T

Cosine similarity score in range [-1, 1].

ComputeEmergentAudioTextSimilarity(Tensor<T>, string)

Computes emergent cross-modal relationships without explicit pairing.

public T ComputeEmergentAudioTextSimilarity(Tensor<T> audio, string text)

Parameters

audio Tensor<T>

Audio waveform.

text string

Text description.

Returns

T

Similarity score between audio and text.

Remarks

For Beginners: The magic of ImageBind!

Even though ImageBind was never trained on (audio, text) pairs directly, it can still compare them through the shared embedding space!

This works because:

  • Audio is aligned to images (from video)
  • Text is aligned to images (from captions)
  • Therefore, audio and text become implicitly aligned!

"emergent" means this capability appeared without explicit training.

CreateNewInstance()

Creates a new instance of the same type as this neural network.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

A new instance of the same neural network type.

Remarks

For Beginners: This creates a blank version of the same type of neural network.

It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.

CrossModalRetrieval(Vector<T>, IEnumerable<Vector<T>>, int)

Performs cross-modal retrieval from one modality to another.

public IEnumerable<(int Index, T Score)> CrossModalRetrieval(Vector<T> queryEmbedding, IEnumerable<Vector<T>> targetEmbeddings, int topK = 10)

Parameters

queryEmbedding Vector<T>

Query embedding from source modality.

targetEmbeddings IEnumerable<Vector<T>>

Database of embeddings from target modality.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices and scores of most similar items.

Remarks

For Beginners: Search across different types of data!

Examples:

  • Audio → Images: "Find images that match this sound"
  • Text → Audio: "Find sounds matching 'thunderstorm'"
  • Thermal → RGB: "Find color photos of this heat signature"
  • IMU → Video: "Find videos of people doing this motion"

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data that was not covered by the general deserialization process.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

The BinaryReader to read the data from.

Remarks

This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.

For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.

Dispose(bool)

Protected Dispose pattern implementation.

protected override void Dispose(bool disposing)

Parameters

disposing bool

True if called from Dispose(), false if called from finalizer.

FindBestMatch(ModalityType, object, IEnumerable<(ModalityType Modality, object Data)>)

Finds the best matching modality representation for a query.

public (ModalityType Modality, object Data, T Score) FindBestMatch(ModalityType queryModality, object queryData, IEnumerable<(ModalityType Modality, object Data)> candidates)

Parameters

queryModality ModalityType

Type of the query data.

queryData object

The query data.

candidates IEnumerable<(ModalityType Modality, object Data)>

Dictionary of (modality, data) candidates.

Returns

(ModalityType Modality, object Data, T Score)

Best matching candidate with its similarity score.

FuseModalities(Dictionary<ModalityType, Vector<T>>, string)

Performs multimodal fusion by combining embeddings from multiple modalities.

public Vector<T> FuseModalities(Dictionary<ModalityType, Vector<T>> modalityEmbeddings, string fusionMethod = "mean")

Parameters

modalityEmbeddings Dictionary<ModalityType, Vector<T>>

Dictionary of (modality, embedding) pairs.

fusionMethod string

Method for combining: "mean", "concat", "attention".

Returns

Vector<T>

Fused embedding vector.

GenerateDescriptions(ModalityType, object, IEnumerable<string>, int)

Generates text description for non-text modalities.

public IEnumerable<(string Description, T Score)> GenerateDescriptions(ModalityType modality, object data, IEnumerable<string> candidateDescriptions, int topK = 5)

Parameters

modality ModalityType

The input modality type.

data object

The data to describe.

candidateDescriptions IEnumerable<string>

Pool of possible descriptions.

topK int

Number of best descriptions to return.

Returns

IEnumerable<(string Caption, T Score)>

Best matching descriptions with scores.

GetAudioEmbedding(Tensor<T>, int)

Converts audio into a shared embedding vector.

public Vector<T> GetAudioEmbedding(Tensor<T> audioWaveform, int sampleRate = 16000)

Parameters

audioWaveform Tensor<T>

Audio waveform tensor [samples] or [channels, samples].

sampleRate int

Audio sample rate in Hz.

Returns

Vector<T>

Normalized embedding vector.

Remarks

For Beginners: Convert sound into the same vector space as images and text!

This allows:

  • Find images that match a sound (bird chirping → bird photos)
  • Search audio with text ("dog barking" → actual barking sounds)
  • Compare different sounds for similarity

GetDepthEmbedding(Tensor<T>)

Converts depth map into a shared embedding vector.

public Vector<T> GetDepthEmbedding(Tensor<T> depthMap)

Parameters

depthMap Tensor<T>

Depth map tensor [height, width] with distance values.

Returns

Vector<T>

Normalized embedding vector.

Remarks

Depth maps represent 3D structure. ImageBind can find RGB images with similar spatial structure or match to text descriptions.

GetEmbedding(ModalityType, object)

Gets embedding for any supported modality using a generic interface.

public Vector<T> GetEmbedding(ModalityType modality, object data)

Parameters

modality ModalityType

The type of modality.

data object

The data to embed (type depends on modality).

Returns

Vector<T>

Normalized embedding vector.

GetIMUEmbedding(Tensor<T>)

Converts IMU sensor data into a shared embedding vector.

public Vector<T> GetIMUEmbedding(Tensor<T> imuData)

Parameters

imuData Tensor<T>

IMU readings [timesteps, 6] for accelerometer and gyroscope (x,y,z each).

Returns

Vector<T>

Normalized embedding vector.

Remarks

For Beginners: IMU is the motion sensor in your phone!

IMU captures movement patterns:

  • Walking, running, jumping
  • Phone gestures
  • Device orientation

ImageBind can match these motions to videos or text descriptions!

GetImageEmbedding(Tensor<T>)

Converts an image into a shared embedding vector.

public Vector<T> GetImageEmbedding(Tensor<T> image)

Parameters

image Tensor<T>

Preprocessed image tensor [channels, height, width].

Returns

Vector<T>

Normalized embedding vector.

GetModelMetadata()

Gets the metadata for this neural network model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

A ModelMetaData object containing information about the model.

GetParameters()

Gets all trainable parameters of the network as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all parameters of the network.

Remarks

For Beginners: Neural networks learn by adjusting their "parameters" (also called weights and biases). This method collects all those adjustable values into a single list so they can be updated during training.

GetTextEmbedding(string)

Converts text into a shared embedding vector.

public Vector<T> GetTextEmbedding(string text)

Parameters

text string

Text string to embed.

Returns

Vector<T>

Normalized embedding vector.

GetThermalEmbedding(Tensor<T>)

Converts thermal image into a shared embedding vector.

public Vector<T> GetThermalEmbedding(Tensor<T> thermalImage)

Parameters

thermalImage Tensor<T>

Thermal/infrared image tensor.

Returns

Vector<T>

Normalized embedding vector.

Remarks

Thermal images capture heat signatures. ImageBind can match thermal images to their RGB counterparts or find related audio/text.

GetVideoEmbedding(IEnumerable<Tensor<T>>)

Converts video into a shared embedding vector.

public Vector<T> GetVideoEmbedding(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>

Video frames as sequence of image tensors.

Returns

Vector<T>

Normalized embedding vector.

InitializeLayers()

Initializes the layers of the neural network based on the architecture.

protected override void InitializeLayers()

Remarks

For Beginners: This method sets up all the layers in your neural network according to the architecture you've defined. It's like assembling the parts of your network before you can use it.

Predict(Tensor<T>)

Makes a prediction using the neural network.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

The input data to process.

Returns

Tensor<T>

The network's prediction.

Remarks

For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data that is not covered by the general serialization process.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

The BinaryWriter to write the data to.

Remarks

This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.

For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.

SetParameters(Vector<T>)

Sets the parameters of the neural network.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The parameters to set.

Remarks

This method distributes the parameters to all layers in the network. The parameters should be in the same format as returned by GetParameters.

Train(Tensor<T>, Tensor<T>)

Trains the neural network on a single input-output pair.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>

The input data.

expectedOutput Tensor<T>

The expected output for the given input.

Remarks

This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.

For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)

The network then:

  1. Makes a prediction based on the input
  2. Compares its prediction to the expected output
  3. Calculates how wrong it was (the loss)
  4. Adjusts its internal values to do better next time

After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.

UpdateParameters(Vector<T>)

Updates the network's parameters with new values.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The new parameter values to set.

Remarks

For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.

This is typically used by optimization algorithms that calculate better parameter values based on training data.

ZeroShotClassify(ModalityType, object, IEnumerable<string>)

Performs zero-shot classification across modalities.

public Dictionary<string, T> ZeroShotClassify(ModalityType modality, object data, IEnumerable<string> classLabels)

Parameters

modality ModalityType

The modality of the input data.

data object

The data to classify.

classLabels IEnumerable<string>

Text labels for classification.

Returns

Dictionary<string, T>

Dictionary mapping labels to probability scores.

Remarks

Works for any supported modality - classify audio by text labels, classify thermal images, classify motion patterns, etc.