Interface IAudioVisualCorrespondenceModel<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Defines the contract for audio-visual correspondence learning models.

public interface IAudioVisualCorrespondenceModel<T>

Type Parameters

T: The numeric type used for calculations.

Remarks

Audio-visual correspondence learning focuses on understanding the relationship between what we see and what we hear. This enables tasks like finding the source of a sound in a video, synchronizing audio and video, and understanding audio-visual events.

For Beginners: Teaching AI to connect sounds with visuals!

Key capabilities:

Sound source localization: Where in the image is the sound coming from?
Audio-visual synchronization: Are the audio and video in sync?
Cross-modal retrieval: Find images matching sounds and vice versa
Audio-visual scene understanding: What's happening based on both modalities?

Examples:

A dog barking → The model highlights the dog in the image
Piano music → The model finds images of pianos
Clapping sound → The model locates hands in the video

Properties

AudioSampleRate

Gets the expected audio sample rate.

int AudioSampleRate { get; }

Property Value

int

EmbeddingDimension

Gets the embedding dimension for audio-visual features.

int EmbeddingDimension { get; }

Property Value

int

VideoFrameRate

Gets the expected video frame rate.

double VideoFrameRate { get; }

Property Value

double

Methods

CheckSynchronization(Tensor<T>, IEnumerable<Tensor<T>>)

Checks audio-visual synchronization.

(double OffsetSeconds, T Confidence) CheckSynchronization(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames)

Parameters

audioWaveform Tensor<T>: Audio waveform.
frames IEnumerable<Tensor<T>>: Video frames.

Returns

(double OffsetSeconds, T Confidence): Sync offset in seconds (positive = audio ahead, negative = audio behind) and confidence.

ClassifyScene(Tensor<T>, IEnumerable<Tensor<T>>, IEnumerable<string>)

Classifies audio-visual scenes.

Dictionary<string, T> ClassifyScene(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, IEnumerable<string> sceneLabels)

Parameters

audioWaveform Tensor<T>: Audio waveform.
frames IEnumerable<Tensor<T>>: Video frames.
sceneLabels IEnumerable<string>: Possible scene labels.

Returns

Dictionary<string, T>: Classification probabilities.

ComputeCorrespondence(Tensor<T>, IEnumerable<Tensor<T>>)

Computes audio-visual correspondence score.

T ComputeCorrespondence(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames)

Parameters

audioWaveform Tensor<T>: Audio waveform.
frames IEnumerable<Tensor<T>>: Video frames.

Returns

T: Correspondence score (higher = better match).

DescribeExpectedAudio(IEnumerable<Tensor<T>>)

Generates audio description from visual content.

string DescribeExpectedAudio(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>: Video frames.

Returns

string: Description of expected sounds.

GetAudioEmbedding(Tensor<T>, int)

Computes audio embedding from waveform.

Vector<T> GetAudioEmbedding(Tensor<T> audioWaveform, int sampleRate)

Parameters

audioWaveform Tensor<T>: Audio waveform tensor.
sampleRate int: Sample rate of the audio.

Returns

Vector<T>: Normalized audio embedding.

GetVisualEmbedding(IEnumerable<Tensor<T>>)

Computes visual embedding from video frames.

Vector<T> GetVisualEmbedding(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>: Sequence of video frames.

Returns

Vector<T>: Normalized visual embedding.

LearnCorrespondence(IEnumerable<Tensor<T>>, IEnumerable<IEnumerable<Tensor<T>>>, int)

Learns correspondence from paired audio-visual data.

void LearnCorrespondence(IEnumerable<Tensor<T>> audioSamples, IEnumerable<IEnumerable<Tensor<T>>> visualSamples, int epochs = 10)

Parameters

audioSamples IEnumerable<Tensor<T>>: Audio samples.
visualSamples IEnumerable<IEnumerable<Tensor<T>>>: Corresponding visual samples.
epochs int: Training epochs.

LocalizeSoundSource(Tensor<T>, IEnumerable<Tensor<T>>)

Localizes sound sources in video frames.

IEnumerable<Tensor<T>> LocalizeSoundSource(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames)

Parameters

audioWaveform Tensor<T>: Audio waveform.
frames IEnumerable<Tensor<T>>: Video frames.

Returns

IEnumerable<Tensor<T>>: Attention maps showing sound source locations for each frame.

Remarks

For Beginners: Find where sounds come from in images!

Returns a "heat map" for each frame showing which regions are most likely producing the sound we hear.

RetrieveAudioFromVisuals(IEnumerable<Tensor<T>>, IEnumerable<Vector<T>>, int)

Retrieves audio content matching visual input.

IEnumerable<(int Index, T Score)> RetrieveAudioFromVisuals(IEnumerable<Tensor<T>> frames, IEnumerable<Vector<T>> audioDatabase, int topK = 10)

Parameters

frames IEnumerable<Tensor<T>>: Query video frames.
audioDatabase IEnumerable<Vector<T>>: Database of audio embeddings.
topK int: Number of results.

Returns

IEnumerable<(int Index, T Score)>: Indices and scores of matching audio.

RetrieveVisualsFromAudio(Tensor<T>, IEnumerable<Vector<T>>, int)

Retrieves visual content matching audio.

IEnumerable<(int Index, T Score)> RetrieveVisualsFromAudio(Tensor<T> audioWaveform, IEnumerable<Vector<T>> visualDatabase, int topK = 10)

Parameters

audioWaveform Tensor<T>: Query audio.
visualDatabase IEnumerable<Vector<T>>: Database of visual embeddings.
topK int: Number of results.

Returns

IEnumerable<(int Index, T Score)>: Indices and scores of matching visuals.

SeparateAudioByVisual(Tensor<T>, Tensor<T>)

Separates audio sources based on visual guidance.

Tensor<T> SeparateAudioByVisual(Tensor<T> mixedAudio, Tensor<T> targetVisual)

Parameters

mixedAudio Tensor<T>: Mixed audio waveform.
targetVisual Tensor<T>: Visual of the target sound source.

Returns

Tensor<T>: Separated audio for the target source.

Remarks

Uses visual information to guide audio source separation. For example, given a video of two people talking and pointing at one person, extracts just that person's voice.

Table of Contents

Interface IAudioVisualCorrespondenceModel<T>

Type Parameters

Remarks

Properties

AudioSampleRate

Property Value

EmbeddingDimension

Property Value

VideoFrameRate

Property Value

Methods

CheckSynchronization(Tensor<T>, IEnumerable<Tensor<T>>)

Parameters

Returns

ClassifyScene(Tensor<T>, IEnumerable<Tensor<T>>, IEnumerable<string>)

Parameters

Returns

ComputeCorrespondence(Tensor<T>, IEnumerable<Tensor<T>>)

Parameters

Returns

DescribeExpectedAudio(IEnumerable<Tensor<T>>)

Parameters

Returns

GetAudioEmbedding(Tensor<T>, int)

Parameters

Returns

GetVisualEmbedding(IEnumerable<Tensor<T>>)

Parameters

Returns

LearnCorrespondence(IEnumerable<Tensor<T>>, IEnumerable<IEnumerable<Tensor<T>>>, int)

Parameters

LocalizeSoundSource(Tensor<T>, IEnumerable<Tensor<T>>)

Parameters

Returns

Remarks

RetrieveAudioFromVisuals(IEnumerable<Tensor<T>>, IEnumerable<Vector<T>>, int)

Parameters

Returns

RetrieveVisualsFromAudio(Tensor<T>, IEnumerable<Vector<T>>, int)

Parameters

Returns

SeparateAudioByVisual(Tensor<T>, Tensor<T>)

Parameters

Returns

Remarks