Table of Contents

Interface IAudioVisualCorrespondenceModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for audio-visual correspondence learning models.

public interface IAudioVisualCorrespondenceModel<T>

Type Parameters

T

The numeric type used for calculations.

Remarks

Audio-visual correspondence learning focuses on understanding the relationship between what we see and what we hear. This enables tasks like finding the source of a sound in a video, synchronizing audio and video, and understanding audio-visual events.

For Beginners: Teaching AI to connect sounds with visuals!

Key capabilities:

  • Sound source localization: Where in the image is the sound coming from?
  • Audio-visual synchronization: Are the audio and video in sync?
  • Cross-modal retrieval: Find images matching sounds and vice versa
  • Audio-visual scene understanding: What's happening based on both modalities?

Examples:

  • A dog barking → The model highlights the dog in the image
  • Piano music → The model finds images of pianos
  • Clapping sound → The model locates hands in the video

Properties

AudioSampleRate

Gets the expected audio sample rate.

int AudioSampleRate { get; }

Property Value

int

EmbeddingDimension

Gets the embedding dimension for audio-visual features.

int EmbeddingDimension { get; }

Property Value

int

VideoFrameRate

Gets the expected video frame rate.

double VideoFrameRate { get; }

Property Value

double

Methods

CheckSynchronization(Tensor<T>, IEnumerable<Tensor<T>>)

Checks audio-visual synchronization.

(double OffsetSeconds, T Confidence) CheckSynchronization(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

Returns

(double OffsetSeconds, T Confidence)

Sync offset in seconds (positive = audio ahead, negative = audio behind) and confidence.

ClassifyScene(Tensor<T>, IEnumerable<Tensor<T>>, IEnumerable<string>)

Classifies audio-visual scenes.

Dictionary<string, T> ClassifyScene(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, IEnumerable<string> sceneLabels)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

sceneLabels IEnumerable<string>

Possible scene labels.

Returns

Dictionary<string, T>

Classification probabilities.

ComputeCorrespondence(Tensor<T>, IEnumerable<Tensor<T>>)

Computes audio-visual correspondence score.

T ComputeCorrespondence(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

Returns

T

Correspondence score (higher = better match).

DescribeExpectedAudio(IEnumerable<Tensor<T>>)

Generates audio description from visual content.

string DescribeExpectedAudio(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>

Video frames.

Returns

string

Description of expected sounds.

GetAudioEmbedding(Tensor<T>, int)

Computes audio embedding from waveform.

Vector<T> GetAudioEmbedding(Tensor<T> audioWaveform, int sampleRate)

Parameters

audioWaveform Tensor<T>

Audio waveform tensor.

sampleRate int

Sample rate of the audio.

Returns

Vector<T>

Normalized audio embedding.

GetVisualEmbedding(IEnumerable<Tensor<T>>)

Computes visual embedding from video frames.

Vector<T> GetVisualEmbedding(IEnumerable<Tensor<T>> frames)

Parameters

frames IEnumerable<Tensor<T>>

Sequence of video frames.

Returns

Vector<T>

Normalized visual embedding.

LearnCorrespondence(IEnumerable<Tensor<T>>, IEnumerable<IEnumerable<Tensor<T>>>, int)

Learns correspondence from paired audio-visual data.

void LearnCorrespondence(IEnumerable<Tensor<T>> audioSamples, IEnumerable<IEnumerable<Tensor<T>>> visualSamples, int epochs = 10)

Parameters

audioSamples IEnumerable<Tensor<T>>

Audio samples.

visualSamples IEnumerable<IEnumerable<Tensor<T>>>

Corresponding visual samples.

epochs int

Training epochs.

LocalizeSoundSource(Tensor<T>, IEnumerable<Tensor<T>>)

Localizes sound sources in video frames.

IEnumerable<Tensor<T>> LocalizeSoundSource(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

Returns

IEnumerable<Tensor<T>>

Attention maps showing sound source locations for each frame.

Remarks

For Beginners: Find where sounds come from in images!

Returns a "heat map" for each frame showing which regions are most likely producing the sound we hear.

RetrieveAudioFromVisuals(IEnumerable<Tensor<T>>, IEnumerable<Vector<T>>, int)

Retrieves audio content matching visual input.

IEnumerable<(int Index, T Score)> RetrieveAudioFromVisuals(IEnumerable<Tensor<T>> frames, IEnumerable<Vector<T>> audioDatabase, int topK = 10)

Parameters

frames IEnumerable<Tensor<T>>

Query video frames.

audioDatabase IEnumerable<Vector<T>>

Database of audio embeddings.

topK int

Number of results.

Returns

IEnumerable<(int Index, T Score)>

Indices and scores of matching audio.

RetrieveVisualsFromAudio(Tensor<T>, IEnumerable<Vector<T>>, int)

Retrieves visual content matching audio.

IEnumerable<(int Index, T Score)> RetrieveVisualsFromAudio(Tensor<T> audioWaveform, IEnumerable<Vector<T>> visualDatabase, int topK = 10)

Parameters

audioWaveform Tensor<T>

Query audio.

visualDatabase IEnumerable<Vector<T>>

Database of visual embeddings.

topK int

Number of results.

Returns

IEnumerable<(int Index, T Score)>

Indices and scores of matching visuals.

SeparateAudioByVisual(Tensor<T>, Tensor<T>)

Separates audio sources based on visual guidance.

Tensor<T> SeparateAudioByVisual(Tensor<T> mixedAudio, Tensor<T> targetVisual)

Parameters

mixedAudio Tensor<T>

Mixed audio waveform.

targetVisual Tensor<T>

Visual of the target sound source.

Returns

Tensor<T>

Separated audio for the target source.

Remarks

Uses visual information to guide audio source separation. For example, given a video of two people talking and pointing at one person, extracts just that person's voice.