Interface IAudioVisualCorrespondenceModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for audio-visual correspondence learning models.
public interface IAudioVisualCorrespondenceModel<T>
Type Parameters
TThe numeric type used for calculations.
Remarks
Audio-visual correspondence learning focuses on understanding the relationship between what we see and what we hear. This enables tasks like finding the source of a sound in a video, synchronizing audio and video, and understanding audio-visual events.
For Beginners: Teaching AI to connect sounds with visuals!
Key capabilities:
- Sound source localization: Where in the image is the sound coming from?
- Audio-visual synchronization: Are the audio and video in sync?
- Cross-modal retrieval: Find images matching sounds and vice versa
- Audio-visual scene understanding: What's happening based on both modalities?
Examples:
- A dog barking → The model highlights the dog in the image
- Piano music → The model finds images of pianos
- Clapping sound → The model locates hands in the video
Properties
AudioSampleRate
Gets the expected audio sample rate.
int AudioSampleRate { get; }
Property Value
EmbeddingDimension
Gets the embedding dimension for audio-visual features.
int EmbeddingDimension { get; }
Property Value
VideoFrameRate
Gets the expected video frame rate.
double VideoFrameRate { get; }
Property Value
Methods
CheckSynchronization(Tensor<T>, IEnumerable<Tensor<T>>)
Checks audio-visual synchronization.
(double OffsetSeconds, T Confidence) CheckSynchronization(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
Returns
- (double OffsetSeconds, T Confidence)
Sync offset in seconds (positive = audio ahead, negative = audio behind) and confidence.
ClassifyScene(Tensor<T>, IEnumerable<Tensor<T>>, IEnumerable<string>)
Classifies audio-visual scenes.
Dictionary<string, T> ClassifyScene(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, IEnumerable<string> sceneLabels)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
sceneLabelsIEnumerable<string>Possible scene labels.
Returns
- Dictionary<string, T>
Classification probabilities.
ComputeCorrespondence(Tensor<T>, IEnumerable<Tensor<T>>)
Computes audio-visual correspondence score.
T ComputeCorrespondence(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
Returns
- T
Correspondence score (higher = better match).
DescribeExpectedAudio(IEnumerable<Tensor<T>>)
Generates audio description from visual content.
string DescribeExpectedAudio(IEnumerable<Tensor<T>> frames)
Parameters
framesIEnumerable<Tensor<T>>Video frames.
Returns
- string
Description of expected sounds.
GetAudioEmbedding(Tensor<T>, int)
Computes audio embedding from waveform.
Vector<T> GetAudioEmbedding(Tensor<T> audioWaveform, int sampleRate)
Parameters
audioWaveformTensor<T>Audio waveform tensor.
sampleRateintSample rate of the audio.
Returns
- Vector<T>
Normalized audio embedding.
GetVisualEmbedding(IEnumerable<Tensor<T>>)
Computes visual embedding from video frames.
Vector<T> GetVisualEmbedding(IEnumerable<Tensor<T>> frames)
Parameters
framesIEnumerable<Tensor<T>>Sequence of video frames.
Returns
- Vector<T>
Normalized visual embedding.
LearnCorrespondence(IEnumerable<Tensor<T>>, IEnumerable<IEnumerable<Tensor<T>>>, int)
Learns correspondence from paired audio-visual data.
void LearnCorrespondence(IEnumerable<Tensor<T>> audioSamples, IEnumerable<IEnumerable<Tensor<T>>> visualSamples, int epochs = 10)
Parameters
audioSamplesIEnumerable<Tensor<T>>Audio samples.
visualSamplesIEnumerable<IEnumerable<Tensor<T>>>Corresponding visual samples.
epochsintTraining epochs.
LocalizeSoundSource(Tensor<T>, IEnumerable<Tensor<T>>)
Localizes sound sources in video frames.
IEnumerable<Tensor<T>> LocalizeSoundSource(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
Returns
- IEnumerable<Tensor<T>>
Attention maps showing sound source locations for each frame.
Remarks
For Beginners: Find where sounds come from in images!
Returns a "heat map" for each frame showing which regions are most likely producing the sound we hear.
RetrieveAudioFromVisuals(IEnumerable<Tensor<T>>, IEnumerable<Vector<T>>, int)
Retrieves audio content matching visual input.
IEnumerable<(int Index, T Score)> RetrieveAudioFromVisuals(IEnumerable<Tensor<T>> frames, IEnumerable<Vector<T>> audioDatabase, int topK = 10)
Parameters
framesIEnumerable<Tensor<T>>Query video frames.
audioDatabaseIEnumerable<Vector<T>>Database of audio embeddings.
topKintNumber of results.
Returns
- IEnumerable<(int Index, T Score)>
Indices and scores of matching audio.
RetrieveVisualsFromAudio(Tensor<T>, IEnumerable<Vector<T>>, int)
Retrieves visual content matching audio.
IEnumerable<(int Index, T Score)> RetrieveVisualsFromAudio(Tensor<T> audioWaveform, IEnumerable<Vector<T>> visualDatabase, int topK = 10)
Parameters
audioWaveformTensor<T>Query audio.
visualDatabaseIEnumerable<Vector<T>>Database of visual embeddings.
topKintNumber of results.
Returns
- IEnumerable<(int Index, T Score)>
Indices and scores of matching visuals.
SeparateAudioByVisual(Tensor<T>, Tensor<T>)
Separates audio sources based on visual guidance.
Tensor<T> SeparateAudioByVisual(Tensor<T> mixedAudio, Tensor<T> targetVisual)
Parameters
mixedAudioTensor<T>Mixed audio waveform.
targetVisualTensor<T>Visual of the target sound source.
Returns
- Tensor<T>
Separated audio for the target source.
Remarks
Uses visual information to guide audio source separation. For example, given a video of two people talking and pointing at one person, extracts just that person's voice.