Table of Contents

Interface IAudioVisualEventLocalizationModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for audio-visual event localization models.

public interface IAudioVisualEventLocalizationModel<T>

Type Parameters

T

The numeric type used for calculations.

Remarks

Audio-visual event localization identifies WHEN and WHERE events occur in video by jointly analyzing audio and visual streams. This goes beyond simple detection to provide precise temporal boundaries and spatial locations.

For Beginners: Finding events in videos using sight AND sound!

Key capabilities:

  • Temporal localization: When does the dog bark? (2.3s - 4.1s)
  • Spatial localization: Where is the barking dog? (bounding box)
  • Event classification: What kind of event is it? (animal sound)
  • Multi-event detection: Find all events in a video

Use cases:

  • Video surveillance: Detect glass breaking sounds and locate the window
  • Sports analysis: Find and timestamp all goals using crowd cheering
  • Content moderation: Detect and locate inappropriate audio-visual content

Properties

SupportedEventCategories

Gets the supported event categories.

IReadOnlyList<string> SupportedEventCategories { get; }

Property Value

IReadOnlyList<string>

TemporalResolution

Gets the temporal resolution in seconds.

double TemporalResolution { get; }

Property Value

double

Methods

AnswerEventQuestion(Tensor<T>, IEnumerable<Tensor<T>>, string, double)

Answers questions about events in the video.

(string Answer, IEnumerable<(double StartTime, double EndTime)> Evidence) AnswerEventQuestion(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, string question, double frameRate)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

question string

Question about events.

frameRate double

Video frame rate.

Returns

(string Answer, IEnumerable<(double StartTime, double EndTime)> Evidence)

Answer with supporting temporal evidence.

ClassifyEvent(Tensor<T>, IEnumerable<Tensor<T>>, IEnumerable<string>)

Classifies a pre-segmented event.

Dictionary<string, T> ClassifyEvent(Tensor<T> audioSegment, IEnumerable<Tensor<T>> frameSegment, IEnumerable<string> candidateLabels)

Parameters

audioSegment Tensor<T>

Audio segment for the event.

frameSegment IEnumerable<Tensor<T>>

Video frames for the event.

candidateLabels IEnumerable<string>

Possible event labels.

Returns

Dictionary<string, T>

Classification probabilities.

ComputeEventAttention(Tensor<T>, IEnumerable<Tensor<T>>)

Computes event-level audio-visual attention.

(Tensor<T> AudioToVisualAttention, Tensor<T> VisualToAudioAttention) ComputeEventAttention(Tensor<T> audioSegment, IEnumerable<Tensor<T>> frameSegment)

Parameters

audioSegment Tensor<T>

Audio segment.

frameSegment IEnumerable<Tensor<T>>

Video frame segment.

Returns

(Tensor<T> grad1, Tensor<T> grad2)

Cross-modal attention weights.

DetectAnomalies(Tensor<T>, IEnumerable<Tensor<T>>, double)

Detects anomalous events that don't match expected patterns.

IEnumerable<(double StartTime, double EndTime, T AnomalyScore, string Description)> DetectAnomalies(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

frameRate double

Video frame rate.

Returns

IEnumerable<(double StartTime, double EndTime, T SyncQuality, string Description)>

Detected anomalies with anomaly scores.

DetectEvents(Tensor<T>, IEnumerable<Tensor<T>>, double)

Detects and localizes all audio-visual events in a video.

IEnumerable<AudioVisualEvent> DetectEvents(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

frameRate double

Video frame rate.

Returns

IEnumerable<AudioVisualEvent>

List of detected events with temporal and spatial localization.

DetectSpecificEvents(Tensor<T>, IEnumerable<Tensor<T>>, IEnumerable<string>, double)

Detects events of specific categories.

IEnumerable<AudioVisualEvent> DetectSpecificEvents(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, IEnumerable<string> targetCategories, double frameRate)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

targetCategories IEnumerable<string>

Categories to detect.

frameRate double

Video frame rate.

Returns

IEnumerable<AudioVisualEvent>

Detected events matching the target categories.

DetectSyncEvents(Tensor<T>, IEnumerable<Tensor<T>>, double)

Detects audio-visual synchronization events (e.g., lip sync).

IEnumerable<(double StartTime, double EndTime, T SyncQuality, string Description)> DetectSyncEvents(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

frameRate double

Video frame rate.

Returns

IEnumerable<(double StartTime, double EndTime, T SyncQuality, string Description)>

Sync events with quality scores.

GenerateDenseCaptions(Tensor<T>, IEnumerable<Tensor<T>>, double)

Generates dense event captions for the entire video.

IEnumerable<(double Time, string Caption)> GenerateDenseCaptions(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

frameRate double

Video frame rate.

Returns

IEnumerable<(double Time, string Caption)>

Time-stamped captions describing events.

GenerateProposals(Tensor<T>, IEnumerable<Tensor<T>>, double)

Generates temporal proposals for potential events.

IEnumerable<(double StartTime, double EndTime, T EventnessScore)> GenerateProposals(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

frameRate double

Video frame rate.

Returns

IEnumerable<(double StartTime, double EndTime, T Confidence)>

Proposed time segments that may contain events.

LocalizeEventByDescription(Tensor<T>, IEnumerable<Tensor<T>>, string, double)

Localizes a specific event described in text.

IEnumerable<(double StartTime, double EndTime, T Confidence)> LocalizeEventByDescription(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, string eventDescription, double frameRate)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

eventDescription string

Text description of the event.

frameRate double

Video frame rate.

Returns

IEnumerable<(double StartTime, double EndTime, T Confidence)>

Temporal segments where the event occurs.

Remarks

For Beginners: Find events using natural language!

Example: "person laughing" → returns [(5.2s, 7.8s), (15.1s, 16.4s)]

SegmentScenes(Tensor<T>, IEnumerable<Tensor<T>>, double)

Segments video into coherent audio-visual scenes.

IEnumerable<(double StartTime, double EndTime, string SceneDescription)> SegmentScenes(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)

Parameters

audioWaveform Tensor<T>

Audio waveform.

frames IEnumerable<Tensor<T>>

Video frames.

frameRate double

Video frame rate.

Returns

IEnumerable<(double StartTime, double EndTime, string SceneDescription)>

Scene boundaries with descriptions.

TrackEvent(Tensor<T>, IEnumerable<Tensor<T>>, AudioVisualEvent, double)

Tracks an event across time.

IEnumerable<AudioVisualEvent> TrackEvent(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, AudioVisualEvent initialEvent, double frameRate)

Parameters

audioWaveform Tensor<T>

Full audio waveform.

frames IEnumerable<Tensor<T>>

All video frames.

initialEvent AudioVisualEvent

Initial event detection.

frameRate double

Video frame rate.

Returns

IEnumerable<AudioVisualEvent>

Event trajectory with updated temporal and spatial locations.