Interface IAudioVisualEventLocalizationModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for audio-visual event localization models.
public interface IAudioVisualEventLocalizationModel<T>
Type Parameters
TThe numeric type used for calculations.
Remarks
Audio-visual event localization identifies WHEN and WHERE events occur in video by jointly analyzing audio and visual streams. This goes beyond simple detection to provide precise temporal boundaries and spatial locations.
For Beginners: Finding events in videos using sight AND sound!
Key capabilities:
- Temporal localization: When does the dog bark? (2.3s - 4.1s)
- Spatial localization: Where is the barking dog? (bounding box)
- Event classification: What kind of event is it? (animal sound)
- Multi-event detection: Find all events in a video
Use cases:
- Video surveillance: Detect glass breaking sounds and locate the window
- Sports analysis: Find and timestamp all goals using crowd cheering
- Content moderation: Detect and locate inappropriate audio-visual content
Properties
SupportedEventCategories
Gets the supported event categories.
IReadOnlyList<string> SupportedEventCategories { get; }
Property Value
TemporalResolution
Gets the temporal resolution in seconds.
double TemporalResolution { get; }
Property Value
Methods
AnswerEventQuestion(Tensor<T>, IEnumerable<Tensor<T>>, string, double)
Answers questions about events in the video.
(string Answer, IEnumerable<(double StartTime, double EndTime)> Evidence) AnswerEventQuestion(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, string question, double frameRate)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
questionstringQuestion about events.
frameRatedoubleVideo frame rate.
Returns
- (string Answer, IEnumerable<(double StartTime, double EndTime)> Evidence)
Answer with supporting temporal evidence.
ClassifyEvent(Tensor<T>, IEnumerable<Tensor<T>>, IEnumerable<string>)
Classifies a pre-segmented event.
Dictionary<string, T> ClassifyEvent(Tensor<T> audioSegment, IEnumerable<Tensor<T>> frameSegment, IEnumerable<string> candidateLabels)
Parameters
audioSegmentTensor<T>Audio segment for the event.
frameSegmentIEnumerable<Tensor<T>>Video frames for the event.
candidateLabelsIEnumerable<string>Possible event labels.
Returns
- Dictionary<string, T>
Classification probabilities.
ComputeEventAttention(Tensor<T>, IEnumerable<Tensor<T>>)
Computes event-level audio-visual attention.
(Tensor<T> AudioToVisualAttention, Tensor<T> VisualToAudioAttention) ComputeEventAttention(Tensor<T> audioSegment, IEnumerable<Tensor<T>> frameSegment)
Parameters
audioSegmentTensor<T>Audio segment.
frameSegmentIEnumerable<Tensor<T>>Video frame segment.
Returns
DetectAnomalies(Tensor<T>, IEnumerable<Tensor<T>>, double)
Detects anomalous events that don't match expected patterns.
IEnumerable<(double StartTime, double EndTime, T AnomalyScore, string Description)> DetectAnomalies(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
frameRatedoubleVideo frame rate.
Returns
- IEnumerable<(double StartTime, double EndTime, T SyncQuality, string Description)>
Detected anomalies with anomaly scores.
DetectEvents(Tensor<T>, IEnumerable<Tensor<T>>, double)
Detects and localizes all audio-visual events in a video.
IEnumerable<AudioVisualEvent> DetectEvents(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
frameRatedoubleVideo frame rate.
Returns
- IEnumerable<AudioVisualEvent>
List of detected events with temporal and spatial localization.
DetectSpecificEvents(Tensor<T>, IEnumerable<Tensor<T>>, IEnumerable<string>, double)
Detects events of specific categories.
IEnumerable<AudioVisualEvent> DetectSpecificEvents(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, IEnumerable<string> targetCategories, double frameRate)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
targetCategoriesIEnumerable<string>Categories to detect.
frameRatedoubleVideo frame rate.
Returns
- IEnumerable<AudioVisualEvent>
Detected events matching the target categories.
DetectSyncEvents(Tensor<T>, IEnumerable<Tensor<T>>, double)
Detects audio-visual synchronization events (e.g., lip sync).
IEnumerable<(double StartTime, double EndTime, T SyncQuality, string Description)> DetectSyncEvents(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
frameRatedoubleVideo frame rate.
Returns
- IEnumerable<(double StartTime, double EndTime, T SyncQuality, string Description)>
Sync events with quality scores.
GenerateDenseCaptions(Tensor<T>, IEnumerable<Tensor<T>>, double)
Generates dense event captions for the entire video.
IEnumerable<(double Time, string Caption)> GenerateDenseCaptions(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
frameRatedoubleVideo frame rate.
Returns
- IEnumerable<(double Time, string Caption)>
Time-stamped captions describing events.
GenerateProposals(Tensor<T>, IEnumerable<Tensor<T>>, double)
Generates temporal proposals for potential events.
IEnumerable<(double StartTime, double EndTime, T EventnessScore)> GenerateProposals(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
frameRatedoubleVideo frame rate.
Returns
- IEnumerable<(double StartTime, double EndTime, T Confidence)>
Proposed time segments that may contain events.
LocalizeEventByDescription(Tensor<T>, IEnumerable<Tensor<T>>, string, double)
Localizes a specific event described in text.
IEnumerable<(double StartTime, double EndTime, T Confidence)> LocalizeEventByDescription(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, string eventDescription, double frameRate)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
eventDescriptionstringText description of the event.
frameRatedoubleVideo frame rate.
Returns
- IEnumerable<(double StartTime, double EndTime, T Confidence)>
Temporal segments where the event occurs.
Remarks
For Beginners: Find events using natural language!
Example: "person laughing" → returns [(5.2s, 7.8s), (15.1s, 16.4s)]
SegmentScenes(Tensor<T>, IEnumerable<Tensor<T>>, double)
Segments video into coherent audio-visual scenes.
IEnumerable<(double StartTime, double EndTime, string SceneDescription)> SegmentScenes(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, double frameRate)
Parameters
audioWaveformTensor<T>Audio waveform.
framesIEnumerable<Tensor<T>>Video frames.
frameRatedoubleVideo frame rate.
Returns
- IEnumerable<(double StartTime, double EndTime, string SceneDescription)>
Scene boundaries with descriptions.
TrackEvent(Tensor<T>, IEnumerable<Tensor<T>>, AudioVisualEvent, double)
Tracks an event across time.
IEnumerable<AudioVisualEvent> TrackEvent(Tensor<T> audioWaveform, IEnumerable<Tensor<T>> frames, AudioVisualEvent initialEvent, double frameRate)
Parameters
audioWaveformTensor<T>Full audio waveform.
framesIEnumerable<Tensor<T>>All video frames.
initialEventAudioVisualEventInitial event detection.
frameRatedoubleVideo frame rate.
Returns
- IEnumerable<AudioVisualEvent>
Event trajectory with updated temporal and spatial locations.