Interface IVoiceActivityDetector<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for Voice Activity Detection (VAD) models.
public interface IVoiceActivityDetector<T>
Type Parameters
TThe numeric type used for calculations.
Remarks
Voice Activity Detection determines when speech is present in an audio signal. This is a fundamental building block for many speech processing systems.
For Beginners: VAD answers the question "Is someone speaking right now?"
Why VAD is important:
- Speech Recognition: Only process audio when speech is present (saves compute)
- Voice Assistants: Detect when user starts/stops talking
- VoIP/Video Calls: Only transmit audio when speaking (saves bandwidth)
- Transcription: Find speech segments in long recordings
- Speaker Diarization: First step to identify who spoke when
How it works:
- Traditional: Look at energy levels, zero-crossing rate, spectral features
- Modern (Neural): Train a model to classify frames as speech/non-speech
Key metrics:
- Accuracy: How often it's correct
- False Positive Rate: Saying "speech" when it's noise (annoying in voice assistants)
- False Negative Rate: Missing actual speech (drops words in transcription)
- Latency: How quickly it detects speech onset
Properties
FrameSize
Gets the frame size in samples used for detection.
int FrameSize { get; }
Property Value
MinSilenceDurationMs
Gets or sets the minimum silence duration in milliseconds.
int MinSilenceDurationMs { get; set; }
Property Value
Remarks
Silence gaps shorter than this don't split speech segments.
MinSpeechDurationMs
Gets or sets the minimum speech duration in milliseconds.
int MinSpeechDurationMs { get; set; }
Property Value
Remarks
Speech segments shorter than this are ignored (reduces false triggers).
SampleRate
Gets the sample rate this VAD operates at.
int SampleRate { get; }
Property Value
Threshold
Gets or sets the detection threshold (0.0 to 1.0).
double Threshold { get; set; }
Property Value
Remarks
Higher threshold = fewer false positives but may miss quiet speech. Lower threshold = catches more speech but may trigger on noise. Default is typically 0.5.
Methods
DetectSpeech(Tensor<T>)
Detects whether speech is present in an audio frame.
bool DetectSpeech(Tensor<T> audioFrame)
Parameters
audioFrameTensor<T>Audio frame with shape [samples] or [channels, samples].
Returns
- bool
True if speech is detected, false otherwise.
DetectSpeechSegments(Tensor<T>)
Detects speech segments in a longer audio recording.
IReadOnlyList<(int StartSample, int EndSample)> DetectSpeechSegments(Tensor<T> audio)
Parameters
audioTensor<T>Full audio recording.
Returns
- IReadOnlyList<(int StartSample, int EndSample)>
List of (startSample, endSample) tuples for each speech segment.
Remarks
For Beginners: This finds all the parts where someone is talking.
Example result for a 10-second recording: [(0.5s, 2.3s), (4.1s, 6.8s), (8.0s, 9.5s)] Meaning: Speech from 0.5-2.3s, silence, speech from 4.1-6.8s, etc.
GetFrameProbabilities(Tensor<T>)
Gets frame-by-frame speech probabilities for the entire audio.
T[] GetFrameProbabilities(Tensor<T> audio)
Parameters
audioTensor<T>Full audio recording.
Returns
- T[]
Array of speech probabilities, one per frame.
GetSpeechProbability(Tensor<T>)
Gets the speech probability for an audio frame.
T GetSpeechProbability(Tensor<T> audioFrame)
Parameters
audioFrameTensor<T>Audio frame to analyze.
Returns
- T
Probability of speech (0.0 = definitely not speech, 1.0 = definitely speech).
ProcessChunk(Tensor<T>)
Processes audio in streaming mode, maintaining state between calls.
(bool IsSpeech, T Probability) ProcessChunk(Tensor<T> audioChunk)
Parameters
audioChunkTensor<T>A chunk of audio for real-time processing.
Returns
- (bool SameLanguage, T Confidence)
Speech detection result with probability.
ResetState()
Resets internal state for streaming mode.
void ResetState()