Class VoiceActivityDetectorBase<T>
- Namespace
- AiDotNet.Audio.VoiceActivity
- Assembly
- AiDotNet.dll
Base class for algorithmic voice activity detection implementations (non-neural network).
public abstract class VoiceActivityDetectorBase<T> : IVoiceActivityDetector<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
VoiceActivityDetectorBase<T>
- Implements
- Derived
- Inherited Members
Remarks
Voice Activity Detection (VAD) determines whether audio contains speech or silence. This is fundamental to many audio applications including speech recognition, communication systems, and noise reduction.
For Beginners: VAD answers a simple question: "Is someone speaking right now?"
Common uses:
- Skip silence during transcription
- Reduce transmission bandwidth in VoIP
- Trigger recording only when speech is detected
- Segment audio into speaker turns
This base class provides:
- Frame-based processing with hangover logic
- Streaming mode with state management
- Segment detection across entire audio files
For neural network-based VAD (like Silero), see classes that extend AudioNeuralNetworkBase.
Constructors
VoiceActivityDetectorBase(int, int, double, int, int)
Initializes a new instance of VoiceActivityDetectorBase.
protected VoiceActivityDetectorBase(int sampleRate = 16000, int frameSize = 480, double threshold = 0.5, int minSpeechDurationMs = 250, int minSilenceDurationMs = 300)
Parameters
sampleRateintAudio sample rate.
frameSizeintFrame size in samples.
thresholddoubleDetection threshold (0-1).
minSpeechDurationMsintMinimum speech duration in ms.
minSilenceDurationMsintMinimum silence duration in ms.
Fields
NumOps
Numeric operations for type T.
protected readonly INumericOperations<T> NumOps
Field Value
- INumericOperations<T>
_inSpeech
Current speech state.
protected bool _inSpeech
Field Value
_silenceFrameCount
Number of consecutive silence frames.
protected int _silenceFrameCount
Field Value
_speechFrameCount
Number of consecutive speech frames.
protected int _speechFrameCount
Field Value
Properties
FrameSize
Gets the frame size in samples used for detection.
public int FrameSize { get; protected set; }
Property Value
MinSilenceDurationMs
Gets or sets the minimum silence duration in milliseconds.
public int MinSilenceDurationMs { get; set; }
Property Value
Remarks
Silence gaps shorter than this don't split speech segments.
MinSpeechDurationMs
Gets or sets the minimum speech duration in milliseconds.
public int MinSpeechDurationMs { get; set; }
Property Value
Remarks
Speech segments shorter than this are ignored (reduces false triggers).
SampleRate
Gets the sample rate this VAD operates at.
public int SampleRate { get; protected set; }
Property Value
Threshold
Gets or sets the detection threshold (0.0 to 1.0).
public double Threshold { get; set; }
Property Value
Remarks
Higher threshold = fewer false positives but may miss quiet speech. Lower threshold = catches more speech but may trigger on noise. Default is typically 0.5.
Methods
ComputeFrameProbability(T[])
Computes speech probability for a single frame.
protected abstract T ComputeFrameProbability(T[] frame)
Parameters
frameT[]Audio frame data.
Returns
- T
Speech probability (0-1).
DetectSpeech(Tensor<T>)
Detects whether speech is present in an audio frame.
public virtual bool DetectSpeech(Tensor<T> audioFrame)
Parameters
audioFrameTensor<T>Audio frame with shape [samples] or [channels, samples].
Returns
- bool
True if speech is detected, false otherwise.
DetectSpeechSegments(Tensor<T>)
Detects speech segments in a longer audio recording.
public virtual IReadOnlyList<(int StartSample, int EndSample)> DetectSpeechSegments(Tensor<T> audio)
Parameters
audioTensor<T>Full audio recording.
Returns
- IReadOnlyList<(int StartSample, int EndSample)>
List of (startSample, endSample) tuples for each speech segment.
Remarks
For Beginners: This finds all the parts where someone is talking.
Example result for a 10-second recording: [(0.5s, 2.3s), (4.1s, 6.8s), (8.0s, 9.5s)] Meaning: Speech from 0.5-2.3s, silence, speech from 4.1-6.8s, etc.
GetFrameProbabilities(Tensor<T>)
Gets frame-by-frame speech probabilities for the entire audio.
public virtual T[] GetFrameProbabilities(Tensor<T> audio)
Parameters
audioTensor<T>Full audio recording.
Returns
- T[]
Array of speech probabilities, one per frame.
GetSpeechProbability(Tensor<T>)
Gets the speech probability for an audio frame.
public virtual T GetSpeechProbability(Tensor<T> audioFrame)
Parameters
audioFrameTensor<T>Audio frame to analyze.
Returns
- T
Probability of speech (0.0 = definitely not speech, 1.0 = definitely speech).
ProcessChunk(Tensor<T>)
Processes audio in streaming mode, maintaining state between calls.
public virtual (bool IsSpeech, T Probability) ProcessChunk(Tensor<T> audioChunk)
Parameters
audioChunkTensor<T>A chunk of audio for real-time processing.
Returns
- (bool SameLanguage, T Confidence)
Speech detection result with probability.
ResetState()
Resets internal state for streaming mode.
public virtual void ResetState()