Table of Contents

Class VoiceActivityDetectorBase<T>

Namespace
AiDotNet.Audio.VoiceActivity
Assembly
AiDotNet.dll

Base class for algorithmic voice activity detection implementations (non-neural network).

public abstract class VoiceActivityDetectorBase<T> : IVoiceActivityDetector<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
VoiceActivityDetectorBase<T>
Implements
Derived
Inherited Members

Remarks

Voice Activity Detection (VAD) determines whether audio contains speech or silence. This is fundamental to many audio applications including speech recognition, communication systems, and noise reduction.

For Beginners: VAD answers a simple question: "Is someone speaking right now?"

Common uses:

  • Skip silence during transcription
  • Reduce transmission bandwidth in VoIP
  • Trigger recording only when speech is detected
  • Segment audio into speaker turns

This base class provides:

  • Frame-based processing with hangover logic
  • Streaming mode with state management
  • Segment detection across entire audio files

For neural network-based VAD (like Silero), see classes that extend AudioNeuralNetworkBase.

Constructors

VoiceActivityDetectorBase(int, int, double, int, int)

Initializes a new instance of VoiceActivityDetectorBase.

protected VoiceActivityDetectorBase(int sampleRate = 16000, int frameSize = 480, double threshold = 0.5, int minSpeechDurationMs = 250, int minSilenceDurationMs = 300)

Parameters

sampleRate int

Audio sample rate.

frameSize int

Frame size in samples.

threshold double

Detection threshold (0-1).

minSpeechDurationMs int

Minimum speech duration in ms.

minSilenceDurationMs int

Minimum silence duration in ms.

Fields

NumOps

Numeric operations for type T.

protected readonly INumericOperations<T> NumOps

Field Value

INumericOperations<T>

_inSpeech

Current speech state.

protected bool _inSpeech

Field Value

bool

_silenceFrameCount

Number of consecutive silence frames.

protected int _silenceFrameCount

Field Value

int

_speechFrameCount

Number of consecutive speech frames.

protected int _speechFrameCount

Field Value

int

Properties

FrameSize

Gets the frame size in samples used for detection.

public int FrameSize { get; protected set; }

Property Value

int

MinSilenceDurationMs

Gets or sets the minimum silence duration in milliseconds.

public int MinSilenceDurationMs { get; set; }

Property Value

int

Remarks

Silence gaps shorter than this don't split speech segments.

MinSpeechDurationMs

Gets or sets the minimum speech duration in milliseconds.

public int MinSpeechDurationMs { get; set; }

Property Value

int

Remarks

Speech segments shorter than this are ignored (reduces false triggers).

SampleRate

Gets the sample rate this VAD operates at.

public int SampleRate { get; protected set; }

Property Value

int

Threshold

Gets or sets the detection threshold (0.0 to 1.0).

public double Threshold { get; set; }

Property Value

double

Remarks

Higher threshold = fewer false positives but may miss quiet speech. Lower threshold = catches more speech but may trigger on noise. Default is typically 0.5.

Methods

ComputeFrameProbability(T[])

Computes speech probability for a single frame.

protected abstract T ComputeFrameProbability(T[] frame)

Parameters

frame T[]

Audio frame data.

Returns

T

Speech probability (0-1).

DetectSpeech(Tensor<T>)

Detects whether speech is present in an audio frame.

public virtual bool DetectSpeech(Tensor<T> audioFrame)

Parameters

audioFrame Tensor<T>

Audio frame with shape [samples] or [channels, samples].

Returns

bool

True if speech is detected, false otherwise.

DetectSpeechSegments(Tensor<T>)

Detects speech segments in a longer audio recording.

public virtual IReadOnlyList<(int StartSample, int EndSample)> DetectSpeechSegments(Tensor<T> audio)

Parameters

audio Tensor<T>

Full audio recording.

Returns

IReadOnlyList<(int StartSample, int EndSample)>

List of (startSample, endSample) tuples for each speech segment.

Remarks

For Beginners: This finds all the parts where someone is talking.

Example result for a 10-second recording: [(0.5s, 2.3s), (4.1s, 6.8s), (8.0s, 9.5s)] Meaning: Speech from 0.5-2.3s, silence, speech from 4.1-6.8s, etc.

GetFrameProbabilities(Tensor<T>)

Gets frame-by-frame speech probabilities for the entire audio.

public virtual T[] GetFrameProbabilities(Tensor<T> audio)

Parameters

audio Tensor<T>

Full audio recording.

Returns

T[]

Array of speech probabilities, one per frame.

GetSpeechProbability(Tensor<T>)

Gets the speech probability for an audio frame.

public virtual T GetSpeechProbability(Tensor<T> audioFrame)

Parameters

audioFrame Tensor<T>

Audio frame to analyze.

Returns

T

Probability of speech (0.0 = definitely not speech, 1.0 = definitely speech).

ProcessChunk(Tensor<T>)

Processes audio in streaming mode, maintaining state between calls.

public virtual (bool IsSpeech, T Probability) ProcessChunk(Tensor<T> audioChunk)

Parameters

audioChunk Tensor<T>

A chunk of audio for real-time processing.

Returns

(bool SameLanguage, T Confidence)

Speech detection result with probability.

ResetState()

Resets internal state for streaming mode.

public virtual void ResetState()