Table of Contents

Class EnergyBasedVad<T>

Namespace
AiDotNet.Audio.VoiceActivity
Assembly
AiDotNet.dll

Simple energy-based voice activity detector (algorithmic, no neural network).

public class EnergyBasedVad<T> : VoiceActivityDetectorBase<T>, IVoiceActivityDetector<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
EnergyBasedVad<T>
Implements
Inherited Members

Remarks

This is a basic VAD that detects speech based on signal energy (loudness). It combines multiple features for more robust detection: - Short-time energy - Zero-crossing rate - Spectral flatness

For Beginners: This is the simplest type of VAD:

Basic idea: Speech is louder than silence!

  • Compute the "energy" (sum of squared samples) for each frame
  • If energy exceeds a threshold, it's probably speech

Enhanced features used here:

  1. Energy: How loud is the signal?
  2. Zero-crossings: How often does the signal cross zero?
    • Speech: Medium zero-crossings (voiced sounds)
    • Noise: High zero-crossings (random noise)
  3. Spectral flatness: Is it tonal or noisy?
    • Speech: Low flatness (has harmonic structure)
    • Noise: High flatness (random spectrum)

Pros:

  • Very fast (no neural network)
  • Low latency
  • Works well in quiet environments

Cons:

  • Struggles with background noise
  • May trigger on loud non-speech sounds
  • Requires threshold tuning for different environments

For better noise robustness, use neural network-based VAD like SileroVad.

Constructors

EnergyBasedVad(int, int, double, double, double, double, bool, int, int)

Creates an energy-based VAD with default parameters.

public EnergyBasedVad(int sampleRate = 16000, int frameSize = 480, double threshold = 0.5, double energyWeight = 0.5, double zcrWeight = 0.25, double flatnessWeight = 0.25, bool adaptiveThreshold = true, int minSpeechDurationMs = 250, int minSilenceDurationMs = 300)

Parameters

sampleRate int

Audio sample rate (default: 16000).

frameSize int

Frame size in samples (default: 480 = 30ms at 16kHz).

threshold double

Detection threshold 0-1 (default: 0.5).

energyWeight double

Weight for energy feature (default: 0.5).

zcrWeight double

Weight for zero-crossing rate (default: 0.25).

flatnessWeight double

Weight for spectral flatness (default: 0.25).

adaptiveThreshold bool

Enable adaptive threshold (default: true).

minSpeechDurationMs int

Minimum speech duration (default: 250ms).

minSilenceDurationMs int

Minimum silence duration (default: 300ms).

Methods

ComputeFrameProbability(T[])

Computes speech probability for a single frame.

protected override T ComputeFrameProbability(T[] frame)

Parameters

frame T[]

Audio frame data.

Returns

T

Speech probability (0-1).

ResetState()

Resets the VAD state including adaptive thresholds.

public override void ResetState()