Class EnergyBasedVad<T>

Namespace: AiDotNet.Audio.VoiceActivity

Assembly: AiDotNet.dll

Simple energy-based voice activity detector (algorithmic, no neural network).

public class EnergyBasedVad<T> : VoiceActivityDetectorBase<T>, IVoiceActivityDetector<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

VoiceActivityDetectorBase<T>

EnergyBasedVad<T>

Implements: IVoiceActivityDetector<T>

Inherited Members: VoiceActivityDetectorBase<T>.NumOps

VoiceActivityDetectorBase<T>.SampleRate

VoiceActivityDetectorBase<T>.FrameSize

VoiceActivityDetectorBase<T>.Threshold

VoiceActivityDetectorBase<T>.MinSpeechDurationMs

VoiceActivityDetectorBase<T>.MinSilenceDurationMs

VoiceActivityDetectorBase<T>._speechFrameCount

VoiceActivityDetectorBase<T>._silenceFrameCount

VoiceActivityDetectorBase<T>._inSpeech

VoiceActivityDetectorBase<T>.DetectSpeech(Tensor<T>)

VoiceActivityDetectorBase<T>.GetSpeechProbability(Tensor<T>)

VoiceActivityDetectorBase<T>.DetectSpeechSegments(Tensor<T>)

VoiceActivityDetectorBase<T>.GetFrameProbabilities(Tensor<T>)

VoiceActivityDetectorBase<T>.ProcessChunk(Tensor<T>)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

This is a basic VAD that detects speech based on signal energy (loudness). It combines multiple features for more robust detection: - Short-time energy - Zero-crossing rate - Spectral flatness

For Beginners: This is the simplest type of VAD:

Basic idea: Speech is louder than silence!

Compute the "energy" (sum of squared samples) for each frame
If energy exceeds a threshold, it's probably speech

Enhanced features used here:

Energy: How loud is the signal?
Zero-crossings: How often does the signal cross zero?
- Speech: Medium zero-crossings (voiced sounds)
- Noise: High zero-crossings (random noise)
Spectral flatness: Is it tonal or noisy?
- Speech: Low flatness (has harmonic structure)
- Noise: High flatness (random spectrum)

Pros:

Very fast (no neural network)
Low latency
Works well in quiet environments

Cons:

Struggles with background noise
May trigger on loud non-speech sounds
Requires threshold tuning for different environments

For better noise robustness, use neural network-based VAD like SileroVad.

Constructors

EnergyBasedVad(int, int, double, double, double, double, bool, int, int)

Creates an energy-based VAD with default parameters.

public EnergyBasedVad(int sampleRate = 16000, int frameSize = 480, double threshold = 0.5, double energyWeight = 0.5, double zcrWeight = 0.25, double flatnessWeight = 0.25, bool adaptiveThreshold = true, int minSpeechDurationMs = 250, int minSilenceDurationMs = 300)

Parameters

sampleRate int: Audio sample rate (default: 16000).
frameSize int: Frame size in samples (default: 480 = 30ms at 16kHz).
threshold double: Detection threshold 0-1 (default: 0.5).
energyWeight double: Weight for energy feature (default: 0.5).
zcrWeight double: Weight for zero-crossing rate (default: 0.25).
flatnessWeight double: Weight for spectral flatness (default: 0.25).
adaptiveThreshold bool: Enable adaptive threshold (default: true).
minSpeechDurationMs int: Minimum speech duration (default: 250ms).
minSilenceDurationMs int: Minimum silence duration (default: 300ms).

Methods

ComputeFrameProbability(T[])

Computes speech probability for a single frame.

protected override T ComputeFrameProbability(T[] frame)

Parameters

frame T[]: Audio frame data.

Returns

T: Speech probability (0-1).

ResetState()

Resets the VAD state including adaptive thresholds.

public override void ResetState()

Table of Contents

Class EnergyBasedVad<T>

Type Parameters

Remarks

Constructors

EnergyBasedVad(int, int, double, double, double, double, bool, int, int)

Parameters

Methods

ComputeFrameProbability(T[])

Parameters

Returns

ResetState()