Table of Contents

Interface IVoiceActivityDetector<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for Voice Activity Detection (VAD) models.

public interface IVoiceActivityDetector<T>

Type Parameters

T

The numeric type used for calculations.

Remarks

Voice Activity Detection determines when speech is present in an audio signal. This is a fundamental building block for many speech processing systems.

For Beginners: VAD answers the question "Is someone speaking right now?"

Why VAD is important:

  • Speech Recognition: Only process audio when speech is present (saves compute)
  • Voice Assistants: Detect when user starts/stops talking
  • VoIP/Video Calls: Only transmit audio when speaking (saves bandwidth)
  • Transcription: Find speech segments in long recordings
  • Speaker Diarization: First step to identify who spoke when

How it works:

  1. Traditional: Look at energy levels, zero-crossing rate, spectral features
  2. Modern (Neural): Train a model to classify frames as speech/non-speech

Key metrics:

  • Accuracy: How often it's correct
  • False Positive Rate: Saying "speech" when it's noise (annoying in voice assistants)
  • False Negative Rate: Missing actual speech (drops words in transcription)
  • Latency: How quickly it detects speech onset

Properties

FrameSize

Gets the frame size in samples used for detection.

int FrameSize { get; }

Property Value

int

MinSilenceDurationMs

Gets or sets the minimum silence duration in milliseconds.

int MinSilenceDurationMs { get; set; }

Property Value

int

Remarks

Silence gaps shorter than this don't split speech segments.

MinSpeechDurationMs

Gets or sets the minimum speech duration in milliseconds.

int MinSpeechDurationMs { get; set; }

Property Value

int

Remarks

Speech segments shorter than this are ignored (reduces false triggers).

SampleRate

Gets the sample rate this VAD operates at.

int SampleRate { get; }

Property Value

int

Threshold

Gets or sets the detection threshold (0.0 to 1.0).

double Threshold { get; set; }

Property Value

double

Remarks

Higher threshold = fewer false positives but may miss quiet speech. Lower threshold = catches more speech but may trigger on noise. Default is typically 0.5.

Methods

DetectSpeech(Tensor<T>)

Detects whether speech is present in an audio frame.

bool DetectSpeech(Tensor<T> audioFrame)

Parameters

audioFrame Tensor<T>

Audio frame with shape [samples] or [channels, samples].

Returns

bool

True if speech is detected, false otherwise.

DetectSpeechSegments(Tensor<T>)

Detects speech segments in a longer audio recording.

IReadOnlyList<(int StartSample, int EndSample)> DetectSpeechSegments(Tensor<T> audio)

Parameters

audio Tensor<T>

Full audio recording.

Returns

IReadOnlyList<(int StartSample, int EndSample)>

List of (startSample, endSample) tuples for each speech segment.

Remarks

For Beginners: This finds all the parts where someone is talking.

Example result for a 10-second recording: [(0.5s, 2.3s), (4.1s, 6.8s), (8.0s, 9.5s)] Meaning: Speech from 0.5-2.3s, silence, speech from 4.1-6.8s, etc.

GetFrameProbabilities(Tensor<T>)

Gets frame-by-frame speech probabilities for the entire audio.

T[] GetFrameProbabilities(Tensor<T> audio)

Parameters

audio Tensor<T>

Full audio recording.

Returns

T[]

Array of speech probabilities, one per frame.

GetSpeechProbability(Tensor<T>)

Gets the speech probability for an audio frame.

T GetSpeechProbability(Tensor<T> audioFrame)

Parameters

audioFrame Tensor<T>

Audio frame to analyze.

Returns

T

Probability of speech (0.0 = definitely not speech, 1.0 = definitely speech).

ProcessChunk(Tensor<T>)

Processes audio in streaming mode, maintaining state between calls.

(bool IsSpeech, T Probability) ProcessChunk(Tensor<T> audioChunk)

Parameters

audioChunk Tensor<T>

A chunk of audio for real-time processing.

Returns

(bool SameLanguage, T Confidence)

Speech detection result with probability.

ResetState()

Resets internal state for streaming mode.

void ResetState()