Table of Contents

Interface ISpeakerDiarizer<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for speaker diarization models that segment audio by speaker ("who spoke when").

public interface ISpeakerDiarizer<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members
Extension Methods

Remarks

Speaker diarization partitions an audio stream into segments based on speaker identity. It answers the question "Who spoke when?" without necessarily knowing who the speakers are (unlike speaker identification which requires enrolled speakers).

For Beginners: Diarization is like labeling a transcript with "Speaker A said..." "Speaker B said..." without knowing their names.

How it works:

  1. Audio is segmented into small chunks
  2. Speaker embeddings are extracted for each chunk
  3. Clustering groups similar embeddings together
  4. Each cluster represents a unique speaker
  5. Output: Timeline showing when each speaker talks

Common use cases:

  • Meeting transcription (separating participants)
  • Podcast/interview processing
  • Call center analytics
  • Medical dictation

Challenges:

  • Overlapping speech (multiple people talking at once)
  • Short turns (quick back-and-forth conversation)
  • Similar voices (e.g., siblings)
  • Background noise and music

This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.

Properties

IsOnnxMode

Gets whether this model is running in ONNX inference mode.

bool IsOnnxMode { get; }

Property Value

bool

MinSegmentDuration

Gets the minimum segment duration in seconds.

double MinSegmentDuration { get; }

Property Value

double

Remarks

Segments shorter than this may not contain enough speech for reliable speaker assignment.

SampleRate

Gets the expected sample rate for input audio.

int SampleRate { get; }

Property Value

int

SupportsOverlapDetection

Gets whether this model can detect overlapping speech.

bool SupportsOverlapDetection { get; }

Property Value

bool

Remarks

For Beginners: Overlapping speech is when two or more people talk at the same time. Not all diarization systems can handle this.

Methods

Diarize(Tensor<T>, int?, int, int)

Performs speaker diarization on audio.

DiarizationResult<T> Diarize(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10)

Parameters

audio Tensor<T>

Audio waveform tensor [samples].

numSpeakers int?

Expected number of speakers. Auto-detected if null.

minSpeakers int

Minimum number of speakers (for auto-detection).

maxSpeakers int

Maximum number of speakers (for auto-detection).

Returns

DiarizationResult<T>

Diarization result with speaker segments.

Remarks

For Beginners: This is the main method for finding who spoke when. - Pass in audio of a conversation - Get back a timeline of speaker turns - Speakers are labeled as "Speaker_0", "Speaker_1", etc.

DiarizeAsync(Tensor<T>, int?, int, int, CancellationToken)

Performs speaker diarization asynchronously.

Task<DiarizationResult<T>> DiarizeAsync(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>

Audio waveform tensor [samples].

numSpeakers int?

Expected number of speakers. Auto-detected if null.

minSpeakers int

Minimum number of speakers (for auto-detection).

maxSpeakers int

Maximum number of speakers (for auto-detection).

cancellationToken CancellationToken

Cancellation token for async operation.

Returns

Task<DiarizationResult<T>>

Diarization result with speaker segments.

DiarizeWithKnownSpeakers(Tensor<T>, IReadOnlyList<SpeakerProfile<T>>, bool)

Performs diarization with known speaker profiles.

DiarizationResult<T> DiarizeWithKnownSpeakers(Tensor<T> audio, IReadOnlyList<SpeakerProfile<T>> knownSpeakers, bool allowUnknownSpeakers = true)

Parameters

audio Tensor<T>

Audio waveform tensor [samples].

knownSpeakers IReadOnlyList<SpeakerProfile<T>>

Known speaker profiles to match against.

allowUnknownSpeakers bool

Whether to create new labels for unknown speakers.

Returns

DiarizationResult<T>

Diarization result with identified speaker segments.

Remarks

For Beginners: If you know who might be speaking, you can provide their voice profiles and the system will label segments with actual names instead of generic "Speaker_0" labels.

ExtractSpeakerEmbeddings(Tensor<T>, DiarizationResult<T>)

Gets speaker embeddings for each detected speaker.

IReadOnlyDictionary<string, Tensor<T>> ExtractSpeakerEmbeddings(Tensor<T> audio, DiarizationResult<T> diarizationResult)

Parameters

audio Tensor<T>

Audio waveform tensor [samples].

diarizationResult DiarizationResult<T>

Previous diarization result.

Returns

IReadOnlyDictionary<string, Tensor<T>>

Dictionary mapping speaker labels to their embeddings.

RefineDiarization(Tensor<T>, DiarizationResult<T>, T)

Refines diarization result by re-segmenting with different parameters.

DiarizationResult<T> RefineDiarization(Tensor<T> audio, DiarizationResult<T> previousResult, T mergeThreshold)

Parameters

audio Tensor<T>

Audio waveform tensor [samples].

previousResult DiarizationResult<T>

Previous diarization result to refine.

mergeThreshold T

Threshold for merging similar speakers.

Returns

DiarizationResult<T>

Refined diarization result.