Interface ISpeakerDiarizer<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Interface for speaker diarization models that segment audio by speaker ("who spoke when").

public interface ISpeakerDiarizer<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inherited Members: IFullModel<T, Tensor<T>, Tensor<T>>.DefaultLossFunction

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.Train(Tensor<T>, Tensor<T>)

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.Predict(Tensor<T>)

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.GetModelMetadata()

IModelSerializer.Serialize()

IModelSerializer.Deserialize(byte[])

IModelSerializer.SaveModel(string)

IModelSerializer.LoadModel(string)

ICheckpointableModel.SaveState(Stream)

ICheckpointableModel.LoadState(Stream)

IParameterizable<T, Tensor<T>, Tensor<T>>.GetParameters()

IParameterizable<T, Tensor<T>, Tensor<T>>.SetParameters(Vector<T>)

IParameterizable<T, Tensor<T>, Tensor<T>>.ParameterCount

IParameterizable<T, Tensor<T>, Tensor<T>>.WithParameters(Vector<T>)

IFeatureAware.GetActiveFeatureIndices()

IFeatureAware.SetActiveFeatureIndices(IEnumerable<int>)

IFeatureAware.IsFeatureUsed(int)

IFeatureImportance<T>.GetFeatureImportance()

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>.DeepCopy()

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>.Clone()

IGradientComputable<T, Tensor<T>, Tensor<T>>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

IGradientComputable<T, Tensor<T>, Tensor<T>>.ApplyGradients(Vector<T>, T)

IJitCompilable<T>.ExportComputationGraph(List<ComputationNode<T>>)

IJitCompilable<T>.SupportsJitCompilation

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

Speaker diarization partitions an audio stream into segments based on speaker identity. It answers the question "Who spoke when?" without necessarily knowing who the speakers are (unlike speaker identification which requires enrolled speakers).

For Beginners: Diarization is like labeling a transcript with "Speaker A said..." "Speaker B said..." without knowing their names.

How it works:

Audio is segmented into small chunks
Speaker embeddings are extracted for each chunk
Clustering groups similar embeddings together
Each cluster represents a unique speaker
Output: Timeline showing when each speaker talks

Common use cases:

Meeting transcription (separating participants)
Podcast/interview processing
Call center analytics
Medical dictation

Challenges:

Overlapping speech (multiple people talking at once)
Short turns (quick back-and-forth conversation)
Similar voices (e.g., siblings)
Background noise and music

This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.

Properties

IsOnnxMode

Gets whether this model is running in ONNX inference mode.

bool IsOnnxMode { get; }

Property Value

bool

MinSegmentDuration

Gets the minimum segment duration in seconds.

double MinSegmentDuration { get; }

Property Value

double

Remarks

Segments shorter than this may not contain enough speech for reliable speaker assignment.

SampleRate

Gets the expected sample rate for input audio.

int SampleRate { get; }

Property Value

int

SupportsOverlapDetection

Gets whether this model can detect overlapping speech.

bool SupportsOverlapDetection { get; }

Property Value

bool

Remarks

For Beginners: Overlapping speech is when two or more people talk at the same time. Not all diarization systems can handle this.

Methods

Diarize(Tensor<T>, int?, int, int)

Performs speaker diarization on audio.

DiarizationResult<T> Diarize(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10)

Parameters

audio Tensor<T>: Audio waveform tensor [samples].
numSpeakers int?: Expected number of speakers. Auto-detected if null.
minSpeakers int: Minimum number of speakers (for auto-detection).
maxSpeakers int: Maximum number of speakers (for auto-detection).

Returns

DiarizationResult<T>: Diarization result with speaker segments.

Remarks

For Beginners: This is the main method for finding who spoke when. - Pass in audio of a conversation - Get back a timeline of speaker turns - Speakers are labeled as "Speaker_0", "Speaker_1", etc.

DiarizeAsync(Tensor<T>, int?, int, int, CancellationToken)

Performs speaker diarization asynchronously.

Task<DiarizationResult<T>> DiarizeAsync(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>: Audio waveform tensor [samples].
numSpeakers int?: Expected number of speakers. Auto-detected if null.
minSpeakers int: Minimum number of speakers (for auto-detection).
maxSpeakers int: Maximum number of speakers (for auto-detection).
cancellationToken CancellationToken: Cancellation token for async operation.

Returns

Task<DiarizationResult<T>>: Diarization result with speaker segments.

DiarizeWithKnownSpeakers(Tensor<T>, IReadOnlyList<SpeakerProfile<T>>, bool)

Performs diarization with known speaker profiles.

DiarizationResult<T> DiarizeWithKnownSpeakers(Tensor<T> audio, IReadOnlyList<SpeakerProfile<T>> knownSpeakers, bool allowUnknownSpeakers = true)

Parameters

audio Tensor<T>: Audio waveform tensor [samples].
knownSpeakers IReadOnlyList<SpeakerProfile<T>>: Known speaker profiles to match against.
allowUnknownSpeakers bool: Whether to create new labels for unknown speakers.

Returns

DiarizationResult<T>: Diarization result with identified speaker segments.

Remarks

For Beginners: If you know who might be speaking, you can provide their voice profiles and the system will label segments with actual names instead of generic "Speaker_0" labels.

ExtractSpeakerEmbeddings(Tensor<T>, DiarizationResult<T>)

Gets speaker embeddings for each detected speaker.

IReadOnlyDictionary<string, Tensor<T>> ExtractSpeakerEmbeddings(Tensor<T> audio, DiarizationResult<T> diarizationResult)

Parameters

audio Tensor<T>: Audio waveform tensor [samples].
diarizationResult DiarizationResult<T>: Previous diarization result.

Returns

IReadOnlyDictionary<string, Tensor<T>>: Dictionary mapping speaker labels to their embeddings.

RefineDiarization(Tensor<T>, DiarizationResult<T>, T)

Refines diarization result by re-segmenting with different parameters.

DiarizationResult<T> RefineDiarization(Tensor<T> audio, DiarizationResult<T> previousResult, T mergeThreshold)

Parameters

audio Tensor<T>: Audio waveform tensor [samples].
previousResult DiarizationResult<T>: Previous diarization result to refine.
mergeThreshold T: Threshold for merging similar speakers.

Returns

DiarizationResult<T>: Refined diarization result.

Table of Contents

Interface ISpeakerDiarizer<T>

Type Parameters

Remarks

Properties

IsOnnxMode

Property Value

MinSegmentDuration

Property Value

Remarks

SampleRate

Property Value

SupportsOverlapDetection

Property Value

Remarks

Methods

Diarize(Tensor<T>, int?, int, int)

Parameters

Returns

Remarks

DiarizeAsync(Tensor<T>, int?, int, int, CancellationToken)

Parameters

Returns

DiarizeWithKnownSpeakers(Tensor<T>, IReadOnlyList<SpeakerProfile<T>>, bool)

Parameters

Returns

Remarks

ExtractSpeakerEmbeddings(Tensor<T>, DiarizationResult<T>)

Parameters

Returns

RefineDiarization(Tensor<T>, DiarizationResult<T>, T)

Parameters

Returns