Interface ISpeakerDiarizer<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for speaker diarization models that segment audio by speaker ("who spoke when").
public interface ISpeakerDiarizer<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
- Extension Methods
Remarks
Speaker diarization partitions an audio stream into segments based on speaker identity. It answers the question "Who spoke when?" without necessarily knowing who the speakers are (unlike speaker identification which requires enrolled speakers).
For Beginners: Diarization is like labeling a transcript with "Speaker A said..." "Speaker B said..." without knowing their names.
How it works:
- Audio is segmented into small chunks
- Speaker embeddings are extracted for each chunk
- Clustering groups similar embeddings together
- Each cluster represents a unique speaker
- Output: Timeline showing when each speaker talks
Common use cases:
- Meeting transcription (separating participants)
- Podcast/interview processing
- Call center analytics
- Medical dictation
Challenges:
- Overlapping speech (multiple people talking at once)
- Short turns (quick back-and-forth conversation)
- Similar voices (e.g., siblings)
- Background noise and music
This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.
Properties
IsOnnxMode
Gets whether this model is running in ONNX inference mode.
bool IsOnnxMode { get; }
Property Value
MinSegmentDuration
Gets the minimum segment duration in seconds.
double MinSegmentDuration { get; }
Property Value
Remarks
Segments shorter than this may not contain enough speech for reliable speaker assignment.
SampleRate
Gets the expected sample rate for input audio.
int SampleRate { get; }
Property Value
SupportsOverlapDetection
Gets whether this model can detect overlapping speech.
bool SupportsOverlapDetection { get; }
Property Value
Remarks
For Beginners: Overlapping speech is when two or more people talk at the same time. Not all diarization systems can handle this.
Methods
Diarize(Tensor<T>, int?, int, int)
Performs speaker diarization on audio.
DiarizationResult<T> Diarize(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10)
Parameters
audioTensor<T>Audio waveform tensor [samples].
numSpeakersint?Expected number of speakers. Auto-detected if null.
minSpeakersintMinimum number of speakers (for auto-detection).
maxSpeakersintMaximum number of speakers (for auto-detection).
Returns
- DiarizationResult<T>
Diarization result with speaker segments.
Remarks
For Beginners: This is the main method for finding who spoke when. - Pass in audio of a conversation - Get back a timeline of speaker turns - Speakers are labeled as "Speaker_0", "Speaker_1", etc.
DiarizeAsync(Tensor<T>, int?, int, int, CancellationToken)
Performs speaker diarization asynchronously.
Task<DiarizationResult<T>> DiarizeAsync(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10, CancellationToken cancellationToken = default)
Parameters
audioTensor<T>Audio waveform tensor [samples].
numSpeakersint?Expected number of speakers. Auto-detected if null.
minSpeakersintMinimum number of speakers (for auto-detection).
maxSpeakersintMaximum number of speakers (for auto-detection).
cancellationTokenCancellationTokenCancellation token for async operation.
Returns
- Task<DiarizationResult<T>>
Diarization result with speaker segments.
DiarizeWithKnownSpeakers(Tensor<T>, IReadOnlyList<SpeakerProfile<T>>, bool)
Performs diarization with known speaker profiles.
DiarizationResult<T> DiarizeWithKnownSpeakers(Tensor<T> audio, IReadOnlyList<SpeakerProfile<T>> knownSpeakers, bool allowUnknownSpeakers = true)
Parameters
audioTensor<T>Audio waveform tensor [samples].
knownSpeakersIReadOnlyList<SpeakerProfile<T>>Known speaker profiles to match against.
allowUnknownSpeakersboolWhether to create new labels for unknown speakers.
Returns
- DiarizationResult<T>
Diarization result with identified speaker segments.
Remarks
For Beginners: If you know who might be speaking, you can provide their voice profiles and the system will label segments with actual names instead of generic "Speaker_0" labels.
ExtractSpeakerEmbeddings(Tensor<T>, DiarizationResult<T>)
Gets speaker embeddings for each detected speaker.
IReadOnlyDictionary<string, Tensor<T>> ExtractSpeakerEmbeddings(Tensor<T> audio, DiarizationResult<T> diarizationResult)
Parameters
audioTensor<T>Audio waveform tensor [samples].
diarizationResultDiarizationResult<T>Previous diarization result.
Returns
- IReadOnlyDictionary<string, Tensor<T>>
Dictionary mapping speaker labels to their embeddings.
RefineDiarization(Tensor<T>, DiarizationResult<T>, T)
Refines diarization result by re-segmenting with different parameters.
DiarizationResult<T> RefineDiarization(Tensor<T> audio, DiarizationResult<T> previousResult, T mergeThreshold)
Parameters
audioTensor<T>Audio waveform tensor [samples].
previousResultDiarizationResult<T>Previous diarization result to refine.
mergeThresholdTThreshold for merging similar speakers.
Returns
- DiarizationResult<T>
Refined diarization result.