Table of Contents

Class SpeakerDiarizer<T>

Namespace
AiDotNet.Audio.Speaker
Assembly
AiDotNet.dll

Performs speaker diarization (who spoke when) on audio recordings.

public class SpeakerDiarizer<T> : SpeakerRecognitionBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ISpeakerDiarizer<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
SpeakerDiarizer<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

Speaker diarization segments audio by speaker, answering "who spoke when?" It uses embeddings from sliding windows and clustering to identify speaker turns.

This class supports both:

  • ONNX mode: Load pre-trained models for fast inference
  • Native training mode: Train from scratch using the layer architecture

For Beginners: Diarization is like automatically labeling a meeting recording with "Speaker A: 0:00-0:15, Speaker B: 0:15-0:45..."

The process:

  1. Split audio into short segments
  2. Extract speaker embeddings for each segment
  3. Cluster similar embeddings together
  4. Each cluster represents a different speaker

Common applications:

  • Meeting transcription
  • Call center analytics
  • Podcast processing

Usage:

// ONNX mode (recommended for inference)
var diarizer = new SpeakerDiarizer<float>(architecture, modelPath);
var result = diarizer.Diarize(audioTensor);

// Native training mode
var diarizer = new SpeakerDiarizer<float>(architecture);
diarizer.Train(features, labels);

Constructors

SpeakerDiarizer(SpeakerDiarizerOptions?)

Creates a new speaker diarizer with legacy options only.

public SpeakerDiarizer(SpeakerDiarizerOptions? options = null)

Parameters

options SpeakerDiarizerOptions

Diarization options.

Remarks

Legacy API: Prefer the constructors with NeuralNetworkArchitecture parameter. This constructor creates a default architecture for backward compatibility.

SpeakerDiarizer(NeuralNetworkArchitecture<T>, SpeakerDiarizerOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?)

Creates a new speaker diarizer in native training mode.

public SpeakerDiarizer(NeuralNetworkArchitecture<T> architecture, SpeakerDiarizerOptions? options = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null)

Parameters

architecture NeuralNetworkArchitecture<T>

Neural network architecture configuration.

options SpeakerDiarizerOptions

Diarization options.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optional custom optimizer (defaults to AdamW).

Remarks

For Beginners: Use this constructor when you want to train a model from scratch or fine-tune an existing model. Training requires labeled diarization data.

SpeakerDiarizer(NeuralNetworkArchitecture<T>, string, SpeakerDiarizerOptions?)

Creates a new speaker diarizer in ONNX inference mode.

public SpeakerDiarizer(NeuralNetworkArchitecture<T> architecture, string modelPath, SpeakerDiarizerOptions? options = null)

Parameters

architecture NeuralNetworkArchitecture<T>

Neural network architecture configuration.

modelPath string

Path to the ONNX model file.

options SpeakerDiarizerOptions

Diarization options.

Remarks

For Beginners: Use this constructor for production inference with pre-trained models. ONNX models are optimized for fast execution on various hardware.

Exceptions

ArgumentNullException

Thrown when modelPath is null.

FileNotFoundException

Thrown when the model file doesn't exist.

Properties

ClusteringThreshold

Gets the clustering threshold.

public double ClusteringThreshold { get; }

Property Value

double

IsOnnxMode

Gets whether the model is operating in ONNX inference mode.

public bool IsOnnxMode { get; }

Property Value

bool

MinSegmentDuration

Gets the minimum segment duration in seconds.

public double MinSegmentDuration { get; }

Property Value

double

MinTurnDuration

Gets the minimum turn duration in seconds.

public double MinTurnDuration { get; }

Property Value

double

Remarks

Legacy API - use MinSegmentDuration instead.

SampleRate

Gets the sample rate.

public int SampleRate { get; }

Property Value

int

SupportsOverlapDetection

Gets whether this model can detect overlapping speech.

public bool SupportsOverlapDetection { get; }

Property Value

bool

Remarks

For Beginners: Overlapping speech is when two or more people talk at the same time. This implementation currently does not support overlap detection.

Methods

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

New model instance.

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Binary reader.

Diarize(Tensor<T>, int?, int, int)

Performs speaker diarization on audio.

public DiarizationResult<T> Diarize(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10)

Parameters

audio Tensor<T>

Audio waveform tensor [samples].

numSpeakers int?

Expected number of speakers. Auto-detected if null.

minSpeakers int

Minimum number of speakers (for auto-detection).

maxSpeakers int

Maximum number of speakers (for auto-detection).

Returns

DiarizationResult<T>

Diarization result with speaker segments.

DiarizeAsync(Tensor<T>, int?, int, int, CancellationToken)

Performs speaker diarization asynchronously.

public Task<DiarizationResult<T>> DiarizeAsync(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>
numSpeakers int?
minSpeakers int
maxSpeakers int
cancellationToken CancellationToken

Returns

Task<DiarizationResult<T>>

DiarizeLegacy(Tensor<T>)

Performs diarization on audio (legacy API).

public DiarizationResult DiarizeLegacy(Tensor<T> audio)

Parameters

audio Tensor<T>

Audio samples as a tensor.

Returns

DiarizationResult

Legacy diarization result.

Remarks

Legacy API: Prefer using Diarize(Tensor<T>, int?, int, int) instead.

DiarizeLegacy(Vector<T>)

Performs diarization on audio (legacy API).

public DiarizationResult DiarizeLegacy(Vector<T> audio)

Parameters

audio Vector<T>

Audio samples as a vector.

Returns

DiarizationResult

Legacy diarization result.

DiarizeWithKnownSpeakers(Tensor<T>, IReadOnlyList<SpeakerProfile<T>>, bool)

Performs diarization with known speaker profiles.

public DiarizationResult<T> DiarizeWithKnownSpeakers(Tensor<T> audio, IReadOnlyList<SpeakerProfile<T>> knownSpeakers, bool allowUnknownSpeakers = true)

Parameters

audio Tensor<T>
knownSpeakers IReadOnlyList<SpeakerProfile<T>>
allowUnknownSpeakers bool

Returns

DiarizationResult<T>

Dispose()

Disposes resources.

public void Dispose()

Dispose(bool)

Disposes managed resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

ExtractSpeakerEmbeddings(Tensor<T>, DiarizationResult<T>)

Gets speaker embeddings for each detected speaker.

public IReadOnlyDictionary<string, Tensor<T>> ExtractSpeakerEmbeddings(Tensor<T> audio, DiarizationResult<T> diarizationResult)

Parameters

audio Tensor<T>
diarizationResult DiarizationResult<T>

Returns

IReadOnlyDictionary<string, Tensor<T>>

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

Model metadata.

InitializeLayers()

Initializes the neural network layers.

protected override void InitializeLayers()

Remarks

This follows the golden standard pattern: 1. If in ONNX mode, layers are not needed (inference uses ONNX runtime) 2. If Architecture.Layers is provided, use those layers 3. Otherwise, fall back to LayerHelper.CreateDefaultSpeakerEmbeddingLayers()

For Beginners: Layers are only initialized in native training mode. In ONNX mode, the model is already fully trained and ready for inference.

PostprocessOutput(Tensor<T>)

Postprocesses model output into the final result format.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Model output tensor.

Returns

Tensor<T>

Postprocessed output.

Predict(Tensor<T>)

Predicts output for the given input.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Input tensor (audio features).

Returns

Tensor<T>

Output tensor (speaker probabilities per frame).

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Raw audio waveform.

Returns

Tensor<T>

Preprocessed audio features.

RefineDiarization(Tensor<T>, DiarizationResult<T>, T)

Refines diarization result by re-segmenting with different parameters.

public DiarizationResult<T> RefineDiarization(Tensor<T> audio, DiarizationResult<T> previousResult, T mergeThreshold)

Parameters

audio Tensor<T>
previousResult DiarizationResult<T>
mergeThreshold T

Returns

DiarizationResult<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

Binary writer.

Train(Tensor<T>, Tensor<T>)

Trains the model on a single example.

public override void Train(Tensor<T> input, Tensor<T> expected)

Parameters

input Tensor<T>

Input features.

expected Tensor<T>

Expected output.

UpdateParameters(Vector<T>)

Updates model parameters.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

Parameter vector.