Class SpeakerDiarizer<T>

Namespace: AiDotNet.Audio.Speaker

Assembly: AiDotNet.dll

Performs speaker diarization (who spoke when) on audio recordings.

public class SpeakerDiarizer<T> : SpeakerRecognitionBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ISpeakerDiarizer<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

NeuralNetworkBase<T>

AudioNeuralNetworkBase<T>

SpeakerRecognitionBase<T>

SpeakerDiarizer<T>

Implements: INeuralNetworkModel<T>

INeuralNetwork<T>

IInterpretableModel<T>

IInputGradientComputable<T>

IDisposable

ISpeakerDiarizer<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

Inherited Members: SpeakerRecognitionBase<T>.EmbeddingDimension

SpeakerRecognitionBase<T>.MfccExtractor

SpeakerRecognitionBase<T>.ComputeCosineSimilarity(Vector<T>, Vector<T>)

SpeakerRecognitionBase<T>.ComputeCosineSimilarity(Tensor<T>, Tensor<T>)

SpeakerRecognitionBase<T>.NormalizeEmbedding(Tensor<T>)

SpeakerRecognitionBase<T>.AggregateEmbeddings(IReadOnlyList<Tensor<T>>)

SpeakerRecognitionBase<T>.CreateMfccExtractor(int, int)

AudioNeuralNetworkBase<T>.NumMels

AudioNeuralNetworkBase<T>.OnnxEncoder

AudioNeuralNetworkBase<T>.OnnxDecoder

AudioNeuralNetworkBase<T>.OnnxModel

AudioNeuralNetworkBase<T>.MelSpec

AudioNeuralNetworkBase<T>.SupportsTraining

AudioNeuralNetworkBase<T>.RunOnnxInference(Tensor<T>)

AudioNeuralNetworkBase<T>.Forward(Tensor<T>)

AudioNeuralNetworkBase<T>.DefaultLossFunction

AudioNeuralNetworkBase<T>.CreateMelSpectrogram(int, int, int, int)

NeuralNetworkBase<T>.Layers

NeuralNetworkBase<T>.LayerCount

NeuralNetworkBase<T>.Architecture

NeuralNetworkBase<T>.NumOps

NeuralNetworkBase<T>.Engine

NeuralNetworkBase<T>._layerInputs

NeuralNetworkBase<T>._layerOutputs

NeuralNetworkBase<T>.Random

NeuralNetworkBase<T>.LossFunction

NeuralNetworkBase<T>.LastLoss

NeuralNetworkBase<T>.IsTrainingMode

NeuralNetworkBase<T>.SupportsGpuTraining

NeuralNetworkBase<T>.CanTrainOnGpu

NeuralNetworkBase<T>.GpuEngine

NeuralNetworkBase<T>.MaxGradNorm

NeuralNetworkBase<T>._mixedPrecisionContext

NeuralNetworkBase<T>._memoryManager

NeuralNetworkBase<T>.IsMemoryManagementEnabled

NeuralNetworkBase<T>.IsGradientCheckpointingEnabled

NeuralNetworkBase<T>.IsMixedPrecisionEnabled

NeuralNetworkBase<T>.ClipGradients(List<Tensor<T>>)

NeuralNetworkBase<T>.ClipGradient(Tensor<T>)

NeuralNetworkBase<T>.ClipGradient(Vector<T>)

NeuralNetworkBase<T>.GetParameters()

NeuralNetworkBase<T>.Backpropagate(Tensor<T>)

NeuralNetworkBase<T>.BackpropagateWithRecompute(Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpuDeferred(IGpuTensor<T>, GpuExecutionOptions)

NeuralNetworkBase<T>.UpdateParametersGpu(T, T, T)

NeuralNetworkBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

NeuralNetworkBase<T>.UpdateParametersGpuDeferred(IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferred(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferredAsync(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions, CancellationToken)

NeuralNetworkBase<T>.UploadWeightsToGpu()

NeuralNetworkBase<T>.DownloadWeightsFromGpu()

NeuralNetworkBase<T>.ZeroGradientsGpu()

NeuralNetworkBase<T>.ExtractSingleExample(Tensor<T>, int)

NeuralNetworkBase<T>.ForwardWithMemory(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithCheckpointing(Tensor<T>)

NeuralNetworkBase<T>.CanUseGpuResidentPath()

NeuralNetworkBase<T>.TryForwardGpuOptimized(Tensor<T>, out Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferred(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferredAsync(Tensor<T>, CancellationToken)

NeuralNetworkBase<T>.BeginGpuExecution(GpuExecutionOptions)

NeuralNetworkBase<T>.ForwardWithGpuContext(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithGpuContext(IGpuTensor<T>)

NeuralNetworkBase<T>.GetGpuMemoryStats()

NeuralNetworkBase<T>.ForwardWithFeatures(Tensor<T>, int[])

NeuralNetworkBase<T>.ParameterCount

NeuralNetworkBase<T>.GetParameterCount()

NeuralNetworkBase<T>.InvalidateParameterCountCache()

NeuralNetworkBase<T>.AddLayerToCollection(ILayer<T>)

NeuralNetworkBase<T>.RemoveLayerFromCollection(ILayer<T>)

NeuralNetworkBase<T>.ClearLayers()

NeuralNetworkBase<T>.ValidateCustomLayers(List<ILayer<T>>)

NeuralNetworkBase<T>.ValidateCustomLayersInternal(List<ILayer<T>>)

NeuralNetworkBase<T>.IsValidInputLayer(ILayer<T>)

NeuralNetworkBase<T>.IsValidOutputLayer(ILayer<T>)

NeuralNetworkBase<T>.AreLayersCompatible(ILayer<T>, ILayer<T>)

NeuralNetworkBase<T>.GetParameterGradients()

NeuralNetworkBase<T>.EnsureArchitectureInitialized()

NeuralNetworkBase<T>.SetTrainingMode(bool)

NeuralNetworkBase<T>.EnableMemoryManagement(TrainingMemoryConfig)

NeuralNetworkBase<T>.DisableMemoryManagement()

NeuralNetworkBase<T>.GetMemoryEstimate(int, int)

NeuralNetworkBase<T>.GetLastLoss()

NeuralNetworkBase<T>.ResetState()

NeuralNetworkBase<T>.BackwardWithInputGradient(Tensor<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Vector<T>, Vector<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.SaveModel(string)

NeuralNetworkBase<T>.LoadModel(string)

NeuralNetworkBase<T>.Serialize()

NeuralNetworkBase<T>.Deserialize(byte[])

NeuralNetworkBase<T>.WithParameters(Vector<T>)

NeuralNetworkBase<T>.GetActiveFeatureIndices()

NeuralNetworkBase<T>.IsFeatureUsed(int)

NeuralNetworkBase<T>.DeepCopy()

NeuralNetworkBase<T>.Clone()

NeuralNetworkBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

NeuralNetworkBase<T>._enabledMethods

NeuralNetworkBase<T>._sensitiveFeatures

NeuralNetworkBase<T>._fairnessMetrics

NeuralNetworkBase<T>._baseModel

NeuralNetworkBase<T>.GetGlobalFeatureImportanceAsync()

NeuralNetworkBase<T>.GetLocalFeatureImportanceAsync(Tensor<T>)

NeuralNetworkBase<T>.GetShapValuesAsync(Tensor<T>)

NeuralNetworkBase<T>.GetLimeExplanationAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetPartialDependenceAsync(Vector<int>, int)

NeuralNetworkBase<T>.GetCounterfactualAsync(Tensor<T>, Tensor<T>, int)

NeuralNetworkBase<T>.GetModelSpecificInterpretabilityAsync()

NeuralNetworkBase<T>.GenerateTextExplanationAsync(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.GetFeatureInteractionAsync(int, int)

NeuralNetworkBase<T>.ValidateFairnessAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetAnchorExplanationAsync(Tensor<T>, T)

NeuralNetworkBase<T>.SetBaseModel<TInput, TOutput>(IFullModel<T, TInput, TOutput>)

NeuralNetworkBase<T>.EnableMethod(params InterpretationMethod[])

NeuralNetworkBase<T>.ConfigureFairness(Vector<int>, params FairnessMetric[])

NeuralNetworkBase<T>.GetNamedLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.GetArchitecture()

NeuralNetworkBase<T>.GetFeatureImportance()

NeuralNetworkBase<T>.SetParameters(Vector<T>)

NeuralNetworkBase<T>.AddLayer(LayerType, int, ActivationFunction)

NeuralNetworkBase<T>.AddConvolutionalLayer(int, int, int, ActivationFunction)

NeuralNetworkBase<T>.AddLSTMLayer(int, bool)

NeuralNetworkBase<T>.AddDropoutLayer(double)

NeuralNetworkBase<T>.AddBatchNormalizationLayer(int, double, double)

NeuralNetworkBase<T>.AddPoolingLayer(int[], PoolingType, int, int?)

NeuralNetworkBase<T>.GetGradients()

NeuralNetworkBase<T>.GetInputShape()

NeuralNetworkBase<T>.GetLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

NeuralNetworkBase<T>.ApplyGradients(Vector<T>, T)

NeuralNetworkBase<T>.SaveState(Stream)

NeuralNetworkBase<T>.LoadState(Stream)

NeuralNetworkBase<T>.SupportsJitCompilation

NeuralNetworkBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

NeuralNetworkBase<T>.ConvertLayerToGraph(ILayer<T>, ComputationNode<T>)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

Speaker diarization segments audio by speaker, answering "who spoke when?" It uses embeddings from sliding windows and clustering to identify speaker turns.

This class supports both:

ONNX mode: Load pre-trained models for fast inference
Native training mode: Train from scratch using the layer architecture

For Beginners: Diarization is like automatically labeling a meeting recording with "Speaker A: 0:00-0:15, Speaker B: 0:15-0:45..."

The process:

Split audio into short segments
Extract speaker embeddings for each segment
Cluster similar embeddings together
Each cluster represents a different speaker

Common applications:

Meeting transcription
Call center analytics
Podcast processing

Usage:

// ONNX mode (recommended for inference)
var diarizer = new SpeakerDiarizer<float>(architecture, modelPath);
var result = diarizer.Diarize(audioTensor);

// Native training mode
var diarizer = new SpeakerDiarizer<float>(architecture);
diarizer.Train(features, labels);

Constructors

SpeakerDiarizer(SpeakerDiarizerOptions?)

Creates a new speaker diarizer with legacy options only.

public SpeakerDiarizer(SpeakerDiarizerOptions? options = null)

Parameters

options SpeakerDiarizerOptions: Diarization options.

Remarks

Legacy API: Prefer the constructors with NeuralNetworkArchitecture parameter. This constructor creates a default architecture for backward compatibility.

SpeakerDiarizer(NeuralNetworkArchitecture<T>, SpeakerDiarizerOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?)

Creates a new speaker diarizer in native training mode.

public SpeakerDiarizer(NeuralNetworkArchitecture<T> architecture, SpeakerDiarizerOptions? options = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null)

Parameters

architecture NeuralNetworkArchitecture<T>: Neural network architecture configuration.
options SpeakerDiarizerOptions: Diarization options.
optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>: Optional custom optimizer (defaults to AdamW).

Remarks

For Beginners: Use this constructor when you want to train a model from scratch or fine-tune an existing model. Training requires labeled diarization data.

SpeakerDiarizer(NeuralNetworkArchitecture<T>, string, SpeakerDiarizerOptions?)

Creates a new speaker diarizer in ONNX inference mode.

public SpeakerDiarizer(NeuralNetworkArchitecture<T> architecture, string modelPath, SpeakerDiarizerOptions? options = null)

Parameters

architecture NeuralNetworkArchitecture<T>: Neural network architecture configuration.
modelPath string: Path to the ONNX model file.
options SpeakerDiarizerOptions: Diarization options.

Remarks

For Beginners: Use this constructor for production inference with pre-trained models. ONNX models are optimized for fast execution on various hardware.

Exceptions

ArgumentNullException: Thrown when modelPath is null.
FileNotFoundException: Thrown when the model file doesn't exist.

Properties

ClusteringThreshold

Gets the clustering threshold.

public double ClusteringThreshold { get; }

Property Value

double

IsOnnxMode

Gets whether the model is operating in ONNX inference mode.

public bool IsOnnxMode { get; }

Property Value

bool

MinSegmentDuration

Gets the minimum segment duration in seconds.

public double MinSegmentDuration { get; }

Property Value

double

MinTurnDuration

Gets the minimum turn duration in seconds.

public double MinTurnDuration { get; }

Property Value

double

Remarks

Legacy API - use MinSegmentDuration instead.

SampleRate

Gets the sample rate.

public int SampleRate { get; }

Property Value

int

SupportsOverlapDetection

Gets whether this model can detect overlapping speech.

public bool SupportsOverlapDetection { get; }

Property Value

bool

Remarks

For Beginners: Overlapping speech is when two or more people talk at the same time. This implementation currently does not support overlap detection.

Methods

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>: New model instance.

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader: Binary reader.

Diarize(Tensor<T>, int?, int, int)

Performs speaker diarization on audio.

public DiarizationResult<T> Diarize(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10)

Parameters

audio Tensor<T>: Audio waveform tensor [samples].
numSpeakers int?: Expected number of speakers. Auto-detected if null.
minSpeakers int: Minimum number of speakers (for auto-detection).
maxSpeakers int: Maximum number of speakers (for auto-detection).

Returns

DiarizationResult<T>: Diarization result with speaker segments.

DiarizeAsync(Tensor<T>, int?, int, int, CancellationToken)

Performs speaker diarization asynchronously.

public Task<DiarizationResult<T>> DiarizeAsync(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>
numSpeakers int?
minSpeakers int
maxSpeakers int
cancellationToken CancellationToken

Returns

Task<DiarizationResult<T>>

DiarizeLegacy(Tensor<T>)

Performs diarization on audio (legacy API).

public DiarizationResult DiarizeLegacy(Tensor<T> audio)

Parameters

audio Tensor<T>: Audio samples as a tensor.

Returns

DiarizationResult: Legacy diarization result.

Remarks

Legacy API: Prefer using Diarize(Tensor<T>, int?, int, int) instead.

DiarizeLegacy(Vector<T>)

Performs diarization on audio (legacy API).

public DiarizationResult DiarizeLegacy(Vector<T> audio)

Parameters

audio Vector<T>: Audio samples as a vector.

Returns

DiarizationResult: Legacy diarization result.

DiarizeWithKnownSpeakers(Tensor<T>, IReadOnlyList<SpeakerProfile<T>>, bool)

Performs diarization with known speaker profiles.

public DiarizationResult<T> DiarizeWithKnownSpeakers(Tensor<T> audio, IReadOnlyList<SpeakerProfile<T>> knownSpeakers, bool allowUnknownSpeakers = true)

Parameters

audio Tensor<T>
knownSpeakers IReadOnlyList<SpeakerProfile<T>>
allowUnknownSpeakers bool

Returns

DiarizationResult<T>

Dispose()

Disposes resources.

public void Dispose()

Dispose(bool)

Disposes managed resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

ExtractSpeakerEmbeddings(Tensor<T>, DiarizationResult<T>)

Gets speaker embeddings for each detected speaker.

public IReadOnlyDictionary<string, Tensor<T>> ExtractSpeakerEmbeddings(Tensor<T> audio, DiarizationResult<T> diarizationResult)

Parameters

audio Tensor<T>
diarizationResult DiarizationResult<T>

Returns

IReadOnlyDictionary<string, Tensor<T>>

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>: Model metadata.

InitializeLayers()

Initializes the neural network layers.

protected override void InitializeLayers()

Remarks

This follows the golden standard pattern: 1. If in ONNX mode, layers are not needed (inference uses ONNX runtime) 2. If Architecture.Layers is provided, use those layers 3. Otherwise, fall back to LayerHelper.CreateDefaultSpeakerEmbeddingLayers()

For Beginners: Layers are only initialized in native training mode. In ONNX mode, the model is already fully trained and ready for inference.

PostprocessOutput(Tensor<T>)

Postprocesses model output into the final result format.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>: Model output tensor.

Returns

Tensor<T>: Postprocessed output.

Predict(Tensor<T>)

Predicts output for the given input.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>: Input tensor (audio features).

Returns

Tensor<T>: Output tensor (speaker probabilities per frame).

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>: Raw audio waveform.

Returns

Tensor<T>: Preprocessed audio features.

RefineDiarization(Tensor<T>, DiarizationResult<T>, T)

Refines diarization result by re-segmenting with different parameters.

public DiarizationResult<T> RefineDiarization(Tensor<T> audio, DiarizationResult<T> previousResult, T mergeThreshold)

Parameters

audio Tensor<T>
previousResult DiarizationResult<T>
mergeThreshold T

Returns

DiarizationResult<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter: Binary writer.

Train(Tensor<T>, Tensor<T>)

Trains the model on a single example.

public override void Train(Tensor<T> input, Tensor<T> expected)

Parameters

input Tensor<T>: Input features.
expected Tensor<T>: Expected output.

UpdateParameters(Vector<T>)

Updates model parameters.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: Parameter vector.

Table of Contents

Class SpeakerDiarizer<T>

Type Parameters

Remarks

Constructors

SpeakerDiarizer(SpeakerDiarizerOptions?)

Parameters

Remarks

SpeakerDiarizer(NeuralNetworkArchitecture<T>, SpeakerDiarizerOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?)

Parameters

Remarks

SpeakerDiarizer(NeuralNetworkArchitecture<T>, string, SpeakerDiarizerOptions?)

Parameters

Remarks

Exceptions

Properties

ClusteringThreshold

Property Value

IsOnnxMode

Property Value

MinSegmentDuration

Property Value

MinTurnDuration

Property Value

Remarks

SampleRate

Property Value

SupportsOverlapDetection

Property Value

Remarks

Methods

CreateNewInstance()

Returns

DeserializeNetworkSpecificData(BinaryReader)

Parameters

Diarize(Tensor<T>, int?, int, int)

Parameters

Returns

DiarizeAsync(Tensor<T>, int?, int, int, CancellationToken)

Parameters

Returns

DiarizeLegacy(Tensor<T>)

Parameters

Returns

Remarks

DiarizeLegacy(Vector<T>)

Parameters

Returns

DiarizeWithKnownSpeakers(Tensor<T>, IReadOnlyList<SpeakerProfile<T>>, bool)

Parameters

Returns

Dispose()

Dispose(bool)

Parameters

ExtractSpeakerEmbeddings(Tensor<T>, DiarizationResult<T>)

Parameters

Returns

GetModelMetadata()

Returns

InitializeLayers()

Remarks

PostprocessOutput(Tensor<T>)

Parameters

Returns

Predict(Tensor<T>)

Parameters

Returns

PreprocessAudio(Tensor<T>)

Parameters

Returns

RefineDiarization(Tensor<T>, DiarizationResult<T>, T)

Parameters

Returns

SerializeNetworkSpecificData(BinaryWriter)

Parameters

Train(Tensor<T>, Tensor<T>)

Parameters

UpdateParameters(Vector<T>)

Parameters