Class SpeakerDiarizer<T>
Performs speaker diarization (who spoke when) on audio recordings.
public class SpeakerDiarizer<T> : SpeakerRecognitionBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ISpeakerDiarizer<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
SpeakerDiarizer<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
Speaker diarization segments audio by speaker, answering "who spoke when?" It uses embeddings from sliding windows and clustering to identify speaker turns.
This class supports both:
- ONNX mode: Load pre-trained models for fast inference
- Native training mode: Train from scratch using the layer architecture
For Beginners: Diarization is like automatically labeling a meeting recording with "Speaker A: 0:00-0:15, Speaker B: 0:15-0:45..."
The process:
- Split audio into short segments
- Extract speaker embeddings for each segment
- Cluster similar embeddings together
- Each cluster represents a different speaker
Common applications:
- Meeting transcription
- Call center analytics
- Podcast processing
Usage:
// ONNX mode (recommended for inference)
var diarizer = new SpeakerDiarizer<float>(architecture, modelPath);
var result = diarizer.Diarize(audioTensor);
// Native training mode
var diarizer = new SpeakerDiarizer<float>(architecture);
diarizer.Train(features, labels);
Constructors
SpeakerDiarizer(SpeakerDiarizerOptions?)
Creates a new speaker diarizer with legacy options only.
public SpeakerDiarizer(SpeakerDiarizerOptions? options = null)
Parameters
optionsSpeakerDiarizerOptionsDiarization options.
Remarks
Legacy API: Prefer the constructors with NeuralNetworkArchitecture parameter. This constructor creates a default architecture for backward compatibility.
SpeakerDiarizer(NeuralNetworkArchitecture<T>, SpeakerDiarizerOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?)
Creates a new speaker diarizer in native training mode.
public SpeakerDiarizer(NeuralNetworkArchitecture<T> architecture, SpeakerDiarizerOptions? options = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null)
Parameters
architectureNeuralNetworkArchitecture<T>Neural network architecture configuration.
optionsSpeakerDiarizerOptionsDiarization options.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optional custom optimizer (defaults to AdamW).
Remarks
For Beginners: Use this constructor when you want to train a model from scratch or fine-tune an existing model. Training requires labeled diarization data.
SpeakerDiarizer(NeuralNetworkArchitecture<T>, string, SpeakerDiarizerOptions?)
Creates a new speaker diarizer in ONNX inference mode.
public SpeakerDiarizer(NeuralNetworkArchitecture<T> architecture, string modelPath, SpeakerDiarizerOptions? options = null)
Parameters
architectureNeuralNetworkArchitecture<T>Neural network architecture configuration.
modelPathstringPath to the ONNX model file.
optionsSpeakerDiarizerOptionsDiarization options.
Remarks
For Beginners: Use this constructor for production inference with pre-trained models. ONNX models are optimized for fast execution on various hardware.
Exceptions
- ArgumentNullException
Thrown when modelPath is null.
- FileNotFoundException
Thrown when the model file doesn't exist.
Properties
ClusteringThreshold
Gets the clustering threshold.
public double ClusteringThreshold { get; }
Property Value
IsOnnxMode
Gets whether the model is operating in ONNX inference mode.
public bool IsOnnxMode { get; }
Property Value
MinSegmentDuration
Gets the minimum segment duration in seconds.
public double MinSegmentDuration { get; }
Property Value
MinTurnDuration
Gets the minimum turn duration in seconds.
public double MinTurnDuration { get; }
Property Value
Remarks
Legacy API - use MinSegmentDuration instead.
SampleRate
Gets the sample rate.
public int SampleRate { get; }
Property Value
SupportsOverlapDetection
Gets whether this model can detect overlapping speech.
public bool SupportsOverlapDetection { get; }
Property Value
Remarks
For Beginners: Overlapping speech is when two or more people talk at the same time. This implementation currently does not support overlap detection.
Methods
CreateNewInstance()
Creates a new instance of this model for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
New model instance.
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReaderBinary reader.
Diarize(Tensor<T>, int?, int, int)
Performs speaker diarization on audio.
public DiarizationResult<T> Diarize(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10)
Parameters
audioTensor<T>Audio waveform tensor [samples].
numSpeakersint?Expected number of speakers. Auto-detected if null.
minSpeakersintMinimum number of speakers (for auto-detection).
maxSpeakersintMaximum number of speakers (for auto-detection).
Returns
- DiarizationResult<T>
Diarization result with speaker segments.
DiarizeAsync(Tensor<T>, int?, int, int, CancellationToken)
Performs speaker diarization asynchronously.
public Task<DiarizationResult<T>> DiarizeAsync(Tensor<T> audio, int? numSpeakers = null, int minSpeakers = 1, int maxSpeakers = 10, CancellationToken cancellationToken = default)
Parameters
audioTensor<T>numSpeakersint?minSpeakersintmaxSpeakersintcancellationTokenCancellationToken
Returns
DiarizeLegacy(Tensor<T>)
Performs diarization on audio (legacy API).
public DiarizationResult DiarizeLegacy(Tensor<T> audio)
Parameters
audioTensor<T>Audio samples as a tensor.
Returns
- DiarizationResult
Legacy diarization result.
Remarks
Legacy API: Prefer using Diarize(Tensor<T>, int?, int, int) instead.
DiarizeLegacy(Vector<T>)
Performs diarization on audio (legacy API).
public DiarizationResult DiarizeLegacy(Vector<T> audio)
Parameters
audioVector<T>Audio samples as a vector.
Returns
- DiarizationResult
Legacy diarization result.
DiarizeWithKnownSpeakers(Tensor<T>, IReadOnlyList<SpeakerProfile<T>>, bool)
Performs diarization with known speaker profiles.
public DiarizationResult<T> DiarizeWithKnownSpeakers(Tensor<T> audio, IReadOnlyList<SpeakerProfile<T>> knownSpeakers, bool allowUnknownSpeakers = true)
Parameters
audioTensor<T>knownSpeakersIReadOnlyList<SpeakerProfile<T>>allowUnknownSpeakersbool
Returns
Dispose()
Disposes resources.
public void Dispose()
Dispose(bool)
Disposes managed resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
ExtractSpeakerEmbeddings(Tensor<T>, DiarizationResult<T>)
Gets speaker embeddings for each detected speaker.
public IReadOnlyDictionary<string, Tensor<T>> ExtractSpeakerEmbeddings(Tensor<T> audio, DiarizationResult<T> diarizationResult)
Parameters
audioTensor<T>diarizationResultDiarizationResult<T>
Returns
- IReadOnlyDictionary<string, Tensor<T>>
GetModelMetadata()
Gets metadata about the model.
public override ModelMetadata<T> GetModelMetadata()
Returns
- ModelMetadata<T>
Model metadata.
InitializeLayers()
Initializes the neural network layers.
protected override void InitializeLayers()
Remarks
This follows the golden standard pattern: 1. If in ONNX mode, layers are not needed (inference uses ONNX runtime) 2. If Architecture.Layers is provided, use those layers 3. Otherwise, fall back to LayerHelper.CreateDefaultSpeakerEmbeddingLayers()
For Beginners: Layers are only initialized in native training mode. In ONNX mode, the model is already fully trained and ready for inference.
PostprocessOutput(Tensor<T>)
Postprocesses model output into the final result format.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>Model output tensor.
Returns
- Tensor<T>
Postprocessed output.
Predict(Tensor<T>)
Predicts output for the given input.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>Input tensor (audio features).
Returns
- Tensor<T>
Output tensor (speaker probabilities per frame).
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>Raw audio waveform.
Returns
- Tensor<T>
Preprocessed audio features.
RefineDiarization(Tensor<T>, DiarizationResult<T>, T)
Refines diarization result by re-segmenting with different parameters.
public DiarizationResult<T> RefineDiarization(Tensor<T> audio, DiarizationResult<T> previousResult, T mergeThreshold)
Parameters
audioTensor<T>previousResultDiarizationResult<T>mergeThresholdT
Returns
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriterBinary writer.
Train(Tensor<T>, Tensor<T>)
Trains the model on a single example.
public override void Train(Tensor<T> input, Tensor<T> expected)
Parameters
inputTensor<T>Input features.
expectedTensor<T>Expected output.
UpdateParameters(Vector<T>)
Updates model parameters.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>Parameter vector.