Table of Contents

Class SpeakerEmbeddingExtractor<T>

Namespace
AiDotNet.Audio.Speaker
Assembly
AiDotNet.dll

Extracts speaker embeddings (d-vectors) from audio for speaker recognition.

public class SpeakerEmbeddingExtractor<T> : SpeakerRecognitionBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ISpeakerEmbeddingExtractor<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
SpeakerEmbeddingExtractor<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

Speaker embeddings are compact vector representations that capture the unique characteristics of a speaker's voice. These can be used for speaker verification (is this the same person?) and speaker identification (who is speaking?).

For Beginners: Each person's voice has unique characteristics like pitch, rhythm, and timbre (tone color). This class converts audio into a numerical "fingerprint" of the speaker's voice.

These embeddings are vectors (lists of numbers) that are:

  • Close together for the same speaker
  • Far apart for different speakers

Usage (ONNX Mode):

var extractor = new SpeakerEmbeddingExtractor<float>(
    architecture,
    modelPath: "speaker_model.onnx");
var embedding = extractor.ExtractEmbedding(audio);

Usage (Native Training Mode):

var extractor = new SpeakerEmbeddingExtractor<float>(architecture);
extractor.Train(audioInput, expectedEmbedding);

Constructors

SpeakerEmbeddingExtractor()

Creates a SpeakerEmbeddingExtractor with default settings for native training mode.

public SpeakerEmbeddingExtractor()

Remarks

For Beginners: This is the simplest way to create a speaker embedding extractor. It uses default settings suitable for most use cases.

SpeakerEmbeddingExtractor(SpeakerEmbeddingOptions)

Creates a SpeakerEmbeddingExtractor with custom options.

public SpeakerEmbeddingExtractor(SpeakerEmbeddingOptions options)

Parameters

options SpeakerEmbeddingOptions

Configuration options for the extractor.

Remarks

For Beginners: Use this constructor to customize sample rate, embedding dimension, etc.

SpeakerEmbeddingExtractor(NeuralNetworkArchitecture<T>, int, int, double, int, int, int, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a SpeakerEmbeddingExtractor for native training mode.

public SpeakerEmbeddingExtractor(NeuralNetworkArchitecture<T> architecture, int sampleRate = 16000, int embeddingDimension = 256, double minimumDurationSeconds = 0.5, int hiddenDim = 256, int numEncoderLayers = 3, int numHeads = 4, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

sampleRate int

Expected sample rate for input audio. Default is 16000.

embeddingDimension int

Dimension of output embeddings. Default is 256.

minimumDurationSeconds double

Minimum audio duration for reliable embedding. Default is 0.5.

hiddenDim int

Hidden dimension for encoder layers. Default is 256.

numEncoderLayers int

Number of encoder layers. Default is 3.

numHeads int

Number of attention heads. Default is 4.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optimizer for training. If null, AdamW is used.

lossFunction ILossFunction<T>

Loss function for training. If null, MSE loss is used.

Remarks

For Beginners: Use this constructor to train your own speaker embedding model.

Example:

var extractor = new SpeakerEmbeddingExtractor<float>(architecture);
extractor.Train(audioInput, expectedEmbedding);

SpeakerEmbeddingExtractor(NeuralNetworkArchitecture<T>, string, int, int, double, OnnxModelOptions?)

Creates a SpeakerEmbeddingExtractor for ONNX inference with a pretrained model.

public SpeakerEmbeddingExtractor(NeuralNetworkArchitecture<T> architecture, string modelPath, int sampleRate = 16000, int embeddingDimension = 256, double minimumDurationSeconds = 0.5, OnnxModelOptions? onnxOptions = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

modelPath string

Required path to speaker embedding ONNX model.

sampleRate int

Expected sample rate for input audio. Default is 16000.

embeddingDimension int

Dimension of output embeddings. Default is 256.

minimumDurationSeconds double

Minimum audio duration for reliable embedding. Default is 0.5.

onnxOptions OnnxModelOptions

ONNX runtime options.

Remarks

For Beginners: Use this constructor when you have a pretrained speaker embedding model.

Example:

var extractor = new SpeakerEmbeddingExtractor<float>(
    architecture,
    modelPath: "ecapa_tdnn.onnx");

Properties

HasNeuralModel

Gets whether a neural model is loaded.

public bool HasNeuralModel { get; }

Property Value

bool

IsOnnxMode

Gets whether the model is in ONNX inference mode.

public bool IsOnnxMode { get; }

Property Value

bool

MinimumDurationSeconds

Gets the minimum audio duration required for reliable embedding extraction.

public double MinimumDurationSeconds { get; }

Property Value

double

Methods

ComputeSimilarity(SpeakerEmbedding<T>, SpeakerEmbedding<T>)

Computes cosine similarity between two speaker embeddings (legacy API).

public T ComputeSimilarity(SpeakerEmbedding<T> embedding1, SpeakerEmbedding<T> embedding2)

Parameters

embedding1 SpeakerEmbedding<T>
embedding2 SpeakerEmbedding<T>

Returns

T

ComputeSimilarity(Tensor<T>, Tensor<T>)

Computes similarity between two speaker embeddings.

public T ComputeSimilarity(Tensor<T> embedding1, Tensor<T> embedding2)

Parameters

embedding1 Tensor<T>
embedding2 Tensor<T>

Returns

T

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Dispose(bool)

Disposes the model and releases resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

Extract(Tensor<T>)

Extracts a speaker embedding from audio (legacy API).

public SpeakerEmbedding<T> Extract(Tensor<T> audio)

Parameters

audio Tensor<T>

Audio samples as a tensor.

Returns

SpeakerEmbedding<T>

Speaker embedding result.

Extract(Vector<T>)

Extracts a speaker embedding from audio (legacy API).

public SpeakerEmbedding<T> Extract(Vector<T> audio)

Parameters

audio Vector<T>

Audio samples as a vector.

Returns

SpeakerEmbedding<T>

Speaker embedding result.

ExtractBatch(IEnumerable<Tensor<T>>)

Extracts embeddings from multiple audio segments (legacy API).

public List<SpeakerEmbedding<T>> ExtractBatch(IEnumerable<Tensor<T>> segments)

Parameters

segments IEnumerable<Tensor<T>>

Returns

List<SpeakerEmbedding<T>>

ExtractEmbedding(Tensor<T>)

Extracts a speaker embedding from audio.

public Tensor<T> ExtractEmbedding(Tensor<T> audio)

Parameters

audio Tensor<T>

Audio waveform tensor [samples] or [batch, samples].

Returns

Tensor<T>

Speaker embedding tensor [embedding_dim] or [batch, embedding_dim].

ExtractEmbeddingAsync(Tensor<T>, CancellationToken)

Extracts a speaker embedding from audio asynchronously.

public Task<Tensor<T>> ExtractEmbeddingAsync(Tensor<T> audio, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>
cancellationToken CancellationToken

Returns

Task<Tensor<T>>

ExtractEmbeddings(IReadOnlyList<Tensor<T>>)

Extracts embeddings from multiple audio segments.

public IReadOnlyList<Tensor<T>> ExtractEmbeddings(IReadOnlyList<Tensor<T>> audioSegments)

Parameters

audioSegments IReadOnlyList<Tensor<T>>

Returns

IReadOnlyList<Tensor<T>>

ExtractTensor(Tensor<T>)

Extracts speaker embedding from audio as a Tensor.

public Tensor<T> ExtractTensor(Tensor<T> audio)

Parameters

audio Tensor<T>

Returns

Tensor<T>

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes the layers for the speaker embedding model.

protected override void InitializeLayers()

Remarks

Follows the golden standard pattern:

  1. Check if in native mode (ONNX mode returns early)
  2. Use Architecture.Layers if provided by user
  3. Fall back to LayerHelper.CreateDefaultSpeakerEmbeddingLayers() otherwise

PostprocessOutput(Tensor<T>)

Postprocesses model output into the final result format.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

Train(Tensor<T>, Tensor<T>)

Trains the model on input data.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

UpdateParameters(Vector<T>)

Updates model parameters.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>