Table of Contents

Interface ISpeakerEmbeddingExtractor<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for speaker embedding extraction models (d-vector/x-vector extraction).

public interface ISpeakerEmbeddingExtractor<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members
Extension Methods

Remarks

Speaker embedding extractors convert voice audio into fixed-length vectors that capture the unique characteristics of a speaker's voice. These embeddings enable speaker verification, identification, and diarization tasks.

For Beginners: Speaker embeddings are like a "voiceprint" - a compact representation of what makes someone's voice unique.

How speaker embeddings work:

  1. Audio of someone speaking is fed into the model
  2. The model outputs a fixed-size vector (e.g., 256 or 512 numbers)
  3. This vector captures voice characteristics (pitch, timbre, accent, etc.)
  4. Vectors from the same speaker are similar; different speakers are different

Common use cases:

  • Voice authentication ("Is this person who they claim to be?")
  • Speaker identification ("Who is speaking?")
  • Voice cloning (TTS with specific voice)
  • Meeting transcription (separating speakers)

Key concepts:

  • d-vector: Early embedding approach using DNN
  • x-vector: Modern approach using TDNN with statistics pooling
  • ECAPA-TDNN: State-of-the-art speaker embedding model

This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.

Properties

EmbeddingDimension

Gets the dimension of output speaker embeddings.

int EmbeddingDimension { get; }

Property Value

int

Remarks

Common values: 192, 256, or 512. Higher dimensions may capture more nuance but require more storage and computation.

IsOnnxMode

Gets whether this model is running in ONNX inference mode.

bool IsOnnxMode { get; }

Property Value

bool

Remarks

When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.

MinimumDurationSeconds

Gets the minimum audio duration required for reliable embedding extraction.

double MinimumDurationSeconds { get; }

Property Value

double

Remarks

For Beginners: Very short audio clips may not contain enough voice information for accurate speaker representation. This property tells you the minimum length needed for reliable results.

SampleRate

Gets the expected sample rate for input audio.

int SampleRate { get; }

Property Value

int

Remarks

Typically 16000 Hz for speaker recognition models.

Methods

AggregateEmbeddings(IReadOnlyList<Tensor<T>>)

Aggregates multiple embeddings into a single representative embedding.

Tensor<T> AggregateEmbeddings(IReadOnlyList<Tensor<T>> embeddings)

Parameters

embeddings IReadOnlyList<Tensor<T>>

Collection of embeddings from the same speaker.

Returns

Tensor<T>

Aggregated embedding representing the speaker.

Remarks

For Beginners: If you have multiple recordings of the same person, this combines them into one stronger voiceprint. More samples = better accuracy.

ComputeSimilarity(Tensor<T>, Tensor<T>)

Computes similarity between two speaker embeddings.

T ComputeSimilarity(Tensor<T> embedding1, Tensor<T> embedding2)

Parameters

embedding1 Tensor<T>

First speaker embedding.

embedding2 Tensor<T>

Second speaker embedding.

Returns

T

Similarity score, typically cosine similarity (0 to 1).

Remarks

For Beginners: This tells you how similar two voiceprints are. - Score close to 1.0: Likely same speaker - Score close to 0.0: Likely different speakers

ExtractEmbedding(Tensor<T>)

Extracts speaker embedding from audio.

Tensor<T> ExtractEmbedding(Tensor<T> audio)

Parameters

audio Tensor<T>

Audio waveform tensor [samples] or [batch, samples].

Returns

Tensor<T>

Speaker embedding tensor [embedding_dim] or [batch, embedding_dim].

Remarks

For Beginners: This is the main method for extracting a voiceprint. - Pass in audio of someone speaking - Get back a compact vector representing their voice

ExtractEmbeddingAsync(Tensor<T>, CancellationToken)

Extracts speaker embedding from audio asynchronously.

Task<Tensor<T>> ExtractEmbeddingAsync(Tensor<T> audio, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>

Audio waveform tensor [samples] or [batch, samples].

cancellationToken CancellationToken

Cancellation token for async operation.

Returns

Task<Tensor<T>>

Speaker embedding tensor [embedding_dim] or [batch, embedding_dim].

ExtractEmbeddings(IReadOnlyList<Tensor<T>>)

Extracts embeddings from multiple audio segments.

IReadOnlyList<Tensor<T>> ExtractEmbeddings(IReadOnlyList<Tensor<T>> audioSegments)

Parameters

audioSegments IReadOnlyList<Tensor<T>>

List of audio waveform tensors.

Returns

IReadOnlyList<Tensor<T>>

List of speaker embedding tensors.

Remarks

Useful for processing multiple utterances from the same recording or comparing embeddings across different audio files.

NormalizeEmbedding(Tensor<T>)

Normalizes an embedding for comparison (typically L2 normalization).

Tensor<T> NormalizeEmbedding(Tensor<T> embedding)

Parameters

embedding Tensor<T>

The embedding to normalize.

Returns

Tensor<T>

Normalized embedding with unit length.