Interface ISpeakerEmbeddingExtractor<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for speaker embedding extraction models (d-vector/x-vector extraction).
public interface ISpeakerEmbeddingExtractor<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
- Extension Methods
Remarks
Speaker embedding extractors convert voice audio into fixed-length vectors that capture the unique characteristics of a speaker's voice. These embeddings enable speaker verification, identification, and diarization tasks.
For Beginners: Speaker embeddings are like a "voiceprint" - a compact representation of what makes someone's voice unique.
How speaker embeddings work:
- Audio of someone speaking is fed into the model
- The model outputs a fixed-size vector (e.g., 256 or 512 numbers)
- This vector captures voice characteristics (pitch, timbre, accent, etc.)
- Vectors from the same speaker are similar; different speakers are different
Common use cases:
- Voice authentication ("Is this person who they claim to be?")
- Speaker identification ("Who is speaking?")
- Voice cloning (TTS with specific voice)
- Meeting transcription (separating speakers)
Key concepts:
- d-vector: Early embedding approach using DNN
- x-vector: Modern approach using TDNN with statistics pooling
- ECAPA-TDNN: State-of-the-art speaker embedding model
This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.
Properties
EmbeddingDimension
Gets the dimension of output speaker embeddings.
int EmbeddingDimension { get; }
Property Value
Remarks
Common values: 192, 256, or 512. Higher dimensions may capture more nuance but require more storage and computation.
IsOnnxMode
Gets whether this model is running in ONNX inference mode.
bool IsOnnxMode { get; }
Property Value
Remarks
When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.
MinimumDurationSeconds
Gets the minimum audio duration required for reliable embedding extraction.
double MinimumDurationSeconds { get; }
Property Value
Remarks
For Beginners: Very short audio clips may not contain enough voice information for accurate speaker representation. This property tells you the minimum length needed for reliable results.
SampleRate
Gets the expected sample rate for input audio.
int SampleRate { get; }
Property Value
Remarks
Typically 16000 Hz for speaker recognition models.
Methods
AggregateEmbeddings(IReadOnlyList<Tensor<T>>)
Aggregates multiple embeddings into a single representative embedding.
Tensor<T> AggregateEmbeddings(IReadOnlyList<Tensor<T>> embeddings)
Parameters
embeddingsIReadOnlyList<Tensor<T>>Collection of embeddings from the same speaker.
Returns
- Tensor<T>
Aggregated embedding representing the speaker.
Remarks
For Beginners: If you have multiple recordings of the same person, this combines them into one stronger voiceprint. More samples = better accuracy.
ComputeSimilarity(Tensor<T>, Tensor<T>)
Computes similarity between two speaker embeddings.
T ComputeSimilarity(Tensor<T> embedding1, Tensor<T> embedding2)
Parameters
embedding1Tensor<T>First speaker embedding.
embedding2Tensor<T>Second speaker embedding.
Returns
- T
Similarity score, typically cosine similarity (0 to 1).
Remarks
For Beginners: This tells you how similar two voiceprints are. - Score close to 1.0: Likely same speaker - Score close to 0.0: Likely different speakers
ExtractEmbedding(Tensor<T>)
Extracts speaker embedding from audio.
Tensor<T> ExtractEmbedding(Tensor<T> audio)
Parameters
audioTensor<T>Audio waveform tensor [samples] or [batch, samples].
Returns
- Tensor<T>
Speaker embedding tensor [embedding_dim] or [batch, embedding_dim].
Remarks
For Beginners: This is the main method for extracting a voiceprint. - Pass in audio of someone speaking - Get back a compact vector representing their voice
ExtractEmbeddingAsync(Tensor<T>, CancellationToken)
Extracts speaker embedding from audio asynchronously.
Task<Tensor<T>> ExtractEmbeddingAsync(Tensor<T> audio, CancellationToken cancellationToken = default)
Parameters
audioTensor<T>Audio waveform tensor [samples] or [batch, samples].
cancellationTokenCancellationTokenCancellation token for async operation.
Returns
- Task<Tensor<T>>
Speaker embedding tensor [embedding_dim] or [batch, embedding_dim].
ExtractEmbeddings(IReadOnlyList<Tensor<T>>)
Extracts embeddings from multiple audio segments.
IReadOnlyList<Tensor<T>> ExtractEmbeddings(IReadOnlyList<Tensor<T>> audioSegments)
Parameters
audioSegmentsIReadOnlyList<Tensor<T>>List of audio waveform tensors.
Returns
- IReadOnlyList<Tensor<T>>
List of speaker embedding tensors.
Remarks
Useful for processing multiple utterances from the same recording or comparing embeddings across different audio files.
NormalizeEmbedding(Tensor<T>)
Normalizes an embedding for comparison (typically L2 normalization).
Tensor<T> NormalizeEmbedding(Tensor<T> embedding)
Parameters
embeddingTensor<T>The embedding to normalize.
Returns
- Tensor<T>
Normalized embedding with unit length.