Class SpeakerEmbeddingExtractor<T>
Extracts speaker embeddings (d-vectors) from audio for speaker recognition.
public class SpeakerEmbeddingExtractor<T> : SpeakerRecognitionBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ISpeakerEmbeddingExtractor<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
SpeakerEmbeddingExtractor<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
Speaker embeddings are compact vector representations that capture the unique characteristics of a speaker's voice. These can be used for speaker verification (is this the same person?) and speaker identification (who is speaking?).
For Beginners: Each person's voice has unique characteristics like pitch, rhythm, and timbre (tone color). This class converts audio into a numerical "fingerprint" of the speaker's voice.
These embeddings are vectors (lists of numbers) that are:
- Close together for the same speaker
- Far apart for different speakers
Usage (ONNX Mode):
var extractor = new SpeakerEmbeddingExtractor<float>(
architecture,
modelPath: "speaker_model.onnx");
var embedding = extractor.ExtractEmbedding(audio);
Usage (Native Training Mode):
var extractor = new SpeakerEmbeddingExtractor<float>(architecture);
extractor.Train(audioInput, expectedEmbedding);
Constructors
SpeakerEmbeddingExtractor()
Creates a SpeakerEmbeddingExtractor with default settings for native training mode.
public SpeakerEmbeddingExtractor()
Remarks
For Beginners: This is the simplest way to create a speaker embedding extractor. It uses default settings suitable for most use cases.
SpeakerEmbeddingExtractor(SpeakerEmbeddingOptions)
Creates a SpeakerEmbeddingExtractor with custom options.
public SpeakerEmbeddingExtractor(SpeakerEmbeddingOptions options)
Parameters
optionsSpeakerEmbeddingOptionsConfiguration options for the extractor.
Remarks
For Beginners: Use this constructor to customize sample rate, embedding dimension, etc.
SpeakerEmbeddingExtractor(NeuralNetworkArchitecture<T>, int, int, double, int, int, int, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a SpeakerEmbeddingExtractor for native training mode.
public SpeakerEmbeddingExtractor(NeuralNetworkArchitecture<T> architecture, int sampleRate = 16000, int embeddingDimension = 256, double minimumDurationSeconds = 0.5, int hiddenDim = 256, int numEncoderLayers = 3, int numHeads = 4, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
sampleRateintExpected sample rate for input audio. Default is 16000.
embeddingDimensionintDimension of output embeddings. Default is 256.
minimumDurationSecondsdoubleMinimum audio duration for reliable embedding. Default is 0.5.
hiddenDimintHidden dimension for encoder layers. Default is 256.
numEncoderLayersintNumber of encoder layers. Default is 3.
numHeadsintNumber of attention heads. Default is 4.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optimizer for training. If null, AdamW is used.
lossFunctionILossFunction<T>Loss function for training. If null, MSE loss is used.
Remarks
For Beginners: Use this constructor to train your own speaker embedding model.
Example:
var extractor = new SpeakerEmbeddingExtractor<float>(architecture);
extractor.Train(audioInput, expectedEmbedding);
SpeakerEmbeddingExtractor(NeuralNetworkArchitecture<T>, string, int, int, double, OnnxModelOptions?)
Creates a SpeakerEmbeddingExtractor for ONNX inference with a pretrained model.
public SpeakerEmbeddingExtractor(NeuralNetworkArchitecture<T> architecture, string modelPath, int sampleRate = 16000, int embeddingDimension = 256, double minimumDurationSeconds = 0.5, OnnxModelOptions? onnxOptions = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
modelPathstringRequired path to speaker embedding ONNX model.
sampleRateintExpected sample rate for input audio. Default is 16000.
embeddingDimensionintDimension of output embeddings. Default is 256.
minimumDurationSecondsdoubleMinimum audio duration for reliable embedding. Default is 0.5.
onnxOptionsOnnxModelOptionsONNX runtime options.
Remarks
For Beginners: Use this constructor when you have a pretrained speaker embedding model.
Example:
var extractor = new SpeakerEmbeddingExtractor<float>(
architecture,
modelPath: "ecapa_tdnn.onnx");
Properties
HasNeuralModel
Gets whether a neural model is loaded.
public bool HasNeuralModel { get; }
Property Value
IsOnnxMode
Gets whether the model is in ONNX inference mode.
public bool IsOnnxMode { get; }
Property Value
MinimumDurationSeconds
Gets the minimum audio duration required for reliable embedding extraction.
public double MinimumDurationSeconds { get; }
Property Value
Methods
ComputeSimilarity(SpeakerEmbedding<T>, SpeakerEmbedding<T>)
Computes cosine similarity between two speaker embeddings (legacy API).
public T ComputeSimilarity(SpeakerEmbedding<T> embedding1, SpeakerEmbedding<T> embedding2)
Parameters
embedding1SpeakerEmbedding<T>embedding2SpeakerEmbedding<T>
Returns
- T
ComputeSimilarity(Tensor<T>, Tensor<T>)
Computes similarity between two speaker embeddings.
public T ComputeSimilarity(Tensor<T> embedding1, Tensor<T> embedding2)
Parameters
embedding1Tensor<T>embedding2Tensor<T>
Returns
- T
CreateNewInstance()
Creates a new instance of this model for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReader
Dispose(bool)
Disposes the model and releases resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
Extract(Tensor<T>)
Extracts a speaker embedding from audio (legacy API).
public SpeakerEmbedding<T> Extract(Tensor<T> audio)
Parameters
audioTensor<T>Audio samples as a tensor.
Returns
- SpeakerEmbedding<T>
Speaker embedding result.
Extract(Vector<T>)
Extracts a speaker embedding from audio (legacy API).
public SpeakerEmbedding<T> Extract(Vector<T> audio)
Parameters
audioVector<T>Audio samples as a vector.
Returns
- SpeakerEmbedding<T>
Speaker embedding result.
ExtractBatch(IEnumerable<Tensor<T>>)
Extracts embeddings from multiple audio segments (legacy API).
public List<SpeakerEmbedding<T>> ExtractBatch(IEnumerable<Tensor<T>> segments)
Parameters
segmentsIEnumerable<Tensor<T>>
Returns
- List<SpeakerEmbedding<T>>
ExtractEmbedding(Tensor<T>)
Extracts a speaker embedding from audio.
public Tensor<T> ExtractEmbedding(Tensor<T> audio)
Parameters
audioTensor<T>Audio waveform tensor [samples] or [batch, samples].
Returns
- Tensor<T>
Speaker embedding tensor [embedding_dim] or [batch, embedding_dim].
ExtractEmbeddingAsync(Tensor<T>, CancellationToken)
Extracts a speaker embedding from audio asynchronously.
public Task<Tensor<T>> ExtractEmbeddingAsync(Tensor<T> audio, CancellationToken cancellationToken = default)
Parameters
audioTensor<T>cancellationTokenCancellationToken
Returns
- Task<Tensor<T>>
ExtractEmbeddings(IReadOnlyList<Tensor<T>>)
Extracts embeddings from multiple audio segments.
public IReadOnlyList<Tensor<T>> ExtractEmbeddings(IReadOnlyList<Tensor<T>> audioSegments)
Parameters
audioSegmentsIReadOnlyList<Tensor<T>>
Returns
- IReadOnlyList<Tensor<T>>
ExtractTensor(Tensor<T>)
Extracts speaker embedding from audio as a Tensor.
public Tensor<T> ExtractTensor(Tensor<T> audio)
Parameters
audioTensor<T>
Returns
- Tensor<T>
GetModelMetadata()
Gets metadata about the model.
public override ModelMetadata<T> GetModelMetadata()
Returns
InitializeLayers()
Initializes the layers for the speaker embedding model.
protected override void InitializeLayers()
Remarks
Follows the golden standard pattern:
- Check if in native mode (ONNX mode returns early)
- Use Architecture.Layers if provided by user
- Fall back to LayerHelper.CreateDefaultSpeakerEmbeddingLayers() otherwise
PostprocessOutput(Tensor<T>)
Postprocesses model output into the final result format.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>
Returns
- Tensor<T>
Predict(Tensor<T>)
Makes a prediction using the model.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>
Returns
- Tensor<T>
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriter
Train(Tensor<T>, Tensor<T>)
Trains the model on input data.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>expectedOutputTensor<T>
UpdateParameters(Vector<T>)
Updates model parameters.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>