Table of Contents

Class SpeakerRecognitionBase<T>

Namespace
AiDotNet.Audio.Speaker
Assembly
AiDotNet.dll

Base class for speaker recognition models (embedding extraction, verification, diarization).

public abstract class SpeakerRecognitionBase<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable

Type Parameters

T

The numeric type used for calculations.

Inheritance
SpeakerRecognitionBase<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Derived
Inherited Members
Extension Methods

Remarks

Speaker recognition encompasses tasks that identify or verify speakers based on their voice. This base class provides common functionality for: - Speaker embedding extraction (d-vectors, x-vectors) - Speaker verification (is this the claimed speaker?) - Speaker diarization (who spoke when?)

For Beginners: Speaker recognition is like voice fingerprinting. Just as fingerprints are unique to each person, voice characteristics (pitch, speaking style, accent) can identify individuals.

This base class provides:

  • Feature extraction utilities (MFCCs, spectral features)
  • Embedding dimension management
  • Similarity computation methods

Constructors

SpeakerRecognitionBase(NeuralNetworkArchitecture<T>, ILossFunction<T>?)

Initializes a new instance of the SpeakerRecognitionBase class.

protected SpeakerRecognitionBase(NeuralNetworkArchitecture<T> architecture, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture.

lossFunction ILossFunction<T>

The loss function to use. If null, a default MSE loss is used.

Properties

EmbeddingDimension

Gets the dimension of output speaker embeddings.

public int EmbeddingDimension { get; protected set; }

Property Value

int

Remarks

Common values: 192, 256, or 512. Higher dimensions may capture more nuance but require more storage and computation.

MfccExtractor

Gets the MFCC extractor for preprocessing.

protected MfccExtractor<T>? MfccExtractor { get; set; }

Property Value

MfccExtractor<T>

Methods

AggregateEmbeddings(IReadOnlyList<Tensor<T>>)

Aggregates multiple embeddings into a single representative embedding.

protected Tensor<T> AggregateEmbeddings(IReadOnlyList<Tensor<T>> embeddings)

Parameters

embeddings IReadOnlyList<Tensor<T>>

Collection of embeddings to aggregate.

Returns

Tensor<T>

Aggregated embedding (normalized mean).

Remarks

For Beginners: If you have multiple recordings of the same person, this combines them into one stronger voiceprint by averaging and normalizing.

ComputeCosineSimilarity(Tensor<T>, Tensor<T>)

Computes cosine similarity between two speaker embedding tensors.

protected T ComputeCosineSimilarity(Tensor<T> embedding1, Tensor<T> embedding2)

Parameters

embedding1 Tensor<T>

First speaker embedding tensor.

embedding2 Tensor<T>

Second speaker embedding tensor.

Returns

T

Cosine similarity score.

ComputeCosineSimilarity(Vector<T>, Vector<T>)

Computes cosine similarity between two speaker embeddings.

protected T ComputeCosineSimilarity(Vector<T> embedding1, Vector<T> embedding2)

Parameters

embedding1 Vector<T>

First speaker embedding vector.

embedding2 Vector<T>

Second speaker embedding vector.

Returns

T

Cosine similarity score between -1 and 1.

Remarks

For Beginners: Cosine similarity measures how similar two embeddings are. - Score close to 1.0: Very similar (likely same speaker) - Score close to 0.0: Not similar - Score close to -1.0: Opposite (very different)

CreateMfccExtractor(int, int)

Creates an MFCC extractor for preprocessing speaker audio.

protected MfccExtractor<T> CreateMfccExtractor(int sampleRate = 16000, int numCoeffs = 40)

Parameters

sampleRate int

Sample rate of input audio.

numCoeffs int

Number of MFCC coefficients.

Returns

MfccExtractor<T>

A configured MFCC extractor.

NormalizeEmbedding(Tensor<T>)

Normalizes an embedding to unit length (L2 normalization).

protected Tensor<T> NormalizeEmbedding(Tensor<T> embedding)

Parameters

embedding Tensor<T>

The embedding to normalize.

Returns

Tensor<T>

Normalized embedding with unit length.

Remarks

For Beginners: Normalizing embeddings makes them easier to compare. After normalization, all embeddings have length 1, so cosine similarity becomes equivalent to a simple dot product.