Class Wav2Vec2Model<T>

Namespace: AiDotNet.Audio.SpeechRecognition

Assembly: AiDotNet.dll

Wav2Vec2 self-supervised speech recognition model.

public class Wav2Vec2Model<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ISpeechRecognizer<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

NeuralNetworkBase<T>

AudioNeuralNetworkBase<T>

Wav2Vec2Model<T>

Implements: INeuralNetworkModel<T>

INeuralNetwork<T>

IInterpretableModel<T>

IInputGradientComputable<T>

IDisposable

ISpeechRecognizer<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

Inherited Members: AudioNeuralNetworkBase<T>.SampleRate

AudioNeuralNetworkBase<T>.NumMels

AudioNeuralNetworkBase<T>.IsOnnxMode

AudioNeuralNetworkBase<T>.OnnxEncoder

AudioNeuralNetworkBase<T>.OnnxDecoder

AudioNeuralNetworkBase<T>.OnnxModel

AudioNeuralNetworkBase<T>.MelSpec

AudioNeuralNetworkBase<T>.SupportsTraining

AudioNeuralNetworkBase<T>.RunOnnxInference(Tensor<T>)

AudioNeuralNetworkBase<T>.Forward(Tensor<T>)

AudioNeuralNetworkBase<T>.DefaultLossFunction

AudioNeuralNetworkBase<T>.CreateMelSpectrogram(int, int, int, int)

NeuralNetworkBase<T>.Layers

NeuralNetworkBase<T>.LayerCount

NeuralNetworkBase<T>.Architecture

NeuralNetworkBase<T>.NumOps

NeuralNetworkBase<T>.Engine

NeuralNetworkBase<T>._layerInputs

NeuralNetworkBase<T>._layerOutputs

NeuralNetworkBase<T>.Random

NeuralNetworkBase<T>.LossFunction

NeuralNetworkBase<T>.LastLoss

NeuralNetworkBase<T>.IsTrainingMode

NeuralNetworkBase<T>.SupportsGpuTraining

NeuralNetworkBase<T>.CanTrainOnGpu

NeuralNetworkBase<T>.GpuEngine

NeuralNetworkBase<T>.MaxGradNorm

NeuralNetworkBase<T>._mixedPrecisionContext

NeuralNetworkBase<T>._memoryManager

NeuralNetworkBase<T>.IsMemoryManagementEnabled

NeuralNetworkBase<T>.IsGradientCheckpointingEnabled

NeuralNetworkBase<T>.IsMixedPrecisionEnabled

NeuralNetworkBase<T>.ClipGradients(List<Tensor<T>>)

NeuralNetworkBase<T>.ClipGradient(Tensor<T>)

NeuralNetworkBase<T>.ClipGradient(Vector<T>)

NeuralNetworkBase<T>.GetParameters()

NeuralNetworkBase<T>.Backpropagate(Tensor<T>)

NeuralNetworkBase<T>.BackpropagateWithRecompute(Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpuDeferred(IGpuTensor<T>, GpuExecutionOptions)

NeuralNetworkBase<T>.UpdateParametersGpu(T, T, T)

NeuralNetworkBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

NeuralNetworkBase<T>.UpdateParametersGpuDeferred(IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferred(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferredAsync(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions, CancellationToken)

NeuralNetworkBase<T>.UploadWeightsToGpu()

NeuralNetworkBase<T>.DownloadWeightsFromGpu()

NeuralNetworkBase<T>.ZeroGradientsGpu()

NeuralNetworkBase<T>.ExtractSingleExample(Tensor<T>, int)

NeuralNetworkBase<T>.ForwardWithMemory(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithCheckpointing(Tensor<T>)

NeuralNetworkBase<T>.CanUseGpuResidentPath()

NeuralNetworkBase<T>.TryForwardGpuOptimized(Tensor<T>, out Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferred(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferredAsync(Tensor<T>, CancellationToken)

NeuralNetworkBase<T>.BeginGpuExecution(GpuExecutionOptions)

NeuralNetworkBase<T>.ForwardWithGpuContext(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithGpuContext(IGpuTensor<T>)

NeuralNetworkBase<T>.GetGpuMemoryStats()

NeuralNetworkBase<T>.ForwardWithFeatures(Tensor<T>, int[])

NeuralNetworkBase<T>.ParameterCount

NeuralNetworkBase<T>.GetParameterCount()

NeuralNetworkBase<T>.InvalidateParameterCountCache()

NeuralNetworkBase<T>.AddLayerToCollection(ILayer<T>)

NeuralNetworkBase<T>.RemoveLayerFromCollection(ILayer<T>)

NeuralNetworkBase<T>.ClearLayers()

NeuralNetworkBase<T>.ValidateCustomLayers(List<ILayer<T>>)

NeuralNetworkBase<T>.ValidateCustomLayersInternal(List<ILayer<T>>)

NeuralNetworkBase<T>.IsValidInputLayer(ILayer<T>)

NeuralNetworkBase<T>.IsValidOutputLayer(ILayer<T>)

NeuralNetworkBase<T>.AreLayersCompatible(ILayer<T>, ILayer<T>)

NeuralNetworkBase<T>.GetParameterGradients()

NeuralNetworkBase<T>.EnsureArchitectureInitialized()

NeuralNetworkBase<T>.SetTrainingMode(bool)

NeuralNetworkBase<T>.EnableMemoryManagement(TrainingMemoryConfig)

NeuralNetworkBase<T>.DisableMemoryManagement()

NeuralNetworkBase<T>.GetMemoryEstimate(int, int)

NeuralNetworkBase<T>.GetLastLoss()

NeuralNetworkBase<T>.ResetState()

NeuralNetworkBase<T>.BackwardWithInputGradient(Tensor<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Vector<T>, Vector<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.SaveModel(string)

NeuralNetworkBase<T>.LoadModel(string)

NeuralNetworkBase<T>.Serialize()

NeuralNetworkBase<T>.Deserialize(byte[])

NeuralNetworkBase<T>.WithParameters(Vector<T>)

NeuralNetworkBase<T>.GetActiveFeatureIndices()

NeuralNetworkBase<T>.IsFeatureUsed(int)

NeuralNetworkBase<T>.DeepCopy()

NeuralNetworkBase<T>.Clone()

NeuralNetworkBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

NeuralNetworkBase<T>._enabledMethods

NeuralNetworkBase<T>._sensitiveFeatures

NeuralNetworkBase<T>._fairnessMetrics

NeuralNetworkBase<T>._baseModel

NeuralNetworkBase<T>.GetGlobalFeatureImportanceAsync()

NeuralNetworkBase<T>.GetLocalFeatureImportanceAsync(Tensor<T>)

NeuralNetworkBase<T>.GetShapValuesAsync(Tensor<T>)

NeuralNetworkBase<T>.GetLimeExplanationAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetPartialDependenceAsync(Vector<int>, int)

NeuralNetworkBase<T>.GetCounterfactualAsync(Tensor<T>, Tensor<T>, int)

NeuralNetworkBase<T>.GetModelSpecificInterpretabilityAsync()

NeuralNetworkBase<T>.GenerateTextExplanationAsync(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.GetFeatureInteractionAsync(int, int)

NeuralNetworkBase<T>.ValidateFairnessAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetAnchorExplanationAsync(Tensor<T>, T)

NeuralNetworkBase<T>.SetBaseModel<TInput, TOutput>(IFullModel<T, TInput, TOutput>)

NeuralNetworkBase<T>.EnableMethod(params InterpretationMethod[])

NeuralNetworkBase<T>.ConfigureFairness(Vector<int>, params FairnessMetric[])

NeuralNetworkBase<T>.GetNamedLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.GetArchitecture()

NeuralNetworkBase<T>.GetFeatureImportance()

NeuralNetworkBase<T>.SetParameters(Vector<T>)

NeuralNetworkBase<T>.AddLayer(LayerType, int, ActivationFunction)

NeuralNetworkBase<T>.AddConvolutionalLayer(int, int, int, ActivationFunction)

NeuralNetworkBase<T>.AddLSTMLayer(int, bool)

NeuralNetworkBase<T>.AddDropoutLayer(double)

NeuralNetworkBase<T>.AddBatchNormalizationLayer(int, double, double)

NeuralNetworkBase<T>.AddPoolingLayer(int[], PoolingType, int, int?)

NeuralNetworkBase<T>.GetGradients()

NeuralNetworkBase<T>.GetInputShape()

NeuralNetworkBase<T>.GetLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

NeuralNetworkBase<T>.ApplyGradients(Vector<T>, T)

NeuralNetworkBase<T>.SaveState(Stream)

NeuralNetworkBase<T>.LoadState(Stream)

NeuralNetworkBase<T>.Dispose()

NeuralNetworkBase<T>.SupportsJitCompilation

NeuralNetworkBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

NeuralNetworkBase<T>.ConvertLayerToGraph(ILayer<T>, ComputationNode<T>)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

Wav2Vec2 is a self-supervised learning model for speech recognition developed by Meta AI. It learns representations from raw audio through contrastive learning, then can be fine-tuned for speech recognition tasks.

For Beginners: Wav2Vec2 works differently from traditional speech recognition:

It processes raw audio directly (no mel spectrograms needed)
It learns speech patterns from unlabeled audio data
It can be fine-tuned with small amounts of labeled data

Architecture:

Convolutional feature encoder: Processes raw audio into features
Transformer encoder: Captures long-range dependencies in speech
CTC head: Aligns speech to text (Connectionist Temporal Classification)

Two ways to use this class:

ONNX Mode: Load pretrained Wav2Vec2 models for fast inference
Native Mode: Train your own speech recognition model from scratch

ONNX Mode Example:

var wav2vec2 = new Wav2Vec2Model<float>(
    architecture,
    modelPath: "path/to/wav2vec2.onnx");
var result = wav2vec2.Transcribe(audioTensor);
Console.WriteLine(result.Text);

Training Mode Example:

var wav2vec2 = new Wav2Vec2Model<float>(architecture);
for (int epoch = 0; epoch < 100; epoch++)
{
    foreach (var (audio, tokens) in trainingData)
    {
        wav2vec2.Train(audio, tokens);
    }
}

Constructors

Wav2Vec2Model(NeuralNetworkArchitecture<T>, string?, int, int, int, int, int, int, string[]?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a Wav2Vec2 network for training from scratch using native layers.

public Wav2Vec2Model(NeuralNetworkArchitecture<T> architecture, string? language = "en", int sampleRate = 16000, int maxAudioLengthSeconds = 30, int hiddenDim = 768, int numTransformerLayers = 12, int numHeads = 12, int ffDim = 3072, string[]? vocabulary = null, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
language string: Target language code (e.g., "en", "es"). Default is "en".
sampleRate int: Audio sample rate in Hz. Default is 16000.
maxAudioLengthSeconds int: Maximum audio length to process. Default is 30 seconds.
hiddenDim int: Hidden dimension for transformer. Default is 768.
numTransformerLayers int: Number of transformer layers. Default is 12.
numHeads int: Number of attention heads. Default is 12.
ffDim int: Feed-forward dimension. Default is 3072.
vocabulary string[]: CTC vocabulary for decoding. If null, uses default English alphabet.
optimizer IOptimizer<T, Tensor<T>, Tensor<T>>: Optimizer for training. If null, uses Adam with default settings.
lossFunction ILossFunction<T>: Loss function for training. If null, uses CTC loss.

Remarks

For Beginners: Use this constructor to train a speech recognition model from scratch.

Training Wav2Vec2 typically involves:

Pre-training on unlabeled audio (self-supervised)
Fine-tuning on labeled transcription data

Example:

var wav2vec2 = new Wav2Vec2Model<float>(
    architecture,
    language: "en",
    hiddenDim: 768,
    numTransformerLayers: 12);

// Training loop
for (int epoch = 0; epoch < numEpochs; epoch++)
{
    foreach (var (audio, tokens) in trainingData)
    {
        wav2vec2.Train(audio, tokens);
    }
}

Wav2Vec2Model(NeuralNetworkArchitecture<T>, string, string?, int, int, string[]?, OnnxModelOptions?)

Creates a Wav2Vec2 network using a pretrained ONNX model.

public Wav2Vec2Model(NeuralNetworkArchitecture<T> architecture, string modelPath, string? language = "en", int sampleRate = 16000, int maxAudioLengthSeconds = 30, string[]? vocabulary = null, OnnxModelOptions? onnxOptions = null)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
modelPath string: Path to the ONNX model file.
language string: Target language code (e.g., "en", "es"). Default is "en".
sampleRate int: Audio sample rate in Hz. Wav2Vec2 expects 16000.
maxAudioLengthSeconds int: Maximum audio length to process. Default is 30 seconds.
vocabulary string[]: CTC vocabulary for decoding. If null, uses default English alphabet.
onnxOptions OnnxModelOptions: ONNX runtime options.

Remarks

For Beginners: Use this constructor when you have a pretrained Wav2Vec2 ONNX model.

You can get ONNX models from:

HuggingFace: facebook/wav2vec2-base-960h, etc.
Convert from PyTorch using ONNX export tools

Example:

var wav2vec2 = new Wav2Vec2Model<float>(
    architecture,
    modelPath: "wav2vec2-base.onnx",
    language: "en");

Properties

IsReady

Gets whether the model is ready for inference.

public bool IsReady { get; }

Property Value

bool

Language

Gets the target language for transcription.

public string? Language { get; }

Property Value

string

MaxAudioLengthSeconds

Gets the maximum audio length in seconds.

public int MaxAudioLengthSeconds { get; }

Property Value

int

SupportedLanguages

Gets the list of languages supported by this model.

public IReadOnlyList<string> SupportedLanguages { get; }

Property Value

IReadOnlyList<string>

SupportsStreaming

Gets whether this model supports real-time streaming transcription.

public bool SupportsStreaming { get; }

Property Value

bool

SupportsWordTimestamps

Gets whether this model can identify timestamps for each word.

public bool SupportsWordTimestamps { get; }

Property Value

bool

Methods

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

DetectLanguage(Tensor<T>)

Detects the language spoken in the audio.

public string DetectLanguage(Tensor<T> audio)

Parameters

audio Tensor<T>

Returns

string

DetectLanguageProbabilities(Tensor<T>)

Gets language detection probabilities for the audio.

public IReadOnlyDictionary<string, T> DetectLanguageProbabilities(Tensor<T> audio)

Parameters

audio Tensor<T>

Returns

IReadOnlyDictionary<string, T>

Dispose(bool)

Disposes the model and releases resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes layers for ONNX inference mode.

protected override void InitializeLayers()

PostprocessOutput(Tensor<T>)

Postprocesses model output.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

StartStreamingSession(string?)

Starts a streaming transcription session.

public IStreamingTranscriptionSession<T> StartStreamingSession(string? language = null)

Parameters

language string

Returns

IStreamingTranscriptionSession<T>

Train(Tensor<T>, Tensor<T>)

Trains the model on a single batch.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

Transcribe(Tensor<T>, string?, bool)

Transcribes audio to text.

public TranscriptionResult<T> Transcribe(Tensor<T> audio, string? language = null, bool includeTimestamps = false)

Parameters

audio Tensor<T>
language string
includeTimestamps bool

Returns

TranscriptionResult<T>

TranscribeAsync(Tensor<T>, string?, bool, CancellationToken)

Transcribes audio to text asynchronously.

public Task<TranscriptionResult<T>> TranscribeAsync(Tensor<T> audio, string? language = null, bool includeTimestamps = false, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>
language string
includeTimestamps bool
cancellationToken CancellationToken

Returns

Task<TranscriptionResult<T>>

UpdateParameters(Vector<T>)

Updates model parameters by applying gradient descent.

public override void UpdateParameters(Vector<T> gradients)

Parameters

gradients Vector<T>

Table of Contents

Class Wav2Vec2Model<T>

Type Parameters

Remarks

Constructors

Wav2Vec2Model(NeuralNetworkArchitecture<T>, string?, int, int, int, int, int, int, string[]?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Parameters

Remarks

Wav2Vec2Model(NeuralNetworkArchitecture<T>, string, string?, int, int, string[]?, OnnxModelOptions?)

Parameters

Remarks

Properties

IsReady

Property Value

Language

Property Value

MaxAudioLengthSeconds

Property Value

SupportedLanguages

Property Value

SupportsStreaming

Property Value

SupportsWordTimestamps

Property Value

Methods

CreateNewInstance()

Returns

DeserializeNetworkSpecificData(BinaryReader)

Parameters

DetectLanguage(Tensor<T>)

Parameters

Returns

DetectLanguageProbabilities(Tensor<T>)

Parameters

Returns

Dispose(bool)

Parameters

GetModelMetadata()

Returns

InitializeLayers()

PostprocessOutput(Tensor<T>)

Parameters

Returns

Predict(Tensor<T>)

Parameters

Returns

PreprocessAudio(Tensor<T>)

Parameters

Returns

SerializeNetworkSpecificData(BinaryWriter)

Parameters

StartStreamingSession(string?)

Parameters

Returns

Train(Tensor<T>, Tensor<T>)

Parameters

Transcribe(Tensor<T>, string?, bool)

Parameters

Returns

TranscribeAsync(Tensor<T>, string?, bool, CancellationToken)

Parameters

Returns

UpdateParameters(Vector<T>)

Parameters