Table of Contents

Class WhisperModel<T>

Namespace
AiDotNet.Audio.Whisper
Assembly
AiDotNet.dll

Whisper automatic speech recognition model for transcribing audio to text.

public class WhisperModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ISpeechRecognizer<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
WhisperModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

Whisper is a state-of-the-art speech recognition model by OpenAI that can:

  • Transcribe speech in 99+ languages
  • Translate non-English speech to English
  • Detect the spoken language automatically
  • Handle noisy audio and accents well

For Beginners: Whisper converts spoken audio into text. It works by: 1. Converting audio to a mel spectrogram (visual representation of sound) 2. Processing through an encoder neural network 3. Generating text tokens through a decoder neural network

Two ways to use this class:

  1. ONNX Mode: Load pretrained models for fast inference
  2. Native Mode: Train your own speech recognition model from scratch

ONNX Mode Example:

var whisper = new WhisperModel<float>(
    architecture,
    encoderPath: "path/to/encoder.onnx",
    decoderPath: "path/to/decoder.onnx");
var result = whisper.Transcribe(audioTensor);
Console.WriteLine(result.Text);

Training Mode Example:

var whisper = new WhisperModel<float>(architecture);
for (int epoch = 0; epoch < 100; epoch++)
{
    foreach (var (audio, tokens) in trainingData)
    {
        whisper.Train(audio, tokens);
    }
}

Constructors

WhisperModel(NeuralNetworkArchitecture<T>, WhisperModelSize, string?, bool, int, int, int, int, int, double, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a Whisper network for training from scratch using native layers.

public WhisperModel(NeuralNetworkArchitecture<T> architecture, WhisperModelSize modelSize = WhisperModelSize.Base, string? language = null, bool translate = false, int sampleRate = 16000, int numMels = 80, int maxAudioLengthSeconds = 30, int maxTokens = 448, int beamSize = 5, double temperature = 0, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

modelSize WhisperModelSize

Model size variant (determines layer dimensions).

language string

Target language code. Null for multilingual training.

translate bool

Whether to train for translation task.

sampleRate int

Audio sample rate in Hz.

numMels int

Number of mel filterbank channels.

maxAudioLengthSeconds int

Maximum audio length to process.

maxTokens int

Maximum sequence length for decoder.

beamSize int

Beam size for inference.

temperature double

Sampling temperature.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optimizer for training. If null, uses Adam with default settings.

lossFunction ILossFunction<T>

Loss function for training. If null, uses cross-entropy.

Remarks

For Beginners: Use this constructor when you want to train a speech recognition model from scratch with your own data.

Training Whisper requires:

  1. Large amounts of paired audio-transcript data
  2. Significant compute resources (GPUs recommended)
  3. Many training epochs

This is useful for:

  • Domain-specific vocabulary (medical, legal, technical)
  • Languages not well supported by pretrained models
  • Specific accent or dialect adaptation
  • Research and experimentation

Example:

var whisper = new WhisperModel<float>(
    architecture,
    modelSize: WhisperModelSize.Base,
    language: "en");

// Training loop
for (int epoch = 0; epoch < numEpochs; epoch++)
{
    foreach (var (audio, tokens) in trainingData)
    {
        whisper.Train(audio, tokens);
    }
}

WhisperModel(NeuralNetworkArchitecture<T>, string, string, WhisperModelSize, string?, bool, int, int, int, int, int, double, OnnxModelOptions?)

Creates a Whisper network using pretrained ONNX models.

public WhisperModel(NeuralNetworkArchitecture<T> architecture, string encoderPath, string decoderPath, WhisperModelSize modelSize = WhisperModelSize.Base, string? language = null, bool translate = false, int sampleRate = 16000, int numMels = 80, int maxAudioLengthSeconds = 30, int maxTokens = 448, int beamSize = 5, double temperature = 0, OnnxModelOptions? onnxOptions = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

encoderPath string

Path to the encoder ONNX model.

decoderPath string

Path to the decoder ONNX model.

modelSize WhisperModelSize

Model size variant (Tiny, Base, Small, Medium, Large).

language string

Target language code (e.g., "en", "es"). Null for auto-detection.

translate bool

Whether to translate non-English to English.

sampleRate int

Audio sample rate in Hz. Whisper expects 16000.

numMels int

Number of mel filterbank channels. Whisper uses 80.

maxAudioLengthSeconds int

Maximum audio length to process. Whisper uses 30s chunks.

maxTokens int

Maximum number of tokens to generate.

beamSize int

Beam size for beam search decoding.

temperature double

Sampling temperature (0 = greedy/deterministic).

onnxOptions OnnxModelOptions

ONNX runtime options.

Remarks

For Beginners: Use this constructor when you have downloaded Whisper ONNX models.

The encoder processes audio features and the decoder generates text tokens. Both are needed for transcription.

You can get ONNX models from:

  • HuggingFace: openai/whisper-base, openai/whisper-small, etc.
  • Convert from PyTorch using ONNX export tools

Example:

var whisper = new WhisperModel<float>(
    architecture,
    encoderPath: "whisper-base-encoder.onnx",
    decoderPath: "whisper-base-decoder.onnx",
    modelSize: WhisperModelSize.Base,
    language: "en");  // English transcription

Properties

IsReady

Gets whether the model is ready for inference.

public bool IsReady { get; }

Property Value

bool

Language

Gets the target language for transcription.

public string? Language { get; }

Property Value

string

MaxAudioLengthSeconds

Gets the maximum audio length in seconds.

public int MaxAudioLengthSeconds { get; }

Property Value

int

ModelSize

Gets the model size variant.

public WhisperModelSize ModelSize { get; }

Property Value

WhisperModelSize

SupportedLanguages

Gets the list of languages supported by this model.

public IReadOnlyList<string> SupportedLanguages { get; }

Property Value

IReadOnlyList<string>

SupportsStreaming

Gets whether this model supports real-time streaming transcription.

public bool SupportsStreaming { get; }

Property Value

bool

SupportsWordTimestamps

Gets whether this model can identify timestamps for each word.

public bool SupportsWordTimestamps { get; }

Property Value

bool

Translate

Gets whether translation to English is enabled.

public bool Translate { get; }

Property Value

bool

Methods

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

DetectLanguage(Tensor<T>)

Detects the language spoken in the audio.

public string DetectLanguage(Tensor<T> audio)

Parameters

audio Tensor<T>

Returns

string

DetectLanguageProbabilities(Tensor<T>)

Gets language detection probabilities for the audio.

public IReadOnlyDictionary<string, T> DetectLanguageProbabilities(Tensor<T> audio)

Parameters

audio Tensor<T>

Returns

IReadOnlyDictionary<string, T>

Dispose(bool)

Disposes the model and releases resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes layers following the golden standard pattern.

protected override void InitializeLayers()

PostprocessOutput(Tensor<T>)

Postprocesses model output into the final result format.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

StartStreamingSession(string?)

Starts a streaming transcription session.

public IStreamingTranscriptionSession<T> StartStreamingSession(string? language = null)

Parameters

language string

Returns

IStreamingTranscriptionSession<T>

Train(Tensor<T>, Tensor<T>)

Trains the model on a single batch of audio and expected transcription tokens.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>

Audio waveform tensor [batch, samples] or [samples].

expectedOutput Tensor<T>

Expected token sequence tensor [batch, sequence_length].

Remarks

For Beginners: Training teaches the model to transcribe audio correctly.

The training process:

  1. Forward pass: Audio goes through encoder-decoder to predict tokens
  2. Loss calculation: Compare predicted tokens to expected tokens
  3. Backward pass: Calculate gradients showing how to improve
  4. Update: Adjust model parameters to reduce error

Call this method repeatedly with different audio/transcript pairs. After many iterations, the model learns to transcribe correctly.

Transcribe(Tensor<T>, string?, bool)

Transcribes audio to text.

public TranscriptionResult<T> Transcribe(Tensor<T> audio, string? language = null, bool includeTimestamps = false)

Parameters

audio Tensor<T>

Audio waveform tensor [batch, samples] or [samples].

language string

Optional language code. Auto-detected if null.

includeTimestamps bool

Whether to include word-level timestamps.

Returns

TranscriptionResult<T>

Transcription result containing text and optional timestamps.

TranscribeAsync(Tensor<T>, string?, bool, CancellationToken)

Transcribes audio to text asynchronously.

public Task<TranscriptionResult<T>> TranscribeAsync(Tensor<T> audio, string? language = null, bool includeTimestamps = false, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>
language string
includeTimestamps bool
cancellationToken CancellationToken

Returns

Task<TranscriptionResult<T>>

UpdateParameters(Vector<T>)

Updates model parameters by applying gradient descent.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

Remarks

Applies the simple gradient descent update rule: params = params - learning_rate * gradients.

For Beginners: This is how the model learns!

During training:

  1. The model transcribes audio
  2. We compare to the correct transcription (loss)
  3. We compute gradients (which direction to adjust each parameter)
  4. This method applies those adjustments to improve transcription

The learning rate controls adjustment magnitude:

  • Too big: May overshoot optimal values
  • Too small: Learning is slow but precise
  • Default (0.001): Good starting point