Class WhisperModel<T>
Whisper automatic speech recognition model for transcribing audio to text.
public class WhisperModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ISpeechRecognizer<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
WhisperModel<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
Whisper is a state-of-the-art speech recognition model by OpenAI that can:
- Transcribe speech in 99+ languages
- Translate non-English speech to English
- Detect the spoken language automatically
- Handle noisy audio and accents well
For Beginners: Whisper converts spoken audio into text. It works by: 1. Converting audio to a mel spectrogram (visual representation of sound) 2. Processing through an encoder neural network 3. Generating text tokens through a decoder neural network
Two ways to use this class:
- ONNX Mode: Load pretrained models for fast inference
- Native Mode: Train your own speech recognition model from scratch
ONNX Mode Example:
var whisper = new WhisperModel<float>(
architecture,
encoderPath: "path/to/encoder.onnx",
decoderPath: "path/to/decoder.onnx");
var result = whisper.Transcribe(audioTensor);
Console.WriteLine(result.Text);
Training Mode Example:
var whisper = new WhisperModel<float>(architecture);
for (int epoch = 0; epoch < 100; epoch++)
{
foreach (var (audio, tokens) in trainingData)
{
whisper.Train(audio, tokens);
}
}
Constructors
WhisperModel(NeuralNetworkArchitecture<T>, WhisperModelSize, string?, bool, int, int, int, int, int, double, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a Whisper network for training from scratch using native layers.
public WhisperModel(NeuralNetworkArchitecture<T> architecture, WhisperModelSize modelSize = WhisperModelSize.Base, string? language = null, bool translate = false, int sampleRate = 16000, int numMels = 80, int maxAudioLengthSeconds = 30, int maxTokens = 448, int beamSize = 5, double temperature = 0, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
modelSizeWhisperModelSizeModel size variant (determines layer dimensions).
languagestringTarget language code. Null for multilingual training.
translateboolWhether to train for translation task.
sampleRateintAudio sample rate in Hz.
numMelsintNumber of mel filterbank channels.
maxAudioLengthSecondsintMaximum audio length to process.
maxTokensintMaximum sequence length for decoder.
beamSizeintBeam size for inference.
temperaturedoubleSampling temperature.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optimizer for training. If null, uses Adam with default settings.
lossFunctionILossFunction<T>Loss function for training. If null, uses cross-entropy.
Remarks
For Beginners: Use this constructor when you want to train a speech recognition model from scratch with your own data.
Training Whisper requires:
- Large amounts of paired audio-transcript data
- Significant compute resources (GPUs recommended)
- Many training epochs
This is useful for:
- Domain-specific vocabulary (medical, legal, technical)
- Languages not well supported by pretrained models
- Specific accent or dialect adaptation
- Research and experimentation
Example:
var whisper = new WhisperModel<float>(
architecture,
modelSize: WhisperModelSize.Base,
language: "en");
// Training loop
for (int epoch = 0; epoch < numEpochs; epoch++)
{
foreach (var (audio, tokens) in trainingData)
{
whisper.Train(audio, tokens);
}
}
WhisperModel(NeuralNetworkArchitecture<T>, string, string, WhisperModelSize, string?, bool, int, int, int, int, int, double, OnnxModelOptions?)
Creates a Whisper network using pretrained ONNX models.
public WhisperModel(NeuralNetworkArchitecture<T> architecture, string encoderPath, string decoderPath, WhisperModelSize modelSize = WhisperModelSize.Base, string? language = null, bool translate = false, int sampleRate = 16000, int numMels = 80, int maxAudioLengthSeconds = 30, int maxTokens = 448, int beamSize = 5, double temperature = 0, OnnxModelOptions? onnxOptions = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
encoderPathstringPath to the encoder ONNX model.
decoderPathstringPath to the decoder ONNX model.
modelSizeWhisperModelSizeModel size variant (Tiny, Base, Small, Medium, Large).
languagestringTarget language code (e.g., "en", "es"). Null for auto-detection.
translateboolWhether to translate non-English to English.
sampleRateintAudio sample rate in Hz. Whisper expects 16000.
numMelsintNumber of mel filterbank channels. Whisper uses 80.
maxAudioLengthSecondsintMaximum audio length to process. Whisper uses 30s chunks.
maxTokensintMaximum number of tokens to generate.
beamSizeintBeam size for beam search decoding.
temperaturedoubleSampling temperature (0 = greedy/deterministic).
onnxOptionsOnnxModelOptionsONNX runtime options.
Remarks
For Beginners: Use this constructor when you have downloaded Whisper ONNX models.
The encoder processes audio features and the decoder generates text tokens. Both are needed for transcription.
You can get ONNX models from:
- HuggingFace: openai/whisper-base, openai/whisper-small, etc.
- Convert from PyTorch using ONNX export tools
Example:
var whisper = new WhisperModel<float>(
architecture,
encoderPath: "whisper-base-encoder.onnx",
decoderPath: "whisper-base-decoder.onnx",
modelSize: WhisperModelSize.Base,
language: "en"); // English transcription
Properties
IsReady
Gets whether the model is ready for inference.
public bool IsReady { get; }
Property Value
Language
Gets the target language for transcription.
public string? Language { get; }
Property Value
MaxAudioLengthSeconds
Gets the maximum audio length in seconds.
public int MaxAudioLengthSeconds { get; }
Property Value
ModelSize
Gets the model size variant.
public WhisperModelSize ModelSize { get; }
Property Value
SupportedLanguages
Gets the list of languages supported by this model.
public IReadOnlyList<string> SupportedLanguages { get; }
Property Value
SupportsStreaming
Gets whether this model supports real-time streaming transcription.
public bool SupportsStreaming { get; }
Property Value
SupportsWordTimestamps
Gets whether this model can identify timestamps for each word.
public bool SupportsWordTimestamps { get; }
Property Value
Translate
Gets whether translation to English is enabled.
public bool Translate { get; }
Property Value
Methods
CreateNewInstance()
Creates a new instance of this model for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReader
DetectLanguage(Tensor<T>)
Detects the language spoken in the audio.
public string DetectLanguage(Tensor<T> audio)
Parameters
audioTensor<T>
Returns
DetectLanguageProbabilities(Tensor<T>)
Gets language detection probabilities for the audio.
public IReadOnlyDictionary<string, T> DetectLanguageProbabilities(Tensor<T> audio)
Parameters
audioTensor<T>
Returns
Dispose(bool)
Disposes the model and releases resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
GetModelMetadata()
Gets metadata about the model.
public override ModelMetadata<T> GetModelMetadata()
Returns
InitializeLayers()
Initializes layers following the golden standard pattern.
protected override void InitializeLayers()
PostprocessOutput(Tensor<T>)
Postprocesses model output into the final result format.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>
Returns
- Tensor<T>
Predict(Tensor<T>)
Makes a prediction using the model.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>
Returns
- Tensor<T>
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriter
StartStreamingSession(string?)
Starts a streaming transcription session.
public IStreamingTranscriptionSession<T> StartStreamingSession(string? language = null)
Parameters
languagestring
Returns
Train(Tensor<T>, Tensor<T>)
Trains the model on a single batch of audio and expected transcription tokens.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>Audio waveform tensor [batch, samples] or [samples].
expectedOutputTensor<T>Expected token sequence tensor [batch, sequence_length].
Remarks
For Beginners: Training teaches the model to transcribe audio correctly.
The training process:
- Forward pass: Audio goes through encoder-decoder to predict tokens
- Loss calculation: Compare predicted tokens to expected tokens
- Backward pass: Calculate gradients showing how to improve
- Update: Adjust model parameters to reduce error
Call this method repeatedly with different audio/transcript pairs. After many iterations, the model learns to transcribe correctly.
Transcribe(Tensor<T>, string?, bool)
Transcribes audio to text.
public TranscriptionResult<T> Transcribe(Tensor<T> audio, string? language = null, bool includeTimestamps = false)
Parameters
audioTensor<T>Audio waveform tensor [batch, samples] or [samples].
languagestringOptional language code. Auto-detected if null.
includeTimestampsboolWhether to include word-level timestamps.
Returns
- TranscriptionResult<T>
Transcription result containing text and optional timestamps.
TranscribeAsync(Tensor<T>, string?, bool, CancellationToken)
Transcribes audio to text asynchronously.
public Task<TranscriptionResult<T>> TranscribeAsync(Tensor<T> audio, string? language = null, bool includeTimestamps = false, CancellationToken cancellationToken = default)
Parameters
audioTensor<T>languagestringincludeTimestampsboolcancellationTokenCancellationToken
Returns
UpdateParameters(Vector<T>)
Updates model parameters by applying gradient descent.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>
Remarks
Applies the simple gradient descent update rule: params = params - learning_rate * gradients.
For Beginners: This is how the model learns!
During training:
- The model transcribes audio
- We compare to the correct transcription (loss)
- We compute gradients (which direction to adjust each parameter)
- This method applies those adjustments to improve transcription
The learning rate controls adjustment magnitude:
- Too big: May overshoot optimal values
- Too small: Learning is slow but precise
- Default (0.001): Good starting point