Table of Contents

Interface ISpeechRecognizer<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for speech recognition models that transcribe audio to text (ASR - Automatic Speech Recognition).

public interface ISpeechRecognizer<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members
Extension Methods

Remarks

Speech recognition models convert spoken audio into written text. They analyze audio waveforms or spectrograms to identify phonemes, words, and sentences. Modern speech recognition uses encoder-decoder architectures (like Whisper) or CTC-based models.

For Beginners: Speech recognition is like having a transcriptionist listen to audio and type out what they hear.

How speech recognition works:

  1. Audio is converted to features (spectrograms or mel-spectrograms)
  2. The model processes these features to identify speech patterns
  3. Patterns are decoded into words and sentences

Common use cases:

  • Voice assistants (Siri, Alexa, Google Assistant)
  • Video/podcast transcription
  • Real-time captioning for accessibility
  • Voice typing and dictation

Key challenges:

  • Different accents and speaking styles
  • Background noise and multiple speakers
  • Domain-specific vocabulary (medical, legal terms)

This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.

Properties

IsOnnxMode

Gets whether this model is running in ONNX inference mode.

bool IsOnnxMode { get; }

Property Value

bool

Remarks

When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.

SampleRate

Gets the sample rate expected by this model.

int SampleRate { get; }

Property Value

int

Remarks

Most speech recognition models expect 16000 Hz audio. Input audio should be resampled to match this rate before processing.

SupportedLanguages

Gets the list of languages supported by this model.

IReadOnlyList<string> SupportedLanguages { get; }

Property Value

IReadOnlyList<string>

Remarks

Multilingual models like Whisper support many languages. Monolingual models may only support one. Check this property before processing foreign audio.

SupportsStreaming

Gets whether this model supports real-time streaming transcription.

bool SupportsStreaming { get; }

Property Value

bool

Remarks

For Beginners: Streaming mode transcribes audio as it comes in, without waiting for the entire recording. Good for live captioning.

SupportsWordTimestamps

Gets whether this model can identify timestamps for each word.

bool SupportsWordTimestamps { get; }

Property Value

bool

Remarks

Word-level timestamps are useful for subtitle generation and audio editing.

Methods

DetectLanguage(Tensor<T>)

Detects the language spoken in the audio.

string DetectLanguage(Tensor<T> audio)

Parameters

audio Tensor<T>

Audio waveform tensor [batch, samples] or [samples].

Returns

string

Detected language code (e.g., "en", "es", "fr").

Remarks

For Beginners: This identifies what language is being spoken before transcription. Useful for multilingual applications.

DetectLanguageProbabilities(Tensor<T>)

Gets language detection probabilities for the audio.

IReadOnlyDictionary<string, T> DetectLanguageProbabilities(Tensor<T> audio)

Parameters

audio Tensor<T>

Audio waveform tensor [batch, samples] or [samples].

Returns

IReadOnlyDictionary<string, T>

Dictionary mapping language codes to confidence scores (0.0 to 1.0).

StartStreamingSession(string?)

Starts a streaming transcription session.

IStreamingTranscriptionSession<T> StartStreamingSession(string? language = null)

Parameters

language string

Optional language code for transcription.

Returns

IStreamingTranscriptionSession<T>

A streaming session that can receive audio chunks incrementally.

Exceptions

NotSupportedException

Thrown if streaming is not supported.

Transcribe(Tensor<T>, string?, bool)

Transcribes audio to text.

TranscriptionResult<T> Transcribe(Tensor<T> audio, string? language = null, bool includeTimestamps = false)

Parameters

audio Tensor<T>

Audio waveform tensor [batch, samples] or [samples].

language string

Optional language code (e.g., "en", "es"). Auto-detected if null.

includeTimestamps bool

Whether to include word-level timestamps.

Returns

TranscriptionResult<T>

Transcription result containing text and optional timestamps.

Remarks

For Beginners: This is the main method for converting speech to text. - Pass in audio data (as a tensor of samples) - Get back the transcribed text

TranscribeAsync(Tensor<T>, string?, bool, CancellationToken)

Transcribes audio to text asynchronously.

Task<TranscriptionResult<T>> TranscribeAsync(Tensor<T> audio, string? language = null, bool includeTimestamps = false, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>

Audio waveform tensor [batch, samples] or [samples].

language string

Optional language code (e.g., "en", "es"). Auto-detected if null.

includeTimestamps bool

Whether to include word-level timestamps.

cancellationToken CancellationToken

Cancellation token for async operation.

Returns

Task<TranscriptionResult<T>>

Transcription result containing text and optional timestamps.