Interface ISpeechRecognizer<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Interface for speech recognition models that transcribe audio to text (ASR - Automatic Speech Recognition).

public interface ISpeechRecognizer<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inherited Members: IFullModel<T, Tensor<T>, Tensor<T>>.DefaultLossFunction

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.Train(Tensor<T>, Tensor<T>)

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.Predict(Tensor<T>)

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.GetModelMetadata()

IModelSerializer.Serialize()

IModelSerializer.Deserialize(byte[])

IModelSerializer.SaveModel(string)

IModelSerializer.LoadModel(string)

ICheckpointableModel.SaveState(Stream)

ICheckpointableModel.LoadState(Stream)

IParameterizable<T, Tensor<T>, Tensor<T>>.GetParameters()

IParameterizable<T, Tensor<T>, Tensor<T>>.SetParameters(Vector<T>)

IParameterizable<T, Tensor<T>, Tensor<T>>.ParameterCount

IParameterizable<T, Tensor<T>, Tensor<T>>.WithParameters(Vector<T>)

IFeatureAware.GetActiveFeatureIndices()

IFeatureAware.SetActiveFeatureIndices(IEnumerable<int>)

IFeatureAware.IsFeatureUsed(int)

IFeatureImportance<T>.GetFeatureImportance()

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>.DeepCopy()

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>.Clone()

IGradientComputable<T, Tensor<T>, Tensor<T>>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

IGradientComputable<T, Tensor<T>, Tensor<T>>.ApplyGradients(Vector<T>, T)

IJitCompilable<T>.ExportComputationGraph(List<ComputationNode<T>>)

IJitCompilable<T>.SupportsJitCompilation

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

Speech recognition models convert spoken audio into written text. They analyze audio waveforms or spectrograms to identify phonemes, words, and sentences. Modern speech recognition uses encoder-decoder architectures (like Whisper) or CTC-based models.

For Beginners: Speech recognition is like having a transcriptionist listen to audio and type out what they hear.

How speech recognition works:

Audio is converted to features (spectrograms or mel-spectrograms)
The model processes these features to identify speech patterns
Patterns are decoded into words and sentences

Common use cases:

Voice assistants (Siri, Alexa, Google Assistant)
Video/podcast transcription
Real-time captioning for accessibility
Voice typing and dictation

Key challenges:

Different accents and speaking styles
Background noise and multiple speakers
Domain-specific vocabulary (medical, legal terms)

This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.

Properties

IsOnnxMode

Gets whether this model is running in ONNX inference mode.

bool IsOnnxMode { get; }

Property Value

bool

Remarks

When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.

SampleRate

Gets the sample rate expected by this model.

int SampleRate { get; }

Property Value

int

Remarks

Most speech recognition models expect 16000 Hz audio. Input audio should be resampled to match this rate before processing.

SupportedLanguages

Gets the list of languages supported by this model.

IReadOnlyList<string> SupportedLanguages { get; }

Property Value

IReadOnlyList<string>

Remarks

Multilingual models like Whisper support many languages. Monolingual models may only support one. Check this property before processing foreign audio.

SupportsStreaming

Gets whether this model supports real-time streaming transcription.

bool SupportsStreaming { get; }

Property Value

bool

Remarks

For Beginners: Streaming mode transcribes audio as it comes in, without waiting for the entire recording. Good for live captioning.

SupportsWordTimestamps

Gets whether this model can identify timestamps for each word.

bool SupportsWordTimestamps { get; }

Property Value

bool

Remarks

Word-level timestamps are useful for subtitle generation and audio editing.

Methods

DetectLanguage(Tensor<T>)

Detects the language spoken in the audio.

string DetectLanguage(Tensor<T> audio)

Parameters

audio Tensor<T>: Audio waveform tensor [batch, samples] or [samples].

Returns

string: Detected language code (e.g., "en", "es", "fr").

Remarks

For Beginners: This identifies what language is being spoken before transcription. Useful for multilingual applications.

DetectLanguageProbabilities(Tensor<T>)

Gets language detection probabilities for the audio.

IReadOnlyDictionary<string, T> DetectLanguageProbabilities(Tensor<T> audio)

Parameters

audio Tensor<T>: Audio waveform tensor [batch, samples] or [samples].

Returns

IReadOnlyDictionary<string, T>: Dictionary mapping language codes to confidence scores (0.0 to 1.0).

StartStreamingSession(string?)

Starts a streaming transcription session.

IStreamingTranscriptionSession<T> StartStreamingSession(string? language = null)

Parameters

language string: Optional language code for transcription.

Returns

IStreamingTranscriptionSession<T>: A streaming session that can receive audio chunks incrementally.

Exceptions

NotSupportedException: Thrown if streaming is not supported.

Transcribe(Tensor<T>, string?, bool)

Transcribes audio to text.

TranscriptionResult<T> Transcribe(Tensor<T> audio, string? language = null, bool includeTimestamps = false)

Parameters

audio Tensor<T>: Audio waveform tensor [batch, samples] or [samples].
language string: Optional language code (e.g., "en", "es"). Auto-detected if null.
includeTimestamps bool: Whether to include word-level timestamps.

Returns

TranscriptionResult<T>: Transcription result containing text and optional timestamps.

Remarks

For Beginners: This is the main method for converting speech to text. - Pass in audio data (as a tensor of samples) - Get back the transcribed text

TranscribeAsync(Tensor<T>, string?, bool, CancellationToken)

Transcribes audio to text asynchronously.

Task<TranscriptionResult<T>> TranscribeAsync(Tensor<T> audio, string? language = null, bool includeTimestamps = false, CancellationToken cancellationToken = default)

Parameters

audio Tensor<T>: Audio waveform tensor [batch, samples] or [samples].
language string: Optional language code (e.g., "en", "es"). Auto-detected if null.
includeTimestamps bool: Whether to include word-level timestamps.
cancellationToken CancellationToken: Cancellation token for async operation.

Returns

Task<TranscriptionResult<T>>: Transcription result containing text and optional timestamps.

Table of Contents

Interface ISpeechRecognizer<T>

Type Parameters

Remarks

Properties

IsOnnxMode

Property Value

Remarks

SampleRate

Property Value

Remarks

SupportedLanguages

Property Value

Remarks

SupportsStreaming

Property Value

Remarks

SupportsWordTimestamps

Property Value

Remarks

Methods

DetectLanguage(Tensor<T>)

Parameters

Returns

Remarks

DetectLanguageProbabilities(Tensor<T>)

Parameters

Returns

StartStreamingSession(string?)

Parameters

Returns

Exceptions

Transcribe(Tensor<T>, string?, bool)

Parameters

Returns

Remarks

TranscribeAsync(Tensor<T>, string?, bool, CancellationToken)

Parameters

Returns