Interface ISpeechRecognizer<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for speech recognition models that transcribe audio to text (ASR - Automatic Speech Recognition).
public interface ISpeechRecognizer<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
- Extension Methods
Remarks
Speech recognition models convert spoken audio into written text. They analyze audio waveforms or spectrograms to identify phonemes, words, and sentences. Modern speech recognition uses encoder-decoder architectures (like Whisper) or CTC-based models.
For Beginners: Speech recognition is like having a transcriptionist listen to audio and type out what they hear.
How speech recognition works:
- Audio is converted to features (spectrograms or mel-spectrograms)
- The model processes these features to identify speech patterns
- Patterns are decoded into words and sentences
Common use cases:
- Voice assistants (Siri, Alexa, Google Assistant)
- Video/podcast transcription
- Real-time captioning for accessibility
- Voice typing and dictation
Key challenges:
- Different accents and speaking styles
- Background noise and multiple speakers
- Domain-specific vocabulary (medical, legal terms)
This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.
Properties
IsOnnxMode
Gets whether this model is running in ONNX inference mode.
bool IsOnnxMode { get; }
Property Value
Remarks
When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.
SampleRate
Gets the sample rate expected by this model.
int SampleRate { get; }
Property Value
Remarks
Most speech recognition models expect 16000 Hz audio. Input audio should be resampled to match this rate before processing.
SupportedLanguages
Gets the list of languages supported by this model.
IReadOnlyList<string> SupportedLanguages { get; }
Property Value
Remarks
Multilingual models like Whisper support many languages. Monolingual models may only support one. Check this property before processing foreign audio.
SupportsStreaming
Gets whether this model supports real-time streaming transcription.
bool SupportsStreaming { get; }
Property Value
Remarks
For Beginners: Streaming mode transcribes audio as it comes in, without waiting for the entire recording. Good for live captioning.
SupportsWordTimestamps
Gets whether this model can identify timestamps for each word.
bool SupportsWordTimestamps { get; }
Property Value
Remarks
Word-level timestamps are useful for subtitle generation and audio editing.
Methods
DetectLanguage(Tensor<T>)
Detects the language spoken in the audio.
string DetectLanguage(Tensor<T> audio)
Parameters
audioTensor<T>Audio waveform tensor [batch, samples] or [samples].
Returns
- string
Detected language code (e.g., "en", "es", "fr").
Remarks
For Beginners: This identifies what language is being spoken before transcription. Useful for multilingual applications.
DetectLanguageProbabilities(Tensor<T>)
Gets language detection probabilities for the audio.
IReadOnlyDictionary<string, T> DetectLanguageProbabilities(Tensor<T> audio)
Parameters
audioTensor<T>Audio waveform tensor [batch, samples] or [samples].
Returns
- IReadOnlyDictionary<string, T>
Dictionary mapping language codes to confidence scores (0.0 to 1.0).
StartStreamingSession(string?)
Starts a streaming transcription session.
IStreamingTranscriptionSession<T> StartStreamingSession(string? language = null)
Parameters
languagestringOptional language code for transcription.
Returns
- IStreamingTranscriptionSession<T>
A streaming session that can receive audio chunks incrementally.
Exceptions
- NotSupportedException
Thrown if streaming is not supported.
Transcribe(Tensor<T>, string?, bool)
Transcribes audio to text.
TranscriptionResult<T> Transcribe(Tensor<T> audio, string? language = null, bool includeTimestamps = false)
Parameters
audioTensor<T>Audio waveform tensor [batch, samples] or [samples].
languagestringOptional language code (e.g., "en", "es"). Auto-detected if null.
includeTimestampsboolWhether to include word-level timestamps.
Returns
- TranscriptionResult<T>
Transcription result containing text and optional timestamps.
Remarks
For Beginners: This is the main method for converting speech to text. - Pass in audio data (as a tensor of samples) - Get back the transcribed text
TranscribeAsync(Tensor<T>, string?, bool, CancellationToken)
Transcribes audio to text asynchronously.
Task<TranscriptionResult<T>> TranscribeAsync(Tensor<T> audio, string? language = null, bool includeTimestamps = false, CancellationToken cancellationToken = default)
Parameters
audioTensor<T>Audio waveform tensor [batch, samples] or [samples].
languagestringOptional language code (e.g., "en", "es"). Auto-detected if null.
includeTimestampsboolWhether to include word-level timestamps.
cancellationTokenCancellationTokenCancellation token for async operation.
Returns
- Task<TranscriptionResult<T>>
Transcription result containing text and optional timestamps.