Interface ITextToSpeech<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for text-to-speech (TTS) models that synthesize spoken audio from text.
public interface ITextToSpeech<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
- Extension Methods
Remarks
Text-to-speech models convert written text into natural-sounding spoken audio. Modern TTS systems use neural networks to produce high-quality, expressive speech that can sound nearly indistinguishable from human speakers.
For Beginners: TTS is like having a computer read text out loud to you.
How TTS works:
- Text is analyzed for pronunciation, emphasis, and pacing
- The model generates audio features (mel-spectrograms)
- A vocoder converts features to waveform audio
Common use cases:
- Accessibility (screen readers for visually impaired)
- Voice assistants and chatbots
- Audiobook and podcast generation
- Language learning applications
Key features:
- Voice cloning: Make it sound like a specific person
- Emotion control: Express happiness, sadness, excitement
- Speed control: Speak faster or slower
This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.
Properties
AvailableVoices
Gets the list of available built-in voices.
IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }
Property Value
Remarks
Each voice has unique characteristics (gender, age, accent, style).
IsOnnxMode
Gets whether this model is running in ONNX inference mode.
bool IsOnnxMode { get; }
Property Value
Remarks
When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.
SampleRate
Gets the sample rate of generated audio.
int SampleRate { get; }
Property Value
Remarks
Common values: 22050 Hz (standard), 44100 Hz (high quality), 16000 Hz (telephony).
SupportsEmotionControl
Gets whether this model supports emotional expression control.
bool SupportsEmotionControl { get; }
Property Value
SupportsStreaming
Gets whether this model supports streaming audio generation.
bool SupportsStreaming { get; }
Property Value
SupportsVoiceCloning
Gets whether this model supports voice cloning from reference audio.
bool SupportsVoiceCloning { get; }
Property Value
Remarks
For Beginners: Voice cloning lets you make the TTS sound like a specific person by providing a sample of their voice.
Methods
ExtractSpeakerEmbedding(Tensor<T>)
Extracts speaker embedding from reference audio for voice cloning.
Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)
Parameters
referenceAudioTensor<T>Reference audio sample.
Returns
- Tensor<T>
Speaker embedding tensor that captures voice characteristics.
StartStreamingSession(string?, double)
Starts a streaming synthesis session for incremental audio generation.
IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)
Parameters
Returns
- IStreamingSynthesisSession<T>
A streaming session that can receive text incrementally.
Exceptions
- NotSupportedException
Thrown if streaming is not supported.
Synthesize(string, string?, double, double)
Synthesizes speech from text.
Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)
Parameters
textstringThe text to speak.
voiceIdstringOptional voice identifier. Uses default if null.
speakingRatedoubleSpeed multiplier (0.5 = half speed, 2.0 = double speed).
pitchdoublePitch adjustment in semitones (-12 to +12).
Returns
- Tensor<T>
Audio waveform tensor [samples] or [channels, samples].
Remarks
For Beginners: This is the main method for converting text to speech. - Pass in text like "Hello, how are you?" - Get back audio you can play through speakers
SynthesizeAsync(string, string?, double, double, CancellationToken)
Synthesizes speech from text asynchronously.
Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)
Parameters
textstringThe text to speak.
voiceIdstringOptional voice identifier. Uses default if null.
speakingRatedoubleSpeed multiplier (0.5 = half speed, 2.0 = double speed).
pitchdoublePitch adjustment in semitones (-12 to +12).
cancellationTokenCancellationTokenCancellation token for async operation.
Returns
- Task<Tensor<T>>
Audio waveform tensor [samples] or [channels, samples].
SynthesizeWithEmotion(string, string, double, string?, double)
Synthesizes speech with emotional expression.
Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)
Parameters
textstringThe text to speak.
emotionstringThe emotion to express (e.g., "happy", "sad", "angry").
emotionIntensitydoubleIntensity of the emotion (0.0 to 1.0).
voiceIdstringOptional voice identifier.
speakingRatedoubleSpeed multiplier.
Returns
- Tensor<T>
Audio waveform tensor with emotional expression.
Exceptions
- NotSupportedException
Thrown if emotion control is not supported.
SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)
Synthesizes speech using a cloned voice from reference audio.
Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)
Parameters
textstringThe text to speak.
referenceAudioTensor<T>Reference audio sample of the voice to clone.
speakingRatedoubleSpeed multiplier.
pitchdoublePitch adjustment in semitones.
Returns
- Tensor<T>
Audio waveform tensor matching the reference voice.
Remarks
For Beginners: This creates speech that sounds like the person in the reference audio. The model learns the voice characteristics from the sample and applies them to new text.
Exceptions
- NotSupportedException
Thrown if voice cloning is not supported.