Table of Contents

Interface ITextToSpeech<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for text-to-speech (TTS) models that synthesize spoken audio from text.

public interface ITextToSpeech<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members
Extension Methods

Remarks

Text-to-speech models convert written text into natural-sounding spoken audio. Modern TTS systems use neural networks to produce high-quality, expressive speech that can sound nearly indistinguishable from human speakers.

For Beginners: TTS is like having a computer read text out loud to you.

How TTS works:

  1. Text is analyzed for pronunciation, emphasis, and pacing
  2. The model generates audio features (mel-spectrograms)
  3. A vocoder converts features to waveform audio

Common use cases:

  • Accessibility (screen readers for visually impaired)
  • Voice assistants and chatbots
  • Audiobook and podcast generation
  • Language learning applications

Key features:

  • Voice cloning: Make it sound like a specific person
  • Emotion control: Express happiness, sadness, excitement
  • Speed control: Speak faster or slower

This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.

Properties

AvailableVoices

Gets the list of available built-in voices.

IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }

Property Value

IReadOnlyList<VoiceInfo<T>>

Remarks

Each voice has unique characteristics (gender, age, accent, style).

IsOnnxMode

Gets whether this model is running in ONNX inference mode.

bool IsOnnxMode { get; }

Property Value

bool

Remarks

When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.

SampleRate

Gets the sample rate of generated audio.

int SampleRate { get; }

Property Value

int

Remarks

Common values: 22050 Hz (standard), 44100 Hz (high quality), 16000 Hz (telephony).

SupportsEmotionControl

Gets whether this model supports emotional expression control.

bool SupportsEmotionControl { get; }

Property Value

bool

SupportsStreaming

Gets whether this model supports streaming audio generation.

bool SupportsStreaming { get; }

Property Value

bool

SupportsVoiceCloning

Gets whether this model supports voice cloning from reference audio.

bool SupportsVoiceCloning { get; }

Property Value

bool

Remarks

For Beginners: Voice cloning lets you make the TTS sound like a specific person by providing a sample of their voice.

Methods

ExtractSpeakerEmbedding(Tensor<T>)

Extracts speaker embedding from reference audio for voice cloning.

Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>

Reference audio sample.

Returns

Tensor<T>

Speaker embedding tensor that captures voice characteristics.

StartStreamingSession(string?, double)

Starts a streaming synthesis session for incremental audio generation.

IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)

Parameters

voiceId string

Optional voice identifier.

speakingRate double

Speed multiplier.

Returns

IStreamingSynthesisSession<T>

A streaming session that can receive text incrementally.

Exceptions

NotSupportedException

Thrown if streaming is not supported.

Synthesize(string, string?, double, double)

Synthesizes speech from text.

Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)

Parameters

text string

The text to speak.

voiceId string

Optional voice identifier. Uses default if null.

speakingRate double

Speed multiplier (0.5 = half speed, 2.0 = double speed).

pitch double

Pitch adjustment in semitones (-12 to +12).

Returns

Tensor<T>

Audio waveform tensor [samples] or [channels, samples].

Remarks

For Beginners: This is the main method for converting text to speech. - Pass in text like "Hello, how are you?" - Get back audio you can play through speakers

SynthesizeAsync(string, string?, double, double, CancellationToken)

Synthesizes speech from text asynchronously.

Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)

Parameters

text string

The text to speak.

voiceId string

Optional voice identifier. Uses default if null.

speakingRate double

Speed multiplier (0.5 = half speed, 2.0 = double speed).

pitch double

Pitch adjustment in semitones (-12 to +12).

cancellationToken CancellationToken

Cancellation token for async operation.

Returns

Task<Tensor<T>>

Audio waveform tensor [samples] or [channels, samples].

SynthesizeWithEmotion(string, string, double, string?, double)

Synthesizes speech with emotional expression.

Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)

Parameters

text string

The text to speak.

emotion string

The emotion to express (e.g., "happy", "sad", "angry").

emotionIntensity double

Intensity of the emotion (0.0 to 1.0).

voiceId string

Optional voice identifier.

speakingRate double

Speed multiplier.

Returns

Tensor<T>

Audio waveform tensor with emotional expression.

Exceptions

NotSupportedException

Thrown if emotion control is not supported.

SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)

Synthesizes speech using a cloned voice from reference audio.

Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)

Parameters

text string

The text to speak.

referenceAudio Tensor<T>

Reference audio sample of the voice to clone.

speakingRate double

Speed multiplier.

pitch double

Pitch adjustment in semitones.

Returns

Tensor<T>

Audio waveform tensor matching the reference voice.

Remarks

For Beginners: This creates speech that sounds like the person in the reference audio. The model learns the voice characteristics from the sample and applies them to new text.

Exceptions

NotSupportedException

Thrown if voice cloning is not supported.