Interface ITextToSpeech<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Interface for text-to-speech (TTS) models that synthesize spoken audio from text.

public interface ITextToSpeech<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inherited Members: IFullModel<T, Tensor<T>, Tensor<T>>.DefaultLossFunction

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.Train(Tensor<T>, Tensor<T>)

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.Predict(Tensor<T>)

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.GetModelMetadata()

IModelSerializer.Serialize()

IModelSerializer.Deserialize(byte[])

IModelSerializer.SaveModel(string)

IModelSerializer.LoadModel(string)

ICheckpointableModel.SaveState(Stream)

ICheckpointableModel.LoadState(Stream)

IParameterizable<T, Tensor<T>, Tensor<T>>.GetParameters()

IParameterizable<T, Tensor<T>, Tensor<T>>.SetParameters(Vector<T>)

IParameterizable<T, Tensor<T>, Tensor<T>>.ParameterCount

IParameterizable<T, Tensor<T>, Tensor<T>>.WithParameters(Vector<T>)

IFeatureAware.GetActiveFeatureIndices()

IFeatureAware.SetActiveFeatureIndices(IEnumerable<int>)

IFeatureAware.IsFeatureUsed(int)

IFeatureImportance<T>.GetFeatureImportance()

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>.DeepCopy()

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>.Clone()

IGradientComputable<T, Tensor<T>, Tensor<T>>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

IGradientComputable<T, Tensor<T>, Tensor<T>>.ApplyGradients(Vector<T>, T)

IJitCompilable<T>.ExportComputationGraph(List<ComputationNode<T>>)

IJitCompilable<T>.SupportsJitCompilation

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

Text-to-speech models convert written text into natural-sounding spoken audio. Modern TTS systems use neural networks to produce high-quality, expressive speech that can sound nearly indistinguishable from human speakers.

For Beginners: TTS is like having a computer read text out loud to you.

How TTS works:

Text is analyzed for pronunciation, emphasis, and pacing
The model generates audio features (mel-spectrograms)
A vocoder converts features to waveform audio

Common use cases:

Accessibility (screen readers for visually impaired)
Voice assistants and chatbots
Audiobook and podcast generation
Language learning applications

Key features:

Voice cloning: Make it sound like a specific person
Emotion control: Express happiness, sadness, excitement
Speed control: Speak faster or slower

This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.

Properties

AvailableVoices

Gets the list of available built-in voices.

IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }

Property Value

IReadOnlyList<VoiceInfo<T>>

Remarks

Each voice has unique characteristics (gender, age, accent, style).

IsOnnxMode

Gets whether this model is running in ONNX inference mode.

bool IsOnnxMode { get; }

Property Value

bool

Remarks

When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.

SampleRate

Gets the sample rate of generated audio.

int SampleRate { get; }

Property Value

int

Remarks

Common values: 22050 Hz (standard), 44100 Hz (high quality), 16000 Hz (telephony).

SupportsEmotionControl

Gets whether this model supports emotional expression control.

bool SupportsEmotionControl { get; }

Property Value

bool

SupportsStreaming

Gets whether this model supports streaming audio generation.

bool SupportsStreaming { get; }

Property Value

bool

SupportsVoiceCloning

Gets whether this model supports voice cloning from reference audio.

bool SupportsVoiceCloning { get; }

Property Value

bool

Remarks

For Beginners: Voice cloning lets you make the TTS sound like a specific person by providing a sample of their voice.

Methods

ExtractSpeakerEmbedding(Tensor<T>)

Extracts speaker embedding from reference audio for voice cloning.

Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>: Reference audio sample.

Returns

Tensor<T>: Speaker embedding tensor that captures voice characteristics.

StartStreamingSession(string?, double)

Starts a streaming synthesis session for incremental audio generation.

IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)

Parameters

voiceId string: Optional voice identifier.
speakingRate double: Speed multiplier.

Returns

IStreamingSynthesisSession<T>: A streaming session that can receive text incrementally.

Exceptions

NotSupportedException: Thrown if streaming is not supported.

Synthesize(string, string?, double, double)

Synthesizes speech from text.

Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)

Parameters

text string: The text to speak.
voiceId string: Optional voice identifier. Uses default if null.
speakingRate double: Speed multiplier (0.5 = half speed, 2.0 = double speed).
pitch double: Pitch adjustment in semitones (-12 to +12).

Returns

Tensor<T>: Audio waveform tensor [samples] or [channels, samples].

Remarks

For Beginners: This is the main method for converting text to speech. - Pass in text like "Hello, how are you?" - Get back audio you can play through speakers

SynthesizeAsync(string, string?, double, double, CancellationToken)

Synthesizes speech from text asynchronously.

Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)

Parameters

text string: The text to speak.
voiceId string: Optional voice identifier. Uses default if null.
speakingRate double: Speed multiplier (0.5 = half speed, 2.0 = double speed).
pitch double: Pitch adjustment in semitones (-12 to +12).
cancellationToken CancellationToken: Cancellation token for async operation.

Returns

Task<Tensor<T>>: Audio waveform tensor [samples] or [channels, samples].

SynthesizeWithEmotion(string, string, double, string?, double)

Synthesizes speech with emotional expression.

Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)

Parameters

text string: The text to speak.
emotion string: The emotion to express (e.g., "happy", "sad", "angry").
emotionIntensity double: Intensity of the emotion (0.0 to 1.0).
voiceId string: Optional voice identifier.
speakingRate double: Speed multiplier.

Returns

Tensor<T>: Audio waveform tensor with emotional expression.

Exceptions

NotSupportedException: Thrown if emotion control is not supported.

SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)

Synthesizes speech using a cloned voice from reference audio.

Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)

Parameters

text string: The text to speak.
referenceAudio Tensor<T>: Reference audio sample of the voice to clone.
speakingRate double: Speed multiplier.
pitch double: Pitch adjustment in semitones.

Returns

Tensor<T>: Audio waveform tensor matching the reference voice.

Remarks

For Beginners: This creates speech that sounds like the person in the reference audio. The model learns the voice characteristics from the sample and applies them to new text.

Exceptions

NotSupportedException: Thrown if voice cloning is not supported.

Table of Contents

Interface ITextToSpeech<T>

Type Parameters

Remarks

Properties

AvailableVoices

Property Value

Remarks

IsOnnxMode

Property Value

Remarks

SampleRate

Property Value

Remarks

SupportsEmotionControl

Property Value

SupportsStreaming

Property Value

SupportsVoiceCloning

Property Value

Remarks

Methods

ExtractSpeakerEmbedding(Tensor<T>)

Parameters

Returns

StartStreamingSession(string?, double)

Parameters

Returns

Exceptions

Synthesize(string, string?, double, double)

Parameters

Returns

Remarks

SynthesizeAsync(string, string?, double, double, CancellationToken)

Parameters

Returns

SynthesizeWithEmotion(string, string, double, string?, double)

Parameters

Returns

Exceptions

SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)

Parameters

Returns

Remarks

Exceptions