Table of Contents

Class TtsModel<T>

Namespace
AiDotNet.Audio.TextToSpeech
Assembly
AiDotNet.dll

Text-to-speech model for synthesizing speech from text.

public class TtsModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ITextToSpeech<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
TtsModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

This TTS model uses a two-stage pipeline: 1. Acoustic Model (FastSpeech2): Converts text/phonemes to mel spectrogram 2. Vocoder (HiFi-GAN or Griffin-Lim): Converts mel spectrogram to audio waveform

For Beginners: Text-to-Speech works like this: 1. Your text is converted to phonemes (speech sounds) 2. The acoustic model predicts what the speech should "look like" (mel spectrogram) 3. The vocoder makes it actually sound like speech

This class supports two modes:

  • ONNX Mode: Load pretrained FastSpeech2/HiFi-GAN models for instant synthesis
  • Native Mode: Train your own TTS model from scratch

Usage (ONNX Mode):

var tts = new TtsModel<float>(
    architecture,
    acousticModelPath: "path/to/fastspeech2.onnx",
    vocoderModelPath: "path/to/hifigan.onnx");

var audio = tts.Synthesize("Hello, world!");

Usage (Native Training Mode):

var tts = new TtsModel<float>(
    architecture,
    optimizer: new AdamOptimizer<float>(),
    lossFunction: new MeanSquaredErrorLoss<float>());

tts.Train(phonemeInput, expectedMelSpectrogram);

Constructors

TtsModel(NeuralNetworkArchitecture<T>, int, int, double, double, double, int?, string?, int, int, int, int, int, int, int, int, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a TtsModel for native training mode.

public TtsModel(NeuralNetworkArchitecture<T> architecture, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double pitchShift = 0, double energy = 1, int? speakerId = null, string? language = null, int hiddenDim = 256, int numHeads = 4, int numEncoderLayers = 4, int numDecoderLayers = 4, int maxPhonemeLength = 256, int fftSize = 1024, int hopLength = 256, int griffinLimIterations = 60, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

sampleRate int

Output sample rate in Hz. Default is 22050 (standard for TTS).

numMels int

Number of mel spectrogram channels. Default is 80.

speakingRate double

Speaking rate multiplier. 1.0 = normal speed. Default is 1.0.

pitchShift double

Pitch shift in semitones. 0 = normal. Default is 0.

energy double

Energy/volume level. 1.0 = normal. Default is 1.0.

speakerId int?

Speaker ID for multi-speaker models. Default is null.

language string

Language code for multi-lingual models. Default is null.

hiddenDim int

Hidden dimension for acoustic model. Default is 256.

numHeads int

Number of attention heads. Default is 4.

numEncoderLayers int

Number of encoder layers. Default is 4.

numDecoderLayers int

Number of decoder layers. Default is 4.

maxPhonemeLength int

Maximum phoneme sequence length. Default is 256.

fftSize int

FFT size for Griffin-Lim. Default is 1024.

hopLength int

Hop length for Griffin-Lim. Default is 256.

griffinLimIterations int

Number of Griffin-Lim iterations. Default is 60.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optimizer for training. If null, a default Adam optimizer is used.

lossFunction ILossFunction<T>

Loss function for training. If null, MSE loss is used.

Remarks

For Beginners: Use this constructor to train your own TTS model.

You'll need a dataset of (phoneme sequence, mel spectrogram) pairs. Training TTS from scratch requires significant data and compute resources.

Example:

var tts = new TtsModel<float>(
    architecture,
    optimizer: new AdamOptimizer<float>(),
    lossFunction: new MeanSquaredErrorLoss<float>());

// Train on your dataset
tts.Train(phonemeInput, expectedMelSpectrogram);

TtsModel(NeuralNetworkArchitecture<T>, string, string?, int, int, double, double, double, int?, string?, bool, int, int, int, OnnxModelOptions?)

Creates a TtsModel for ONNX inference with pretrained models.

public TtsModel(NeuralNetworkArchitecture<T> architecture, string acousticModelPath, string? vocoderModelPath = null, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double pitchShift = 0, double energy = 1, int? speakerId = null, string? language = null, bool useGriffinLimFallback = true, int griffinLimIterations = 60, int fftSize = 1024, int hopLength = 256, OnnxModelOptions? onnxOptions = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

acousticModelPath string

Required path to acoustic model ONNX file (e.g., FastSpeech2).

vocoderModelPath string

Optional path to vocoder ONNX file (e.g., HiFi-GAN). If null, uses Griffin-Lim.

sampleRate int

Output sample rate in Hz. Default is 22050 (standard for TTS).

numMels int

Number of mel spectrogram channels. Default is 80.

speakingRate double

Speaking rate multiplier. 1.0 = normal speed. Default is 1.0.

pitchShift double

Pitch shift in semitones. 0 = normal. Default is 0.

energy double

Energy/volume level. 1.0 = normal. Default is 1.0.

speakerId int?

Speaker ID for multi-speaker models. Default is null.

language string

Language code for multi-lingual models. Default is null.

useGriffinLimFallback bool

Whether to use Griffin-Lim as fallback. Default is true.

griffinLimIterations int

Number of Griffin-Lim iterations. Default is 60.

fftSize int

FFT size for Griffin-Lim. Default is 1024.

hopLength int

Hop length for Griffin-Lim. Default is 256.

onnxOptions OnnxModelOptions

ONNX runtime options.

Remarks

For Beginners: Use this constructor when you have pretrained TTS models.

You need at least an acoustic model (converts text to mel spectrogram). The vocoder (converts mel to audio) is optional - Griffin-Lim can be used as fallback.

Example:

var tts = new TtsModel<float>(
    architecture,
    acousticModelPath: "fastspeech2.onnx",
    vocoderModelPath: "hifigan.onnx");

Properties

AvailableVoices

Gets the list of available built-in voices.

public IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }

Property Value

IReadOnlyList<VoiceInfo<T>>

IsReady

Gets whether the model is ready for synthesis.

public bool IsReady { get; }

Property Value

bool

SupportsEmotionControl

Gets whether this model supports emotional expression control.

public bool SupportsEmotionControl { get; }

Property Value

bool

SupportsStreaming

Gets whether this model supports streaming audio generation.

public bool SupportsStreaming { get; }

Property Value

bool

SupportsVoiceCloning

Gets whether this model supports voice cloning from reference audio.

public bool SupportsVoiceCloning { get; }

Property Value

bool

Methods

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Dispose(bool)

Disposes the model and releases resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

ExtractSpeakerEmbedding(Tensor<T>)

Extracts speaker embedding from reference audio for voice cloning.

public Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>

Returns

Tensor<T>

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes the layers for the TTS model.

protected override void InitializeLayers()

Remarks

Follows the golden standard pattern:

  1. Check if in native mode (ONNX mode returns early)
  2. Use Architecture.Layers if provided by user
  3. Fall back to LayerHelper.CreateDefaultTtsLayers() otherwise

PostprocessOutput(Tensor<T>)

Postprocesses model output into the final result format.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

StartStreamingSession(string?, double)

Starts a streaming synthesis session.

public IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)

Parameters

voiceId string
speakingRate double

Returns

IStreamingSynthesisSession<T>

Synthesize(string, string?, double, double)

Synthesizes speech from text.

public Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)

Parameters

text string
voiceId string
speakingRate double
pitch double

Returns

Tensor<T>

SynthesizeAsync(string, string?, double, double, CancellationToken)

Synthesizes speech from text asynchronously.

public Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)

Parameters

text string
voiceId string
speakingRate double
pitch double
cancellationToken CancellationToken

Returns

Task<Tensor<T>>

SynthesizeWithEmotion(string, string, double, string?, double)

Synthesizes speech with emotional expression.

public Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)

Parameters

text string
emotion string
emotionIntensity double
voiceId string
speakingRate double

Returns

Tensor<T>

SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)

Synthesizes speech using a cloned voice from reference audio.

public Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)

Parameters

text string
referenceAudio Tensor<T>
speakingRate double
pitch double

Returns

Tensor<T>

Train(Tensor<T>, Tensor<T>)

Trains the model on input data.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

UpdateParameters(Vector<T>)

Updates model parameters using gradient descent.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>