Table of Contents

Class Tacotron2Model<T>

Namespace
AiDotNet.Audio.TextToSpeech
Assembly
AiDotNet.dll

Tacotron2 attention-based text-to-speech model.

public class Tacotron2Model<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ITextToSpeech<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
Tacotron2Model<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

Tacotron2 is a classic neural TTS model that generates mel spectrograms from text. It uses an encoder-attention-decoder architecture with:

  • Character/phoneme encoder with convolutional layers
  • Location-sensitive attention for alignment
  • Autoregressive LSTM decoder
  • Post-net for mel spectrogram refinement

For Beginners: Tacotron2 is a two-stage TTS system:

Stage 1 (Tacotron2): Text -> Mel Spectrogram Stage 2 (Vocoder): Mel Spectrogram -> Audio Waveform

Key characteristics:

  • Autoregressive: Generates one mel frame at a time
  • Attention-based: Learns to align text with audio
  • High quality but slower than parallel models like VITS

Two ways to use this class:

  1. ONNX Mode: Load pretrained Tacotron2 models for inference
  2. Native Mode: Train your own TTS model from scratch

ONNX Mode Example:

var tacotron = new Tacotron2Model<float>(
    architecture,
    acousticModelPath: "tacotron2.onnx",
    vocoderPath: "hifigan.onnx");
var audio = tacotron.Synthesize("Hello, world!");

Training Mode Example:

var tacotron = new Tacotron2Model<float>(architecture);
tacotron.Train(phonemeInput, expectedMelSpectrogram);

Constructors

Tacotron2Model(NeuralNetworkArchitecture<T>, int, int, double, int, int, int, int, int, int, int, int, int, int, int, double, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a Tacotron2 model for native training mode.

public Tacotron2Model(NeuralNetworkArchitecture<T> architecture, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, int vocabSize = 148, int embeddingDim = 512, int encoderDim = 512, int decoderDim = 1024, int attentionDim = 128, int prenetDim = 256, int postnetEmbeddingDim = 512, int numEncoderConvLayers = 3, int numPostnetConvLayers = 5, int numMelsPerFrame = 2, int maxDecoderSteps = 1000, double stopThreshold = 0.5, int fftSize = 1024, int hopLength = 256, int griffinLimIterations = 60, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

sampleRate int

Output sample rate in Hz. Default is 22050.

numMels int

Number of mel spectrogram channels. Default is 80.

speakingRate double

Speaking rate multiplier. Default is 1.0.

vocabSize int

Character/phoneme vocabulary size. Default is 148.

embeddingDim int

Embedding dimension. Default is 512.

encoderDim int

Encoder hidden dimension. Default is 512.

decoderDim int

Decoder hidden dimension. Default is 1024.

attentionDim int

Attention dimension. Default is 128.

prenetDim int

Pre-net dimension. Default is 256.

postnetEmbeddingDim int

Post-net embedding dimension. Default is 512.

numEncoderConvLayers int

Number of encoder conv layers. Default is 3.

numPostnetConvLayers int

Number of post-net conv layers. Default is 5.

numMelsPerFrame int

Mel frames per decoder step. Default is 2.

maxDecoderSteps int

Maximum decoder steps. Default is 1000.

stopThreshold double

Stop token threshold. Default is 0.5.

fftSize int

FFT size for Griffin-Lim. Default is 1024.

hopLength int

Hop length. Default is 256.

griffinLimIterations int

Griffin-Lim iterations. Default is 60.

optimizer IOptimizer<T, Tensor<T>, Tensor<T>>

Optimizer for training. If null, uses Adam.

lossFunction ILossFunction<T>

Loss function for training. If null, uses MSE.

Remarks

For Beginners: Use this constructor to train your own Tacotron2 model.

Training Tacotron2 requires:

  1. Paired text-audio data with aligned phoneme sequences
  2. GPU training is recommended (many hours of training)
  3. Teacher forcing is used during training

Example:

var tacotron = new Tacotron2Model<float>(
    architecture,
    embeddingDim: 512,
    encoderDim: 512,
    decoderDim: 1024);

// Training loop
tacotron.Train(phonemeInput, expectedMelSpectrogram);

Tacotron2Model(NeuralNetworkArchitecture<T>, string, string?, int, int, double, int, double, int, int, int, OnnxModelOptions?)

Creates a Tacotron2 model for ONNX inference with pretrained models.

public Tacotron2Model(NeuralNetworkArchitecture<T> architecture, string acousticModelPath, string? vocoderPath = null, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, int maxDecoderSteps = 1000, double stopThreshold = 0.5, int fftSize = 1024, int hopLength = 256, int griffinLimIterations = 60, OnnxModelOptions? onnxOptions = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

acousticModelPath string

Path to the Tacotron2 ONNX model.

vocoderPath string

Optional path to vocoder ONNX (HiFi-GAN/WaveGlow). Uses Griffin-Lim if null.

sampleRate int

Output sample rate in Hz. Default is 22050.

numMels int

Number of mel spectrogram channels. Default is 80.

speakingRate double

Speaking rate multiplier. Default is 1.0.

maxDecoderSteps int

Maximum decoder steps. Default is 1000.

stopThreshold double

Stop token threshold. Default is 0.5.

fftSize int

FFT size for Griffin-Lim. Default is 1024.

hopLength int

Hop length. Default is 256.

griffinLimIterations int

Griffin-Lim iterations. Default is 60.

onnxOptions OnnxModelOptions

ONNX runtime options.

Remarks

For Beginners: Use this constructor with pretrained Tacotron2 models.

You need at least an acoustic model (Tacotron2). The vocoder is optional - Griffin-Lim can be used as fallback.

Example:

var tacotron = new Tacotron2Model<float>(
    architecture,
    acousticModelPath: "tacotron2.onnx",
    vocoderPath: "hifigan.onnx");

Properties

AvailableVoices

Gets the list of available built-in voices.

public IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }

Property Value

IReadOnlyList<VoiceInfo<T>>

IsReady

Gets whether the model is ready for synthesis.

public bool IsReady { get; }

Property Value

bool

MaxDecoderSteps

Gets the maximum decoder steps.

public int MaxDecoderSteps { get; }

Property Value

int

SupportsEmotionControl

Gets whether this model supports emotional expression control.

public bool SupportsEmotionControl { get; }

Property Value

bool

SupportsStreaming

Gets whether this model supports streaming audio generation.

public bool SupportsStreaming { get; }

Property Value

bool

SupportsVoiceCloning

Gets whether this model supports voice cloning from reference audio.

public bool SupportsVoiceCloning { get; }

Property Value

bool

Methods

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Dispose(bool)

Disposes the model and releases resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

ExtractSpeakerEmbedding(Tensor<T>)

Extracts speaker embedding from reference audio.

public Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>

Returns

Tensor<T>

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes layers for ONNX inference mode.

protected override void InitializeLayers()

PostprocessOutput(Tensor<T>)

Postprocesses model output.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

StartStreamingSession(string?, double)

Starts a streaming synthesis session.

public IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)

Parameters

voiceId string
speakingRate double

Returns

IStreamingSynthesisSession<T>

Synthesize(string, string?, double, double)

Synthesizes speech from text.

public Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)

Parameters

text string
voiceId string
speakingRate double
pitch double

Returns

Tensor<T>

SynthesizeAsync(string, string?, double, double, CancellationToken)

Synthesizes speech from text asynchronously.

public Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)

Parameters

text string
voiceId string
speakingRate double
pitch double
cancellationToken CancellationToken

Returns

Task<Tensor<T>>

SynthesizeWithEmotion(string, string, double, string?, double)

Synthesizes speech with emotional expression.

public Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)

Parameters

text string
emotion string
emotionIntensity double
voiceId string
speakingRate double

Returns

Tensor<T>

SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)

Synthesizes speech using a cloned voice from reference audio.

public Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)

Parameters

text string
referenceAudio Tensor<T>
speakingRate double
pitch double

Returns

Tensor<T>

Train(Tensor<T>, Tensor<T>)

Trains the model on input data.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

UpdateParameters(Vector<T>)

Updates model parameters using the configured optimizer.

public override void UpdateParameters(Vector<T> gradients)

Parameters

gradients Vector<T>