Class Tacotron2Model<T>
- Namespace
- AiDotNet.Audio.TextToSpeech
- Assembly
- AiDotNet.dll
Tacotron2 attention-based text-to-speech model.
public class Tacotron2Model<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ITextToSpeech<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
Tacotron2Model<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
Tacotron2 is a classic neural TTS model that generates mel spectrograms from text. It uses an encoder-attention-decoder architecture with:
- Character/phoneme encoder with convolutional layers
- Location-sensitive attention for alignment
- Autoregressive LSTM decoder
- Post-net for mel spectrogram refinement
For Beginners: Tacotron2 is a two-stage TTS system:
Stage 1 (Tacotron2): Text -> Mel Spectrogram Stage 2 (Vocoder): Mel Spectrogram -> Audio Waveform
Key characteristics:
- Autoregressive: Generates one mel frame at a time
- Attention-based: Learns to align text with audio
- High quality but slower than parallel models like VITS
Two ways to use this class:
- ONNX Mode: Load pretrained Tacotron2 models for inference
- Native Mode: Train your own TTS model from scratch
ONNX Mode Example:
var tacotron = new Tacotron2Model<float>(
architecture,
acousticModelPath: "tacotron2.onnx",
vocoderPath: "hifigan.onnx");
var audio = tacotron.Synthesize("Hello, world!");
Training Mode Example:
var tacotron = new Tacotron2Model<float>(architecture);
tacotron.Train(phonemeInput, expectedMelSpectrogram);
Constructors
Tacotron2Model(NeuralNetworkArchitecture<T>, int, int, double, int, int, int, int, int, int, int, int, int, int, int, double, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a Tacotron2 model for native training mode.
public Tacotron2Model(NeuralNetworkArchitecture<T> architecture, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, int vocabSize = 148, int embeddingDim = 512, int encoderDim = 512, int decoderDim = 1024, int attentionDim = 128, int prenetDim = 256, int postnetEmbeddingDim = 512, int numEncoderConvLayers = 3, int numPostnetConvLayers = 5, int numMelsPerFrame = 2, int maxDecoderSteps = 1000, double stopThreshold = 0.5, int fftSize = 1024, int hopLength = 256, int griffinLimIterations = 60, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
sampleRateintOutput sample rate in Hz. Default is 22050.
numMelsintNumber of mel spectrogram channels. Default is 80.
speakingRatedoubleSpeaking rate multiplier. Default is 1.0.
vocabSizeintCharacter/phoneme vocabulary size. Default is 148.
embeddingDimintEmbedding dimension. Default is 512.
encoderDimintEncoder hidden dimension. Default is 512.
decoderDimintDecoder hidden dimension. Default is 1024.
attentionDimintAttention dimension. Default is 128.
prenetDimintPre-net dimension. Default is 256.
postnetEmbeddingDimintPost-net embedding dimension. Default is 512.
numEncoderConvLayersintNumber of encoder conv layers. Default is 3.
numPostnetConvLayersintNumber of post-net conv layers. Default is 5.
numMelsPerFrameintMel frames per decoder step. Default is 2.
maxDecoderStepsintMaximum decoder steps. Default is 1000.
stopThresholddoubleStop token threshold. Default is 0.5.
fftSizeintFFT size for Griffin-Lim. Default is 1024.
hopLengthintHop length. Default is 256.
griffinLimIterationsintGriffin-Lim iterations. Default is 60.
optimizerIOptimizer<T, Tensor<T>, Tensor<T>>Optimizer for training. If null, uses Adam.
lossFunctionILossFunction<T>Loss function for training. If null, uses MSE.
Remarks
For Beginners: Use this constructor to train your own Tacotron2 model.
Training Tacotron2 requires:
- Paired text-audio data with aligned phoneme sequences
- GPU training is recommended (many hours of training)
- Teacher forcing is used during training
Example:
var tacotron = new Tacotron2Model<float>(
architecture,
embeddingDim: 512,
encoderDim: 512,
decoderDim: 1024);
// Training loop
tacotron.Train(phonemeInput, expectedMelSpectrogram);
Tacotron2Model(NeuralNetworkArchitecture<T>, string, string?, int, int, double, int, double, int, int, int, OnnxModelOptions?)
Creates a Tacotron2 model for ONNX inference with pretrained models.
public Tacotron2Model(NeuralNetworkArchitecture<T> architecture, string acousticModelPath, string? vocoderPath = null, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, int maxDecoderSteps = 1000, double stopThreshold = 0.5, int fftSize = 1024, int hopLength = 256, int griffinLimIterations = 60, OnnxModelOptions? onnxOptions = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
acousticModelPathstringPath to the Tacotron2 ONNX model.
vocoderPathstringOptional path to vocoder ONNX (HiFi-GAN/WaveGlow). Uses Griffin-Lim if null.
sampleRateintOutput sample rate in Hz. Default is 22050.
numMelsintNumber of mel spectrogram channels. Default is 80.
speakingRatedoubleSpeaking rate multiplier. Default is 1.0.
maxDecoderStepsintMaximum decoder steps. Default is 1000.
stopThresholddoubleStop token threshold. Default is 0.5.
fftSizeintFFT size for Griffin-Lim. Default is 1024.
hopLengthintHop length. Default is 256.
griffinLimIterationsintGriffin-Lim iterations. Default is 60.
onnxOptionsOnnxModelOptionsONNX runtime options.
Remarks
For Beginners: Use this constructor with pretrained Tacotron2 models.
You need at least an acoustic model (Tacotron2). The vocoder is optional - Griffin-Lim can be used as fallback.
Example:
var tacotron = new Tacotron2Model<float>(
architecture,
acousticModelPath: "tacotron2.onnx",
vocoderPath: "hifigan.onnx");
Properties
AvailableVoices
Gets the list of available built-in voices.
public IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }
Property Value
IsReady
Gets whether the model is ready for synthesis.
public bool IsReady { get; }
Property Value
MaxDecoderSteps
Gets the maximum decoder steps.
public int MaxDecoderSteps { get; }
Property Value
SupportsEmotionControl
Gets whether this model supports emotional expression control.
public bool SupportsEmotionControl { get; }
Property Value
SupportsStreaming
Gets whether this model supports streaming audio generation.
public bool SupportsStreaming { get; }
Property Value
SupportsVoiceCloning
Gets whether this model supports voice cloning from reference audio.
public bool SupportsVoiceCloning { get; }
Property Value
Methods
CreateNewInstance()
Creates a new instance of this model for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReader
Dispose(bool)
Disposes the model and releases resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
ExtractSpeakerEmbedding(Tensor<T>)
Extracts speaker embedding from reference audio.
public Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)
Parameters
referenceAudioTensor<T>
Returns
- Tensor<T>
GetModelMetadata()
Gets metadata about the model.
public override ModelMetadata<T> GetModelMetadata()
Returns
InitializeLayers()
Initializes layers for ONNX inference mode.
protected override void InitializeLayers()
PostprocessOutput(Tensor<T>)
Postprocesses model output.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>
Returns
- Tensor<T>
Predict(Tensor<T>)
Makes a prediction using the model.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>
Returns
- Tensor<T>
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriter
StartStreamingSession(string?, double)
Starts a streaming synthesis session.
public IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)
Parameters
Returns
Synthesize(string, string?, double, double)
Synthesizes speech from text.
public Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)
Parameters
Returns
- Tensor<T>
SynthesizeAsync(string, string?, double, double, CancellationToken)
Synthesizes speech from text asynchronously.
public Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)
Parameters
textstringvoiceIdstringspeakingRatedoublepitchdoublecancellationTokenCancellationToken
Returns
- Task<Tensor<T>>
SynthesizeWithEmotion(string, string, double, string?, double)
Synthesizes speech with emotional expression.
public Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)
Parameters
Returns
- Tensor<T>
SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)
Synthesizes speech using a cloned voice from reference audio.
public Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)
Parameters
Returns
- Tensor<T>
Train(Tensor<T>, Tensor<T>)
Trains the model on input data.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>expectedOutputTensor<T>
UpdateParameters(Vector<T>)
Updates model parameters using the configured optimizer.
public override void UpdateParameters(Vector<T> gradients)
Parameters
gradientsVector<T>