Class TtsModel<T>
- Namespace
- AiDotNet.Audio.TextToSpeech
- Assembly
- AiDotNet.dll
Text-to-speech model for synthesizing speech from text.
public class TtsModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ITextToSpeech<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
TtsModel<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
This TTS model uses a two-stage pipeline: 1. Acoustic Model (FastSpeech2): Converts text/phonemes to mel spectrogram 2. Vocoder (HiFi-GAN or Griffin-Lim): Converts mel spectrogram to audio waveform
For Beginners: Text-to-Speech works like this: 1. Your text is converted to phonemes (speech sounds) 2. The acoustic model predicts what the speech should "look like" (mel spectrogram) 3. The vocoder makes it actually sound like speech
This class supports two modes:
- ONNX Mode: Load pretrained FastSpeech2/HiFi-GAN models for instant synthesis
- Native Mode: Train your own TTS model from scratch
Usage (ONNX Mode):
var tts = new TtsModel<float>(
architecture,
acousticModelPath: "path/to/fastspeech2.onnx",
vocoderModelPath: "path/to/hifigan.onnx");
var audio = tts.Synthesize("Hello, world!");
Usage (Native Training Mode):
var tts = new TtsModel<float>(
architecture,
optimizer: new AdamOptimizer<float>(),
lossFunction: new MeanSquaredErrorLoss<float>());
tts.Train(phonemeInput, expectedMelSpectrogram);
Constructors
TtsModel(NeuralNetworkArchitecture<T>, int, int, double, double, double, int?, string?, int, int, int, int, int, int, int, int, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a TtsModel for native training mode.
public TtsModel(NeuralNetworkArchitecture<T> architecture, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double pitchShift = 0, double energy = 1, int? speakerId = null, string? language = null, int hiddenDim = 256, int numHeads = 4, int numEncoderLayers = 4, int numDecoderLayers = 4, int maxPhonemeLength = 256, int fftSize = 1024, int hopLength = 256, int griffinLimIterations = 60, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
sampleRateintOutput sample rate in Hz. Default is 22050 (standard for TTS).
numMelsintNumber of mel spectrogram channels. Default is 80.
speakingRatedoubleSpeaking rate multiplier. 1.0 = normal speed. Default is 1.0.
pitchShiftdoublePitch shift in semitones. 0 = normal. Default is 0.
energydoubleEnergy/volume level. 1.0 = normal. Default is 1.0.
speakerIdint?Speaker ID for multi-speaker models. Default is null.
languagestringLanguage code for multi-lingual models. Default is null.
hiddenDimintHidden dimension for acoustic model. Default is 256.
numHeadsintNumber of attention heads. Default is 4.
numEncoderLayersintNumber of encoder layers. Default is 4.
numDecoderLayersintNumber of decoder layers. Default is 4.
maxPhonemeLengthintMaximum phoneme sequence length. Default is 256.
fftSizeintFFT size for Griffin-Lim. Default is 1024.
hopLengthintHop length for Griffin-Lim. Default is 256.
griffinLimIterationsintNumber of Griffin-Lim iterations. Default is 60.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optimizer for training. If null, a default Adam optimizer is used.
lossFunctionILossFunction<T>Loss function for training. If null, MSE loss is used.
Remarks
For Beginners: Use this constructor to train your own TTS model.
You'll need a dataset of (phoneme sequence, mel spectrogram) pairs. Training TTS from scratch requires significant data and compute resources.
Example:
var tts = new TtsModel<float>(
architecture,
optimizer: new AdamOptimizer<float>(),
lossFunction: new MeanSquaredErrorLoss<float>());
// Train on your dataset
tts.Train(phonemeInput, expectedMelSpectrogram);
TtsModel(NeuralNetworkArchitecture<T>, string, string?, int, int, double, double, double, int?, string?, bool, int, int, int, OnnxModelOptions?)
Creates a TtsModel for ONNX inference with pretrained models.
public TtsModel(NeuralNetworkArchitecture<T> architecture, string acousticModelPath, string? vocoderModelPath = null, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double pitchShift = 0, double energy = 1, int? speakerId = null, string? language = null, bool useGriffinLimFallback = true, int griffinLimIterations = 60, int fftSize = 1024, int hopLength = 256, OnnxModelOptions? onnxOptions = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
acousticModelPathstringRequired path to acoustic model ONNX file (e.g., FastSpeech2).
vocoderModelPathstringOptional path to vocoder ONNX file (e.g., HiFi-GAN). If null, uses Griffin-Lim.
sampleRateintOutput sample rate in Hz. Default is 22050 (standard for TTS).
numMelsintNumber of mel spectrogram channels. Default is 80.
speakingRatedoubleSpeaking rate multiplier. 1.0 = normal speed. Default is 1.0.
pitchShiftdoublePitch shift in semitones. 0 = normal. Default is 0.
energydoubleEnergy/volume level. 1.0 = normal. Default is 1.0.
speakerIdint?Speaker ID for multi-speaker models. Default is null.
languagestringLanguage code for multi-lingual models. Default is null.
useGriffinLimFallbackboolWhether to use Griffin-Lim as fallback. Default is true.
griffinLimIterationsintNumber of Griffin-Lim iterations. Default is 60.
fftSizeintFFT size for Griffin-Lim. Default is 1024.
hopLengthintHop length for Griffin-Lim. Default is 256.
onnxOptionsOnnxModelOptionsONNX runtime options.
Remarks
For Beginners: Use this constructor when you have pretrained TTS models.
You need at least an acoustic model (converts text to mel spectrogram). The vocoder (converts mel to audio) is optional - Griffin-Lim can be used as fallback.
Example:
var tts = new TtsModel<float>(
architecture,
acousticModelPath: "fastspeech2.onnx",
vocoderModelPath: "hifigan.onnx");
Properties
AvailableVoices
Gets the list of available built-in voices.
public IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }
Property Value
IsReady
Gets whether the model is ready for synthesis.
public bool IsReady { get; }
Property Value
SupportsEmotionControl
Gets whether this model supports emotional expression control.
public bool SupportsEmotionControl { get; }
Property Value
SupportsStreaming
Gets whether this model supports streaming audio generation.
public bool SupportsStreaming { get; }
Property Value
SupportsVoiceCloning
Gets whether this model supports voice cloning from reference audio.
public bool SupportsVoiceCloning { get; }
Property Value
Methods
CreateNewInstance()
Creates a new instance of this model for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReader
Dispose(bool)
Disposes the model and releases resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
ExtractSpeakerEmbedding(Tensor<T>)
Extracts speaker embedding from reference audio for voice cloning.
public Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)
Parameters
referenceAudioTensor<T>
Returns
- Tensor<T>
GetModelMetadata()
Gets metadata about the model.
public override ModelMetadata<T> GetModelMetadata()
Returns
InitializeLayers()
Initializes the layers for the TTS model.
protected override void InitializeLayers()
Remarks
Follows the golden standard pattern:
- Check if in native mode (ONNX mode returns early)
- Use Architecture.Layers if provided by user
- Fall back to LayerHelper.CreateDefaultTtsLayers() otherwise
PostprocessOutput(Tensor<T>)
Postprocesses model output into the final result format.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>
Returns
- Tensor<T>
Predict(Tensor<T>)
Makes a prediction using the model.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>
Returns
- Tensor<T>
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriter
StartStreamingSession(string?, double)
Starts a streaming synthesis session.
public IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)
Parameters
Returns
Synthesize(string, string?, double, double)
Synthesizes speech from text.
public Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)
Parameters
Returns
- Tensor<T>
SynthesizeAsync(string, string?, double, double, CancellationToken)
Synthesizes speech from text asynchronously.
public Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)
Parameters
textstringvoiceIdstringspeakingRatedoublepitchdoublecancellationTokenCancellationToken
Returns
- Task<Tensor<T>>
SynthesizeWithEmotion(string, string, double, string?, double)
Synthesizes speech with emotional expression.
public Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)
Parameters
Returns
- Tensor<T>
SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)
Synthesizes speech using a cloned voice from reference audio.
public Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)
Parameters
Returns
- Tensor<T>
Train(Tensor<T>, Tensor<T>)
Trains the model on input data.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>expectedOutputTensor<T>
UpdateParameters(Vector<T>)
Updates model parameters using gradient descent.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>