Class VITSModel<T>
- Namespace
- AiDotNet.Audio.TextToSpeech
- Assembly
- AiDotNet.dll
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) model.
public class VITSModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ITextToSpeech<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
VITSModel<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
VITS is a state-of-the-art end-to-end TTS model that generates high-quality speech directly from text without requiring a separate vocoder. It combines:
- Variational autoencoder (VAE) for learning latent representations
- Normalizing flows for improved audio quality
- Adversarial training for realistic speech synthesis
- Multi-speaker support with speaker embeddings
For Beginners: VITS is a modern TTS model with several advantages:
- End-to-end: Converts text directly to audio (no separate vocoder needed)
- Fast: Parallel generation is much faster than autoregressive models
- High quality: Produces natural-sounding speech
- Voice cloning: Can learn to speak in new voices from short audio samples
Two ways to use this class:
- ONNX Mode: Load pretrained VITS models for fast inference
- Native Mode: Train your own TTS model from scratch
ONNX Mode Example:
var vits = new VITSModel<float>(
architecture,
modelPath: "path/to/vits.onnx");
var audio = vits.Synthesize("Hello, world!");
Voice Cloning Example:
var audio = vits.SynthesizeWithVoiceCloning(
"Hello, world!",
referenceAudio);
Constructors
VITSModel(NeuralNetworkArchitecture<T>, int, int, double, double, double, int, int, int, int, int, int, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a VITS model for native training mode.
public VITSModel(NeuralNetworkArchitecture<T> architecture, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double noiseScale = 0.667, double lengthScale = 1, int hiddenDim = 192, int numHeads = 2, int numEncoderLayers = 6, int numFlowLayers = 4, int speakerEmbeddingDim = 256, int numSpeakers = 1, int maxPhonemeLength = 256, int fftSize = 1024, int hopLength = 256, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
sampleRateintOutput sample rate in Hz. Default is 22050.
numMelsintNumber of mel spectrogram channels. Default is 80.
speakingRatedoubleSpeaking rate multiplier. Default is 1.0.
noiseScaledoubleNoise scale for variational sampling. Default is 0.667.
lengthScaledoubleLength scale for duration control. Default is 1.0.
hiddenDimintHidden dimension. Default is 192.
numHeadsintNumber of attention heads. Default is 2.
numEncoderLayersintNumber of text encoder layers. Default is 6.
numFlowLayersintNumber of flow layers. Default is 4.
speakerEmbeddingDimintSpeaker embedding dimension. Default is 256.
numSpeakersintNumber of speakers for multi-speaker model. Default is 1.
maxPhonemeLengthintMaximum phoneme sequence length. Default is 256.
fftSizeintFFT size for audio generation. Default is 1024.
hopLengthintHop length for audio generation. Default is 256.
optimizerIOptimizer<T, Tensor<T>, Tensor<T>>Optimizer for training. If null, uses Adam.
lossFunctionILossFunction<T>Loss function for training. If null, uses MSE.
Remarks
For Beginners: Use this constructor to train your own VITS model.
Training VITS requires:
- Large amounts of paired text-audio data
- Significant compute resources (GPUs recommended)
- Many training epochs
Example:
var vits = new VITSModel<float>(
architecture,
numSpeakers: 10, // Multi-speaker model
speakerEmbeddingDim: 256);
// Training loop
vits.Train(phonemeInput, audioOutput);
VITSModel(NeuralNetworkArchitecture<T>, string, string?, int, int, double, double, double, int, int, OnnxModelOptions?)
Creates a VITS model for ONNX inference with pretrained models.
public VITSModel(NeuralNetworkArchitecture<T> architecture, string modelPath, string? speakerEncoderPath = null, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double noiseScale = 0.667, double lengthScale = 1, int fftSize = 1024, int hopLength = 256, OnnxModelOptions? onnxOptions = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
modelPathstringPath to the VITS ONNX model file.
speakerEncoderPathstringOptional path to speaker encoder for voice cloning.
sampleRateintOutput sample rate in Hz. Default is 22050.
numMelsintNumber of mel spectrogram channels. Default is 80.
speakingRatedoubleSpeaking rate multiplier. Default is 1.0.
noiseScaledoubleNoise scale for variational sampling. Default is 0.667.
lengthScaledoubleLength scale for duration control. Default is 1.0.
fftSizeintFFT size for audio generation. Default is 1024.
hopLengthintHop length for audio generation. Default is 256.
onnxOptionsOnnxModelOptionsONNX runtime options.
Remarks
For Beginners: Use this constructor with pretrained VITS models.
You can get ONNX models from:
- HuggingFace: Various VITS models
- Coqui TTS exports
Example:
var vits = new VITSModel<float>(
architecture,
modelPath: "vits-en.onnx",
speakerEncoderPath: "speaker-encoder.onnx"); // For voice cloning
Properties
AvailableVoices
Gets the list of available built-in voices.
public IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }
Property Value
IsReady
Gets whether the model is ready for synthesis.
public bool IsReady { get; }
Property Value
SupportsEmotionControl
Gets whether this model supports emotional expression control.
public bool SupportsEmotionControl { get; }
Property Value
SupportsStreaming
Gets whether this model supports streaming audio generation.
public bool SupportsStreaming { get; }
Property Value
SupportsVoiceCloning
Gets whether this model supports voice cloning from reference audio.
public bool SupportsVoiceCloning { get; }
Property Value
Methods
CreateNewInstance()
Creates a new instance of this model for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReader
Dispose(bool)
Disposes the model and releases resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
ExtractSpeakerEmbedding(Tensor<T>)
Extracts speaker embedding from reference audio for voice cloning.
public Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)
Parameters
referenceAudioTensor<T>
Returns
- Tensor<T>
GetModelMetadata()
Gets metadata about the model.
public override ModelMetadata<T> GetModelMetadata()
Returns
InitializeLayers()
Initializes layers for ONNX inference mode.
protected override void InitializeLayers()
PostprocessOutput(Tensor<T>)
Postprocesses model output.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>
Returns
- Tensor<T>
Predict(Tensor<T>)
Makes a prediction using the model.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>
Returns
- Tensor<T>
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriter
StartStreamingSession(string?, double)
Starts a streaming synthesis session.
public IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)
Parameters
Returns
Synthesize(string, string?, double, double)
Synthesizes speech from text.
public Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)
Parameters
Returns
- Tensor<T>
SynthesizeAsync(string, string?, double, double, CancellationToken)
Synthesizes speech from text asynchronously.
public Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)
Parameters
textstringvoiceIdstringspeakingRatedoublepitchdoublecancellationTokenCancellationToken
Returns
- Task<Tensor<T>>
SynthesizeWithEmotion(string, string, double, string?, double)
Synthesizes speech with emotional expression.
public Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)
Parameters
Returns
- Tensor<T>
SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)
Synthesizes speech using a cloned voice from reference audio.
public Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)
Parameters
Returns
- Tensor<T>
Train(Tensor<T>, Tensor<T>)
Trains the model on input data.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>expectedOutputTensor<T>
UpdateParameters(Vector<T>)
Updates model parameters using the configured optimizer.
public override void UpdateParameters(Vector<T> gradients)
Parameters
gradientsVector<T>