Table of Contents

Class VITSModel<T>

Namespace
AiDotNet.Audio.TextToSpeech
Assembly
AiDotNet.dll

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) model.

public class VITSModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ITextToSpeech<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
VITSModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

VITS is a state-of-the-art end-to-end TTS model that generates high-quality speech directly from text without requiring a separate vocoder. It combines:

  • Variational autoencoder (VAE) for learning latent representations
  • Normalizing flows for improved audio quality
  • Adversarial training for realistic speech synthesis
  • Multi-speaker support with speaker embeddings

For Beginners: VITS is a modern TTS model with several advantages:

  1. End-to-end: Converts text directly to audio (no separate vocoder needed)
  2. Fast: Parallel generation is much faster than autoregressive models
  3. High quality: Produces natural-sounding speech
  4. Voice cloning: Can learn to speak in new voices from short audio samples

Two ways to use this class:

  1. ONNX Mode: Load pretrained VITS models for fast inference
  2. Native Mode: Train your own TTS model from scratch

ONNX Mode Example:

var vits = new VITSModel<float>(
    architecture,
    modelPath: "path/to/vits.onnx");
var audio = vits.Synthesize("Hello, world!");

Voice Cloning Example:

var audio = vits.SynthesizeWithVoiceCloning(
    "Hello, world!",
    referenceAudio);

Constructors

VITSModel(NeuralNetworkArchitecture<T>, int, int, double, double, double, int, int, int, int, int, int, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a VITS model for native training mode.

public VITSModel(NeuralNetworkArchitecture<T> architecture, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double noiseScale = 0.667, double lengthScale = 1, int hiddenDim = 192, int numHeads = 2, int numEncoderLayers = 6, int numFlowLayers = 4, int speakerEmbeddingDim = 256, int numSpeakers = 1, int maxPhonemeLength = 256, int fftSize = 1024, int hopLength = 256, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

sampleRate int

Output sample rate in Hz. Default is 22050.

numMels int

Number of mel spectrogram channels. Default is 80.

speakingRate double

Speaking rate multiplier. Default is 1.0.

noiseScale double

Noise scale for variational sampling. Default is 0.667.

lengthScale double

Length scale for duration control. Default is 1.0.

hiddenDim int

Hidden dimension. Default is 192.

numHeads int

Number of attention heads. Default is 2.

numEncoderLayers int

Number of text encoder layers. Default is 6.

numFlowLayers int

Number of flow layers. Default is 4.

speakerEmbeddingDim int

Speaker embedding dimension. Default is 256.

numSpeakers int

Number of speakers for multi-speaker model. Default is 1.

maxPhonemeLength int

Maximum phoneme sequence length. Default is 256.

fftSize int

FFT size for audio generation. Default is 1024.

hopLength int

Hop length for audio generation. Default is 256.

optimizer IOptimizer<T, Tensor<T>, Tensor<T>>

Optimizer for training. If null, uses Adam.

lossFunction ILossFunction<T>

Loss function for training. If null, uses MSE.

Remarks

For Beginners: Use this constructor to train your own VITS model.

Training VITS requires:

  1. Large amounts of paired text-audio data
  2. Significant compute resources (GPUs recommended)
  3. Many training epochs

Example:

var vits = new VITSModel<float>(
    architecture,
    numSpeakers: 10,  // Multi-speaker model
    speakerEmbeddingDim: 256);

// Training loop
vits.Train(phonemeInput, audioOutput);

VITSModel(NeuralNetworkArchitecture<T>, string, string?, int, int, double, double, double, int, int, OnnxModelOptions?)

Creates a VITS model for ONNX inference with pretrained models.

public VITSModel(NeuralNetworkArchitecture<T> architecture, string modelPath, string? speakerEncoderPath = null, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double noiseScale = 0.667, double lengthScale = 1, int fftSize = 1024, int hopLength = 256, OnnxModelOptions? onnxOptions = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

modelPath string

Path to the VITS ONNX model file.

speakerEncoderPath string

Optional path to speaker encoder for voice cloning.

sampleRate int

Output sample rate in Hz. Default is 22050.

numMels int

Number of mel spectrogram channels. Default is 80.

speakingRate double

Speaking rate multiplier. Default is 1.0.

noiseScale double

Noise scale for variational sampling. Default is 0.667.

lengthScale double

Length scale for duration control. Default is 1.0.

fftSize int

FFT size for audio generation. Default is 1024.

hopLength int

Hop length for audio generation. Default is 256.

onnxOptions OnnxModelOptions

ONNX runtime options.

Remarks

For Beginners: Use this constructor with pretrained VITS models.

You can get ONNX models from:

  • HuggingFace: Various VITS models
  • Coqui TTS exports

Example:

var vits = new VITSModel<float>(
    architecture,
    modelPath: "vits-en.onnx",
    speakerEncoderPath: "speaker-encoder.onnx");  // For voice cloning

Properties

AvailableVoices

Gets the list of available built-in voices.

public IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }

Property Value

IReadOnlyList<VoiceInfo<T>>

IsReady

Gets whether the model is ready for synthesis.

public bool IsReady { get; }

Property Value

bool

SupportsEmotionControl

Gets whether this model supports emotional expression control.

public bool SupportsEmotionControl { get; }

Property Value

bool

SupportsStreaming

Gets whether this model supports streaming audio generation.

public bool SupportsStreaming { get; }

Property Value

bool

SupportsVoiceCloning

Gets whether this model supports voice cloning from reference audio.

public bool SupportsVoiceCloning { get; }

Property Value

bool

Methods

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Dispose(bool)

Disposes the model and releases resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

ExtractSpeakerEmbedding(Tensor<T>)

Extracts speaker embedding from reference audio for voice cloning.

public Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>

Returns

Tensor<T>

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes layers for ONNX inference mode.

protected override void InitializeLayers()

PostprocessOutput(Tensor<T>)

Postprocesses model output.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

StartStreamingSession(string?, double)

Starts a streaming synthesis session.

public IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)

Parameters

voiceId string
speakingRate double

Returns

IStreamingSynthesisSession<T>

Synthesize(string, string?, double, double)

Synthesizes speech from text.

public Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)

Parameters

text string
voiceId string
speakingRate double
pitch double

Returns

Tensor<T>

SynthesizeAsync(string, string?, double, double, CancellationToken)

Synthesizes speech from text asynchronously.

public Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)

Parameters

text string
voiceId string
speakingRate double
pitch double
cancellationToken CancellationToken

Returns

Task<Tensor<T>>

SynthesizeWithEmotion(string, string, double, string?, double)

Synthesizes speech with emotional expression.

public Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)

Parameters

text string
emotion string
emotionIntensity double
voiceId string
speakingRate double

Returns

Tensor<T>

SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)

Synthesizes speech using a cloned voice from reference audio.

public Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)

Parameters

text string
referenceAudio Tensor<T>
speakingRate double
pitch double

Returns

Tensor<T>

Train(Tensor<T>, Tensor<T>)

Trains the model on input data.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

UpdateParameters(Vector<T>)

Updates model parameters using the configured optimizer.

public override void UpdateParameters(Vector<T> gradients)

Parameters

gradients Vector<T>