Class VITSModel<T>

Namespace: AiDotNet.Audio.TextToSpeech

Assembly: AiDotNet.dll

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) model.

public class VITSModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ITextToSpeech<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

NeuralNetworkBase<T>

AudioNeuralNetworkBase<T>

VITSModel<T>

Implements: INeuralNetworkModel<T>

INeuralNetwork<T>

IInterpretableModel<T>

IInputGradientComputable<T>

IDisposable

ITextToSpeech<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

Inherited Members: AudioNeuralNetworkBase<T>.SampleRate

AudioNeuralNetworkBase<T>.NumMels

AudioNeuralNetworkBase<T>.IsOnnxMode

AudioNeuralNetworkBase<T>.OnnxEncoder

AudioNeuralNetworkBase<T>.OnnxDecoder

AudioNeuralNetworkBase<T>.OnnxModel

AudioNeuralNetworkBase<T>.MelSpec

AudioNeuralNetworkBase<T>.SupportsTraining

AudioNeuralNetworkBase<T>.RunOnnxInference(Tensor<T>)

AudioNeuralNetworkBase<T>.Forward(Tensor<T>)

AudioNeuralNetworkBase<T>.DefaultLossFunction

AudioNeuralNetworkBase<T>.CreateMelSpectrogram(int, int, int, int)

NeuralNetworkBase<T>.Layers

NeuralNetworkBase<T>.LayerCount

NeuralNetworkBase<T>.Architecture

NeuralNetworkBase<T>.NumOps

NeuralNetworkBase<T>.Engine

NeuralNetworkBase<T>._layerInputs

NeuralNetworkBase<T>._layerOutputs

NeuralNetworkBase<T>.Random

NeuralNetworkBase<T>.LossFunction

NeuralNetworkBase<T>.LastLoss

NeuralNetworkBase<T>.IsTrainingMode

NeuralNetworkBase<T>.SupportsGpuTraining

NeuralNetworkBase<T>.CanTrainOnGpu

NeuralNetworkBase<T>.GpuEngine

NeuralNetworkBase<T>.MaxGradNorm

NeuralNetworkBase<T>._mixedPrecisionContext

NeuralNetworkBase<T>._memoryManager

NeuralNetworkBase<T>.IsMemoryManagementEnabled

NeuralNetworkBase<T>.IsGradientCheckpointingEnabled

NeuralNetworkBase<T>.IsMixedPrecisionEnabled

NeuralNetworkBase<T>.ClipGradients(List<Tensor<T>>)

NeuralNetworkBase<T>.ClipGradient(Tensor<T>)

NeuralNetworkBase<T>.ClipGradient(Vector<T>)

NeuralNetworkBase<T>.GetParameters()

NeuralNetworkBase<T>.Backpropagate(Tensor<T>)

NeuralNetworkBase<T>.BackpropagateWithRecompute(Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpuDeferred(IGpuTensor<T>, GpuExecutionOptions)

NeuralNetworkBase<T>.UpdateParametersGpu(T, T, T)

NeuralNetworkBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

NeuralNetworkBase<T>.UpdateParametersGpuDeferred(IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferred(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferredAsync(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions, CancellationToken)

NeuralNetworkBase<T>.UploadWeightsToGpu()

NeuralNetworkBase<T>.DownloadWeightsFromGpu()

NeuralNetworkBase<T>.ZeroGradientsGpu()

NeuralNetworkBase<T>.ExtractSingleExample(Tensor<T>, int)

NeuralNetworkBase<T>.ForwardWithMemory(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithCheckpointing(Tensor<T>)

NeuralNetworkBase<T>.CanUseGpuResidentPath()

NeuralNetworkBase<T>.TryForwardGpuOptimized(Tensor<T>, out Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferred(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferredAsync(Tensor<T>, CancellationToken)

NeuralNetworkBase<T>.BeginGpuExecution(GpuExecutionOptions)

NeuralNetworkBase<T>.ForwardWithGpuContext(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithGpuContext(IGpuTensor<T>)

NeuralNetworkBase<T>.GetGpuMemoryStats()

NeuralNetworkBase<T>.ForwardWithFeatures(Tensor<T>, int[])

NeuralNetworkBase<T>.ParameterCount

NeuralNetworkBase<T>.GetParameterCount()

NeuralNetworkBase<T>.InvalidateParameterCountCache()

NeuralNetworkBase<T>.AddLayerToCollection(ILayer<T>)

NeuralNetworkBase<T>.RemoveLayerFromCollection(ILayer<T>)

NeuralNetworkBase<T>.ClearLayers()

NeuralNetworkBase<T>.ValidateCustomLayers(List<ILayer<T>>)

NeuralNetworkBase<T>.ValidateCustomLayersInternal(List<ILayer<T>>)

NeuralNetworkBase<T>.IsValidInputLayer(ILayer<T>)

NeuralNetworkBase<T>.IsValidOutputLayer(ILayer<T>)

NeuralNetworkBase<T>.AreLayersCompatible(ILayer<T>, ILayer<T>)

NeuralNetworkBase<T>.GetParameterGradients()

NeuralNetworkBase<T>.EnsureArchitectureInitialized()

NeuralNetworkBase<T>.SetTrainingMode(bool)

NeuralNetworkBase<T>.EnableMemoryManagement(TrainingMemoryConfig)

NeuralNetworkBase<T>.DisableMemoryManagement()

NeuralNetworkBase<T>.GetMemoryEstimate(int, int)

NeuralNetworkBase<T>.GetLastLoss()

NeuralNetworkBase<T>.ResetState()

NeuralNetworkBase<T>.BackwardWithInputGradient(Tensor<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Vector<T>, Vector<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.SaveModel(string)

NeuralNetworkBase<T>.LoadModel(string)

NeuralNetworkBase<T>.Serialize()

NeuralNetworkBase<T>.Deserialize(byte[])

NeuralNetworkBase<T>.WithParameters(Vector<T>)

NeuralNetworkBase<T>.GetActiveFeatureIndices()

NeuralNetworkBase<T>.IsFeatureUsed(int)

NeuralNetworkBase<T>.DeepCopy()

NeuralNetworkBase<T>.Clone()

NeuralNetworkBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

NeuralNetworkBase<T>._enabledMethods

NeuralNetworkBase<T>._sensitiveFeatures

NeuralNetworkBase<T>._fairnessMetrics

NeuralNetworkBase<T>._baseModel

NeuralNetworkBase<T>.GetGlobalFeatureImportanceAsync()

NeuralNetworkBase<T>.GetLocalFeatureImportanceAsync(Tensor<T>)

NeuralNetworkBase<T>.GetShapValuesAsync(Tensor<T>)

NeuralNetworkBase<T>.GetLimeExplanationAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetPartialDependenceAsync(Vector<int>, int)

NeuralNetworkBase<T>.GetCounterfactualAsync(Tensor<T>, Tensor<T>, int)

NeuralNetworkBase<T>.GetModelSpecificInterpretabilityAsync()

NeuralNetworkBase<T>.GenerateTextExplanationAsync(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.GetFeatureInteractionAsync(int, int)

NeuralNetworkBase<T>.ValidateFairnessAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetAnchorExplanationAsync(Tensor<T>, T)

NeuralNetworkBase<T>.SetBaseModel<TInput, TOutput>(IFullModel<T, TInput, TOutput>)

NeuralNetworkBase<T>.EnableMethod(params InterpretationMethod[])

NeuralNetworkBase<T>.ConfigureFairness(Vector<int>, params FairnessMetric[])

NeuralNetworkBase<T>.GetNamedLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.GetArchitecture()

NeuralNetworkBase<T>.GetFeatureImportance()

NeuralNetworkBase<T>.SetParameters(Vector<T>)

NeuralNetworkBase<T>.AddLayer(LayerType, int, ActivationFunction)

NeuralNetworkBase<T>.AddConvolutionalLayer(int, int, int, ActivationFunction)

NeuralNetworkBase<T>.AddLSTMLayer(int, bool)

NeuralNetworkBase<T>.AddDropoutLayer(double)

NeuralNetworkBase<T>.AddBatchNormalizationLayer(int, double, double)

NeuralNetworkBase<T>.AddPoolingLayer(int[], PoolingType, int, int?)

NeuralNetworkBase<T>.GetGradients()

NeuralNetworkBase<T>.GetInputShape()

NeuralNetworkBase<T>.GetLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

NeuralNetworkBase<T>.ApplyGradients(Vector<T>, T)

NeuralNetworkBase<T>.SaveState(Stream)

NeuralNetworkBase<T>.LoadState(Stream)

NeuralNetworkBase<T>.Dispose()

NeuralNetworkBase<T>.SupportsJitCompilation

NeuralNetworkBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

NeuralNetworkBase<T>.ConvertLayerToGraph(ILayer<T>, ComputationNode<T>)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

VITS is a state-of-the-art end-to-end TTS model that generates high-quality speech directly from text without requiring a separate vocoder. It combines:

Variational autoencoder (VAE) for learning latent representations
Normalizing flows for improved audio quality
Adversarial training for realistic speech synthesis
Multi-speaker support with speaker embeddings

For Beginners: VITS is a modern TTS model with several advantages:

End-to-end: Converts text directly to audio (no separate vocoder needed)
Fast: Parallel generation is much faster than autoregressive models
High quality: Produces natural-sounding speech
Voice cloning: Can learn to speak in new voices from short audio samples

Two ways to use this class:

ONNX Mode: Load pretrained VITS models for fast inference
Native Mode: Train your own TTS model from scratch

ONNX Mode Example:

var vits = new VITSModel<float>(
    architecture,
    modelPath: "path/to/vits.onnx");
var audio = vits.Synthesize("Hello, world!");

Voice Cloning Example:

var audio = vits.SynthesizeWithVoiceCloning(
    "Hello, world!",
    referenceAudio);

Constructors

VITSModel(NeuralNetworkArchitecture<T>, int, int, double, double, double, int, int, int, int, int, int, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a VITS model for native training mode.

public VITSModel(NeuralNetworkArchitecture<T> architecture, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double noiseScale = 0.667, double lengthScale = 1, int hiddenDim = 192, int numHeads = 2, int numEncoderLayers = 6, int numFlowLayers = 4, int speakerEmbeddingDim = 256, int numSpeakers = 1, int maxPhonemeLength = 256, int fftSize = 1024, int hopLength = 256, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
sampleRate int: Output sample rate in Hz. Default is 22050.
numMels int: Number of mel spectrogram channels. Default is 80.
speakingRate double: Speaking rate multiplier. Default is 1.0.
noiseScale double: Noise scale for variational sampling. Default is 0.667.
lengthScale double: Length scale for duration control. Default is 1.0.
hiddenDim int: Hidden dimension. Default is 192.
numHeads int: Number of attention heads. Default is 2.
numEncoderLayers int: Number of text encoder layers. Default is 6.
numFlowLayers int: Number of flow layers. Default is 4.
speakerEmbeddingDim int: Speaker embedding dimension. Default is 256.
numSpeakers int: Number of speakers for multi-speaker model. Default is 1.
maxPhonemeLength int: Maximum phoneme sequence length. Default is 256.
fftSize int: FFT size for audio generation. Default is 1024.
hopLength int: Hop length for audio generation. Default is 256.
optimizer IOptimizer<T, Tensor<T>, Tensor<T>>: Optimizer for training. If null, uses Adam.
lossFunction ILossFunction<T>: Loss function for training. If null, uses MSE.

Remarks

For Beginners: Use this constructor to train your own VITS model.

Training VITS requires:

Large amounts of paired text-audio data
Significant compute resources (GPUs recommended)
Many training epochs

Example:

var vits = new VITSModel<float>(
    architecture,
    numSpeakers: 10,  // Multi-speaker model
    speakerEmbeddingDim: 256);

// Training loop
vits.Train(phonemeInput, audioOutput);

VITSModel(NeuralNetworkArchitecture<T>, string, string?, int, int, double, double, double, int, int, OnnxModelOptions?)

Creates a VITS model for ONNX inference with pretrained models.

public VITSModel(NeuralNetworkArchitecture<T> architecture, string modelPath, string? speakerEncoderPath = null, int sampleRate = 22050, int numMels = 80, double speakingRate = 1, double noiseScale = 0.667, double lengthScale = 1, int fftSize = 1024, int hopLength = 256, OnnxModelOptions? onnxOptions = null)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
modelPath string: Path to the VITS ONNX model file.
speakerEncoderPath string: Optional path to speaker encoder for voice cloning.
sampleRate int: Output sample rate in Hz. Default is 22050.
numMels int: Number of mel spectrogram channels. Default is 80.
speakingRate double: Speaking rate multiplier. Default is 1.0.
noiseScale double: Noise scale for variational sampling. Default is 0.667.
lengthScale double: Length scale for duration control. Default is 1.0.
fftSize int: FFT size for audio generation. Default is 1024.
hopLength int: Hop length for audio generation. Default is 256.
onnxOptions OnnxModelOptions: ONNX runtime options.

Remarks

For Beginners: Use this constructor with pretrained VITS models.

You can get ONNX models from:

HuggingFace: Various VITS models
Coqui TTS exports

Example:

var vits = new VITSModel<float>(
    architecture,
    modelPath: "vits-en.onnx",
    speakerEncoderPath: "speaker-encoder.onnx");  // For voice cloning

Properties

AvailableVoices

Gets the list of available built-in voices.

public IReadOnlyList<VoiceInfo<T>> AvailableVoices { get; }

Property Value

IReadOnlyList<VoiceInfo<T>>

IsReady

Gets whether the model is ready for synthesis.

public bool IsReady { get; }

Property Value

bool

SupportsEmotionControl

Gets whether this model supports emotional expression control.

public bool SupportsEmotionControl { get; }

Property Value

bool

SupportsStreaming

Gets whether this model supports streaming audio generation.

public bool SupportsStreaming { get; }

Property Value

bool

SupportsVoiceCloning

Gets whether this model supports voice cloning from reference audio.

public bool SupportsVoiceCloning { get; }

Property Value

bool

Methods

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Dispose(bool)

Disposes the model and releases resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

ExtractSpeakerEmbedding(Tensor<T>)

Extracts speaker embedding from reference audio for voice cloning.

public Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>

Returns

Tensor<T>

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes layers for ONNX inference mode.

protected override void InitializeLayers()

PostprocessOutput(Tensor<T>)

Postprocesses model output.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

StartStreamingSession(string?, double)

Starts a streaming synthesis session.

public IStreamingSynthesisSession<T> StartStreamingSession(string? voiceId = null, double speakingRate = 1)

Parameters

voiceId string
speakingRate double

Returns

IStreamingSynthesisSession<T>

Synthesize(string, string?, double, double)

Synthesizes speech from text.

public Tensor<T> Synthesize(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0)

Parameters

text string
voiceId string
speakingRate double
pitch double

Returns

Tensor<T>

SynthesizeAsync(string, string?, double, double, CancellationToken)

Synthesizes speech from text asynchronously.

public Task<Tensor<T>> SynthesizeAsync(string text, string? voiceId = null, double speakingRate = 1, double pitch = 0, CancellationToken cancellationToken = default)

Parameters

text string
voiceId string
speakingRate double
pitch double
cancellationToken CancellationToken

Returns

Task<Tensor<T>>

SynthesizeWithEmotion(string, string, double, string?, double)

Synthesizes speech with emotional expression.

public Tensor<T> SynthesizeWithEmotion(string text, string emotion, double emotionIntensity = 0.5, string? voiceId = null, double speakingRate = 1)

Parameters

text string
emotion string
emotionIntensity double
voiceId string
speakingRate double

Returns

Tensor<T>

SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)

Synthesizes speech using a cloned voice from reference audio.

public Tensor<T> SynthesizeWithVoiceCloning(string text, Tensor<T> referenceAudio, double speakingRate = 1, double pitch = 0)

Parameters

text string
referenceAudio Tensor<T>
speakingRate double
pitch double

Returns

Tensor<T>

Train(Tensor<T>, Tensor<T>)

Trains the model on input data.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

UpdateParameters(Vector<T>)

Updates model parameters using the configured optimizer.

public override void UpdateParameters(Vector<T> gradients)

Parameters

gradients Vector<T>

Table of Contents

Class VITSModel<T>

Type Parameters

Remarks

Constructors

VITSModel(NeuralNetworkArchitecture<T>, int, int, double, double, double, int, int, int, int, int, int, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Parameters

Remarks

VITSModel(NeuralNetworkArchitecture<T>, string, string?, int, int, double, double, double, int, int, OnnxModelOptions?)

Parameters

Remarks

Properties

AvailableVoices

Property Value

IsReady

Property Value

SupportsEmotionControl

Property Value

SupportsStreaming

Property Value

SupportsVoiceCloning

Property Value

Methods

CreateNewInstance()

Returns

DeserializeNetworkSpecificData(BinaryReader)

Parameters

Dispose(bool)

Parameters

ExtractSpeakerEmbedding(Tensor<T>)

Parameters

Returns

GetModelMetadata()

Returns

InitializeLayers()

PostprocessOutput(Tensor<T>)

Parameters

Returns

Predict(Tensor<T>)

Parameters

Returns

PreprocessAudio(Tensor<T>)

Parameters

Returns

SerializeNetworkSpecificData(BinaryWriter)

Parameters

StartStreamingSession(string?, double)

Parameters

Returns

Synthesize(string, string?, double, double)

Parameters

Returns

SynthesizeAsync(string, string?, double, double, CancellationToken)

Parameters

Returns

SynthesizeWithEmotion(string, string, double, string?, double)

Parameters

Returns

SynthesizeWithVoiceCloning(string, Tensor<T>, double, double)

Parameters

Returns

Train(Tensor<T>, Tensor<T>)

Parameters

UpdateParameters(Vector<T>)

Parameters