Table of Contents

Class AudioLDMModel<T>

Namespace
AiDotNet.Audio.AudioLDM
Assembly
AiDotNet.dll

AudioLDM (Audio Latent Diffusion Model) for generating audio from text descriptions.

public class AudioLDMModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IAudioGenerator<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
AudioLDMModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

AudioLDM is a latent diffusion model that generates audio by learning to reverse a diffusion process in a compressed latent space. It uses CLAP (Contrastive Language-Audio Pretraining) for text conditioning and a VAE for efficient latent space learning.

Architecture components:

  1. CLAP Encoder: Contrastive text encoder that aligns text with audio features
  2. VAE: Variational autoencoder that compresses mel spectrograms to latent space
  3. U-Net Denoiser: Predicts noise to be removed at each diffusion step
  4. HiFi-GAN Vocoder: Converts mel spectrograms to audio waveforms

For Beginners: AudioLDM creates realistic audio from your descriptions:

How it works:

  1. You describe the sound you want ("a cat meowing")
  2. CLAP encodes your text into an audio-aligned representation
  3. The diffusion process generates a latent audio representation
  4. The VAE decoder converts latents to mel spectrogram
  5. HiFi-GAN vocoder converts the spectrogram to audio

Key features:

  • General audio and music generation
  • Environmental sounds, speech, music
  • Controllable through text prompts
  • High-quality 16kHz or 48kHz output

Usage:

var model = new AudioLDMModel<float>(options);
var audio = model.GenerateAudio("A dog barking in a park");

Reference: "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models" by Liu et al., 2023

Constructors

AudioLDMModel(NeuralNetworkArchitecture<T>, AudioLDMOptions?, ITokenizer?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates an AudioLDM model using native layers for training from scratch.

public AudioLDMModel(NeuralNetworkArchitecture<T> architecture, AudioLDMOptions? options = null, ITokenizer? tokenizer = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

options AudioLDMOptions

AudioLDM configuration options.

tokenizer ITokenizer

Optional tokenizer. If null, creates CLAP-compatible tokenizer.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer. Defaults to AdamW.

lossFunction ILossFunction<T>

Optional loss function. Defaults to MSE.

Remarks

For Beginners: Use this constructor when: - Training AudioLDM from scratch (requires significant data) - Fine-tuning on custom audio types - Research and experimentation

For most use cases, load pretrained ONNX models instead.

AudioLDMModel(NeuralNetworkArchitecture<T>, string, string, string, string, ITokenizer, AudioLDMOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates an AudioLDM model using pretrained ONNX models for inference.

public AudioLDMModel(NeuralNetworkArchitecture<T> architecture, string clapEncoderPath, string vaePath, string unetPath, string vocoderPath, ITokenizer tokenizer, AudioLDMOptions? options = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

clapEncoderPath string

Path to the CLAP text encoder ONNX model.

vaePath string

Path to the VAE ONNX model.

unetPath string

Path to the U-Net denoiser ONNX model.

vocoderPath string

Path to the HiFi-GAN vocoder ONNX model.

tokenizer ITokenizer

CLAP tokenizer for text processing (REQUIRED).

options AudioLDMOptions

AudioLDM configuration options.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer for fine-tuning.

lossFunction ILossFunction<T>

Optional loss function.

Exceptions

ArgumentException

Thrown when required paths are empty.

FileNotFoundException

Thrown when model files don't exist.

ArgumentNullException

Thrown when tokenizer is null.

Properties

MaxDurationSeconds

Gets the maximum duration of audio that can be generated.

public double MaxDurationSeconds { get; }

Property Value

double

SampleRate

Gets the sample rate of generated audio.

public int SampleRate { get; }

Property Value

int

SupportsAudioContinuation

Gets whether this model supports audio continuation.

public bool SupportsAudioContinuation { get; }

Property Value

bool

SupportsAudioInpainting

Gets whether this model supports audio inpainting.

public bool SupportsAudioInpainting { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public bool SupportsTextToMusic { get; }

Property Value

bool

Methods

ContinueAudio(Tensor<T>, string?, double, int, int?)

Continues existing audio by extending it.

public Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)

Parameters

inputAudio Tensor<T>
prompt string
extensionSeconds double
numInferenceSteps int
seed int?

Returns

Tensor<T>

CreateNewInstance()

Creates a new instance for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Dispose(bool)

Disposes of model resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

GenerateAudio(string, string?, double, int, double, int?)

Generates audio from a text description.

public Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?

Returns

Tensor<T>

GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)

Generates audio asynchronously.

public Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?
cancellationToken CancellationToken

Returns

Task<Tensor<T>>

GenerateMusic(string, string?, double, int, double, int?)

Generates music from a text description.

public Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?

Returns

Tensor<T>

GetDefaultOptions()

Gets default generation options.

public AudioGenerationOptions<T> GetDefaultOptions()

Returns

AudioGenerationOptions<T>

GetModelMetadata()

Gets model metadata.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes the neural network layers following the golden standard pattern.

protected override void InitializeLayers()

InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)

Fills in missing or masked sections of audio.

public Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)

Parameters

audio Tensor<T>
mask Tensor<T>
prompt string
numInferenceSteps int
seed int?

Returns

Tensor<T>

PostprocessOutput(Tensor<T>)

Postprocesses model output.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

Train(Tensor<T>, Tensor<T>)

Trains the model on input data.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

UpdateParameters(Vector<T>)

Updates model parameters.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>