Table of Contents

Class AudioLDMModel<T>

Namespace
AiDotNet.Diffusion.Models
Assembly
AiDotNet.dll

Audio Latent Diffusion Model (AudioLDM) for text-to-audio generation.

public class AudioLDMModel<T> : AudioDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
AudioLDMModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Examples

// Create an AudioLDM model
var audioLDM = new AudioLDMModel<float>();

// Generate sound effects
var dogBark = audioLDM.GenerateFromText(
    prompt: "A dog barking excitedly",
    durationSeconds: 5.0,
    numInferenceSteps: 100,
    guidanceScale: 3.5);

// Generate music
var music = audioLDM.GenerateMusic(
    prompt: "Soft jazz piano with light drums",
    durationSeconds: 10.0,
    numInferenceSteps: 200,
    guidanceScale: 4.0);

// Save as audio file
SaveWav(dogBark, "dog_bark.wav", sampleRate: 16000);

Remarks

AudioLDM is a latent diffusion model specifically designed for audio generation. It works by generating mel spectrograms in latent space and then converting them to audio using a vocoder (like HiFi-GAN).

For Beginners: AudioLDM lets you create sounds and music from text descriptions:

Example prompts:

  • "A dog barking in a park" -> generates dog barking sounds
  • "Rain falling on a window" -> generates rain sounds
  • "Jazz piano playing softly" -> generates jazz piano music

How it works:

  1. Text -> CLAP encoder -> text embedding (understands audio concepts)
  2. Text embedding guides diffusion in latent space
  3. Latent -> AudioVAE decoder -> mel spectrogram
  4. Mel spectrogram -> Vocoder -> audio waveform

Key features:

  • Text-to-audio: Generate sounds from descriptions
  • Audio-to-audio: Transform sounds while preserving some characteristics
  • Variable duration: Generate audio of different lengths
  • Classifier-free guidance: Control how closely to follow the prompt

Technical specifications: - Sample rate: 16 kHz (standard for speech/effects) or 48 kHz (music) - Latent channels: 8 - Mel channels: 64 (AudioLDM) or 128 (AudioLDM 2) - Duration: Typically 10 seconds, but configurable - Guidance scale: 2.5-5.0 typical

Constructors

AudioLDMModel()

Initializes a new AudioLDM model with default parameters.

public AudioLDMModel()

AudioLDMModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, int, double, int, bool, int?)

Initializes a new AudioLDM model with custom parameters.

public AudioLDMModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, AudioVAE<T>? audioVAE = null, IConditioningModule<T>? conditioner = null, int sampleRate = 16000, double defaultDurationSeconds = 10, int melChannels = 64, bool isVersion2 = false, int? seed = null)

Parameters

options DiffusionModelOptions<T>

Configuration options for the diffusion model.

scheduler INoiseScheduler<T>

Optional custom scheduler.

unet UNetNoisePredictor<T>

Optional custom U-Net noise predictor.

audioVAE AudioVAE<T>

Optional custom AudioVAE.

conditioner IConditioningModule<T>

Optional CLAP conditioning module.

sampleRate int

Audio sample rate in Hz.

defaultDurationSeconds double

Default audio duration.

melChannels int

Number of mel spectrogram channels.

isVersion2 bool

Whether to use AudioLDM 2 configuration.

seed int?

Optional random seed.

Fields

AUDIOLDM_LATENT_CHANNELS

Standard AudioLDM latent channels.

public const int AUDIOLDM_LATENT_CHANNELS = 8

Field Value

int

AUDIOLDM_MEL_CHANNELS

Standard AudioLDM mel channels.

public const int AUDIOLDM_MEL_CHANNELS = 64

Field Value

int

AUDIOLDM_SAMPLE_RATE

Standard AudioLDM sample rate.

public const int AUDIOLDM_SAMPLE_RATE = 16000

Field Value

int

Properties

AudioVAE

Gets the AudioVAE used for encoding/decoding.

public AudioVAE<T> AudioVAE { get; }

Property Value

AudioVAE<T>

Conditioner

Gets the conditioning module (optional, for conditioned generation).

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

IsVersion2

Gets whether this is AudioLDM version 2.

public bool IsVersion2 { get; }

Property Value

bool

LatentChannels

Gets the number of latent channels.

public override int LatentChannels { get; }

Property Value

int

Remarks

Typically 4 for Stable Diffusion models.

NoisePredictor

Gets the noise predictor model (U-Net, DiT, etc.).

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

This property returns the total count of trainable parameters in the model. It's useful for understanding model complexity and memory requirements.

SupportsAudioToAudio

Gets whether this model supports audio-to-audio transformation.

public override bool SupportsAudioToAudio { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public override bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public override bool SupportsTextToMusic { get; }

Property Value

bool

SupportsTextToSpeech

Gets whether this model supports text-to-speech generation.

public override bool SupportsTextToSpeech { get; }

Property Value

bool

VAE

Gets the VAE model used for encoding and decoding.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Methods

Clone()

Creates a deep copy of the model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>

A new instance with the same parameters.

DeepCopy()

Creates a deep copy of this object.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GenerateAudio(string, string?, double?, int, double, int?)

Generates audio from a text prompt.

public virtual Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3.5, int? seed = null)

Parameters

prompt string

Text description of the desired audio.

negativePrompt string

Optional negative prompt.

durationSeconds double?

Duration of audio to generate.

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor [1, samples].

Remarks

For Beginners: This generates audio matching your text description:

Prompt tips:

  • Be descriptive: "A loud thunderstorm with heavy rain" vs "thunder"
  • Include context: "A dog barking in a quiet park"
  • Specify style for music: "Upbeat electronic dance music with synth bass"

Guidance scale effects:

  • Lower (2.0-3.0): More variety, may not match prompt exactly
  • Medium (3.0-4.0): Good balance of quality and prompt following
  • Higher (4.0-6.0): Closely follows prompt, may reduce quality

GenerateMusic(string, string?, double?, int, double, int?)

Generates music from a text prompt.

public override Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 200, double guidanceScale = 4, int? seed = null)

Parameters

prompt string

Text description of the desired music.

negativePrompt string

Optional negative prompt.

durationSeconds double?

Duration of music to generate.

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor [1, samples].

Remarks

Music generation uses the same underlying model but with prompts focused on musical content. For best results:

  • Specify genre: "jazz", "electronic", "classical"
  • Mention instruments: "piano", "guitar", "synthesizer"
  • Describe mood: "upbeat", "melancholic", "energetic"
  • Include tempo hints: "slow ballad", "fast dance beat"

GenerateVariations(Tensor<T>, int, double, int?)

Generates audio variations from an input audio.

public virtual List<Tensor<T>> GenerateVariations(Tensor<T> inputAudio, int numVariations = 4, double variationStrength = 0.3, int? seed = null)

Parameters

inputAudio Tensor<T>

Input audio waveform [batch, samples].

numVariations int

Number of variations to generate.

variationStrength double

How much to vary (0.0-1.0).

seed int?

Optional random seed.

Returns

List<Tensor<T>>

List of audio variation tensors.

Remarks

For Beginners: This creates multiple variations of your audio:

Use cases:

  • Sound design: Generate similar but unique sounds
  • Music production: Create instrument variations
  • Audio augmentation: Expand training data

Each variation will be similar to the input but with random differences.

GetParameters()

Gets the parameters that can be optimized.

public override Vector<T> GetParameters()

Returns

Vector<T>

SetParameters(Vector<T>)

Sets the model parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The parameter vector to set.

Remarks

This method allows direct modification of the model's internal parameters. This is useful for optimization algorithms that need to update parameters iteratively. If the length of parameters does not match ParameterCount, an ArgumentException should be thrown.

Exceptions

ArgumentException

Thrown when the length of parameters does not match ParameterCount.

TransformAudio(Tensor<T>, string, string?, double, int, double, int?)

Transforms audio based on a text prompt (audio-to-audio).

public virtual Tensor<T> TransformAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 100, double guidanceScale = 3.5, int? seed = null)

Parameters

inputAudio Tensor<T>

Input audio waveform [batch, samples].

prompt string

Text description for transformation.

negativePrompt string

Optional negative prompt.

strength double

Transformation strength (0.0-1.0).

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Transformed audio waveform tensor.

Remarks

For Beginners: This transforms existing audio based on your description:

Examples:

  • Input: speech, Prompt: "whispered voice" -> quieter, intimate version
  • Input: guitar, Prompt: "electric guitar with distortion" -> adds effects
  • Input: ambient, Prompt: "add rain sounds" -> mixes in rain

Strength controls how much to change:

  • Low (0.2-0.4): Subtle changes, preserves original character
  • Medium (0.4-0.6): Noticeable changes while keeping structure
  • High (0.6-0.8): Major changes, may alter original significantly