Table of Contents

Class AudioLDM2Model<T>

Namespace
AiDotNet.Diffusion.Models
Assembly
AiDotNet.dll

AudioLDM 2 - Enhanced Audio Latent Diffusion Model with dual text encoders.

public class AudioLDM2Model<T> : AudioDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
AudioLDM2Model<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Examples

// Create an AudioLDM 2 model
var audioLDM2 = new AudioLDM2Model<float>();

// Generate high-quality music
var music = audioLDM2.GenerateMusic(
    prompt: "Cinematic orchestral music with dramatic strings and timpani",
    durationSeconds: 20.0,
    numInferenceSteps: 200,
    guidanceScale: 4.5);

// Generate complex sound effects
var soundscape = audioLDM2.GenerateAudio(
    prompt: "A busy city street with traffic, horns, and people talking",
    durationSeconds: 15.0,
    numInferenceSteps: 150,
    guidanceScale: 4.0);

Remarks

AudioLDM 2 is an improved version of AudioLDM with significant architectural enhancements for better text-to-audio and text-to-music generation. Key improvements include:

  1. Dual Text Encoders: Combines CLAP (audio-text) and T5/GPT-2 (language) embeddings
  2. Larger Architecture: 384 base channels vs 256 in AudioLDM 1
  3. Higher Resolution: 128 mel channels vs 64 for better audio quality
  4. Improved Music Generation: Better temporal coherence and musical structure
  5. Longer Duration Support: Up to 30 seconds of audio generation

For Beginners: AudioLDM 2 generates higher-quality audio than AudioLDM 1:

Example prompts:

  • "A symphony orchestra playing a dramatic crescendo" -> orchestral music
  • "Footsteps on gravel with birds chirping" -> detailed soundscape
  • "Electric guitar riff with heavy distortion" -> rock music

The dual encoder architecture means:

  • CLAP encoder understands audio concepts (instrument sounds, effects)
  • T5/GPT-2 encoder understands language (descriptions, context)
  • Combined, they produce audio that matches both sound and meaning

Technical specifications: - Sample rate: 16 kHz (speech/effects) or 48 kHz (high-quality music) - Latent channels: 8 - Mel channels: 128 (double AudioLDM 1) - Base channels: 384 (1.5x AudioLDM 1) - Context dimension: 1024 (combined encoder output) - Duration: Up to 30 seconds - Guidance scale: 3.0-6.0 typical

Constructors

AudioLDM2Model()

Initializes a new AudioLDM 2 model with default parameters.

public AudioLDM2Model()

AudioLDM2Model(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, IConditioningModule<T>?, AudioLDM2Variant, int, double, int?)

Initializes a new AudioLDM 2 model with custom parameters.

public AudioLDM2Model(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, AudioVAE<T>? audioVAE = null, IConditioningModule<T>? clapConditioner = null, IConditioningModule<T>? languageConditioner = null, AudioLDM2Variant variant = AudioLDM2Variant.Large, int sampleRate = 16000, double defaultDurationSeconds = 10, int? seed = null)

Parameters

options DiffusionModelOptions<T>

Configuration options for the diffusion model.

scheduler INoiseScheduler<T>

Optional custom scheduler.

unet UNetNoisePredictor<T>

Optional custom U-Net noise predictor.

audioVAE AudioVAE<T>

Optional custom AudioVAE.

clapConditioner IConditioningModule<T>

Optional CLAP conditioning module.

languageConditioner IConditioningModule<T>

Optional T5/GPT-2 conditioning module.

variant AudioLDM2Variant

Model variant (Base, Large, or Music).

sampleRate int

Audio sample rate in Hz.

defaultDurationSeconds double

Default audio duration.

seed int?

Optional random seed.

Fields

AUDIOLDM2_BASE_CHANNELS

AudioLDM 2 U-Net base channels (larger than AudioLDM 1).

public const int AUDIOLDM2_BASE_CHANNELS = 384

Field Value

int

AUDIOLDM2_CONTEXT_DIM

Combined context dimension from dual encoders.

public const int AUDIOLDM2_CONTEXT_DIM = 1024

Field Value

int

AUDIOLDM2_LATENT_CHANNELS

AudioLDM 2 latent space channels.

public const int AUDIOLDM2_LATENT_CHANNELS = 8

Field Value

int

AUDIOLDM2_MAX_DURATION

Maximum supported duration in seconds.

public const double AUDIOLDM2_MAX_DURATION = 30

Field Value

double

AUDIOLDM2_MEL_CHANNELS

AudioLDM 2 mel spectrogram channels (increased from 64 to 128).

public const int AUDIOLDM2_MEL_CHANNELS = 128

Field Value

int

AUDIOLDM2_SAMPLE_RATE

AudioLDM 2 default sample rate for high-quality audio.

public const int AUDIOLDM2_SAMPLE_RATE = 16000

Field Value

int

Properties

AudioVAE

Gets the AudioVAE for direct access.

public AudioVAE<T> AudioVAE { get; }

Property Value

AudioVAE<T>

Conditioner

Gets the conditioning module (optional, for conditioned generation).

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

LanguageConditioner

Gets the secondary language conditioning module.

public IConditioningModule<T>? LanguageConditioner { get; }

Property Value

IConditioningModule<T>

LatentChannels

Gets the number of latent channels.

public override int LatentChannels { get; }

Property Value

int

Remarks

Typically 4 for Stable Diffusion models.

NoisePredictor

Gets the noise predictor model (U-Net, DiT, etc.).

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

This property returns the total count of trainable parameters in the model. It's useful for understanding model complexity and memory requirements.

SupportsAudioToAudio

Gets whether this model supports audio-to-audio transformation.

public override bool SupportsAudioToAudio { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public override bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public override bool SupportsTextToMusic { get; }

Property Value

bool

SupportsTextToSpeech

Gets whether this model supports text-to-speech generation.

public override bool SupportsTextToSpeech { get; }

Property Value

bool

VAE

Gets the VAE model used for encoding and decoding.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Variant

Gets the model variant.

public AudioLDM2Variant Variant { get; }

Property Value

AudioLDM2Variant

Methods

Clone()

Creates a deep copy of the model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>

A new instance with the same parameters.

DeepCopy()

Creates a deep copy of this object.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GenerateAudio(string, string?, double?, int, double, int?)

Generates audio from a text prompt using dual encoders.

public virtual Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 150, double guidanceScale = 4, int? seed = null)

Parameters

prompt string

Text description of the desired audio.

negativePrompt string

Optional negative prompt.

durationSeconds double?

Duration of audio to generate (max 30s).

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor [1, samples].

Remarks

AudioLDM 2's dual encoder architecture provides better prompt understanding:

  • CLAP encoder: Understands audio-specific concepts (instruments, sounds, textures)
  • T5/GPT-2 encoder: Understands language semantics (descriptions, contexts, styles)

This combination allows for more nuanced control over generated audio.

GenerateMusic(string, string?, double?, int, double, int?)

Generates music from a text prompt with enhanced musical understanding.

public override Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 200, double guidanceScale = 4.5, int? seed = null)

Parameters

prompt string

Text description of the desired music.

negativePrompt string

Optional negative prompt.

durationSeconds double?

Duration of music to generate (max 30s).

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor [1, samples].

Remarks

AudioLDM 2 excels at music generation due to its dual encoder architecture. The T5/GPT-2 encoder provides better understanding of musical concepts like:

  • Genre descriptions ("jazz fusion", "baroque classical")
  • Mood and emotion ("melancholic", "uplifting")
  • Instrumentation ("string quartet", "electronic synths")
  • Tempo and rhythm ("slow waltz", "fast breakbeat")

The CLAP encoder ensures the generated audio sounds authentic.

GenerateVariations(Tensor<T>, int, double, int?)

Generates audio variations with enhanced diversity.

public virtual List<Tensor<T>> GenerateVariations(Tensor<T> inputAudio, int numVariations = 4, double variationStrength = 0.3, int? seed = null)

Parameters

inputAudio Tensor<T>

Input audio waveform.

numVariations int

Number of variations to generate.

variationStrength double

How much to vary (0.0-1.0).

seed int?

Optional random seed.

Returns

List<Tensor<T>>

List of audio variation tensors.

GetParameters()

Gets the parameters that can be optimized.

public override Vector<T> GetParameters()

Returns

Vector<T>

InterpolateAudio(Tensor<T>, Tensor<T>, int)

Interpolates between two audio samples in latent space.

public virtual List<Tensor<T>> InterpolateAudio(Tensor<T> audio1, Tensor<T> audio2, int numSteps = 5)

Parameters

audio1 Tensor<T>

First audio sample.

audio2 Tensor<T>

Second audio sample.

numSteps int

Number of interpolation steps.

Returns

List<Tensor<T>>

List of interpolated audio tensors.

SetParameters(Vector<T>)

Sets the model parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The parameter vector to set.

Remarks

This method allows direct modification of the model's internal parameters. This is useful for optimization algorithms that need to update parameters iteratively. If the length of parameters does not match ParameterCount, an ArgumentException should be thrown.

Exceptions

ArgumentException

Thrown when the length of parameters does not match ParameterCount.

TransformAudio(Tensor<T>, string, string?, double, int, double, int?)

Transforms audio based on a text prompt (audio-to-audio).

public virtual Tensor<T> TransformAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 150, double guidanceScale = 4, int? seed = null)

Parameters

inputAudio Tensor<T>

Input audio waveform [batch, samples].

prompt string

Text description for transformation.

negativePrompt string

Optional negative prompt.

strength double

Transformation strength (0.0-1.0).

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Transformed audio waveform tensor.