Class AudioLDM2Model<T>

Namespace: AiDotNet.Diffusion.Models

Assembly: AiDotNet.dll

AudioLDM 2 - Enhanced Audio Latent Diffusion Model with dual text encoders.

public class AudioLDM2Model<T> : AudioDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

DiffusionModelBase<T>

LatentDiffusionModelBase<T>

AudioDiffusionModelBase<T>

AudioLDM2Model<T>

Implements: ILatentDiffusionModel<T>

IAudioDiffusionModel<T>

IDiffusionModel<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

Inherited Members: AudioDiffusionModelBase<T>.SampleRate

AudioDiffusionModelBase<T>.DefaultDurationSeconds

AudioDiffusionModelBase<T>.MelChannels

AudioDiffusionModelBase<T>.HopLength

AudioDiffusionModelBase<T>.FFTSize

AudioDiffusionModelBase<T>.MinFrequency

AudioDiffusionModelBase<T>.MaxFrequency

AudioDiffusionModelBase<T>.GenerateFromText(string, string, double?, int, double, int?)

AudioDiffusionModelBase<T>.TextToSpeech(string, Tensor<T>, double, int, int?)

AudioDiffusionModelBase<T>.AudioToAudio(Tensor<T>, string, string, double, int, double, int?)

AudioDiffusionModelBase<T>.ContinueAudio(Tensor<T>, string, double, int, int?)

AudioDiffusionModelBase<T>.WaveformToMelSpectrogram(Tensor<T>)

AudioDiffusionModelBase<T>.MelSpectrogramToWaveform(Tensor<T>)

AudioDiffusionModelBase<T>.ExtractSpeakerEmbedding(Tensor<T>)

AudioDiffusionModelBase<T>.CombineTextAndSpeakerEmbeddings(Tensor<T>, Tensor<T>)

AudioDiffusionModelBase<T>.EstimateSpeechDuration(string, double)

AudioDiffusionModelBase<T>.ExtractLatentContext(Tensor<T>)

AudioDiffusionModelBase<T>.PredictNoiseWithContext(Tensor<T>, int, Tensor<T>, Tensor<T>)

AudioDiffusionModelBase<T>.ConcatenateLatents(Tensor<T>, Tensor<T>)

AudioDiffusionModelBase<T>.ConcatenateAudio(Tensor<T>, Tensor<T>)

LatentDiffusionModelBase<T>.GuidanceScale

LatentDiffusionModelBase<T>.SupportsNegativePrompt

LatentDiffusionModelBase<T>.SupportsInpainting

LatentDiffusionModelBase<T>.EncodeToLatent(Tensor<T>, bool)

LatentDiffusionModelBase<T>.DecodeFromLatent(Tensor<T>)

LatentDiffusionModelBase<T>.GenerateFromText(string, string, int, int, int, double?, int?)

LatentDiffusionModelBase<T>.ImageToImage(Tensor<T>, string, string, double, int, double?, int?)

LatentDiffusionModelBase<T>.Inpaint(Tensor<T>, Tensor<T>, string, string, int, double?, int?)

LatentDiffusionModelBase<T>.SetGuidanceScale(double)

LatentDiffusionModelBase<T>.PredictNoise(Tensor<T>, int)

LatentDiffusionModelBase<T>.Generate(int[], int, int?)

LatentDiffusionModelBase<T>.ApplyGuidance(Tensor<T>, Tensor<T>, double)

LatentDiffusionModelBase<T>.SampleNoiseTensor(int[], Random)

LatentDiffusionModelBase<T>.ResizeMaskToLatent(Tensor<T>, int[])

LatentDiffusionModelBase<T>.BlendLatentsWithMask(Tensor<T>, Tensor<T>, Tensor<T>, int)

DiffusionModelBase<T>.NumOps

DiffusionModelBase<T>.RandomGenerator

DiffusionModelBase<T>.LossFunction

DiffusionModelBase<T>.LearningRate

DiffusionModelBase<T>.Scheduler

DiffusionModelBase<T>.DefaultLossFunction

DiffusionModelBase<T>.SupportsJitCompilation

DiffusionModelBase<T>.ComputeLoss(Tensor<T>, Tensor<T>, int[])

DiffusionModelBase<T>.Train(Tensor<T>, Tensor<T>)

DiffusionModelBase<T>.Predict(Tensor<T>)

DiffusionModelBase<T>.GetModelMetadata()

DiffusionModelBase<T>.WithParameters(Vector<T>)

DiffusionModelBase<T>.Serialize()

DiffusionModelBase<T>.Deserialize(byte[])

DiffusionModelBase<T>.SaveModel(string)

DiffusionModelBase<T>.LoadModel(string)

DiffusionModelBase<T>.SaveState(Stream)

DiffusionModelBase<T>.LoadState(Stream)

DiffusionModelBase<T>.GetActiveFeatureIndices()

DiffusionModelBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

DiffusionModelBase<T>.IsFeatureUsed(int)

DiffusionModelBase<T>.GetFeatureImportance()

DiffusionModelBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

DiffusionModelBase<T>.ApplyGradients(Vector<T>, T)

DiffusionModelBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

DiffusionModelBase<T>.SampleNoise(int, Random)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Examples

// Create an AudioLDM 2 model
var audioLDM2 = new AudioLDM2Model<float>();

// Generate high-quality music
var music = audioLDM2.GenerateMusic(
    prompt: "Cinematic orchestral music with dramatic strings and timpani",
    durationSeconds: 20.0,
    numInferenceSteps: 200,
    guidanceScale: 4.5);

// Generate complex sound effects
var soundscape = audioLDM2.GenerateAudio(
    prompt: "A busy city street with traffic, horns, and people talking",
    durationSeconds: 15.0,
    numInferenceSteps: 150,
    guidanceScale: 4.0);

Remarks

AudioLDM 2 is an improved version of AudioLDM with significant architectural enhancements for better text-to-audio and text-to-music generation. Key improvements include:

Dual Text Encoders: Combines CLAP (audio-text) and T5/GPT-2 (language) embeddings
Larger Architecture: 384 base channels vs 256 in AudioLDM 1
Higher Resolution: 128 mel channels vs 64 for better audio quality
Improved Music Generation: Better temporal coherence and musical structure
Longer Duration Support: Up to 30 seconds of audio generation

For Beginners: AudioLDM 2 generates higher-quality audio than AudioLDM 1:

Example prompts:

"A symphony orchestra playing a dramatic crescendo" -> orchestral music
"Footsteps on gravel with birds chirping" -> detailed soundscape
"Electric guitar riff with heavy distortion" -> rock music

The dual encoder architecture means:

CLAP encoder understands audio concepts (instrument sounds, effects)
T5/GPT-2 encoder understands language (descriptions, context)
Combined, they produce audio that matches both sound and meaning

Technical specifications: - Sample rate: 16 kHz (speech/effects) or 48 kHz (high-quality music) - Latent channels: 8 - Mel channels: 128 (double AudioLDM 1) - Base channels: 384 (1.5x AudioLDM 1) - Context dimension: 1024 (combined encoder output) - Duration: Up to 30 seconds - Guidance scale: 3.0-6.0 typical

Constructors

AudioLDM2Model()

Initializes a new AudioLDM 2 model with default parameters.

public AudioLDM2Model()

AudioLDM2Model(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, IConditioningModule<T>?, AudioLDM2Variant, int, double, int?)

Initializes a new AudioLDM 2 model with custom parameters.

public AudioLDM2Model(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, AudioVAE<T>? audioVAE = null, IConditioningModule<T>? clapConditioner = null, IConditioningModule<T>? languageConditioner = null, AudioLDM2Variant variant = AudioLDM2Variant.Large, int sampleRate = 16000, double defaultDurationSeconds = 10, int? seed = null)

Parameters

options DiffusionModelOptions<T>: Configuration options for the diffusion model.
scheduler INoiseScheduler<T>: Optional custom scheduler.
unet UNetNoisePredictor<T>: Optional custom U-Net noise predictor.
audioVAE AudioVAE<T>: Optional custom AudioVAE.
clapConditioner IConditioningModule<T>: Optional CLAP conditioning module.
languageConditioner IConditioningModule<T>: Optional T5/GPT-2 conditioning module.
variant AudioLDM2Variant: Model variant (Base, Large, or Music).
sampleRate int: Audio sample rate in Hz.
defaultDurationSeconds double: Default audio duration.
seed int?: Optional random seed.

Fields

AUDIOLDM2_BASE_CHANNELS

AudioLDM 2 U-Net base channels (larger than AudioLDM 1).

public const int AUDIOLDM2_BASE_CHANNELS = 384

Field Value

int

AUDIOLDM2_CONTEXT_DIM

Combined context dimension from dual encoders.

public const int AUDIOLDM2_CONTEXT_DIM = 1024

Field Value

int

AUDIOLDM2_LATENT_CHANNELS

AudioLDM 2 latent space channels.

public const int AUDIOLDM2_LATENT_CHANNELS = 8

Field Value

int

AUDIOLDM2_MAX_DURATION

Maximum supported duration in seconds.

public const double AUDIOLDM2_MAX_DURATION = 30

Field Value

double

AUDIOLDM2_MEL_CHANNELS

AudioLDM 2 mel spectrogram channels (increased from 64 to 128).

public const int AUDIOLDM2_MEL_CHANNELS = 128

Field Value

int

AUDIOLDM2_SAMPLE_RATE

AudioLDM 2 default sample rate for high-quality audio.

public const int AUDIOLDM2_SAMPLE_RATE = 16000

Field Value

int

Properties

AudioVAE

Gets the AudioVAE for direct access.

public AudioVAE<T> AudioVAE { get; }

Property Value

AudioVAE<T>

Conditioner

Gets the conditioning module (optional, for conditioned generation).

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

LanguageConditioner

Gets the secondary language conditioning module.

public IConditioningModule<T>? LanguageConditioner { get; }

Property Value

IConditioningModule<T>

LatentChannels

Gets the number of latent channels.

public override int LatentChannels { get; }

Property Value

int

Remarks

Typically 4 for Stable Diffusion models.

NoisePredictor

Gets the noise predictor model (U-Net, DiT, etc.).

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

This property returns the total count of trainable parameters in the model. It's useful for understanding model complexity and memory requirements.

SupportsAudioToAudio

Gets whether this model supports audio-to-audio transformation.

public override bool SupportsAudioToAudio { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public override bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public override bool SupportsTextToMusic { get; }

Property Value

bool

SupportsTextToSpeech

Gets whether this model supports text-to-speech generation.

public override bool SupportsTextToSpeech { get; }

Property Value

bool

VAE

Gets the VAE model used for encoding and decoding.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Variant

Gets the model variant.

public AudioLDM2Variant Variant { get; }

Property Value

AudioLDM2Variant

Methods

Clone()

Creates a deep copy of the model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>: A new instance with the same parameters.

DeepCopy()

Creates a deep copy of this object.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GenerateAudio(string, string?, double?, int, double, int?)

Generates audio from a text prompt using dual encoders.

public virtual Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 150, double guidanceScale = 4, int? seed = null)

Parameters

prompt string: Text description of the desired audio.
negativePrompt string: Optional negative prompt.
durationSeconds double?: Duration of audio to generate (max 30s).
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor [1, samples].

Remarks

AudioLDM 2's dual encoder architecture provides better prompt understanding:

CLAP encoder: Understands audio-specific concepts (instruments, sounds, textures)
T5/GPT-2 encoder: Understands language semantics (descriptions, contexts, styles)

This combination allows for more nuanced control over generated audio.

GenerateMusic(string, string?, double?, int, double, int?)

Generates music from a text prompt with enhanced musical understanding.

public override Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 200, double guidanceScale = 4.5, int? seed = null)

Parameters

prompt string: Text description of the desired music.
negativePrompt string: Optional negative prompt.
durationSeconds double?: Duration of music to generate (max 30s).
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor [1, samples].

Remarks

AudioLDM 2 excels at music generation due to its dual encoder architecture. The T5/GPT-2 encoder provides better understanding of musical concepts like:

Genre descriptions ("jazz fusion", "baroque classical")
Mood and emotion ("melancholic", "uplifting")
Instrumentation ("string quartet", "electronic synths")
Tempo and rhythm ("slow waltz", "fast breakbeat")

The CLAP encoder ensures the generated audio sounds authentic.

GenerateVariations(Tensor<T>, int, double, int?)

Generates audio variations with enhanced diversity.

public virtual List<Tensor<T>> GenerateVariations(Tensor<T> inputAudio, int numVariations = 4, double variationStrength = 0.3, int? seed = null)

Parameters

inputAudio Tensor<T>: Input audio waveform.
numVariations int: Number of variations to generate.
variationStrength double: How much to vary (0.0-1.0).
seed int?: Optional random seed.

Returns

List<Tensor<T>>: List of audio variation tensors.

GetParameters()

Gets the parameters that can be optimized.

public override Vector<T> GetParameters()

Returns

Vector<T>

InterpolateAudio(Tensor<T>, Tensor<T>, int)

Interpolates between two audio samples in latent space.

public virtual List<Tensor<T>> InterpolateAudio(Tensor<T> audio1, Tensor<T> audio2, int numSteps = 5)

Parameters

audio1 Tensor<T>: First audio sample.
audio2 Tensor<T>: Second audio sample.
numSteps int: Number of interpolation steps.

Returns

List<Tensor<T>>: List of interpolated audio tensors.

SetParameters(Vector<T>)

Sets the model parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: The parameter vector to set.

Remarks

This method allows direct modification of the model's internal parameters. This is useful for optimization algorithms that need to update parameters iteratively. If the length of parameters does not match ParameterCount, an ArgumentException should be thrown.

Exceptions

ArgumentException: Thrown when the length of parameters does not match ParameterCount.

TransformAudio(Tensor<T>, string, string?, double, int, double, int?)

Transforms audio based on a text prompt (audio-to-audio).

public virtual Tensor<T> TransformAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 150, double guidanceScale = 4, int? seed = null)

Parameters

inputAudio Tensor<T>: Input audio waveform [batch, samples].
prompt string: Text description for transformation.
negativePrompt string: Optional negative prompt.
strength double: Transformation strength (0.0-1.0).
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Transformed audio waveform tensor.

Table of Contents

Class AudioLDM2Model<T>

Type Parameters

Examples

Remarks

Constructors

AudioLDM2Model()

AudioLDM2Model(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, IConditioningModule<T>?, AudioLDM2Variant, int, double, int?)

Parameters

Fields

AUDIOLDM2_BASE_CHANNELS

Field Value

AUDIOLDM2_CONTEXT_DIM

Field Value

AUDIOLDM2_LATENT_CHANNELS

Field Value

AUDIOLDM2_MAX_DURATION

Field Value

AUDIOLDM2_MEL_CHANNELS

Field Value

AUDIOLDM2_SAMPLE_RATE

Field Value

Properties

AudioVAE

Property Value

Conditioner

Property Value

LanguageConditioner

Property Value

LatentChannels

Property Value

Remarks

NoisePredictor

Property Value

ParameterCount

Property Value

Remarks

SupportsAudioToAudio

Property Value

SupportsTextToAudio

Property Value

SupportsTextToMusic

Property Value

SupportsTextToSpeech

Property Value

VAE

Property Value

Variant

Property Value

Methods

Clone()

Returns

DeepCopy()

Returns

GenerateAudio(string, string?, double?, int, double, int?)

Parameters

Returns

Remarks

GenerateMusic(string, string?, double?, int, double, int?)

Parameters

Returns

Remarks

GenerateVariations(Tensor<T>, int, double, int?)

Parameters

Returns

GetParameters()

Returns

InterpolateAudio(Tensor<T>, Tensor<T>, int)

Parameters

Returns

SetParameters(Vector<T>)

Parameters

Remarks

Exceptions

TransformAudio(Tensor<T>, string, string?, double, int, double, int?)

Parameters

Returns