Class AudioLDM2Model<T>
AudioLDM 2 - Enhanced Audio Latent Diffusion Model with dual text encoders.
public class AudioLDM2Model<T> : AudioDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
AudioLDM2Model<T>
- Implements
- Inherited Members
- Extension Methods
Examples
// Create an AudioLDM 2 model
var audioLDM2 = new AudioLDM2Model<float>();
// Generate high-quality music
var music = audioLDM2.GenerateMusic(
prompt: "Cinematic orchestral music with dramatic strings and timpani",
durationSeconds: 20.0,
numInferenceSteps: 200,
guidanceScale: 4.5);
// Generate complex sound effects
var soundscape = audioLDM2.GenerateAudio(
prompt: "A busy city street with traffic, horns, and people talking",
durationSeconds: 15.0,
numInferenceSteps: 150,
guidanceScale: 4.0);
Remarks
AudioLDM 2 is an improved version of AudioLDM with significant architectural enhancements for better text-to-audio and text-to-music generation. Key improvements include:
- Dual Text Encoders: Combines CLAP (audio-text) and T5/GPT-2 (language) embeddings
- Larger Architecture: 384 base channels vs 256 in AudioLDM 1
- Higher Resolution: 128 mel channels vs 64 for better audio quality
- Improved Music Generation: Better temporal coherence and musical structure
- Longer Duration Support: Up to 30 seconds of audio generation
For Beginners: AudioLDM 2 generates higher-quality audio than AudioLDM 1:
Example prompts:
- "A symphony orchestra playing a dramatic crescendo" -> orchestral music
- "Footsteps on gravel with birds chirping" -> detailed soundscape
- "Electric guitar riff with heavy distortion" -> rock music
The dual encoder architecture means:
- CLAP encoder understands audio concepts (instrument sounds, effects)
- T5/GPT-2 encoder understands language (descriptions, context)
- Combined, they produce audio that matches both sound and meaning
Technical specifications: - Sample rate: 16 kHz (speech/effects) or 48 kHz (high-quality music) - Latent channels: 8 - Mel channels: 128 (double AudioLDM 1) - Base channels: 384 (1.5x AudioLDM 1) - Context dimension: 1024 (combined encoder output) - Duration: Up to 30 seconds - Guidance scale: 3.0-6.0 typical
Constructors
AudioLDM2Model()
Initializes a new AudioLDM 2 model with default parameters.
public AudioLDM2Model()
AudioLDM2Model(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, IConditioningModule<T>?, AudioLDM2Variant, int, double, int?)
Initializes a new AudioLDM 2 model with custom parameters.
public AudioLDM2Model(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, AudioVAE<T>? audioVAE = null, IConditioningModule<T>? clapConditioner = null, IConditioningModule<T>? languageConditioner = null, AudioLDM2Variant variant = AudioLDM2Variant.Large, int sampleRate = 16000, double defaultDurationSeconds = 10, int? seed = null)
Parameters
optionsDiffusionModelOptions<T>Configuration options for the diffusion model.
schedulerINoiseScheduler<T>Optional custom scheduler.
unetUNetNoisePredictor<T>Optional custom U-Net noise predictor.
audioVAEAudioVAE<T>Optional custom AudioVAE.
clapConditionerIConditioningModule<T>Optional CLAP conditioning module.
languageConditionerIConditioningModule<T>Optional T5/GPT-2 conditioning module.
variantAudioLDM2VariantModel variant (Base, Large, or Music).
sampleRateintAudio sample rate in Hz.
defaultDurationSecondsdoubleDefault audio duration.
seedint?Optional random seed.
Fields
AUDIOLDM2_BASE_CHANNELS
AudioLDM 2 U-Net base channels (larger than AudioLDM 1).
public const int AUDIOLDM2_BASE_CHANNELS = 384
Field Value
AUDIOLDM2_CONTEXT_DIM
Combined context dimension from dual encoders.
public const int AUDIOLDM2_CONTEXT_DIM = 1024
Field Value
AUDIOLDM2_LATENT_CHANNELS
AudioLDM 2 latent space channels.
public const int AUDIOLDM2_LATENT_CHANNELS = 8
Field Value
AUDIOLDM2_MAX_DURATION
Maximum supported duration in seconds.
public const double AUDIOLDM2_MAX_DURATION = 30
Field Value
AUDIOLDM2_MEL_CHANNELS
AudioLDM 2 mel spectrogram channels (increased from 64 to 128).
public const int AUDIOLDM2_MEL_CHANNELS = 128
Field Value
AUDIOLDM2_SAMPLE_RATE
AudioLDM 2 default sample rate for high-quality audio.
public const int AUDIOLDM2_SAMPLE_RATE = 16000
Field Value
Properties
AudioVAE
Gets the AudioVAE for direct access.
public AudioVAE<T> AudioVAE { get; }
Property Value
- AudioVAE<T>
Conditioner
Gets the conditioning module (optional, for conditioned generation).
public override IConditioningModule<T>? Conditioner { get; }
Property Value
LanguageConditioner
Gets the secondary language conditioning module.
public IConditioningModule<T>? LanguageConditioner { get; }
Property Value
LatentChannels
Gets the number of latent channels.
public override int LatentChannels { get; }
Property Value
Remarks
Typically 4 for Stable Diffusion models.
NoisePredictor
Gets the noise predictor model (U-Net, DiT, etc.).
public override INoisePredictor<T> NoisePredictor { get; }
Property Value
ParameterCount
Gets the number of parameters in the model.
public override int ParameterCount { get; }
Property Value
Remarks
This property returns the total count of trainable parameters in the model. It's useful for understanding model complexity and memory requirements.
SupportsAudioToAudio
Gets whether this model supports audio-to-audio transformation.
public override bool SupportsAudioToAudio { get; }
Property Value
SupportsTextToAudio
Gets whether this model supports text-to-audio generation.
public override bool SupportsTextToAudio { get; }
Property Value
SupportsTextToMusic
Gets whether this model supports text-to-music generation.
public override bool SupportsTextToMusic { get; }
Property Value
SupportsTextToSpeech
Gets whether this model supports text-to-speech generation.
public override bool SupportsTextToSpeech { get; }
Property Value
VAE
Gets the VAE model used for encoding and decoding.
public override IVAEModel<T> VAE { get; }
Property Value
- IVAEModel<T>
Variant
Gets the model variant.
public AudioLDM2Variant Variant { get; }
Property Value
Methods
Clone()
Creates a deep copy of the model.
public override IDiffusionModel<T> Clone()
Returns
- IDiffusionModel<T>
A new instance with the same parameters.
DeepCopy()
Creates a deep copy of this object.
public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
GenerateAudio(string, string?, double?, int, double, int?)
Generates audio from a text prompt using dual encoders.
public virtual Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 150, double guidanceScale = 4, int? seed = null)
Parameters
promptstringText description of the desired audio.
negativePromptstringOptional negative prompt.
durationSecondsdouble?Duration of audio to generate (max 30s).
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor [1, samples].
Remarks
AudioLDM 2's dual encoder architecture provides better prompt understanding:
- CLAP encoder: Understands audio-specific concepts (instruments, sounds, textures)
- T5/GPT-2 encoder: Understands language semantics (descriptions, contexts, styles)
This combination allows for more nuanced control over generated audio.
GenerateMusic(string, string?, double?, int, double, int?)
Generates music from a text prompt with enhanced musical understanding.
public override Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 200, double guidanceScale = 4.5, int? seed = null)
Parameters
promptstringText description of the desired music.
negativePromptstringOptional negative prompt.
durationSecondsdouble?Duration of music to generate (max 30s).
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor [1, samples].
Remarks
AudioLDM 2 excels at music generation due to its dual encoder architecture. The T5/GPT-2 encoder provides better understanding of musical concepts like:
- Genre descriptions ("jazz fusion", "baroque classical")
- Mood and emotion ("melancholic", "uplifting")
- Instrumentation ("string quartet", "electronic synths")
- Tempo and rhythm ("slow waltz", "fast breakbeat")
The CLAP encoder ensures the generated audio sounds authentic.
GenerateVariations(Tensor<T>, int, double, int?)
Generates audio variations with enhanced diversity.
public virtual List<Tensor<T>> GenerateVariations(Tensor<T> inputAudio, int numVariations = 4, double variationStrength = 0.3, int? seed = null)
Parameters
inputAudioTensor<T>Input audio waveform.
numVariationsintNumber of variations to generate.
variationStrengthdoubleHow much to vary (0.0-1.0).
seedint?Optional random seed.
Returns
- List<Tensor<T>>
List of audio variation tensors.
GetParameters()
Gets the parameters that can be optimized.
public override Vector<T> GetParameters()
Returns
- Vector<T>
InterpolateAudio(Tensor<T>, Tensor<T>, int)
Interpolates between two audio samples in latent space.
public virtual List<Tensor<T>> InterpolateAudio(Tensor<T> audio1, Tensor<T> audio2, int numSteps = 5)
Parameters
audio1Tensor<T>First audio sample.
audio2Tensor<T>Second audio sample.
numStepsintNumber of interpolation steps.
Returns
- List<Tensor<T>>
List of interpolated audio tensors.
SetParameters(Vector<T>)
Sets the model parameters.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>The parameter vector to set.
Remarks
This method allows direct modification of the model's internal parameters.
This is useful for optimization algorithms that need to update parameters iteratively.
If the length of parameters does not match ParameterCount,
an ArgumentException should be thrown.
Exceptions
- ArgumentException
Thrown when the length of
parametersdoes not match ParameterCount.
TransformAudio(Tensor<T>, string, string?, double, int, double, int?)
Transforms audio based on a text prompt (audio-to-audio).
public virtual Tensor<T> TransformAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 150, double guidanceScale = 4, int? seed = null)
Parameters
inputAudioTensor<T>Input audio waveform [batch, samples].
promptstringText description for transformation.
negativePromptstringOptional negative prompt.
strengthdoubleTransformation strength (0.0-1.0).
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Transformed audio waveform tensor.