Class AudioLDMModel<T>
Audio Latent Diffusion Model (AudioLDM) for text-to-audio generation.
public class AudioLDMModel<T> : AudioDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
AudioLDMModel<T>
- Implements
- Inherited Members
- Extension Methods
Examples
// Create an AudioLDM model
var audioLDM = new AudioLDMModel<float>();
// Generate sound effects
var dogBark = audioLDM.GenerateFromText(
prompt: "A dog barking excitedly",
durationSeconds: 5.0,
numInferenceSteps: 100,
guidanceScale: 3.5);
// Generate music
var music = audioLDM.GenerateMusic(
prompt: "Soft jazz piano with light drums",
durationSeconds: 10.0,
numInferenceSteps: 200,
guidanceScale: 4.0);
// Save as audio file
SaveWav(dogBark, "dog_bark.wav", sampleRate: 16000);
Remarks
AudioLDM is a latent diffusion model specifically designed for audio generation. It works by generating mel spectrograms in latent space and then converting them to audio using a vocoder (like HiFi-GAN).
For Beginners: AudioLDM lets you create sounds and music from text descriptions:
Example prompts:
- "A dog barking in a park" -> generates dog barking sounds
- "Rain falling on a window" -> generates rain sounds
- "Jazz piano playing softly" -> generates jazz piano music
How it works:
- Text -> CLAP encoder -> text embedding (understands audio concepts)
- Text embedding guides diffusion in latent space
- Latent -> AudioVAE decoder -> mel spectrogram
- Mel spectrogram -> Vocoder -> audio waveform
Key features:
- Text-to-audio: Generate sounds from descriptions
- Audio-to-audio: Transform sounds while preserving some characteristics
- Variable duration: Generate audio of different lengths
- Classifier-free guidance: Control how closely to follow the prompt
Technical specifications: - Sample rate: 16 kHz (standard for speech/effects) or 48 kHz (music) - Latent channels: 8 - Mel channels: 64 (AudioLDM) or 128 (AudioLDM 2) - Duration: Typically 10 seconds, but configurable - Guidance scale: 2.5-5.0 typical
Constructors
AudioLDMModel()
Initializes a new AudioLDM model with default parameters.
public AudioLDMModel()
AudioLDMModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, int, double, int, bool, int?)
Initializes a new AudioLDM model with custom parameters.
public AudioLDMModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, AudioVAE<T>? audioVAE = null, IConditioningModule<T>? conditioner = null, int sampleRate = 16000, double defaultDurationSeconds = 10, int melChannels = 64, bool isVersion2 = false, int? seed = null)
Parameters
optionsDiffusionModelOptions<T>Configuration options for the diffusion model.
schedulerINoiseScheduler<T>Optional custom scheduler.
unetUNetNoisePredictor<T>Optional custom U-Net noise predictor.
audioVAEAudioVAE<T>Optional custom AudioVAE.
conditionerIConditioningModule<T>Optional CLAP conditioning module.
sampleRateintAudio sample rate in Hz.
defaultDurationSecondsdoubleDefault audio duration.
melChannelsintNumber of mel spectrogram channels.
isVersion2boolWhether to use AudioLDM 2 configuration.
seedint?Optional random seed.
Fields
AUDIOLDM_LATENT_CHANNELS
Standard AudioLDM latent channels.
public const int AUDIOLDM_LATENT_CHANNELS = 8
Field Value
AUDIOLDM_MEL_CHANNELS
Standard AudioLDM mel channels.
public const int AUDIOLDM_MEL_CHANNELS = 64
Field Value
AUDIOLDM_SAMPLE_RATE
Standard AudioLDM sample rate.
public const int AUDIOLDM_SAMPLE_RATE = 16000
Field Value
Properties
AudioVAE
Gets the AudioVAE used for encoding/decoding.
public AudioVAE<T> AudioVAE { get; }
Property Value
- AudioVAE<T>
Conditioner
Gets the conditioning module (optional, for conditioned generation).
public override IConditioningModule<T>? Conditioner { get; }
Property Value
IsVersion2
Gets whether this is AudioLDM version 2.
public bool IsVersion2 { get; }
Property Value
LatentChannels
Gets the number of latent channels.
public override int LatentChannels { get; }
Property Value
Remarks
Typically 4 for Stable Diffusion models.
NoisePredictor
Gets the noise predictor model (U-Net, DiT, etc.).
public override INoisePredictor<T> NoisePredictor { get; }
Property Value
ParameterCount
Gets the number of parameters in the model.
public override int ParameterCount { get; }
Property Value
Remarks
This property returns the total count of trainable parameters in the model. It's useful for understanding model complexity and memory requirements.
SupportsAudioToAudio
Gets whether this model supports audio-to-audio transformation.
public override bool SupportsAudioToAudio { get; }
Property Value
SupportsTextToAudio
Gets whether this model supports text-to-audio generation.
public override bool SupportsTextToAudio { get; }
Property Value
SupportsTextToMusic
Gets whether this model supports text-to-music generation.
public override bool SupportsTextToMusic { get; }
Property Value
SupportsTextToSpeech
Gets whether this model supports text-to-speech generation.
public override bool SupportsTextToSpeech { get; }
Property Value
VAE
Gets the VAE model used for encoding and decoding.
public override IVAEModel<T> VAE { get; }
Property Value
- IVAEModel<T>
Methods
Clone()
Creates a deep copy of the model.
public override IDiffusionModel<T> Clone()
Returns
- IDiffusionModel<T>
A new instance with the same parameters.
DeepCopy()
Creates a deep copy of this object.
public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
GenerateAudio(string, string?, double?, int, double, int?)
Generates audio from a text prompt.
public virtual Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3.5, int? seed = null)
Parameters
promptstringText description of the desired audio.
negativePromptstringOptional negative prompt.
durationSecondsdouble?Duration of audio to generate.
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor [1, samples].
Remarks
For Beginners: This generates audio matching your text description:
Prompt tips:
- Be descriptive: "A loud thunderstorm with heavy rain" vs "thunder"
- Include context: "A dog barking in a quiet park"
- Specify style for music: "Upbeat electronic dance music with synth bass"
Guidance scale effects:
- Lower (2.0-3.0): More variety, may not match prompt exactly
- Medium (3.0-4.0): Good balance of quality and prompt following
- Higher (4.0-6.0): Closely follows prompt, may reduce quality
GenerateMusic(string, string?, double?, int, double, int?)
Generates music from a text prompt.
public override Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 200, double guidanceScale = 4, int? seed = null)
Parameters
promptstringText description of the desired music.
negativePromptstringOptional negative prompt.
durationSecondsdouble?Duration of music to generate.
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor [1, samples].
Remarks
Music generation uses the same underlying model but with prompts focused on musical content. For best results:
- Specify genre: "jazz", "electronic", "classical"
- Mention instruments: "piano", "guitar", "synthesizer"
- Describe mood: "upbeat", "melancholic", "energetic"
- Include tempo hints: "slow ballad", "fast dance beat"
GenerateVariations(Tensor<T>, int, double, int?)
Generates audio variations from an input audio.
public virtual List<Tensor<T>> GenerateVariations(Tensor<T> inputAudio, int numVariations = 4, double variationStrength = 0.3, int? seed = null)
Parameters
inputAudioTensor<T>Input audio waveform [batch, samples].
numVariationsintNumber of variations to generate.
variationStrengthdoubleHow much to vary (0.0-1.0).
seedint?Optional random seed.
Returns
- List<Tensor<T>>
List of audio variation tensors.
Remarks
For Beginners: This creates multiple variations of your audio:
Use cases:
- Sound design: Generate similar but unique sounds
- Music production: Create instrument variations
- Audio augmentation: Expand training data
Each variation will be similar to the input but with random differences.
GetParameters()
Gets the parameters that can be optimized.
public override Vector<T> GetParameters()
Returns
- Vector<T>
SetParameters(Vector<T>)
Sets the model parameters.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>The parameter vector to set.
Remarks
This method allows direct modification of the model's internal parameters.
This is useful for optimization algorithms that need to update parameters iteratively.
If the length of parameters does not match ParameterCount,
an ArgumentException should be thrown.
Exceptions
- ArgumentException
Thrown when the length of
parametersdoes not match ParameterCount.
TransformAudio(Tensor<T>, string, string?, double, int, double, int?)
Transforms audio based on a text prompt (audio-to-audio).
public virtual Tensor<T> TransformAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 100, double guidanceScale = 3.5, int? seed = null)
Parameters
inputAudioTensor<T>Input audio waveform [batch, samples].
promptstringText description for transformation.
negativePromptstringOptional negative prompt.
strengthdoubleTransformation strength (0.0-1.0).
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Transformed audio waveform tensor.
Remarks
For Beginners: This transforms existing audio based on your description:
Examples:
- Input: speech, Prompt: "whispered voice" -> quieter, intimate version
- Input: guitar, Prompt: "electric guitar with distortion" -> adds effects
- Input: ambient, Prompt: "add rain sounds" -> mixes in rain
Strength controls how much to change:
- Low (0.2-0.4): Subtle changes, preserves original character
- Medium (0.4-0.6): Noticeable changes while keeping structure
- High (0.6-0.8): Major changes, may alter original significantly