Class AudioLDMModel<T>

Namespace: AiDotNet.Diffusion.Models

Assembly: AiDotNet.dll

Audio Latent Diffusion Model (AudioLDM) for text-to-audio generation.

public class AudioLDMModel<T> : AudioDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

DiffusionModelBase<T>

LatentDiffusionModelBase<T>

AudioDiffusionModelBase<T>

AudioLDMModel<T>

Implements: ILatentDiffusionModel<T>

IAudioDiffusionModel<T>

IDiffusionModel<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

Inherited Members: AudioDiffusionModelBase<T>.SampleRate

AudioDiffusionModelBase<T>.DefaultDurationSeconds

AudioDiffusionModelBase<T>.MelChannels

AudioDiffusionModelBase<T>.HopLength

AudioDiffusionModelBase<T>.FFTSize

AudioDiffusionModelBase<T>.MinFrequency

AudioDiffusionModelBase<T>.MaxFrequency

AudioDiffusionModelBase<T>.GenerateFromText(string, string, double?, int, double, int?)

AudioDiffusionModelBase<T>.TextToSpeech(string, Tensor<T>, double, int, int?)

AudioDiffusionModelBase<T>.AudioToAudio(Tensor<T>, string, string, double, int, double, int?)

AudioDiffusionModelBase<T>.ContinueAudio(Tensor<T>, string, double, int, int?)

AudioDiffusionModelBase<T>.WaveformToMelSpectrogram(Tensor<T>)

AudioDiffusionModelBase<T>.MelSpectrogramToWaveform(Tensor<T>)

AudioDiffusionModelBase<T>.ExtractSpeakerEmbedding(Tensor<T>)

AudioDiffusionModelBase<T>.CombineTextAndSpeakerEmbeddings(Tensor<T>, Tensor<T>)

AudioDiffusionModelBase<T>.EstimateSpeechDuration(string, double)

AudioDiffusionModelBase<T>.ExtractLatentContext(Tensor<T>)

AudioDiffusionModelBase<T>.PredictNoiseWithContext(Tensor<T>, int, Tensor<T>, Tensor<T>)

AudioDiffusionModelBase<T>.ConcatenateLatents(Tensor<T>, Tensor<T>)

AudioDiffusionModelBase<T>.ConcatenateAudio(Tensor<T>, Tensor<T>)

LatentDiffusionModelBase<T>.GuidanceScale

LatentDiffusionModelBase<T>.SupportsNegativePrompt

LatentDiffusionModelBase<T>.SupportsInpainting

LatentDiffusionModelBase<T>.EncodeToLatent(Tensor<T>, bool)

LatentDiffusionModelBase<T>.DecodeFromLatent(Tensor<T>)

LatentDiffusionModelBase<T>.GenerateFromText(string, string, int, int, int, double?, int?)

LatentDiffusionModelBase<T>.ImageToImage(Tensor<T>, string, string, double, int, double?, int?)

LatentDiffusionModelBase<T>.Inpaint(Tensor<T>, Tensor<T>, string, string, int, double?, int?)

LatentDiffusionModelBase<T>.SetGuidanceScale(double)

LatentDiffusionModelBase<T>.PredictNoise(Tensor<T>, int)

LatentDiffusionModelBase<T>.Generate(int[], int, int?)

LatentDiffusionModelBase<T>.ApplyGuidance(Tensor<T>, Tensor<T>, double)

LatentDiffusionModelBase<T>.SampleNoiseTensor(int[], Random)

LatentDiffusionModelBase<T>.ResizeMaskToLatent(Tensor<T>, int[])

LatentDiffusionModelBase<T>.BlendLatentsWithMask(Tensor<T>, Tensor<T>, Tensor<T>, int)

DiffusionModelBase<T>.NumOps

DiffusionModelBase<T>.RandomGenerator

DiffusionModelBase<T>.LossFunction

DiffusionModelBase<T>.LearningRate

DiffusionModelBase<T>.Scheduler

DiffusionModelBase<T>.DefaultLossFunction

DiffusionModelBase<T>.SupportsJitCompilation

DiffusionModelBase<T>.ComputeLoss(Tensor<T>, Tensor<T>, int[])

DiffusionModelBase<T>.Train(Tensor<T>, Tensor<T>)

DiffusionModelBase<T>.Predict(Tensor<T>)

DiffusionModelBase<T>.GetModelMetadata()

DiffusionModelBase<T>.WithParameters(Vector<T>)

DiffusionModelBase<T>.Serialize()

DiffusionModelBase<T>.Deserialize(byte[])

DiffusionModelBase<T>.SaveModel(string)

DiffusionModelBase<T>.LoadModel(string)

DiffusionModelBase<T>.SaveState(Stream)

DiffusionModelBase<T>.LoadState(Stream)

DiffusionModelBase<T>.GetActiveFeatureIndices()

DiffusionModelBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

DiffusionModelBase<T>.IsFeatureUsed(int)

DiffusionModelBase<T>.GetFeatureImportance()

DiffusionModelBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

DiffusionModelBase<T>.ApplyGradients(Vector<T>, T)

DiffusionModelBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

DiffusionModelBase<T>.SampleNoise(int, Random)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Examples

// Create an AudioLDM model
var audioLDM = new AudioLDMModel<float>();

// Generate sound effects
var dogBark = audioLDM.GenerateFromText(
    prompt: "A dog barking excitedly",
    durationSeconds: 5.0,
    numInferenceSteps: 100,
    guidanceScale: 3.5);

// Generate music
var music = audioLDM.GenerateMusic(
    prompt: "Soft jazz piano with light drums",
    durationSeconds: 10.0,
    numInferenceSteps: 200,
    guidanceScale: 4.0);

// Save as audio file
SaveWav(dogBark, "dog_bark.wav", sampleRate: 16000);

Remarks

AudioLDM is a latent diffusion model specifically designed for audio generation. It works by generating mel spectrograms in latent space and then converting them to audio using a vocoder (like HiFi-GAN).

For Beginners: AudioLDM lets you create sounds and music from text descriptions:

Example prompts:

"A dog barking in a park" -> generates dog barking sounds
"Rain falling on a window" -> generates rain sounds
"Jazz piano playing softly" -> generates jazz piano music

How it works:

Text -> CLAP encoder -> text embedding (understands audio concepts)
Text embedding guides diffusion in latent space
Latent -> AudioVAE decoder -> mel spectrogram
Mel spectrogram -> Vocoder -> audio waveform

Key features:

Text-to-audio: Generate sounds from descriptions
Audio-to-audio: Transform sounds while preserving some characteristics
Variable duration: Generate audio of different lengths
Classifier-free guidance: Control how closely to follow the prompt

Technical specifications: - Sample rate: 16 kHz (standard for speech/effects) or 48 kHz (music) - Latent channels: 8 - Mel channels: 64 (AudioLDM) or 128 (AudioLDM 2) - Duration: Typically 10 seconds, but configurable - Guidance scale: 2.5-5.0 typical

Constructors

AudioLDMModel()

Initializes a new AudioLDM model with default parameters.

public AudioLDMModel()

AudioLDMModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, int, double, int, bool, int?)

Initializes a new AudioLDM model with custom parameters.

public AudioLDMModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, AudioVAE<T>? audioVAE = null, IConditioningModule<T>? conditioner = null, int sampleRate = 16000, double defaultDurationSeconds = 10, int melChannels = 64, bool isVersion2 = false, int? seed = null)

Parameters

options DiffusionModelOptions<T>: Configuration options for the diffusion model.
scheduler INoiseScheduler<T>: Optional custom scheduler.
unet UNetNoisePredictor<T>: Optional custom U-Net noise predictor.
audioVAE AudioVAE<T>: Optional custom AudioVAE.
conditioner IConditioningModule<T>: Optional CLAP conditioning module.
sampleRate int: Audio sample rate in Hz.
defaultDurationSeconds double: Default audio duration.
melChannels int: Number of mel spectrogram channels.
isVersion2 bool: Whether to use AudioLDM 2 configuration.
seed int?: Optional random seed.

Fields

AUDIOLDM_LATENT_CHANNELS

Standard AudioLDM latent channels.

public const int AUDIOLDM_LATENT_CHANNELS = 8

Field Value

int

AUDIOLDM_MEL_CHANNELS

Standard AudioLDM mel channels.

public const int AUDIOLDM_MEL_CHANNELS = 64

Field Value

int

AUDIOLDM_SAMPLE_RATE

Standard AudioLDM sample rate.

public const int AUDIOLDM_SAMPLE_RATE = 16000

Field Value

int

Properties

AudioVAE

Gets the AudioVAE used for encoding/decoding.

public AudioVAE<T> AudioVAE { get; }

Property Value

AudioVAE<T>

Conditioner

Gets the conditioning module (optional, for conditioned generation).

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

IsVersion2

Gets whether this is AudioLDM version 2.

public bool IsVersion2 { get; }

Property Value

bool

LatentChannels

Gets the number of latent channels.

public override int LatentChannels { get; }

Property Value

int

Remarks

Typically 4 for Stable Diffusion models.

NoisePredictor

Gets the noise predictor model (U-Net, DiT, etc.).

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

This property returns the total count of trainable parameters in the model. It's useful for understanding model complexity and memory requirements.

SupportsAudioToAudio

Gets whether this model supports audio-to-audio transformation.

public override bool SupportsAudioToAudio { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public override bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public override bool SupportsTextToMusic { get; }

Property Value

bool

SupportsTextToSpeech

Gets whether this model supports text-to-speech generation.

public override bool SupportsTextToSpeech { get; }

Property Value

bool

VAE

Gets the VAE model used for encoding and decoding.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Methods

Clone()

Creates a deep copy of the model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>: A new instance with the same parameters.

DeepCopy()

Creates a deep copy of this object.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GenerateAudio(string, string?, double?, int, double, int?)

Generates audio from a text prompt.

public virtual Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3.5, int? seed = null)

Parameters

prompt string: Text description of the desired audio.
negativePrompt string: Optional negative prompt.
durationSeconds double?: Duration of audio to generate.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor [1, samples].

Remarks

For Beginners: This generates audio matching your text description:

Prompt tips:

Be descriptive: "A loud thunderstorm with heavy rain" vs "thunder"
Include context: "A dog barking in a quiet park"
Specify style for music: "Upbeat electronic dance music with synth bass"

Guidance scale effects:

Lower (2.0-3.0): More variety, may not match prompt exactly
Medium (3.0-4.0): Good balance of quality and prompt following
Higher (4.0-6.0): Closely follows prompt, may reduce quality

GenerateMusic(string, string?, double?, int, double, int?)

Generates music from a text prompt.

public override Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 200, double guidanceScale = 4, int? seed = null)

Parameters

prompt string: Text description of the desired music.
negativePrompt string: Optional negative prompt.
durationSeconds double?: Duration of music to generate.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor [1, samples].

Remarks

Music generation uses the same underlying model but with prompts focused on musical content. For best results:

Specify genre: "jazz", "electronic", "classical"
Mention instruments: "piano", "guitar", "synthesizer"
Describe mood: "upbeat", "melancholic", "energetic"
Include tempo hints: "slow ballad", "fast dance beat"

GenerateVariations(Tensor<T>, int, double, int?)

Generates audio variations from an input audio.

public virtual List<Tensor<T>> GenerateVariations(Tensor<T> inputAudio, int numVariations = 4, double variationStrength = 0.3, int? seed = null)

Parameters

inputAudio Tensor<T>: Input audio waveform [batch, samples].
numVariations int: Number of variations to generate.
variationStrength double: How much to vary (0.0-1.0).
seed int?: Optional random seed.

Returns

List<Tensor<T>>: List of audio variation tensors.

Remarks

For Beginners: This creates multiple variations of your audio:

Use cases:

Sound design: Generate similar but unique sounds
Music production: Create instrument variations
Audio augmentation: Expand training data

Each variation will be similar to the input but with random differences.

GetParameters()

Gets the parameters that can be optimized.

public override Vector<T> GetParameters()

Returns

Vector<T>

SetParameters(Vector<T>)

Sets the model parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: The parameter vector to set.

Remarks

This method allows direct modification of the model's internal parameters. This is useful for optimization algorithms that need to update parameters iteratively. If the length of parameters does not match ParameterCount, an ArgumentException should be thrown.

Exceptions

ArgumentException: Thrown when the length of parameters does not match ParameterCount.

TransformAudio(Tensor<T>, string, string?, double, int, double, int?)

Transforms audio based on a text prompt (audio-to-audio).

public virtual Tensor<T> TransformAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 100, double guidanceScale = 3.5, int? seed = null)

Parameters

inputAudio Tensor<T>: Input audio waveform [batch, samples].
prompt string: Text description for transformation.
negativePrompt string: Optional negative prompt.
strength double: Transformation strength (0.0-1.0).
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Transformed audio waveform tensor.

Remarks

For Beginners: This transforms existing audio based on your description:

Examples:

Input: speech, Prompt: "whispered voice" -> quieter, intimate version
Input: guitar, Prompt: "electric guitar with distortion" -> adds effects
Input: ambient, Prompt: "add rain sounds" -> mixes in rain

Strength controls how much to change:

Low (0.2-0.4): Subtle changes, preserves original character
Medium (0.4-0.6): Noticeable changes while keeping structure
High (0.6-0.8): Major changes, may alter original significantly

Table of Contents

Class AudioLDMModel<T>

Type Parameters

Examples

Remarks

Constructors

AudioLDMModel()

AudioLDMModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, int, double, int, bool, int?)

Parameters

Fields

AUDIOLDM_LATENT_CHANNELS

Field Value

AUDIOLDM_MEL_CHANNELS

Field Value

AUDIOLDM_SAMPLE_RATE

Field Value

Properties

AudioVAE

Property Value

Conditioner

Property Value

IsVersion2

Property Value

LatentChannels

Property Value

Remarks

NoisePredictor

Property Value

ParameterCount

Property Value

Remarks

SupportsAudioToAudio

Property Value

SupportsTextToAudio

Property Value

SupportsTextToMusic

Property Value

SupportsTextToSpeech

Property Value

VAE

Property Value

Methods

Clone()

Returns

DeepCopy()

Returns

GenerateAudio(string, string?, double?, int, double, int?)

Parameters

Returns

Remarks

GenerateMusic(string, string?, double?, int, double, int?)

Parameters

Returns

Remarks

GenerateVariations(Tensor<T>, int, double, int?)

Parameters

Returns

Remarks

GetParameters()

Returns

SetParameters(Vector<T>)

Parameters

Remarks

Exceptions

TransformAudio(Tensor<T>, string, string?, double, int, double, int?)

Parameters

Returns

Remarks