Class AudioLDMModel<T>
AudioLDM (Audio Latent Diffusion Model) for generating audio from text descriptions.
public class AudioLDMModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IAudioGenerator<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
AudioLDMModel<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
AudioLDM is a latent diffusion model that generates audio by learning to reverse a diffusion process in a compressed latent space. It uses CLAP (Contrastive Language-Audio Pretraining) for text conditioning and a VAE for efficient latent space learning.
Architecture components:
- CLAP Encoder: Contrastive text encoder that aligns text with audio features
- VAE: Variational autoencoder that compresses mel spectrograms to latent space
- U-Net Denoiser: Predicts noise to be removed at each diffusion step
- HiFi-GAN Vocoder: Converts mel spectrograms to audio waveforms
For Beginners: AudioLDM creates realistic audio from your descriptions:
How it works:
- You describe the sound you want ("a cat meowing")
- CLAP encodes your text into an audio-aligned representation
- The diffusion process generates a latent audio representation
- The VAE decoder converts latents to mel spectrogram
- HiFi-GAN vocoder converts the spectrogram to audio
Key features:
- General audio and music generation
- Environmental sounds, speech, music
- Controllable through text prompts
- High-quality 16kHz or 48kHz output
Usage:
var model = new AudioLDMModel<float>(options);
var audio = model.GenerateAudio("A dog barking in a park");
Reference: "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models" by Liu et al., 2023
Constructors
AudioLDMModel(NeuralNetworkArchitecture<T>, AudioLDMOptions?, ITokenizer?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates an AudioLDM model using native layers for training from scratch.
public AudioLDMModel(NeuralNetworkArchitecture<T> architecture, AudioLDMOptions? options = null, ITokenizer? tokenizer = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
optionsAudioLDMOptionsAudioLDM configuration options.
tokenizerITokenizerOptional tokenizer. If null, creates CLAP-compatible tokenizer.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer. Defaults to AdamW.
lossFunctionILossFunction<T>Optional loss function. Defaults to MSE.
Remarks
For Beginners: Use this constructor when: - Training AudioLDM from scratch (requires significant data) - Fine-tuning on custom audio types - Research and experimentation
For most use cases, load pretrained ONNX models instead.
AudioLDMModel(NeuralNetworkArchitecture<T>, string, string, string, string, ITokenizer, AudioLDMOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates an AudioLDM model using pretrained ONNX models for inference.
public AudioLDMModel(NeuralNetworkArchitecture<T> architecture, string clapEncoderPath, string vaePath, string unetPath, string vocoderPath, ITokenizer tokenizer, AudioLDMOptions? options = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
clapEncoderPathstringPath to the CLAP text encoder ONNX model.
vaePathstringPath to the VAE ONNX model.
unetPathstringPath to the U-Net denoiser ONNX model.
vocoderPathstringPath to the HiFi-GAN vocoder ONNX model.
tokenizerITokenizerCLAP tokenizer for text processing (REQUIRED).
optionsAudioLDMOptionsAudioLDM configuration options.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer for fine-tuning.
lossFunctionILossFunction<T>Optional loss function.
Exceptions
- ArgumentException
Thrown when required paths are empty.
- FileNotFoundException
Thrown when model files don't exist.
- ArgumentNullException
Thrown when tokenizer is null.
Properties
MaxDurationSeconds
Gets the maximum duration of audio that can be generated.
public double MaxDurationSeconds { get; }
Property Value
SampleRate
Gets the sample rate of generated audio.
public int SampleRate { get; }
Property Value
SupportsAudioContinuation
Gets whether this model supports audio continuation.
public bool SupportsAudioContinuation { get; }
Property Value
SupportsAudioInpainting
Gets whether this model supports audio inpainting.
public bool SupportsAudioInpainting { get; }
Property Value
SupportsTextToAudio
Gets whether this model supports text-to-audio generation.
public bool SupportsTextToAudio { get; }
Property Value
SupportsTextToMusic
Gets whether this model supports text-to-music generation.
public bool SupportsTextToMusic { get; }
Property Value
Methods
ContinueAudio(Tensor<T>, string?, double, int, int?)
Continues existing audio by extending it.
public Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)
Parameters
Returns
- Tensor<T>
CreateNewInstance()
Creates a new instance for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReader
Dispose(bool)
Disposes of model resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
GenerateAudio(string, string?, double, int, double, int?)
Generates audio from a text description.
public Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?
Returns
- Tensor<T>
GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)
Generates audio asynchronously.
public Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?cancellationTokenCancellationToken
Returns
- Task<Tensor<T>>
GenerateMusic(string, string?, double, int, double, int?)
Generates music from a text description.
public Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?
Returns
- Tensor<T>
GetDefaultOptions()
Gets default generation options.
public AudioGenerationOptions<T> GetDefaultOptions()
Returns
GetModelMetadata()
Gets model metadata.
public override ModelMetadata<T> GetModelMetadata()
Returns
InitializeLayers()
Initializes the neural network layers following the golden standard pattern.
protected override void InitializeLayers()
InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)
Fills in missing or masked sections of audio.
public Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)
Parameters
Returns
- Tensor<T>
PostprocessOutput(Tensor<T>)
Postprocesses model output.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>
Returns
- Tensor<T>
Predict(Tensor<T>)
Makes a prediction using the model.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>
Returns
- Tensor<T>
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriter
Train(Tensor<T>, Tensor<T>)
Trains the model on input data.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>expectedOutputTensor<T>
UpdateParameters(Vector<T>)
Updates model parameters.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>