Class AudioDiffusionModelBase<T>
Base class for audio diffusion models that generate sound and music.
public abstract class AudioDiffusionModelBase<T> : LatentDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
AudioDiffusionModelBase<T>
- Implements
- Derived
- Inherited Members
- Extension Methods
Remarks
This abstract base class provides common functionality for all audio diffusion models, including text-to-audio generation, text-to-music, text-to-speech, and audio transformation.
For Beginners: This is the foundation for audio generation models like AudioLDM. It extends latent diffusion to work with audio by converting sound to spectrograms (visual representations of sound) and back.
How audio diffusion works: 1. Audio is converted to a mel spectrogram (frequency vs time image) 2. The spectrogram is encoded to latent space (like images) 3. Diffusion denoising happens in latent space 4. The result is decoded to a spectrogram 5. A vocoder converts the spectrogram back to audio
Constructors
AudioDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?, int, double, int)
Initializes a new instance of the AudioDiffusionModelBase class.
protected AudioDiffusionModelBase(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, int sampleRate = 16000, double defaultDurationSeconds = 10, int melChannels = 64)
Parameters
optionsDiffusionModelOptions<T>Configuration options for the diffusion model.
schedulerINoiseScheduler<T>Optional custom scheduler.
sampleRateintAudio sample rate in Hz.
defaultDurationSecondsdoubleDefault audio duration.
melChannelsintNumber of mel spectrogram channels.
Properties
DefaultDurationSeconds
Gets the default duration of generated audio in seconds.
public virtual double DefaultDurationSeconds { get; }
Property Value
FFTSize
Gets the FFT window size.
public virtual int FFTSize { get; protected set; }
Property Value
HopLength
Gets the hop length for spectrogram computation.
public virtual int HopLength { get; protected set; }
Property Value
Remarks
Hop length is the number of audio samples between successive frames. Lower values = higher time resolution but more computation. Typical values: 256, 512, 1024.
MaxFrequency
Gets the maximum frequency for mel filterbank.
public virtual double MaxFrequency { get; protected set; }
Property Value
MelChannels
Gets the number of mel spectrogram channels used.
public virtual int MelChannels { get; }
Property Value
Remarks
Mel spectrograms divide the frequency range into perceptual bands. Common values: 64, 80, or 128 mel bins.
MinFrequency
Gets the minimum frequency for mel filterbank.
public virtual double MinFrequency { get; protected set; }
Property Value
SampleRate
Gets the sample rate of generated audio.
public virtual int SampleRate { get; }
Property Value
Remarks
Common values: 16000 Hz (speech), 22050 Hz (music), 44100 Hz (high quality). Higher sample rates = better quality but more computation.
SupportsAudioToAudio
Gets whether this model supports audio-to-audio transformation.
public abstract bool SupportsAudioToAudio { get; }
Property Value
SupportsTextToAudio
Gets whether this model supports text-to-audio generation.
public abstract bool SupportsTextToAudio { get; }
Property Value
SupportsTextToMusic
Gets whether this model supports text-to-music generation.
public abstract bool SupportsTextToMusic { get; }
Property Value
SupportsTextToSpeech
Gets whether this model supports text-to-speech generation.
public abstract bool SupportsTextToSpeech { get; }
Property Value
Methods
AudioToAudio(Tensor<T>, string, string?, double, int, double, int?)
Transforms existing audio based on a text prompt.
public virtual Tensor<T> AudioToAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
inputAudioTensor<T>The input audio waveform.
promptstringText description of the transformation.
negativePromptstringWhat to avoid.
strengthdoubleTransformation strength (0.0-1.0).
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Transformed audio waveform.
Remarks
For Beginners: This changes existing audio: - "Make it sound like it's underwater" - "Add reverb like a large hall" - "Change the voice to sound younger"
CombineTextAndSpeakerEmbeddings(Tensor<T>, Tensor<T>)
Combines text embedding with speaker embedding for TTS.
protected virtual Tensor<T> CombineTextAndSpeakerEmbeddings(Tensor<T> textEmbedding, Tensor<T> speakerEmbedding)
Parameters
textEmbeddingTensor<T>speakerEmbeddingTensor<T>
Returns
- Tensor<T>
ConcatenateAudio(Tensor<T>, Tensor<T>)
Concatenates two audio waveforms.
protected virtual Tensor<T> ConcatenateAudio(Tensor<T> a, Tensor<T> b)
Parameters
aTensor<T>bTensor<T>
Returns
- Tensor<T>
ConcatenateLatents(Tensor<T>, Tensor<T>)
Concatenates two latent tensors along the time dimension.
protected virtual Tensor<T> ConcatenateLatents(Tensor<T> a, Tensor<T> b)
Parameters
aTensor<T>bTensor<T>
Returns
- Tensor<T>
ContinueAudio(Tensor<T>, string?, double, int, int?)
Continues/extends audio from a given clip.
public virtual Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)
Parameters
inputAudioTensor<T>The audio to continue from.
promptstringOptional text guidance for continuation.
extensionSecondsdoubleHow many seconds to add.
numInferenceStepsintNumber of denoising steps.
seedint?Optional random seed.
Returns
- Tensor<T>
Extended audio waveform (original + continuation).
Remarks
For Beginners: This extends audio by generating more that follows: - Input: 5 seconds of a song - Output: Original 5 seconds + 10 more seconds that fit naturally
EstimateSpeechDuration(string, double)
Estimates speech duration based on text length and speaking rate.
protected virtual double EstimateSpeechDuration(string text, double speakingRate)
Parameters
Returns
ExtractLatentContext(Tensor<T>)
Extracts context from the end of latent representation for continuation.
protected virtual Tensor<T> ExtractLatentContext(Tensor<T> latents)
Parameters
latentsTensor<T>
Returns
- Tensor<T>
ExtractSpeakerEmbedding(Tensor<T>)
Gets speaker embeddings from a reference audio clip (for voice cloning).
public virtual Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)
Parameters
referenceAudioTensor<T>Reference audio waveform.
Returns
- Tensor<T>
Speaker embedding tensor.
GenerateFromText(string, string?, double?, int, double, int?)
Generates audio from a text description.
public virtual Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringText description of the desired audio.
negativePromptstringWhat to avoid in the audio.
durationSecondsdouble?Length of audio to generate.
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor [batch, samples] or [batch, channels, samples] for stereo.
Remarks
For Beginners: This creates sound from a description: - prompt: "A thunderstorm with rain" → Thunder and rain sounds - prompt: "Acoustic guitar strumming" → Guitar music
GenerateMusic(string, string?, double?, int, double, int?)
Generates music from a text description.
public virtual Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringText description of the desired music.
negativePromptstringWhat to avoid in the music.
durationSecondsdouble?Length of music to generate.
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor.
Remarks
Music generation may use specialized models tuned for musical content, with better handling of melody, harmony, and rhythm.
MelSpectrogramToWaveform(Tensor<T>)
Converts mel spectrogram back to audio waveform.
public virtual Tensor<T> MelSpectrogramToWaveform(Tensor<T> melSpectrogram)
Parameters
melSpectrogramTensor<T>Mel spectrogram [batch, channels, melBins, timeFrames].
Returns
- Tensor<T>
Audio waveform [batch, samples].
PredictNoiseWithContext(Tensor<T>, int, Tensor<T>, Tensor<T>?)
Predicts noise with context from previous audio.
protected virtual Tensor<T> PredictNoiseWithContext(Tensor<T> latents, int timestep, Tensor<T> context, Tensor<T>? promptEmbedding)
Parameters
latentsTensor<T>timestepintcontextTensor<T>promptEmbeddingTensor<T>
Returns
- Tensor<T>
TextToSpeech(string, Tensor<T>?, double, int, int?)
Synthesizes speech from text (text-to-speech).
public virtual Tensor<T> TextToSpeech(string text, Tensor<T>? speakerEmbedding = null, double speakingRate = 1, int numInferenceSteps = 50, int? seed = null)
Parameters
textstringThe text to speak.
speakerEmbeddingTensor<T>Optional speaker embedding for voice cloning.
speakingRatedoubleSpeed multiplier (1.0 = normal).
numInferenceStepsintNumber of denoising steps.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor.
Remarks
For Beginners: This makes the computer "read" text out loud: - Input: "Hello, how are you today?" - Output: Audio of someone saying those words - speakerEmbedding: Makes it sound like a specific person
WaveformToMelSpectrogram(Tensor<T>)
Converts audio waveform to mel spectrogram.
public virtual Tensor<T> WaveformToMelSpectrogram(Tensor<T> waveform)
Parameters
waveformTensor<T>Audio waveform [batch, samples].
Returns
- Tensor<T>
Mel spectrogram [batch, channels, melBins, timeFrames].