Table of Contents

Class AudioDiffusionModelBase<T>

Namespace
AiDotNet.Diffusion
Assembly
AiDotNet.dll

Base class for audio diffusion models that generate sound and music.

public abstract class AudioDiffusionModelBase<T> : LatentDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
AudioDiffusionModelBase<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Derived
Inherited Members
Extension Methods

Remarks

This abstract base class provides common functionality for all audio diffusion models, including text-to-audio generation, text-to-music, text-to-speech, and audio transformation.

For Beginners: This is the foundation for audio generation models like AudioLDM. It extends latent diffusion to work with audio by converting sound to spectrograms (visual representations of sound) and back.

How audio diffusion works: 1. Audio is converted to a mel spectrogram (frequency vs time image) 2. The spectrogram is encoded to latent space (like images) 3. Diffusion denoising happens in latent space 4. The result is decoded to a spectrogram 5. A vocoder converts the spectrogram back to audio

Constructors

AudioDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?, int, double, int)

Initializes a new instance of the AudioDiffusionModelBase class.

protected AudioDiffusionModelBase(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, int sampleRate = 16000, double defaultDurationSeconds = 10, int melChannels = 64)

Parameters

options DiffusionModelOptions<T>

Configuration options for the diffusion model.

scheduler INoiseScheduler<T>

Optional custom scheduler.

sampleRate int

Audio sample rate in Hz.

defaultDurationSeconds double

Default audio duration.

melChannels int

Number of mel spectrogram channels.

Properties

DefaultDurationSeconds

Gets the default duration of generated audio in seconds.

public virtual double DefaultDurationSeconds { get; }

Property Value

double

FFTSize

Gets the FFT window size.

public virtual int FFTSize { get; protected set; }

Property Value

int

HopLength

Gets the hop length for spectrogram computation.

public virtual int HopLength { get; protected set; }

Property Value

int

Remarks

Hop length is the number of audio samples between successive frames. Lower values = higher time resolution but more computation. Typical values: 256, 512, 1024.

MaxFrequency

Gets the maximum frequency for mel filterbank.

public virtual double MaxFrequency { get; protected set; }

Property Value

double

MelChannels

Gets the number of mel spectrogram channels used.

public virtual int MelChannels { get; }

Property Value

int

Remarks

Mel spectrograms divide the frequency range into perceptual bands. Common values: 64, 80, or 128 mel bins.

MinFrequency

Gets the minimum frequency for mel filterbank.

public virtual double MinFrequency { get; protected set; }

Property Value

double

SampleRate

Gets the sample rate of generated audio.

public virtual int SampleRate { get; }

Property Value

int

Remarks

Common values: 16000 Hz (speech), 22050 Hz (music), 44100 Hz (high quality). Higher sample rates = better quality but more computation.

SupportsAudioToAudio

Gets whether this model supports audio-to-audio transformation.

public abstract bool SupportsAudioToAudio { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public abstract bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public abstract bool SupportsTextToMusic { get; }

Property Value

bool

SupportsTextToSpeech

Gets whether this model supports text-to-speech generation.

public abstract bool SupportsTextToSpeech { get; }

Property Value

bool

Methods

AudioToAudio(Tensor<T>, string, string?, double, int, double, int?)

Transforms existing audio based on a text prompt.

public virtual Tensor<T> AudioToAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

inputAudio Tensor<T>

The input audio waveform.

prompt string

Text description of the transformation.

negativePrompt string

What to avoid.

strength double

Transformation strength (0.0-1.0).

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Transformed audio waveform.

Remarks

For Beginners: This changes existing audio: - "Make it sound like it's underwater" - "Add reverb like a large hall" - "Change the voice to sound younger"

CombineTextAndSpeakerEmbeddings(Tensor<T>, Tensor<T>)

Combines text embedding with speaker embedding for TTS.

protected virtual Tensor<T> CombineTextAndSpeakerEmbeddings(Tensor<T> textEmbedding, Tensor<T> speakerEmbedding)

Parameters

textEmbedding Tensor<T>
speakerEmbedding Tensor<T>

Returns

Tensor<T>

ConcatenateAudio(Tensor<T>, Tensor<T>)

Concatenates two audio waveforms.

protected virtual Tensor<T> ConcatenateAudio(Tensor<T> a, Tensor<T> b)

Parameters

a Tensor<T>
b Tensor<T>

Returns

Tensor<T>

ConcatenateLatents(Tensor<T>, Tensor<T>)

Concatenates two latent tensors along the time dimension.

protected virtual Tensor<T> ConcatenateLatents(Tensor<T> a, Tensor<T> b)

Parameters

a Tensor<T>
b Tensor<T>

Returns

Tensor<T>

ContinueAudio(Tensor<T>, string?, double, int, int?)

Continues/extends audio from a given clip.

public virtual Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)

Parameters

inputAudio Tensor<T>

The audio to continue from.

prompt string

Optional text guidance for continuation.

extensionSeconds double

How many seconds to add.

numInferenceSteps int

Number of denoising steps.

seed int?

Optional random seed.

Returns

Tensor<T>

Extended audio waveform (original + continuation).

Remarks

For Beginners: This extends audio by generating more that follows: - Input: 5 seconds of a song - Output: Original 5 seconds + 10 more seconds that fit naturally

EstimateSpeechDuration(string, double)

Estimates speech duration based on text length and speaking rate.

protected virtual double EstimateSpeechDuration(string text, double speakingRate)

Parameters

text string
speakingRate double

Returns

double

ExtractLatentContext(Tensor<T>)

Extracts context from the end of latent representation for continuation.

protected virtual Tensor<T> ExtractLatentContext(Tensor<T> latents)

Parameters

latents Tensor<T>

Returns

Tensor<T>

ExtractSpeakerEmbedding(Tensor<T>)

Gets speaker embeddings from a reference audio clip (for voice cloning).

public virtual Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>

Reference audio waveform.

Returns

Tensor<T>

Speaker embedding tensor.

GenerateFromText(string, string?, double?, int, double, int?)

Generates audio from a text description.

public virtual Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string

Text description of the desired audio.

negativePrompt string

What to avoid in the audio.

durationSeconds double?

Length of audio to generate.

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor [batch, samples] or [batch, channels, samples] for stereo.

Remarks

For Beginners: This creates sound from a description: - prompt: "A thunderstorm with rain" → Thunder and rain sounds - prompt: "Acoustic guitar strumming" → Guitar music

GenerateMusic(string, string?, double?, int, double, int?)

Generates music from a text description.

public virtual Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string

Text description of the desired music.

negativePrompt string

What to avoid in the music.

durationSeconds double?

Length of music to generate.

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor.

Remarks

Music generation may use specialized models tuned for musical content, with better handling of melody, harmony, and rhythm.

MelSpectrogramToWaveform(Tensor<T>)

Converts mel spectrogram back to audio waveform.

public virtual Tensor<T> MelSpectrogramToWaveform(Tensor<T> melSpectrogram)

Parameters

melSpectrogram Tensor<T>

Mel spectrogram [batch, channels, melBins, timeFrames].

Returns

Tensor<T>

Audio waveform [batch, samples].

PredictNoiseWithContext(Tensor<T>, int, Tensor<T>, Tensor<T>?)

Predicts noise with context from previous audio.

protected virtual Tensor<T> PredictNoiseWithContext(Tensor<T> latents, int timestep, Tensor<T> context, Tensor<T>? promptEmbedding)

Parameters

latents Tensor<T>
timestep int
context Tensor<T>
promptEmbedding Tensor<T>

Returns

Tensor<T>

TextToSpeech(string, Tensor<T>?, double, int, int?)

Synthesizes speech from text (text-to-speech).

public virtual Tensor<T> TextToSpeech(string text, Tensor<T>? speakerEmbedding = null, double speakingRate = 1, int numInferenceSteps = 50, int? seed = null)

Parameters

text string

The text to speak.

speakerEmbedding Tensor<T>

Optional speaker embedding for voice cloning.

speakingRate double

Speed multiplier (1.0 = normal).

numInferenceSteps int

Number of denoising steps.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor.

Remarks

For Beginners: This makes the computer "read" text out loud: - Input: "Hello, how are you today?" - Output: Audio of someone saying those words - speakerEmbedding: Makes it sound like a specific person

WaveformToMelSpectrogram(Tensor<T>)

Converts audio waveform to mel spectrogram.

public virtual Tensor<T> WaveformToMelSpectrogram(Tensor<T> waveform)

Parameters

waveform Tensor<T>

Audio waveform [batch, samples].

Returns

Tensor<T>

Mel spectrogram [batch, channels, melBins, timeFrames].