Class AudioDiffusionModelBase<T>

Namespace: AiDotNet.Diffusion

Assembly: AiDotNet.dll

Base class for audio diffusion models that generate sound and music.

public abstract class AudioDiffusionModelBase<T> : LatentDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

DiffusionModelBase<T>

LatentDiffusionModelBase<T>

AudioDiffusionModelBase<T>

Implements: ILatentDiffusionModel<T>

IAudioDiffusionModel<T>

IDiffusionModel<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

Derived: AudioLDM2Model<T>

AudioLDMModel<T>

MusicGenModel<T>

Inherited Members: LatentDiffusionModelBase<T>.VAE

LatentDiffusionModelBase<T>.NoisePredictor

LatentDiffusionModelBase<T>.Conditioner

LatentDiffusionModelBase<T>.LatentChannels

LatentDiffusionModelBase<T>.GuidanceScale

LatentDiffusionModelBase<T>.SupportsNegativePrompt

LatentDiffusionModelBase<T>.SupportsInpainting

LatentDiffusionModelBase<T>.EncodeToLatent(Tensor<T>, bool)

LatentDiffusionModelBase<T>.DecodeFromLatent(Tensor<T>)

LatentDiffusionModelBase<T>.GenerateFromText(string, string, int, int, int, double?, int?)

LatentDiffusionModelBase<T>.ImageToImage(Tensor<T>, string, string, double, int, double?, int?)

LatentDiffusionModelBase<T>.Inpaint(Tensor<T>, Tensor<T>, string, string, int, double?, int?)

LatentDiffusionModelBase<T>.SetGuidanceScale(double)

LatentDiffusionModelBase<T>.PredictNoise(Tensor<T>, int)

LatentDiffusionModelBase<T>.Generate(int[], int, int?)

LatentDiffusionModelBase<T>.ApplyGuidance(Tensor<T>, Tensor<T>, double)

LatentDiffusionModelBase<T>.SampleNoiseTensor(int[], Random)

LatentDiffusionModelBase<T>.ResizeMaskToLatent(Tensor<T>, int[])

LatentDiffusionModelBase<T>.BlendLatentsWithMask(Tensor<T>, Tensor<T>, Tensor<T>, int)

DiffusionModelBase<T>.NumOps

DiffusionModelBase<T>.RandomGenerator

DiffusionModelBase<T>.LossFunction

DiffusionModelBase<T>.LearningRate

DiffusionModelBase<T>.Scheduler

DiffusionModelBase<T>.ParameterCount

DiffusionModelBase<T>.DefaultLossFunction

DiffusionModelBase<T>.SupportsJitCompilation

DiffusionModelBase<T>.ComputeLoss(Tensor<T>, Tensor<T>, int[])

DiffusionModelBase<T>.Train(Tensor<T>, Tensor<T>)

DiffusionModelBase<T>.Predict(Tensor<T>)

DiffusionModelBase<T>.GetModelMetadata()

DiffusionModelBase<T>.GetParameters()

DiffusionModelBase<T>.SetParameters(Vector<T>)

DiffusionModelBase<T>.WithParameters(Vector<T>)

DiffusionModelBase<T>.Serialize()

DiffusionModelBase<T>.Deserialize(byte[])

DiffusionModelBase<T>.SaveModel(string)

DiffusionModelBase<T>.LoadModel(string)

DiffusionModelBase<T>.SaveState(Stream)

DiffusionModelBase<T>.LoadState(Stream)

DiffusionModelBase<T>.GetActiveFeatureIndices()

DiffusionModelBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

DiffusionModelBase<T>.IsFeatureUsed(int)

DiffusionModelBase<T>.GetFeatureImportance()

DiffusionModelBase<T>.DeepCopy()

DiffusionModelBase<T>.Clone()

DiffusionModelBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

DiffusionModelBase<T>.ApplyGradients(Vector<T>, T)

DiffusionModelBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

DiffusionModelBase<T>.SampleNoise(int, Random)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

This abstract base class provides common functionality for all audio diffusion models, including text-to-audio generation, text-to-music, text-to-speech, and audio transformation.

For Beginners: This is the foundation for audio generation models like AudioLDM. It extends latent diffusion to work with audio by converting sound to spectrograms (visual representations of sound) and back.

How audio diffusion works: 1. Audio is converted to a mel spectrogram (frequency vs time image) 2. The spectrogram is encoded to latent space (like images) 3. Diffusion denoising happens in latent space 4. The result is decoded to a spectrogram 5. A vocoder converts the spectrogram back to audio

Constructors

AudioDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?, int, double, int)

Initializes a new instance of the AudioDiffusionModelBase class.

protected AudioDiffusionModelBase(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, int sampleRate = 16000, double defaultDurationSeconds = 10, int melChannels = 64)

Parameters

options DiffusionModelOptions<T>: Configuration options for the diffusion model.
scheduler INoiseScheduler<T>: Optional custom scheduler.
sampleRate int: Audio sample rate in Hz.
defaultDurationSeconds double: Default audio duration.
melChannels int: Number of mel spectrogram channels.

Properties

DefaultDurationSeconds

Gets the default duration of generated audio in seconds.

public virtual double DefaultDurationSeconds { get; }

Property Value

double

FFTSize

Gets the FFT window size.

public virtual int FFTSize { get; protected set; }

Property Value

int

HopLength

Gets the hop length for spectrogram computation.

public virtual int HopLength { get; protected set; }

Property Value

int

Remarks

Hop length is the number of audio samples between successive frames. Lower values = higher time resolution but more computation. Typical values: 256, 512, 1024.

MaxFrequency

Gets the maximum frequency for mel filterbank.

public virtual double MaxFrequency { get; protected set; }

Property Value

double

MelChannels

Gets the number of mel spectrogram channels used.

public virtual int MelChannels { get; }

Property Value

int

Remarks

Mel spectrograms divide the frequency range into perceptual bands. Common values: 64, 80, or 128 mel bins.

MinFrequency

Gets the minimum frequency for mel filterbank.

public virtual double MinFrequency { get; protected set; }

Property Value

double

SampleRate

Gets the sample rate of generated audio.

public virtual int SampleRate { get; }

Property Value

int

Remarks

Common values: 16000 Hz (speech), 22050 Hz (music), 44100 Hz (high quality). Higher sample rates = better quality but more computation.

SupportsAudioToAudio

Gets whether this model supports audio-to-audio transformation.

public abstract bool SupportsAudioToAudio { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public abstract bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public abstract bool SupportsTextToMusic { get; }

Property Value

bool

SupportsTextToSpeech

Gets whether this model supports text-to-speech generation.

public abstract bool SupportsTextToSpeech { get; }

Property Value

bool

Methods

AudioToAudio(Tensor<T>, string, string?, double, int, double, int?)

Transforms existing audio based on a text prompt.

public virtual Tensor<T> AudioToAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

inputAudio Tensor<T>: The input audio waveform.
prompt string: Text description of the transformation.
negativePrompt string: What to avoid.
strength double: Transformation strength (0.0-1.0).
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Transformed audio waveform.

Remarks

For Beginners: This changes existing audio: - "Make it sound like it's underwater" - "Add reverb like a large hall" - "Change the voice to sound younger"

CombineTextAndSpeakerEmbeddings(Tensor<T>, Tensor<T>)

Combines text embedding with speaker embedding for TTS.

protected virtual Tensor<T> CombineTextAndSpeakerEmbeddings(Tensor<T> textEmbedding, Tensor<T> speakerEmbedding)

Parameters

textEmbedding Tensor<T>
speakerEmbedding Tensor<T>

Returns

Tensor<T>

ConcatenateAudio(Tensor<T>, Tensor<T>)

Concatenates two audio waveforms.

protected virtual Tensor<T> ConcatenateAudio(Tensor<T> a, Tensor<T> b)

Parameters

a Tensor<T>
b Tensor<T>

Returns

Tensor<T>

ConcatenateLatents(Tensor<T>, Tensor<T>)

Concatenates two latent tensors along the time dimension.

protected virtual Tensor<T> ConcatenateLatents(Tensor<T> a, Tensor<T> b)

Parameters

a Tensor<T>
b Tensor<T>

Returns

Tensor<T>

ContinueAudio(Tensor<T>, string?, double, int, int?)

Continues/extends audio from a given clip.

public virtual Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)

Parameters

inputAudio Tensor<T>: The audio to continue from.
prompt string: Optional text guidance for continuation.
extensionSeconds double: How many seconds to add.
numInferenceSteps int: Number of denoising steps.
seed int?: Optional random seed.

Returns

Tensor<T>: Extended audio waveform (original + continuation).

Remarks

For Beginners: This extends audio by generating more that follows: - Input: 5 seconds of a song - Output: Original 5 seconds + 10 more seconds that fit naturally

EstimateSpeechDuration(string, double)

Estimates speech duration based on text length and speaking rate.

protected virtual double EstimateSpeechDuration(string text, double speakingRate)

Parameters

text string
speakingRate double

Returns

double

ExtractLatentContext(Tensor<T>)

Extracts context from the end of latent representation for continuation.

protected virtual Tensor<T> ExtractLatentContext(Tensor<T> latents)

Parameters

latents Tensor<T>

Returns

Tensor<T>

ExtractSpeakerEmbedding(Tensor<T>)

Gets speaker embeddings from a reference audio clip (for voice cloning).

public virtual Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>: Reference audio waveform.

Returns

Tensor<T>: Speaker embedding tensor.

GenerateFromText(string, string?, double?, int, double, int?)

Generates audio from a text description.

public virtual Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string: Text description of the desired audio.
negativePrompt string: What to avoid in the audio.
durationSeconds double?: Length of audio to generate.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor [batch, samples] or [batch, channels, samples] for stereo.

Remarks

For Beginners: This creates sound from a description: - prompt: "A thunderstorm with rain" → Thunder and rain sounds - prompt: "Acoustic guitar strumming" → Guitar music

GenerateMusic(string, string?, double?, int, double, int?)

Generates music from a text description.

public virtual Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string: Text description of the desired music.
negativePrompt string: What to avoid in the music.
durationSeconds double?: Length of music to generate.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor.

Remarks

Music generation may use specialized models tuned for musical content, with better handling of melody, harmony, and rhythm.

MelSpectrogramToWaveform(Tensor<T>)

Converts mel spectrogram back to audio waveform.

public virtual Tensor<T> MelSpectrogramToWaveform(Tensor<T> melSpectrogram)

Parameters

melSpectrogram Tensor<T>: Mel spectrogram [batch, channels, melBins, timeFrames].

Returns

Tensor<T>: Audio waveform [batch, samples].

PredictNoiseWithContext(Tensor<T>, int, Tensor<T>, Tensor<T>?)

Predicts noise with context from previous audio.

protected virtual Tensor<T> PredictNoiseWithContext(Tensor<T> latents, int timestep, Tensor<T> context, Tensor<T>? promptEmbedding)

Parameters

latents Tensor<T>
timestep int
context Tensor<T>
promptEmbedding Tensor<T>

Returns

Tensor<T>

TextToSpeech(string, Tensor<T>?, double, int, int?)

Synthesizes speech from text (text-to-speech).

public virtual Tensor<T> TextToSpeech(string text, Tensor<T>? speakerEmbedding = null, double speakingRate = 1, int numInferenceSteps = 50, int? seed = null)

Parameters

text string: The text to speak.
speakerEmbedding Tensor<T>: Optional speaker embedding for voice cloning.
speakingRate double: Speed multiplier (1.0 = normal).
numInferenceSteps int: Number of denoising steps.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor.

Remarks

For Beginners: This makes the computer "read" text out loud: - Input: "Hello, how are you today?" - Output: Audio of someone saying those words - speakerEmbedding: Makes it sound like a specific person

WaveformToMelSpectrogram(Tensor<T>)

Converts audio waveform to mel spectrogram.

public virtual Tensor<T> WaveformToMelSpectrogram(Tensor<T> waveform)

Parameters

waveform Tensor<T>: Audio waveform [batch, samples].

Returns

Tensor<T>: Mel spectrogram [batch, channels, melBins, timeFrames].

Table of Contents

Class AudioDiffusionModelBase<T>

Type Parameters

Remarks

Constructors

AudioDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?, int, double, int)

Parameters

Properties

DefaultDurationSeconds

Property Value

FFTSize

Property Value

HopLength

Property Value

Remarks

MaxFrequency

Property Value

MelChannels

Property Value

Remarks

MinFrequency

Property Value

SampleRate

Property Value

Remarks

SupportsAudioToAudio

Property Value

SupportsTextToAudio

Property Value

SupportsTextToMusic

Property Value

SupportsTextToSpeech

Property Value

Methods

AudioToAudio(Tensor<T>, string, string?, double, int, double, int?)

Parameters

Returns

Remarks

CombineTextAndSpeakerEmbeddings(Tensor<T>, Tensor<T>)

Parameters

Returns

ConcatenateAudio(Tensor<T>, Tensor<T>)

Parameters

Returns

ConcatenateLatents(Tensor<T>, Tensor<T>)

Parameters

Returns

ContinueAudio(Tensor<T>, string?, double, int, int?)

Parameters

Returns

Remarks

EstimateSpeechDuration(string, double)

Parameters

Returns

ExtractLatentContext(Tensor<T>)

Parameters

Returns

ExtractSpeakerEmbedding(Tensor<T>)

Parameters

Returns

GenerateFromText(string, string?, double?, int, double, int?)

Parameters

Returns

Remarks

GenerateMusic(string, string?, double?, int, double, int?)

Parameters

Returns

Remarks

MelSpectrogramToWaveform(Tensor<T>)

Parameters

Returns

PredictNoiseWithContext(Tensor<T>, int, Tensor<T>, Tensor<T>?)

Parameters

Returns

TextToSpeech(string, Tensor<T>?, double, int, int?)

Parameters

Returns

Remarks

WaveformToMelSpectrogram(Tensor<T>)

Parameters