Table of Contents

Interface IAudioDiffusionModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for audio diffusion models that generate sound and music.

public interface IAudioDiffusionModel<T> : IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members
Extension Methods

Remarks

Audio diffusion models apply diffusion processes to generate audio content, including music, speech, sound effects, and more. They typically operate on audio spectrograms or mel-spectrograms in latent space.

For Beginners: Audio diffusion models work similarly to image diffusion, but instead of generating pictures, they create sounds.

How audio diffusion works:

  1. Audio is converted to a spectrogram (visual representation of sound)
  2. Diffusion happens on this spectrogram (just like image diffusion)
  3. The spectrogram is converted back to audio

Types of audio generation:

  • Text-to-Audio: "A dog barking in a park" → audio clip
  • Text-to-Music: "Upbeat jazz piano" → music track
  • Text-to-Speech: Text → spoken voice
  • Audio-to-Audio: Transform existing audio (voice conversion, style transfer)

Key challenges:

  • Temporal coherence (sounds must flow naturally)
  • Frequency relationships (harmonics, rhythm)
  • Long-range dependencies (verse-chorus structure in music)

This interface extends IDiffusionModel<T> with audio-specific operations.

Properties

DefaultDurationSeconds

Gets the default duration of generated audio in seconds.

double DefaultDurationSeconds { get; }

Property Value

double

MelChannels

Gets the number of mel spectrogram channels used.

int MelChannels { get; }

Property Value

int

Remarks

Mel spectrograms divide the frequency range into perceptual bands. Common values: 64, 80, or 128 mel bins.

SampleRate

Gets the sample rate of generated audio.

int SampleRate { get; }

Property Value

int

Remarks

Common values: 16000 Hz (speech), 22050 Hz (music), 44100 Hz (high quality). Higher sample rates = better quality but more computation.

SupportsAudioToAudio

Gets whether this model supports audio-to-audio transformation.

bool SupportsAudioToAudio { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

bool SupportsTextToMusic { get; }

Property Value

bool

SupportsTextToSpeech

Gets whether this model supports text-to-speech generation.

bool SupportsTextToSpeech { get; }

Property Value

bool

Methods

AudioToAudio(Tensor<T>, string, string?, double, int, double, int?)

Transforms existing audio based on a text prompt.

Tensor<T> AudioToAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

inputAudio Tensor<T>

The input audio waveform.

prompt string

Text description of the transformation.

negativePrompt string

What to avoid.

strength double

Transformation strength (0.0-1.0).

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Transformed audio waveform.

Remarks

For Beginners: This changes existing audio: - "Make it sound like it's underwater" - "Add reverb like a large hall" - "Change the voice to sound younger"

ContinueAudio(Tensor<T>, string?, double, int, int?)

Continues/extends audio from a given clip.

Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)

Parameters

inputAudio Tensor<T>

The audio to continue from.

prompt string

Optional text guidance for continuation.

extensionSeconds double

How many seconds to add.

numInferenceSteps int

Number of denoising steps.

seed int?

Optional random seed.

Returns

Tensor<T>

Extended audio waveform (original + continuation).

Remarks

For Beginners: This extends audio by generating more that follows: - Input: 5 seconds of a song - Output: Original 5 seconds + 10 more seconds that fit naturally

ExtractSpeakerEmbedding(Tensor<T>)

Gets speaker embeddings from a reference audio clip (for voice cloning).

Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>

Reference audio waveform.

Returns

Tensor<T>

Speaker embedding tensor.

GenerateFromText(string, string?, double?, int, double, int?)

Generates audio from a text description.

Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string

Text description of the desired audio.

negativePrompt string

What to avoid in the audio.

durationSeconds double?

Length of audio to generate.

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor [batch, samples] or [batch, channels, samples] for stereo.

Remarks

For Beginners: This creates sound from a description: - prompt: "A thunderstorm with rain" → Thunder and rain sounds - prompt: "Acoustic guitar strumming" → Guitar music

GenerateMusic(string, string?, double?, int, double, int?)

Generates music from a text description.

Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string

Text description of the desired music.

negativePrompt string

What to avoid in the music.

durationSeconds double?

Length of music to generate.

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor.

Remarks

Music generation may use specialized models tuned for musical content, with better handling of melody, harmony, and rhythm.

MelSpectrogramToWaveform(Tensor<T>)

Converts mel spectrogram back to audio waveform.

Tensor<T> MelSpectrogramToWaveform(Tensor<T> melSpectrogram)

Parameters

melSpectrogram Tensor<T>

Mel spectrogram [batch, channels, melBins, timeFrames].

Returns

Tensor<T>

Audio waveform [batch, samples].

TextToSpeech(string, Tensor<T>?, double, int, int?)

Synthesizes speech from text (text-to-speech).

Tensor<T> TextToSpeech(string text, Tensor<T>? speakerEmbedding = null, double speakingRate = 1, int numInferenceSteps = 50, int? seed = null)

Parameters

text string

The text to speak.

speakerEmbedding Tensor<T>

Optional speaker embedding for voice cloning.

speakingRate double

Speed multiplier (1.0 = normal).

numInferenceSteps int

Number of denoising steps.

seed int?

Optional random seed.

Returns

Tensor<T>

Audio waveform tensor.

Remarks

For Beginners: This makes the computer "read" text out loud: - Input: "Hello, how are you today?" - Output: Audio of someone saying those words - speakerEmbedding: Makes it sound like a specific person

WaveformToMelSpectrogram(Tensor<T>)

Converts audio waveform to mel spectrogram.

Tensor<T> WaveformToMelSpectrogram(Tensor<T> waveform)

Parameters

waveform Tensor<T>

Audio waveform [batch, samples].

Returns

Tensor<T>

Mel spectrogram [batch, channels, melBins, timeFrames].