Interface IAudioDiffusionModel<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Interface for audio diffusion models that generate sound and music.

public interface IAudioDiffusionModel<T> : IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inherited Members: IDiffusionModel<T>.Scheduler

IDiffusionModel<T>.Generate(int[], int, int?)

IDiffusionModel<T>.PredictNoise(Tensor<T>, int)

IDiffusionModel<T>.ComputeLoss(Tensor<T>, Tensor<T>, int[])

IFullModel<T, Tensor<T>, Tensor<T>>.DefaultLossFunction

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.Train(Tensor<T>, Tensor<T>)

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.Predict(Tensor<T>)

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>.GetModelMetadata()

IModelSerializer.Serialize()

IModelSerializer.Deserialize(byte[])

IModelSerializer.SaveModel(string)

IModelSerializer.LoadModel(string)

ICheckpointableModel.SaveState(Stream)

ICheckpointableModel.LoadState(Stream)

IParameterizable<T, Tensor<T>, Tensor<T>>.GetParameters()

IParameterizable<T, Tensor<T>, Tensor<T>>.SetParameters(Vector<T>)

IParameterizable<T, Tensor<T>, Tensor<T>>.ParameterCount

IParameterizable<T, Tensor<T>, Tensor<T>>.WithParameters(Vector<T>)

IFeatureAware.GetActiveFeatureIndices()

IFeatureAware.SetActiveFeatureIndices(IEnumerable<int>)

IFeatureAware.IsFeatureUsed(int)

IFeatureImportance<T>.GetFeatureImportance()

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>.DeepCopy()

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>.Clone()

IGradientComputable<T, Tensor<T>, Tensor<T>>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

IGradientComputable<T, Tensor<T>, Tensor<T>>.ApplyGradients(Vector<T>, T)

IJitCompilable<T>.ExportComputationGraph(List<ComputationNode<T>>)

IJitCompilable<T>.SupportsJitCompilation

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

Audio diffusion models apply diffusion processes to generate audio content, including music, speech, sound effects, and more. They typically operate on audio spectrograms or mel-spectrograms in latent space.

For Beginners: Audio diffusion models work similarly to image diffusion, but instead of generating pictures, they create sounds.

How audio diffusion works:

Audio is converted to a spectrogram (visual representation of sound)
Diffusion happens on this spectrogram (just like image diffusion)
The spectrogram is converted back to audio

Types of audio generation:

Text-to-Audio: "A dog barking in a park" → audio clip
Text-to-Music: "Upbeat jazz piano" → music track
Text-to-Speech: Text → spoken voice
Audio-to-Audio: Transform existing audio (voice conversion, style transfer)

Key challenges:

Temporal coherence (sounds must flow naturally)
Frequency relationships (harmonics, rhythm)
Long-range dependencies (verse-chorus structure in music)

This interface extends IDiffusionModel<T> with audio-specific operations.

Properties

DefaultDurationSeconds

Gets the default duration of generated audio in seconds.

double DefaultDurationSeconds { get; }

Property Value

double

MelChannels

Gets the number of mel spectrogram channels used.

int MelChannels { get; }

Property Value

int

Remarks

Mel spectrograms divide the frequency range into perceptual bands. Common values: 64, 80, or 128 mel bins.

SampleRate

Gets the sample rate of generated audio.

int SampleRate { get; }

Property Value

int

Remarks

Common values: 16000 Hz (speech), 22050 Hz (music), 44100 Hz (high quality). Higher sample rates = better quality but more computation.

SupportsAudioToAudio

Gets whether this model supports audio-to-audio transformation.

bool SupportsAudioToAudio { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

bool SupportsTextToMusic { get; }

Property Value

bool

SupportsTextToSpeech

Gets whether this model supports text-to-speech generation.

bool SupportsTextToSpeech { get; }

Property Value

bool

Methods

AudioToAudio(Tensor<T>, string, string?, double, int, double, int?)

Transforms existing audio based on a text prompt.

Tensor<T> AudioToAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

inputAudio Tensor<T>: The input audio waveform.
prompt string: Text description of the transformation.
negativePrompt string: What to avoid.
strength double: Transformation strength (0.0-1.0).
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Transformed audio waveform.

Remarks

For Beginners: This changes existing audio: - "Make it sound like it's underwater" - "Add reverb like a large hall" - "Change the voice to sound younger"

ContinueAudio(Tensor<T>, string?, double, int, int?)

Continues/extends audio from a given clip.

Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)

Parameters

inputAudio Tensor<T>: The audio to continue from.
prompt string: Optional text guidance for continuation.
extensionSeconds double: How many seconds to add.
numInferenceSteps int: Number of denoising steps.
seed int?: Optional random seed.

Returns

Tensor<T>: Extended audio waveform (original + continuation).

Remarks

For Beginners: This extends audio by generating more that follows: - Input: 5 seconds of a song - Output: Original 5 seconds + 10 more seconds that fit naturally

ExtractSpeakerEmbedding(Tensor<T>)

Gets speaker embeddings from a reference audio clip (for voice cloning).

Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)

Parameters

referenceAudio Tensor<T>: Reference audio waveform.

Returns

Tensor<T>: Speaker embedding tensor.

GenerateFromText(string, string?, double?, int, double, int?)

Generates audio from a text description.

Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string: Text description of the desired audio.
negativePrompt string: What to avoid in the audio.
durationSeconds double?: Length of audio to generate.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor [batch, samples] or [batch, channels, samples] for stereo.

Remarks

For Beginners: This creates sound from a description: - prompt: "A thunderstorm with rain" → Thunder and rain sounds - prompt: "Acoustic guitar strumming" → Guitar music

GenerateMusic(string, string?, double?, int, double, int?)

Generates music from a text description.

Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string: Text description of the desired music.
negativePrompt string: What to avoid in the music.
durationSeconds double?: Length of music to generate.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor.

Remarks

Music generation may use specialized models tuned for musical content, with better handling of melody, harmony, and rhythm.

MelSpectrogramToWaveform(Tensor<T>)

Converts mel spectrogram back to audio waveform.

Tensor<T> MelSpectrogramToWaveform(Tensor<T> melSpectrogram)

Parameters

melSpectrogram Tensor<T>: Mel spectrogram [batch, channels, melBins, timeFrames].

Returns

Tensor<T>: Audio waveform [batch, samples].

TextToSpeech(string, Tensor<T>?, double, int, int?)

Synthesizes speech from text (text-to-speech).

Tensor<T> TextToSpeech(string text, Tensor<T>? speakerEmbedding = null, double speakingRate = 1, int numInferenceSteps = 50, int? seed = null)

Parameters

text string: The text to speak.
speakerEmbedding Tensor<T>: Optional speaker embedding for voice cloning.
speakingRate double: Speed multiplier (1.0 = normal).
numInferenceSteps int: Number of denoising steps.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor.

Remarks

For Beginners: This makes the computer "read" text out loud: - Input: "Hello, how are you today?" - Output: Audio of someone saying those words - speakerEmbedding: Makes it sound like a specific person

WaveformToMelSpectrogram(Tensor<T>)

Converts audio waveform to mel spectrogram.

Tensor<T> WaveformToMelSpectrogram(Tensor<T> waveform)

Parameters

waveform Tensor<T>: Audio waveform [batch, samples].

Returns

Tensor<T>: Mel spectrogram [batch, channels, melBins, timeFrames].

Table of Contents

Interface IAudioDiffusionModel<T>

Type Parameters

Remarks

Properties

DefaultDurationSeconds

Property Value

MelChannels

Property Value

Remarks

SampleRate

Property Value

Remarks

SupportsAudioToAudio

Property Value

SupportsTextToAudio

Property Value

SupportsTextToMusic

Property Value

SupportsTextToSpeech

Property Value

Methods

AudioToAudio(Tensor<T>, string, string?, double, int, double, int?)

Parameters

Returns

Remarks

ContinueAudio(Tensor<T>, string?, double, int, int?)

Parameters

Returns

Remarks

ExtractSpeakerEmbedding(Tensor<T>)

Parameters

Returns

GenerateFromText(string, string?, double?, int, double, int?)

Parameters

Returns

Remarks

GenerateMusic(string, string?, double?, int, double, int?)

Parameters

Returns

Remarks

MelSpectrogramToWaveform(Tensor<T>)

Parameters

Returns

TextToSpeech(string, Tensor<T>?, double, int, int?)

Parameters

Returns

Remarks

WaveformToMelSpectrogram(Tensor<T>)

Parameters

Returns