Interface IAudioDiffusionModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for audio diffusion models that generate sound and music.
public interface IAudioDiffusionModel<T> : IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
- Extension Methods
Remarks
Audio diffusion models apply diffusion processes to generate audio content, including music, speech, sound effects, and more. They typically operate on audio spectrograms or mel-spectrograms in latent space.
For Beginners: Audio diffusion models work similarly to image diffusion, but instead of generating pictures, they create sounds.
How audio diffusion works:
- Audio is converted to a spectrogram (visual representation of sound)
- Diffusion happens on this spectrogram (just like image diffusion)
- The spectrogram is converted back to audio
Types of audio generation:
- Text-to-Audio: "A dog barking in a park" → audio clip
- Text-to-Music: "Upbeat jazz piano" → music track
- Text-to-Speech: Text → spoken voice
- Audio-to-Audio: Transform existing audio (voice conversion, style transfer)
Key challenges:
- Temporal coherence (sounds must flow naturally)
- Frequency relationships (harmonics, rhythm)
- Long-range dependencies (verse-chorus structure in music)
This interface extends IDiffusionModel<T> with audio-specific operations.
Properties
DefaultDurationSeconds
Gets the default duration of generated audio in seconds.
double DefaultDurationSeconds { get; }
Property Value
MelChannels
Gets the number of mel spectrogram channels used.
int MelChannels { get; }
Property Value
Remarks
Mel spectrograms divide the frequency range into perceptual bands. Common values: 64, 80, or 128 mel bins.
SampleRate
Gets the sample rate of generated audio.
int SampleRate { get; }
Property Value
Remarks
Common values: 16000 Hz (speech), 22050 Hz (music), 44100 Hz (high quality). Higher sample rates = better quality but more computation.
SupportsAudioToAudio
Gets whether this model supports audio-to-audio transformation.
bool SupportsAudioToAudio { get; }
Property Value
SupportsTextToAudio
Gets whether this model supports text-to-audio generation.
bool SupportsTextToAudio { get; }
Property Value
SupportsTextToMusic
Gets whether this model supports text-to-music generation.
bool SupportsTextToMusic { get; }
Property Value
SupportsTextToSpeech
Gets whether this model supports text-to-speech generation.
bool SupportsTextToSpeech { get; }
Property Value
Methods
AudioToAudio(Tensor<T>, string, string?, double, int, double, int?)
Transforms existing audio based on a text prompt.
Tensor<T> AudioToAudio(Tensor<T> inputAudio, string prompt, string? negativePrompt = null, double strength = 0.5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
inputAudioTensor<T>The input audio waveform.
promptstringText description of the transformation.
negativePromptstringWhat to avoid.
strengthdoubleTransformation strength (0.0-1.0).
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Transformed audio waveform.
Remarks
For Beginners: This changes existing audio: - "Make it sound like it's underwater" - "Add reverb like a large hall" - "Change the voice to sound younger"
ContinueAudio(Tensor<T>, string?, double, int, int?)
Continues/extends audio from a given clip.
Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)
Parameters
inputAudioTensor<T>The audio to continue from.
promptstringOptional text guidance for continuation.
extensionSecondsdoubleHow many seconds to add.
numInferenceStepsintNumber of denoising steps.
seedint?Optional random seed.
Returns
- Tensor<T>
Extended audio waveform (original + continuation).
Remarks
For Beginners: This extends audio by generating more that follows: - Input: 5 seconds of a song - Output: Original 5 seconds + 10 more seconds that fit naturally
ExtractSpeakerEmbedding(Tensor<T>)
Gets speaker embeddings from a reference audio clip (for voice cloning).
Tensor<T> ExtractSpeakerEmbedding(Tensor<T> referenceAudio)
Parameters
referenceAudioTensor<T>Reference audio waveform.
Returns
- Tensor<T>
Speaker embedding tensor.
GenerateFromText(string, string?, double?, int, double, int?)
Generates audio from a text description.
Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringText description of the desired audio.
negativePromptstringWhat to avoid in the audio.
durationSecondsdouble?Length of audio to generate.
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor [batch, samples] or [batch, channels, samples] for stereo.
Remarks
For Beginners: This creates sound from a description: - prompt: "A thunderstorm with rain" → Thunder and rain sounds - prompt: "Acoustic guitar strumming" → Guitar music
GenerateMusic(string, string?, double?, int, double, int?)
Generates music from a text description.
Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringText description of the desired music.
negativePromptstringWhat to avoid in the music.
durationSecondsdouble?Length of music to generate.
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor.
Remarks
Music generation may use specialized models tuned for musical content, with better handling of melody, harmony, and rhythm.
MelSpectrogramToWaveform(Tensor<T>)
Converts mel spectrogram back to audio waveform.
Tensor<T> MelSpectrogramToWaveform(Tensor<T> melSpectrogram)
Parameters
melSpectrogramTensor<T>Mel spectrogram [batch, channels, melBins, timeFrames].
Returns
- Tensor<T>
Audio waveform [batch, samples].
TextToSpeech(string, Tensor<T>?, double, int, int?)
Synthesizes speech from text (text-to-speech).
Tensor<T> TextToSpeech(string text, Tensor<T>? speakerEmbedding = null, double speakingRate = 1, int numInferenceSteps = 50, int? seed = null)
Parameters
textstringThe text to speak.
speakerEmbeddingTensor<T>Optional speaker embedding for voice cloning.
speakingRatedoubleSpeed multiplier (1.0 = normal).
numInferenceStepsintNumber of denoising steps.
seedint?Optional random seed.
Returns
- Tensor<T>
Audio waveform tensor.
Remarks
For Beginners: This makes the computer "read" text out loud: - Input: "Hello, how are you today?" - Output: Audio of someone saying those words - speakerEmbedding: Makes it sound like a specific person
WaveformToMelSpectrogram(Tensor<T>)
Converts audio waveform to mel spectrogram.
Tensor<T> WaveformToMelSpectrogram(Tensor<T> waveform)
Parameters
waveformTensor<T>Audio waveform [batch, samples].
Returns
- Tensor<T>
Mel spectrogram [batch, channels, melBins, timeFrames].