Interface IAudioGenerator<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for audio generation models that create audio from text descriptions or other conditions.
public interface IAudioGenerator<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
- Extension Methods
Remarks
Audio generation models create sounds, music, and audio effects from various inputs. Unlike TTS which focuses on speech, audio generators can produce any type of sound including music, environmental sounds, and sound effects.
For Beginners: Audio generation is like having an artist who can create any sound you describe.
How audio generation works:
- You provide a description ("A dog barking in a park")
- The model generates audio features that match the description
- The features are converted to playable audio
Types of audio generation:
- Text-to-Audio: "Thunder during a storm" creates thunder sounds
- Text-to-Music: "Upbeat jazz piano" creates music
- Audio Inpainting: Fill in missing parts of audio
- Audio Continuation: Extend existing audio naturally
Common use cases:
- Video game sound effects
- Film and media production
- Music composition assistance
- Podcast and content creation
This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.
Properties
IsOnnxMode
Gets whether this model is running in ONNX inference mode.
bool IsOnnxMode { get; }
Property Value
Remarks
When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.
MaxDurationSeconds
Gets the maximum duration of audio that can be generated in seconds.
double MaxDurationSeconds { get; }
Property Value
SampleRate
Gets the sample rate of generated audio.
int SampleRate { get; }
Property Value
Remarks
Common values: 16000 Hz (low quality), 22050 Hz (medium), 44100 Hz (CD quality).
SupportsAudioContinuation
Gets whether this model supports audio continuation.
bool SupportsAudioContinuation { get; }
Property Value
SupportsAudioInpainting
Gets whether this model supports audio inpainting.
bool SupportsAudioInpainting { get; }
Property Value
SupportsTextToAudio
Gets whether this model supports text-to-audio generation.
bool SupportsTextToAudio { get; }
Property Value
SupportsTextToMusic
Gets whether this model supports text-to-music generation.
bool SupportsTextToMusic { get; }
Property Value
Methods
ContinueAudio(Tensor<T>, string?, double, int, int?)
Continues existing audio to extend it naturally.
Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)
Parameters
inputAudioTensor<T>The audio to continue from.
promptstringOptional text guidance for continuation.
extensionSecondsdoubleHow many seconds to add.
numInferenceStepsintNumber of generation steps.
seedint?Random seed for reproducibility.
Returns
- Tensor<T>
Extended audio waveform (original + continuation).
Remarks
For Beginners: This extends audio by generating more that follows naturally. - Input: 5 seconds of guitar - Output: Original + 10 more seconds in the same style
Exceptions
- NotSupportedException
Thrown if continuation is not supported.
GenerateAudio(string, string?, double, int, double, int?)
Generates audio from a text description.
Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringText description of the desired audio.
negativePromptstringWhat to avoid in the generated audio.
durationSecondsdoubleLength of audio to generate.
numInferenceStepsintNumber of generation steps (more = higher quality).
guidanceScaledoubleHow closely to follow the prompt (higher = more literal).
seedint?Random seed for reproducibility.
Returns
- Tensor<T>
Generated audio waveform tensor [samples] or [channels, samples].
Remarks
For Beginners: This creates sound effects or ambient audio from descriptions. - prompt: "Ocean waves crashing on a beach" creates wave sounds - prompt: "Birds chirping in a forest" creates bird sounds - negativePrompt: "No human voices" prevents speech in the output
GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)
Generates audio from a text description asynchronously.
Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)
Parameters
promptstringText description of the desired audio.
negativePromptstringWhat to avoid in the generated audio.
durationSecondsdoubleLength of audio to generate.
numInferenceStepsintNumber of generation steps.
guidanceScaledoubleHow closely to follow the prompt.
seedint?Random seed for reproducibility.
cancellationTokenCancellationTokenCancellation token for async operation.
Returns
- Task<Tensor<T>>
Generated audio waveform tensor.
GenerateMusic(string, string?, double, int, double, int?)
Generates music from a text description.
Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringText description of the desired music.
negativePromptstringWhat to avoid in the generated music.
durationSecondsdoubleLength of music to generate.
numInferenceStepsintNumber of generation steps.
guidanceScaledoubleHow closely to follow the prompt.
seedint?Random seed for reproducibility.
Returns
- Tensor<T>
Generated music waveform tensor.
Remarks
For Beginners: This creates music from descriptions. - prompt: "Relaxing piano melody" creates piano music - prompt: "Energetic rock guitar riff" creates rock music
GetDefaultOptions()
Gets generation options for advanced control.
AudioGenerationOptions<T> GetDefaultOptions()
Returns
InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)
Fills in missing or masked sections of audio.
Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)
Parameters
audioTensor<T>Audio with sections to fill.
maskTensor<T>Mask tensor indicating which samples to regenerate (1 = regenerate, 0 = keep).
promptstringOptional text guidance for inpainting.
numInferenceStepsintNumber of generation steps.
seedint?Random seed for reproducibility.
Returns
- Tensor<T>
Audio with masked sections filled in.
Remarks
For Beginners: This fills in gaps in audio, like photo inpainting but for sound. - Input: Audio with a 2-second gap (maybe someone coughed) - Output: Audio with the gap filled seamlessly
Exceptions
- NotSupportedException
Thrown if inpainting is not supported.