Table of Contents

Interface IAudioGenerator<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for audio generation models that create audio from text descriptions or other conditions.

public interface IAudioGenerator<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members
Extension Methods

Remarks

Audio generation models create sounds, music, and audio effects from various inputs. Unlike TTS which focuses on speech, audio generators can produce any type of sound including music, environmental sounds, and sound effects.

For Beginners: Audio generation is like having an artist who can create any sound you describe.

How audio generation works:

  1. You provide a description ("A dog barking in a park")
  2. The model generates audio features that match the description
  3. The features are converted to playable audio

Types of audio generation:

  • Text-to-Audio: "Thunder during a storm" creates thunder sounds
  • Text-to-Music: "Upbeat jazz piano" creates music
  • Audio Inpainting: Fill in missing parts of audio
  • Audio Continuation: Extend existing audio naturally

Common use cases:

  • Video game sound effects
  • Film and media production
  • Music composition assistance
  • Podcast and content creation

This interface extends IFullModel<T, TInput, TOutput> for Tensor-based audio processing.

Properties

IsOnnxMode

Gets whether this model is running in ONNX inference mode.

bool IsOnnxMode { get; }

Property Value

bool

Remarks

When true, the model uses pre-trained ONNX weights for inference. When false, the model can be trained from scratch using the neural network infrastructure.

MaxDurationSeconds

Gets the maximum duration of audio that can be generated in seconds.

double MaxDurationSeconds { get; }

Property Value

double

SampleRate

Gets the sample rate of generated audio.

int SampleRate { get; }

Property Value

int

Remarks

Common values: 16000 Hz (low quality), 22050 Hz (medium), 44100 Hz (CD quality).

SupportsAudioContinuation

Gets whether this model supports audio continuation.

bool SupportsAudioContinuation { get; }

Property Value

bool

SupportsAudioInpainting

Gets whether this model supports audio inpainting.

bool SupportsAudioInpainting { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

bool SupportsTextToMusic { get; }

Property Value

bool

Methods

ContinueAudio(Tensor<T>, string?, double, int, int?)

Continues existing audio to extend it naturally.

Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)

Parameters

inputAudio Tensor<T>

The audio to continue from.

prompt string

Optional text guidance for continuation.

extensionSeconds double

How many seconds to add.

numInferenceSteps int

Number of generation steps.

seed int?

Random seed for reproducibility.

Returns

Tensor<T>

Extended audio waveform (original + continuation).

Remarks

For Beginners: This extends audio by generating more that follows naturally. - Input: 5 seconds of guitar - Output: Original + 10 more seconds in the same style

Exceptions

NotSupportedException

Thrown if continuation is not supported.

GenerateAudio(string, string?, double, int, double, int?)

Generates audio from a text description.

Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string

Text description of the desired audio.

negativePrompt string

What to avoid in the generated audio.

durationSeconds double

Length of audio to generate.

numInferenceSteps int

Number of generation steps (more = higher quality).

guidanceScale double

How closely to follow the prompt (higher = more literal).

seed int?

Random seed for reproducibility.

Returns

Tensor<T>

Generated audio waveform tensor [samples] or [channels, samples].

Remarks

For Beginners: This creates sound effects or ambient audio from descriptions. - prompt: "Ocean waves crashing on a beach" creates wave sounds - prompt: "Birds chirping in a forest" creates bird sounds - negativePrompt: "No human voices" prevents speech in the output

GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)

Generates audio from a text description asynchronously.

Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)

Parameters

prompt string

Text description of the desired audio.

negativePrompt string

What to avoid in the generated audio.

durationSeconds double

Length of audio to generate.

numInferenceSteps int

Number of generation steps.

guidanceScale double

How closely to follow the prompt.

seed int?

Random seed for reproducibility.

cancellationToken CancellationToken

Cancellation token for async operation.

Returns

Task<Tensor<T>>

Generated audio waveform tensor.

GenerateMusic(string, string?, double, int, double, int?)

Generates music from a text description.

Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string

Text description of the desired music.

negativePrompt string

What to avoid in the generated music.

durationSeconds double

Length of music to generate.

numInferenceSteps int

Number of generation steps.

guidanceScale double

How closely to follow the prompt.

seed int?

Random seed for reproducibility.

Returns

Tensor<T>

Generated music waveform tensor.

Remarks

For Beginners: This creates music from descriptions. - prompt: "Relaxing piano melody" creates piano music - prompt: "Energetic rock guitar riff" creates rock music

GetDefaultOptions()

Gets generation options for advanced control.

AudioGenerationOptions<T> GetDefaultOptions()

Returns

AudioGenerationOptions<T>

InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)

Fills in missing or masked sections of audio.

Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)

Parameters

audio Tensor<T>

Audio with sections to fill.

mask Tensor<T>

Mask tensor indicating which samples to regenerate (1 = regenerate, 0 = keep).

prompt string

Optional text guidance for inpainting.

numInferenceSteps int

Number of generation steps.

seed int?

Random seed for reproducibility.

Returns

Tensor<T>

Audio with masked sections filled in.

Remarks

For Beginners: This fills in gaps in audio, like photo inpainting but for sound. - Input: Audio with a 2-second gap (maybe someone coughed) - Output: Audio with the gap filled seamlessly

Exceptions

NotSupportedException

Thrown if inpainting is not supported.