Table of Contents

Class AudioGenModel<T>

Namespace
AiDotNet.Audio.AudioGen
Assembly
AiDotNet.dll

AudioGen model for generating audio from text descriptions using neural audio codecs.

public class AudioGenModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IAudioGenerator<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
AudioGenModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

AudioGen uses a language model approach to generate audio from text prompts. The architecture consists of three main components:

  1. Text Encoder: Converts text prompts to embeddings (typically T5-based)
  2. Audio Language Model: Generates discrete audio codes autoregressively
  3. Audio Decoder (EnCodec): Converts audio codes back to waveforms

For Beginners: AudioGen is fundamentally different from Text-to-Speech (TTS):

TTS vs AudioGen:

  • TTS: Converts specific words to speech ("Hello world" -> spoken words "Hello world")
  • AudioGen: Creates sounds matching a description ("dog barking" -> actual bark sound)

How it works:

  1. Your text prompt ("a cat meowing softly") is encoded into a numerical representation
  2. A language model generates a sequence of "audio tokens" (like words, but for sound)
  3. The EnCodec decoder converts these tokens back into actual audio waveforms

Why discrete audio codes?

  • Raw audio has too many samples (32,000 per second!)
  • EnCodec compresses audio to ~50 tokens per second
  • This makes the language model's job much easier

Common use cases:

  • Sound effect generation for games/films
  • Creating ambient soundscapes
  • Generating audio for multimedia content
  • Rapid prototyping of audio concepts

Limitations:

  • Cannot generate intelligible speech (use TTS for that)
  • Quality depends on training data
  • May struggle with very specific or unusual sounds

Reference: "AudioGen: Textually Guided Audio Generation" by Kreuk et al., 2022

Constructors

AudioGenModel(NeuralNetworkArchitecture<T>, AudioGenModelSize, int, double, double, double, int, double, double, int, int, int, int, int, int, int, int, int?, ITokenizer?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates an AudioGen network using native library layers for training from scratch.

public AudioGenModel(NeuralNetworkArchitecture<T> architecture, AudioGenModelSize modelSize = AudioGenModelSize.Medium, int sampleRate = 32000, double durationSeconds = 5, double maxDurationSeconds = 30, double temperature = 1, int topK = 250, double topP = 0, double guidanceScale = 3, int channels = 1, int textHiddenDim = 0, int lmHiddenDim = 0, int numLmLayers = 0, int numHeads = 0, int numCodebooks = 4, int codebookSize = 1024, int maxTextLength = 256, int? seed = null, ITokenizer? tokenizer = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

modelSize AudioGenModelSize

Model size variant (default: Medium).

sampleRate int

Output sample rate in Hz (default: 32000 for AudioGen).

durationSeconds double

Default generation duration in seconds (default: 5.0).

maxDurationSeconds double

Maximum generation duration in seconds (default: 30.0).

temperature double

Sampling temperature - higher values produce more random output (default: 1.0).

topK int

Top-k sampling - only consider top k tokens (default: 250).

topP double

Top-p (nucleus) sampling threshold (default: 0.0 = disabled).

guidanceScale double

Classifier-free guidance scale (default: 3.0).

channels int

Number of audio channels, 1=mono, 2=stereo (default: 1).

textHiddenDim int

Text encoder hidden dimension. If 0, uses model size default.

lmHiddenDim int

Language model hidden dimension. If 0, uses model size default.

numLmLayers int

Number of language model layers. If 0, uses model size default.

numHeads int

Number of attention heads. If 0, uses model size default.

numCodebooks int

Number of EnCodec codebooks (default: 4).

codebookSize int

Size of each codebook vocabulary (default: 1024).

maxTextLength int

Maximum text sequence length (default: 256).

seed int?

Random seed for reproducibility. Null for non-deterministic generation.

tokenizer ITokenizer

Optional tokenizer for text processing. If null, creates a T5-style default.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer for training.

lossFunction ILossFunction<T>

Optional loss function.

Remarks

For Beginners: Use this constructor when training AudioGen from scratch.

Training your own AudioGen:

  1. You need paired data: (text descriptions, audio clips)
  2. Audio is pre-encoded to discrete codes using EnCodec
  3. The model learns to predict audio codes from text

This is computationally expensive and requires significant data. For most use cases, load pretrained ONNX models instead.

AudioGenModel(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, AudioGenModelSize, int, double, double, double, int, double, double, int, int?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates an AudioGen network using pretrained ONNX models.

public AudioGenModel(NeuralNetworkArchitecture<T> architecture, string textEncoderPath, string languageModelPath, string audioDecoderPath, ITokenizer tokenizer, AudioGenModelSize modelSize = AudioGenModelSize.Medium, int sampleRate = 32000, double durationSeconds = 5, double maxDurationSeconds = 30, double temperature = 1, int topK = 250, double topP = 0, double guidanceScale = 3, int channels = 1, int? seed = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

textEncoderPath string

Path to the text encoder ONNX model.

languageModelPath string

Path to the audio language model ONNX model.

audioDecoderPath string

Path to the audio decoder (EnCodec) ONNX model.

tokenizer ITokenizer

Tokenizer for text processing. REQUIRED - must match the text encoder (typically T5).

modelSize AudioGenModelSize

Model size variant (default: Medium).

sampleRate int

Output sample rate in Hz (default: 32000 for AudioGen).

durationSeconds double

Default generation duration in seconds (default: 5.0).

maxDurationSeconds double

Maximum generation duration in seconds (default: 30.0).

temperature double

Sampling temperature - higher values produce more random output (default: 1.0).

topK int

Top-k sampling - only consider top k tokens (default: 250).

topP double

Top-p (nucleus) sampling threshold (default: 0.0 = disabled).

guidanceScale double

Classifier-free guidance scale (default: 3.0).

channels int

Number of audio channels, 1=mono, 2=stereo (default: 1).

seed int?

Random seed for reproducibility. Null for non-deterministic generation.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer for fine-tuning.

lossFunction ILossFunction<T>

Optional loss function.

Remarks

When loading pretrained ONNX models, you MUST provide a tokenizer that matches the text encoder. AudioGen uses T5-based text encoders, so use a T5 tokenizer:

var tokenizer = await AutoTokenizer.FromPretrainedAsync("t5-base");
var audioGen = new AudioGenModel<float>(architecture, encoderPath, lmPath, decoderPath, tokenizer);

Properties

IsReady

Gets whether the model is ready for inference.

public bool IsReady { get; }

Property Value

bool

MaxDurationSeconds

Gets the maximum duration of audio that can be generated in seconds.

public double MaxDurationSeconds { get; }

Property Value

double

ModelSize

Gets the model size variant.

public AudioGenModelSize ModelSize { get; }

Property Value

AudioGenModelSize

SupportsAudioContinuation

Gets whether this model supports audio continuation.

public bool SupportsAudioContinuation { get; }

Property Value

bool

SupportsAudioInpainting

Gets whether this model supports audio inpainting.

public bool SupportsAudioInpainting { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public bool SupportsTextToMusic { get; }

Property Value

bool

Methods

ContinueAudio(Tensor<T>, string?, double, int, int?)

Continues existing audio to extend it naturally.

public Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)

Parameters

inputAudio Tensor<T>
prompt string
extensionSeconds double
numInferenceSteps int
seed int?

Returns

Tensor<T>

CreateNewInstance()

Creates a new instance of this model for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Dispose(bool)

Disposes the model and releases resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

GenerateAudio(string, string?, double, int, double, int?)

Generates audio from a text description.

public Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string

Text description of the desired audio.

negativePrompt string

Optional negative prompt for classifier-free guidance.

durationSeconds double

Duration of audio to generate in seconds.

numInferenceSteps int

Number of inference steps (not used in autoregressive generation).

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed for reproducibility.

Returns

Tensor<T>

Generated audio waveform tensor.

GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)

Generates audio from a text description asynchronously.

public Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?
cancellationToken CancellationToken

Returns

Task<Tensor<T>>

GenerateMusic(string, string?, double, int, double, int?)

Generates music from a text description.

public Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?

Returns

Tensor<T>

GetDefaultOptions()

Gets default generation options.

public AudioGenerationOptions<T> GetDefaultOptions()

Returns

AudioGenerationOptions<T>

GetModelMetadata()

Gets metadata about the model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes the neural network layers following the golden standard pattern.

protected override void InitializeLayers()

InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)

Fills in missing or masked sections of audio.

public Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)

Parameters

audio Tensor<T>
mask Tensor<T>
prompt string
numInferenceSteps int
seed int?

Returns

Tensor<T>

PostprocessOutput(Tensor<T>)

Postprocesses model output into the final result format.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

Train(Tensor<T>, Tensor<T>)

Trains the model on input data.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

UpdateParameters(Vector<T>)

Updates model parameters by applying gradient descent.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

Remarks

Applies the simple gradient descent update rule: params = params - learning_rate * gradients.

For Beginners: This is how the model learns!

During training:

  1. The model makes predictions
  2. We calculate how wrong it was (loss)
  3. We compute gradients (which direction to adjust each parameter)
  4. This method applies those adjustments to make the model better

The learning rate controls how big each adjustment is:

  • Too big: Model learns fast but may overshoot optimal values
  • Too small: Model learns slowly but more precisely
  • Default (0.001): A good starting point for most tasks