Class AudioGenModel<T>
AudioGen model for generating audio from text descriptions using neural audio codecs.
public class AudioGenModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IAudioGenerator<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
AudioGenModel<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
AudioGen uses a language model approach to generate audio from text prompts. The architecture consists of three main components:
- Text Encoder: Converts text prompts to embeddings (typically T5-based)
- Audio Language Model: Generates discrete audio codes autoregressively
- Audio Decoder (EnCodec): Converts audio codes back to waveforms
For Beginners: AudioGen is fundamentally different from Text-to-Speech (TTS):
TTS vs AudioGen:
- TTS: Converts specific words to speech ("Hello world" -> spoken words "Hello world")
- AudioGen: Creates sounds matching a description ("dog barking" -> actual bark sound)
How it works:
- Your text prompt ("a cat meowing softly") is encoded into a numerical representation
- A language model generates a sequence of "audio tokens" (like words, but for sound)
- The EnCodec decoder converts these tokens back into actual audio waveforms
Why discrete audio codes?
- Raw audio has too many samples (32,000 per second!)
- EnCodec compresses audio to ~50 tokens per second
- This makes the language model's job much easier
Common use cases:
- Sound effect generation for games/films
- Creating ambient soundscapes
- Generating audio for multimedia content
- Rapid prototyping of audio concepts
Limitations:
- Cannot generate intelligible speech (use TTS for that)
- Quality depends on training data
- May struggle with very specific or unusual sounds
Reference: "AudioGen: Textually Guided Audio Generation" by Kreuk et al., 2022
Constructors
AudioGenModel(NeuralNetworkArchitecture<T>, AudioGenModelSize, int, double, double, double, int, double, double, int, int, int, int, int, int, int, int, int?, ITokenizer?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates an AudioGen network using native library layers for training from scratch.
public AudioGenModel(NeuralNetworkArchitecture<T> architecture, AudioGenModelSize modelSize = AudioGenModelSize.Medium, int sampleRate = 32000, double durationSeconds = 5, double maxDurationSeconds = 30, double temperature = 1, int topK = 250, double topP = 0, double guidanceScale = 3, int channels = 1, int textHiddenDim = 0, int lmHiddenDim = 0, int numLmLayers = 0, int numHeads = 0, int numCodebooks = 4, int codebookSize = 1024, int maxTextLength = 256, int? seed = null, ITokenizer? tokenizer = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
modelSizeAudioGenModelSizeModel size variant (default: Medium).
sampleRateintOutput sample rate in Hz (default: 32000 for AudioGen).
durationSecondsdoubleDefault generation duration in seconds (default: 5.0).
maxDurationSecondsdoubleMaximum generation duration in seconds (default: 30.0).
temperaturedoubleSampling temperature - higher values produce more random output (default: 1.0).
topKintTop-k sampling - only consider top k tokens (default: 250).
topPdoubleTop-p (nucleus) sampling threshold (default: 0.0 = disabled).
guidanceScaledoubleClassifier-free guidance scale (default: 3.0).
channelsintNumber of audio channels, 1=mono, 2=stereo (default: 1).
textHiddenDimintText encoder hidden dimension. If 0, uses model size default.
lmHiddenDimintLanguage model hidden dimension. If 0, uses model size default.
numLmLayersintNumber of language model layers. If 0, uses model size default.
numHeadsintNumber of attention heads. If 0, uses model size default.
numCodebooksintNumber of EnCodec codebooks (default: 4).
codebookSizeintSize of each codebook vocabulary (default: 1024).
maxTextLengthintMaximum text sequence length (default: 256).
seedint?Random seed for reproducibility. Null for non-deterministic generation.
tokenizerITokenizerOptional tokenizer for text processing. If null, creates a T5-style default.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer for training.
lossFunctionILossFunction<T>Optional loss function.
Remarks
For Beginners: Use this constructor when training AudioGen from scratch.
Training your own AudioGen:
- You need paired data: (text descriptions, audio clips)
- Audio is pre-encoded to discrete codes using EnCodec
- The model learns to predict audio codes from text
This is computationally expensive and requires significant data. For most use cases, load pretrained ONNX models instead.
AudioGenModel(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, AudioGenModelSize, int, double, double, double, int, double, double, int, int?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates an AudioGen network using pretrained ONNX models.
public AudioGenModel(NeuralNetworkArchitecture<T> architecture, string textEncoderPath, string languageModelPath, string audioDecoderPath, ITokenizer tokenizer, AudioGenModelSize modelSize = AudioGenModelSize.Medium, int sampleRate = 32000, double durationSeconds = 5, double maxDurationSeconds = 30, double temperature = 1, int topK = 250, double topP = 0, double guidanceScale = 3, int channels = 1, int? seed = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
textEncoderPathstringPath to the text encoder ONNX model.
languageModelPathstringPath to the audio language model ONNX model.
audioDecoderPathstringPath to the audio decoder (EnCodec) ONNX model.
tokenizerITokenizerTokenizer for text processing. REQUIRED - must match the text encoder (typically T5).
modelSizeAudioGenModelSizeModel size variant (default: Medium).
sampleRateintOutput sample rate in Hz (default: 32000 for AudioGen).
durationSecondsdoubleDefault generation duration in seconds (default: 5.0).
maxDurationSecondsdoubleMaximum generation duration in seconds (default: 30.0).
temperaturedoubleSampling temperature - higher values produce more random output (default: 1.0).
topKintTop-k sampling - only consider top k tokens (default: 250).
topPdoubleTop-p (nucleus) sampling threshold (default: 0.0 = disabled).
guidanceScaledoubleClassifier-free guidance scale (default: 3.0).
channelsintNumber of audio channels, 1=mono, 2=stereo (default: 1).
seedint?Random seed for reproducibility. Null for non-deterministic generation.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer for fine-tuning.
lossFunctionILossFunction<T>Optional loss function.
Remarks
When loading pretrained ONNX models, you MUST provide a tokenizer that matches the text encoder. AudioGen uses T5-based text encoders, so use a T5 tokenizer:
var tokenizer = await AutoTokenizer.FromPretrainedAsync("t5-base");
var audioGen = new AudioGenModel<float>(architecture, encoderPath, lmPath, decoderPath, tokenizer);
Properties
IsReady
Gets whether the model is ready for inference.
public bool IsReady { get; }
Property Value
MaxDurationSeconds
Gets the maximum duration of audio that can be generated in seconds.
public double MaxDurationSeconds { get; }
Property Value
ModelSize
Gets the model size variant.
public AudioGenModelSize ModelSize { get; }
Property Value
SupportsAudioContinuation
Gets whether this model supports audio continuation.
public bool SupportsAudioContinuation { get; }
Property Value
SupportsAudioInpainting
Gets whether this model supports audio inpainting.
public bool SupportsAudioInpainting { get; }
Property Value
SupportsTextToAudio
Gets whether this model supports text-to-audio generation.
public bool SupportsTextToAudio { get; }
Property Value
SupportsTextToMusic
Gets whether this model supports text-to-music generation.
public bool SupportsTextToMusic { get; }
Property Value
Methods
ContinueAudio(Tensor<T>, string?, double, int, int?)
Continues existing audio to extend it naturally.
public Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)
Parameters
Returns
- Tensor<T>
CreateNewInstance()
Creates a new instance of this model for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReader
Dispose(bool)
Disposes the model and releases resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
GenerateAudio(string, string?, double, int, double, int?)
Generates audio from a text description.
public Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringText description of the desired audio.
negativePromptstringOptional negative prompt for classifier-free guidance.
durationSecondsdoubleDuration of audio to generate in seconds.
numInferenceStepsintNumber of inference steps (not used in autoregressive generation).
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed for reproducibility.
Returns
- Tensor<T>
Generated audio waveform tensor.
GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)
Generates audio from a text description asynchronously.
public Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?cancellationTokenCancellationToken
Returns
- Task<Tensor<T>>
GenerateMusic(string, string?, double, int, double, int?)
Generates music from a text description.
public Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?
Returns
- Tensor<T>
GetDefaultOptions()
Gets default generation options.
public AudioGenerationOptions<T> GetDefaultOptions()
Returns
GetModelMetadata()
Gets metadata about the model.
public override ModelMetadata<T> GetModelMetadata()
Returns
InitializeLayers()
Initializes the neural network layers following the golden standard pattern.
protected override void InitializeLayers()
InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)
Fills in missing or masked sections of audio.
public Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)
Parameters
Returns
- Tensor<T>
PostprocessOutput(Tensor<T>)
Postprocesses model output into the final result format.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>
Returns
- Tensor<T>
Predict(Tensor<T>)
Makes a prediction using the model.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>
Returns
- Tensor<T>
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriter
Train(Tensor<T>, Tensor<T>)
Trains the model on input data.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>expectedOutputTensor<T>
UpdateParameters(Vector<T>)
Updates model parameters by applying gradient descent.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>
Remarks
Applies the simple gradient descent update rule: params = params - learning_rate * gradients.
For Beginners: This is how the model learns!
During training:
- The model makes predictions
- We calculate how wrong it was (loss)
- We compute gradients (which direction to adjust each parameter)
- This method applies those adjustments to make the model better
The learning rate controls how big each adjustment is:
- Too big: Model learns fast but may overshoot optimal values
- Too small: Model learns slowly but more precisely
- Default (0.001): A good starting point for most tasks