Table of Contents

Class MusicGenModel<T>

Namespace
AiDotNet.Audio.MusicGen
Assembly
AiDotNet.dll

Meta's MusicGen model for generating music from text descriptions.

public class MusicGenModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IAudioGenerator<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
MusicGenModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

MusicGen is a state-of-the-art text-to-music generation model from Meta AI Research. It uses a single-stage transformer language model that operates directly on EnCodec audio codes, generating high-quality music from text descriptions.

Architecture components:

  1. Text Encoder: T5-based encoder that converts text prompts to embeddings
  2. Language Model: Transformer decoder that generates audio codes autoregressively
  3. EnCodec Decoder: Neural audio codec that converts discrete codes to waveforms

For Beginners: MusicGen creates original music from your descriptions:

How it works:

  1. You describe the music you want ("upbeat jazz piano")
  2. The text encoder understands your description
  3. The language model generates a sequence of "music tokens"
  4. The EnCodec decoder converts tokens to actual audio

Key features:

  • 30 seconds of high-quality 32kHz audio
  • Multiple genres and styles
  • Control over instruments, tempo, mood
  • Stereo output option

Usage:

var model = new MusicGenModel<float>(options);
var audio = model.GenerateMusic("Calm piano melody with soft strings");

Reference: "Simple and Controllable Music Generation" by Copet et al., Meta AI, 2023

Constructors

MusicGenModel(NeuralNetworkArchitecture<T>, MusicGenOptions?, ITokenizer?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a MusicGen model using native layers for training from scratch.

public MusicGenModel(NeuralNetworkArchitecture<T> architecture, MusicGenOptions? options = null, ITokenizer? tokenizer = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

options MusicGenOptions

MusicGen configuration options.

tokenizer ITokenizer

Optional tokenizer. If null, creates T5-compatible tokenizer.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer. Defaults to AdamW.

lossFunction ILossFunction<T>

Optional loss function. Defaults to CrossEntropy.

Remarks

For Beginners: Use this constructor when: - Training MusicGen from scratch (requires significant data) - Fine-tuning on custom music styles - Research and experimentation

For most use cases, load pretrained ONNX models instead.

MusicGenModel(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, MusicGenOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a MusicGen model using pretrained ONNX models for inference.

public MusicGenModel(NeuralNetworkArchitecture<T> architecture, string textEncoderPath, string languageModelPath, string encodecDecoderPath, ITokenizer tokenizer, MusicGenOptions? options = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

textEncoderPath string

Path to the T5 text encoder ONNX model.

languageModelPath string

Path to the transformer LM ONNX model.

encodecDecoderPath string

Path to the EnCodec decoder ONNX model.

tokenizer ITokenizer

T5 tokenizer for text processing (REQUIRED).

options MusicGenOptions

MusicGen configuration options.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer for fine-tuning.

lossFunction ILossFunction<T>

Optional loss function.

Exceptions

ArgumentException

Thrown when required paths are empty.

FileNotFoundException

Thrown when model files don't exist.

ArgumentNullException

Thrown when tokenizer is null.

Properties

MaxDurationSeconds

Gets the maximum duration of audio that can be generated.

public double MaxDurationSeconds { get; }

Property Value

double

SampleRate

Gets the sample rate of generated audio.

public int SampleRate { get; }

Property Value

int

SupportsAudioContinuation

Gets whether this model supports audio continuation.

public bool SupportsAudioContinuation { get; }

Property Value

bool

SupportsAudioInpainting

Gets whether this model supports audio inpainting.

public bool SupportsAudioInpainting { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public bool SupportsTextToMusic { get; }

Property Value

bool

Methods

ContinueAudio(Tensor<T>, string?, double, int, int?)

Continues existing audio by extending it.

public Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)

Parameters

inputAudio Tensor<T>

Audio to continue from.

prompt string

Optional text guidance for continuation.

extensionSeconds double

How many seconds to add.

numInferenceSteps int

Not used.

seed int?

Random seed.

Returns

Tensor<T>

Extended audio (original + continuation).

CreateNewInstance()

Creates a new instance for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Dispose(bool)

Disposes of resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

GenerateAudio(string, string?, double, int, double, int?)

Generates audio from a text description.

public Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?

Returns

Tensor<T>

Remarks

MusicGen is optimized for music, not general audio. For best results, use GenerateMusic instead.

GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)

Generates audio asynchronously.

public Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?
cancellationToken CancellationToken

Returns

Task<Tensor<T>>

GenerateMusic(string, string?, double, int, double, int?)

Generates music from a text description.

public Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string

Text description of the desired music.

negativePrompt string

What to avoid in the generated music.

durationSeconds double

Duration of music to generate (max 30s).

numInferenceSteps int

Not used in autoregressive generation.

guidanceScale double

How closely to follow the prompt.

seed int?

Random seed for reproducibility.

Returns

Tensor<T>

Generated music waveform tensor.

GetDefaultOptions()

Gets default generation options.

public AudioGenerationOptions<T> GetDefaultOptions()

Returns

AudioGenerationOptions<T>

GetModelMetadata()

Gets model metadata.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes the neural network layers following the golden standard pattern.

protected override void InitializeLayers()

Remarks

This method follows the AiDotNet golden standard pattern: 1. First, check if the user provided custom layers via Architecture.Layers 2. If custom layers exist, use them (allows full customization) 3. Otherwise, use LayerHelper.CreateDefaultMusicGenLayers() for standard architecture

For Beginners: This gives you flexibility: - Want standard MusicGen? Just create the model, it auto-configures. - Want custom architecture? Pass your own layers in the Architecture.

InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)

Inpainting is not supported by MusicGen.

public Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)

Parameters

audio Tensor<T>
mask Tensor<T>
prompt string
numInferenceSteps int
seed int?

Returns

Tensor<T>

PostprocessOutput(Tensor<T>)

Postprocesses model output.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

Train(Tensor<T>, Tensor<T>)

Trains the model on input data.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

UpdateParameters(Vector<T>)

Updates model parameters.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>