Class MusicGenModel<T>
Meta's MusicGen model for generating music from text descriptions.
public class MusicGenModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IAudioGenerator<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
MusicGenModel<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
MusicGen is a state-of-the-art text-to-music generation model from Meta AI Research. It uses a single-stage transformer language model that operates directly on EnCodec audio codes, generating high-quality music from text descriptions.
Architecture components:
- Text Encoder: T5-based encoder that converts text prompts to embeddings
- Language Model: Transformer decoder that generates audio codes autoregressively
- EnCodec Decoder: Neural audio codec that converts discrete codes to waveforms
For Beginners: MusicGen creates original music from your descriptions:
How it works:
- You describe the music you want ("upbeat jazz piano")
- The text encoder understands your description
- The language model generates a sequence of "music tokens"
- The EnCodec decoder converts tokens to actual audio
Key features:
- 30 seconds of high-quality 32kHz audio
- Multiple genres and styles
- Control over instruments, tempo, mood
- Stereo output option
Usage:
var model = new MusicGenModel<float>(options);
var audio = model.GenerateMusic("Calm piano melody with soft strings");
Reference: "Simple and Controllable Music Generation" by Copet et al., Meta AI, 2023
Constructors
MusicGenModel(NeuralNetworkArchitecture<T>, MusicGenOptions?, ITokenizer?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a MusicGen model using native layers for training from scratch.
public MusicGenModel(NeuralNetworkArchitecture<T> architecture, MusicGenOptions? options = null, ITokenizer? tokenizer = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
optionsMusicGenOptionsMusicGen configuration options.
tokenizerITokenizerOptional tokenizer. If null, creates T5-compatible tokenizer.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer. Defaults to AdamW.
lossFunctionILossFunction<T>Optional loss function. Defaults to CrossEntropy.
Remarks
For Beginners: Use this constructor when: - Training MusicGen from scratch (requires significant data) - Fine-tuning on custom music styles - Research and experimentation
For most use cases, load pretrained ONNX models instead.
MusicGenModel(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, MusicGenOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a MusicGen model using pretrained ONNX models for inference.
public MusicGenModel(NeuralNetworkArchitecture<T> architecture, string textEncoderPath, string languageModelPath, string encodecDecoderPath, ITokenizer tokenizer, MusicGenOptions? options = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
textEncoderPathstringPath to the T5 text encoder ONNX model.
languageModelPathstringPath to the transformer LM ONNX model.
encodecDecoderPathstringPath to the EnCodec decoder ONNX model.
tokenizerITokenizerT5 tokenizer for text processing (REQUIRED).
optionsMusicGenOptionsMusicGen configuration options.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer for fine-tuning.
lossFunctionILossFunction<T>Optional loss function.
Exceptions
- ArgumentException
Thrown when required paths are empty.
- FileNotFoundException
Thrown when model files don't exist.
- ArgumentNullException
Thrown when tokenizer is null.
Properties
MaxDurationSeconds
Gets the maximum duration of audio that can be generated.
public double MaxDurationSeconds { get; }
Property Value
SampleRate
Gets the sample rate of generated audio.
public int SampleRate { get; }
Property Value
SupportsAudioContinuation
Gets whether this model supports audio continuation.
public bool SupportsAudioContinuation { get; }
Property Value
SupportsAudioInpainting
Gets whether this model supports audio inpainting.
public bool SupportsAudioInpainting { get; }
Property Value
SupportsTextToAudio
Gets whether this model supports text-to-audio generation.
public bool SupportsTextToAudio { get; }
Property Value
SupportsTextToMusic
Gets whether this model supports text-to-music generation.
public bool SupportsTextToMusic { get; }
Property Value
Methods
ContinueAudio(Tensor<T>, string?, double, int, int?)
Continues existing audio by extending it.
public Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)
Parameters
inputAudioTensor<T>Audio to continue from.
promptstringOptional text guidance for continuation.
extensionSecondsdoubleHow many seconds to add.
numInferenceStepsintNot used.
seedint?Random seed.
Returns
- Tensor<T>
Extended audio (original + continuation).
CreateNewInstance()
Creates a new instance for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReader
Dispose(bool)
Disposes of resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
GenerateAudio(string, string?, double, int, double, int?)
Generates audio from a text description.
public Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?
Returns
- Tensor<T>
Remarks
MusicGen is optimized for music, not general audio. For best results, use GenerateMusic instead.
GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)
Generates audio asynchronously.
public Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?cancellationTokenCancellationToken
Returns
- Task<Tensor<T>>
GenerateMusic(string, string?, double, int, double, int?)
Generates music from a text description.
public Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringText description of the desired music.
negativePromptstringWhat to avoid in the generated music.
durationSecondsdoubleDuration of music to generate (max 30s).
numInferenceStepsintNot used in autoregressive generation.
guidanceScaledoubleHow closely to follow the prompt.
seedint?Random seed for reproducibility.
Returns
- Tensor<T>
Generated music waveform tensor.
GetDefaultOptions()
Gets default generation options.
public AudioGenerationOptions<T> GetDefaultOptions()
Returns
GetModelMetadata()
Gets model metadata.
public override ModelMetadata<T> GetModelMetadata()
Returns
InitializeLayers()
Initializes the neural network layers following the golden standard pattern.
protected override void InitializeLayers()
Remarks
This method follows the AiDotNet golden standard pattern: 1. First, check if the user provided custom layers via Architecture.Layers 2. If custom layers exist, use them (allows full customization) 3. Otherwise, use LayerHelper.CreateDefaultMusicGenLayers() for standard architecture
For Beginners: This gives you flexibility: - Want standard MusicGen? Just create the model, it auto-configures. - Want custom architecture? Pass your own layers in the Architecture.
InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)
Inpainting is not supported by MusicGen.
public Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)
Parameters
Returns
- Tensor<T>
PostprocessOutput(Tensor<T>)
Postprocesses model output.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>
Returns
- Tensor<T>
Predict(Tensor<T>)
Makes a prediction using the model.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>
Returns
- Tensor<T>
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriter
Train(Tensor<T>, Tensor<T>)
Trains the model on input data.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>expectedOutputTensor<T>
UpdateParameters(Vector<T>)
Updates model parameters.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>