Class MusicGenModel<T>

Namespace: AiDotNet.Diffusion.Models

Assembly: AiDotNet.dll

MusicGen - Diffusion-based music generation model with advanced musical controls.

public class MusicGenModel<T> : AudioDiffusionModelBase<T>, ILatentDiffusionModel<T>, IAudioDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

DiffusionModelBase<T>

LatentDiffusionModelBase<T>

AudioDiffusionModelBase<T>

MusicGenModel<T>

Implements: ILatentDiffusionModel<T>

IAudioDiffusionModel<T>

IDiffusionModel<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

Inherited Members: AudioDiffusionModelBase<T>.SampleRate

AudioDiffusionModelBase<T>.DefaultDurationSeconds

AudioDiffusionModelBase<T>.MelChannels

AudioDiffusionModelBase<T>.HopLength

AudioDiffusionModelBase<T>.FFTSize

AudioDiffusionModelBase<T>.MinFrequency

AudioDiffusionModelBase<T>.MaxFrequency

AudioDiffusionModelBase<T>.GenerateFromText(string, string, double?, int, double, int?)

AudioDiffusionModelBase<T>.TextToSpeech(string, Tensor<T>, double, int, int?)

AudioDiffusionModelBase<T>.AudioToAudio(Tensor<T>, string, string, double, int, double, int?)

AudioDiffusionModelBase<T>.ContinueAudio(Tensor<T>, string, double, int, int?)

AudioDiffusionModelBase<T>.WaveformToMelSpectrogram(Tensor<T>)

AudioDiffusionModelBase<T>.MelSpectrogramToWaveform(Tensor<T>)

AudioDiffusionModelBase<T>.ExtractSpeakerEmbedding(Tensor<T>)

AudioDiffusionModelBase<T>.CombineTextAndSpeakerEmbeddings(Tensor<T>, Tensor<T>)

AudioDiffusionModelBase<T>.EstimateSpeechDuration(string, double)

AudioDiffusionModelBase<T>.ExtractLatentContext(Tensor<T>)

AudioDiffusionModelBase<T>.PredictNoiseWithContext(Tensor<T>, int, Tensor<T>, Tensor<T>)

AudioDiffusionModelBase<T>.ConcatenateLatents(Tensor<T>, Tensor<T>)

AudioDiffusionModelBase<T>.ConcatenateAudio(Tensor<T>, Tensor<T>)

LatentDiffusionModelBase<T>.GuidanceScale

LatentDiffusionModelBase<T>.SupportsNegativePrompt

LatentDiffusionModelBase<T>.SupportsInpainting

LatentDiffusionModelBase<T>.EncodeToLatent(Tensor<T>, bool)

LatentDiffusionModelBase<T>.DecodeFromLatent(Tensor<T>)

LatentDiffusionModelBase<T>.GenerateFromText(string, string, int, int, int, double?, int?)

LatentDiffusionModelBase<T>.ImageToImage(Tensor<T>, string, string, double, int, double?, int?)

LatentDiffusionModelBase<T>.Inpaint(Tensor<T>, Tensor<T>, string, string, int, double?, int?)

LatentDiffusionModelBase<T>.SetGuidanceScale(double)

LatentDiffusionModelBase<T>.PredictNoise(Tensor<T>, int)

LatentDiffusionModelBase<T>.Generate(int[], int, int?)

LatentDiffusionModelBase<T>.ApplyGuidance(Tensor<T>, Tensor<T>, double)

LatentDiffusionModelBase<T>.SampleNoiseTensor(int[], Random)

LatentDiffusionModelBase<T>.ResizeMaskToLatent(Tensor<T>, int[])

LatentDiffusionModelBase<T>.BlendLatentsWithMask(Tensor<T>, Tensor<T>, Tensor<T>, int)

DiffusionModelBase<T>.NumOps

DiffusionModelBase<T>.RandomGenerator

DiffusionModelBase<T>.LossFunction

DiffusionModelBase<T>.LearningRate

DiffusionModelBase<T>.Scheduler

DiffusionModelBase<T>.DefaultLossFunction

DiffusionModelBase<T>.SupportsJitCompilation

DiffusionModelBase<T>.ComputeLoss(Tensor<T>, Tensor<T>, int[])

DiffusionModelBase<T>.Train(Tensor<T>, Tensor<T>)

DiffusionModelBase<T>.Predict(Tensor<T>)

DiffusionModelBase<T>.GetModelMetadata()

DiffusionModelBase<T>.WithParameters(Vector<T>)

DiffusionModelBase<T>.Serialize()

DiffusionModelBase<T>.Deserialize(byte[])

DiffusionModelBase<T>.SaveModel(string)

DiffusionModelBase<T>.LoadModel(string)

DiffusionModelBase<T>.SaveState(Stream)

DiffusionModelBase<T>.LoadState(Stream)

DiffusionModelBase<T>.GetActiveFeatureIndices()

DiffusionModelBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

DiffusionModelBase<T>.IsFeatureUsed(int)

DiffusionModelBase<T>.GetFeatureImportance()

DiffusionModelBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

DiffusionModelBase<T>.ApplyGradients(Vector<T>, T)

DiffusionModelBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

DiffusionModelBase<T>.SampleNoise(int, Random)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Examples

// Create a MusicGen model
var musicGen = new MusicGenModel<float>();

// Generate electronic music at specific BPM
var edm = musicGen.GenerateMusicWithTempo(
    prompt: "Energetic electronic dance music with synthesizers",
    bpm: 128,
    durationSeconds: 30.0,
    numInferenceSteps: 200);

// Generate melody-conditioned music
var variation = musicGen.GenerateFromMelody(
    melodyAudio: originalMelody,
    prompt: "Jazz version with saxophone",
    preservationStrength: 0.7);

Remarks

MusicGenModel is a specialized diffusion model for music generation that provides fine-grained control over musical characteristics including:

Text-to-Music: Generate music from natural language descriptions
Melody Conditioning: Guide generation with a reference melody
Rhythm/Beat Conditioning: Generate music following a specific rhythm pattern
Tempo Control: Generate at specific BPM (beats per minute)
Key/Scale Guidance: Influence the musical key of generated content
Style Transfer: Transform existing music to different styles

For Beginners: This model generates music with precise control:

Example prompts:

"Upbeat electronic dance music at 128 BPM" -> EDM track
"Sad piano ballad in A minor" -> emotional piano piece
"Funky bass groove with drums" -> funk rhythm section
"Orchestral film score, epic and dramatic" -> cinematic music

Advanced controls:

BPM: Set exact tempo (60-200 BPM typical)
Key: Major/minor keys (C major, A minor, etc.)
Instruments: Specify or exclude instruments
Style: Jazz, rock, classical, electronic, etc.

Technical specifications: - Sample rate: 32 kHz (high-quality music) - Latent channels: 16 (more capacity for musical structure) - Mel channels: 128 - Duration: Up to 60 seconds - Guidance scale: 3.0-7.0 typical

Constructors

MusicGenModel()

Initializes a new MusicGen model with default parameters.

public MusicGenModel()

MusicGenModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, MusicGenSize, int, double, int?)

Initializes a new MusicGen model with custom parameters.

public MusicGenModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, AudioVAE<T>? musicVAE = null, IConditioningModule<T>? textConditioner = null, MusicGenSize modelSize = MusicGenSize.Medium, int sampleRate = 32000, double defaultDurationSeconds = 30, int? seed = null)

Parameters

options DiffusionModelOptions<T>: Configuration options for the diffusion model.
scheduler INoiseScheduler<T>: Optional custom scheduler.
unet UNetNoisePredictor<T>: Optional custom U-Net noise predictor.
musicVAE AudioVAE<T>: Optional custom music VAE.
textConditioner IConditioningModule<T>: Optional text conditioning module.
modelSize MusicGenSize: Model size variant.
sampleRate int: Audio sample rate in Hz.
defaultDurationSeconds double: Default music duration.
seed int?: Optional random seed.

Fields

DEFAULT_BPM

Default BPM for music generation.

public const int DEFAULT_BPM = 120

Field Value

int

MUSICGEN_BASE_CHANNELS

MusicGen U-Net base channels.

public const int MUSICGEN_BASE_CHANNELS = 512

Field Value

int

MUSICGEN_CONTEXT_DIM

Context dimension for conditioning.

public const int MUSICGEN_CONTEXT_DIM = 1536

Field Value

int

MUSICGEN_LATENT_CHANNELS

MusicGen latent space channels (larger for musical structure).

public const int MUSICGEN_LATENT_CHANNELS = 16

Field Value

int

MUSICGEN_MAX_DURATION

Maximum supported duration in seconds.

public const double MUSICGEN_MAX_DURATION = 60

Field Value

double

MUSICGEN_MEL_CHANNELS

MusicGen mel spectrogram channels.

public const int MUSICGEN_MEL_CHANNELS = 128

Field Value

int

MUSICGEN_SAMPLE_RATE

MusicGen default sample rate for high-quality music.

public const int MUSICGEN_SAMPLE_RATE = 32000

Field Value

int

Properties

Conditioner

Gets the conditioning module (optional, for conditioned generation).

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

LatentChannels

Gets the number of latent channels.

public override int LatentChannels { get; }

Property Value

int

Remarks

Typically 4 for Stable Diffusion models.

MelodyEncoder

Gets the melody encoder for melody conditioning.

public MelodyEncoder<T> MelodyEncoder { get; }

Property Value

MelodyEncoder<T>

ModelSize

Gets the model size variant.

public MusicGenSize ModelSize { get; }

Property Value

MusicGenSize

MusicVAE

Gets the music VAE for direct access.

public AudioVAE<T> MusicVAE { get; }

Property Value

AudioVAE<T>

NoisePredictor

Gets the noise predictor model (U-Net, DiT, etc.).

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

This property returns the total count of trainable parameters in the model. It's useful for understanding model complexity and memory requirements.

RhythmEncoder

Gets the rhythm encoder for beat conditioning.

public RhythmEncoder<T> RhythmEncoder { get; }

Property Value

RhythmEncoder<T>

SupportsAudioToAudio

Gets whether this model supports audio-to-audio transformation.

public override bool SupportsAudioToAudio { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public override bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public override bool SupportsTextToMusic { get; }

Property Value

bool

SupportsTextToSpeech

Gets whether this model supports text-to-speech generation.

public override bool SupportsTextToSpeech { get; }

Property Value

bool

VAE

Gets the VAE model used for encoding and decoding.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Methods

Clone()

Creates a deep copy of the model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>: A new instance with the same parameters.

ContinueMusic(Tensor<T>, string?, double, double, int, double, int?)

Generates music continuation from an audio prompt.

public virtual Tensor<T> ContinueMusic(Tensor<T> audioPrompt, string? textPrompt = null, double continuationDurationSeconds = 15, double overlapSeconds = 2, int numInferenceSteps = 150, double guidanceScale = 4, int? seed = null)

Parameters

audioPrompt Tensor<T>: Audio to continue from.
textPrompt string: Optional text guidance for continuation.
continuationDurationSeconds double: Duration of continuation.
overlapSeconds double: Overlap with original for smooth transition.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Continued audio waveform.

DeepCopy()

Creates a deep copy of this object.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GenerateFromMelody(Tensor<T>, string, double, string?, int, double, int?)

Generates music conditioned on a reference melody.

public virtual Tensor<T> GenerateFromMelody(Tensor<T> melodyAudio, string prompt, double preservationStrength = 0.6, string? negativePrompt = null, int numInferenceSteps = 200, double guidanceScale = 5, int? seed = null)

Parameters

melodyAudio Tensor<T>: Reference melody audio.
prompt string: Text description for the style/arrangement.
preservationStrength double: How closely to follow the melody (0.0-1.0).
negativePrompt string: Optional negative prompt.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor.

Remarks

For Beginners: Melody conditioning lets you:

Create covers: Keep melody, change style
Add accompaniment: Keep melody, generate instruments
Style transfer: Transform melody to different genre

Preservation strength:

0.3-0.5: Use melody as loose guide
0.5-0.7: Balance melody with new elements
0.7-0.9: Closely follow original melody

GenerateFromRhythm(Tensor<T>, string, double, string?, int, double, int?)

Generates music conditioned on a rhythm/beat pattern.

public virtual Tensor<T> GenerateFromRhythm(Tensor<T> rhythmAudio, string prompt, double rhythmStrength = 0.5, string? negativePrompt = null, int numInferenceSteps = 200, double guidanceScale = 5, int? seed = null)

Parameters

rhythmAudio Tensor<T>: Reference rhythm/percussion audio.
prompt string: Text description for the melody/harmony.
rhythmStrength double: How closely to follow the rhythm (0.0-1.0).
negativePrompt string: Optional negative prompt.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor.

GenerateMusic(string, string?, double?, int, double, int?)

Generates music from a text prompt.

public override Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 200, double guidanceScale = 5, int? seed = null)

Parameters

prompt string: Text description of the desired music.
negativePrompt string: Optional negative prompt.
durationSeconds double?: Duration of music to generate.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor [1, samples].

GenerateMusicWithTempo(string, int, string?, double?, int, double, int?)

Generates music with specific tempo (BPM) control.

public virtual Tensor<T> GenerateMusicWithTempo(string prompt, int bpm, string? negativePrompt = null, double? durationSeconds = null, int numInferenceSteps = 200, double guidanceScale = 5, int? seed = null)

Parameters

prompt string: Text description of the desired music.
bpm int: Target beats per minute (60-200 typical).
negativePrompt string: Optional negative prompt.
durationSeconds double?: Duration of music to generate.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Audio waveform tensor.

Remarks

For Beginners: BPM (Beats Per Minute) controls the tempo:

Common BPM ranges:

60-80: Slow ballads, ambient
80-100: Hip-hop, R&B
100-120: Pop, house
120-140: Techno, trance
140-180: Drum and bass, dubstep

GetParameters()

Gets the parameters that can be optimized.

public override Vector<T> GetParameters()

Returns

Vector<T>

SetParameters(Vector<T>)

Sets the model parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: The parameter vector to set.

Remarks

This method allows direct modification of the model's internal parameters. This is useful for optimization algorithms that need to update parameters iteratively. If the length of parameters does not match ParameterCount, an ArgumentException should be thrown.

Exceptions

ArgumentException: Thrown when the length of parameters does not match ParameterCount.

Table of Contents

Class MusicGenModel<T>

Type Parameters

Examples

Remarks

Constructors

MusicGenModel()

MusicGenModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, AudioVAE<T>?, IConditioningModule<T>?, MusicGenSize, int, double, int?)

Parameters

Fields

DEFAULT_BPM

Field Value

MUSICGEN_BASE_CHANNELS

Field Value

MUSICGEN_CONTEXT_DIM

Field Value

MUSICGEN_LATENT_CHANNELS

Field Value

MUSICGEN_MAX_DURATION

Field Value

MUSICGEN_MEL_CHANNELS

Field Value

MUSICGEN_SAMPLE_RATE

Field Value

Properties

Conditioner

Property Value

LatentChannels

Property Value

Remarks

MelodyEncoder

Property Value

ModelSize

Property Value

MusicVAE

Property Value

NoisePredictor

Property Value

ParameterCount

Property Value

Remarks

RhythmEncoder

Property Value

SupportsAudioToAudio

Property Value

SupportsTextToAudio

Property Value

SupportsTextToMusic

Property Value

SupportsTextToSpeech

Property Value

VAE

Property Value

Methods

Clone()

Returns

ContinueMusic(Tensor<T>, string?, double, double, int, double, int?)

Parameters

Returns

DeepCopy()

Returns

GenerateFromMelody(Tensor<T>, string, double, string?, int, double, int?)

Parameters

Returns

Remarks

GenerateFromRhythm(Tensor<T>, string, double, string?, int, double, int?)

Parameters

Returns

GenerateMusic(string, string?, double?, int, double, int?)

Parameters

Returns

GenerateMusicWithTempo(string, int, string?, double?, int, double, int?)

Parameters

Returns

Remarks

GetParameters()

Returns

SetParameters(Vector<T>)

Parameters

Remarks