Table of Contents

Enum MusicGenModelSize

Namespace
AiDotNet.Audio.MusicGen
Assembly
AiDotNet.dll

Specifies the size variant of the MusicGen model.

public enum MusicGenModelSize

Fields

Large = 2

Large model variant (3.3B parameters).

  • Text encoder: 1024 hidden dim
  • LM: 2048 hidden dim, 48 layers, 16 heads
  • Highest quality, requires significant GPU memory
Medium = 1

Medium model variant (1.5B parameters). Default choice.

  • Text encoder: 768 hidden dim
  • LM: 1536 hidden dim, 24 layers, 16 heads
  • Good balance of quality and speed
Melody = 3

Melody model variant (1.5B parameters).

  • Same architecture as Medium
  • Additionally conditioned on melody input
  • Can generate music that follows a given melody
Small = 0

Small model variant (300M parameters).

  • Text encoder: 256 hidden dim
  • LM: 1024 hidden dim, 24 layers, 16 heads
  • Fast inference, suitable for real-time applications
Stereo = 4

Stereo model variant (1.5B parameters).

  • Same architecture as Medium
  • Generates stereo audio output
  • Uses additional codebook for left/right channel separation

Remarks

MusicGen comes in different sizes balancing quality and computational requirements. Larger models produce higher quality music but require more memory and compute.

For Beginners: Think of model sizes like different quality levels:

  • Small: Fast generation, good for experimentation (300M parameters)
  • Medium: Balanced quality and speed (1.5B parameters)
  • Large: Best quality, requires more resources (3.3B parameters)
Start with Small for testing, use Large for final production.