Enum MusicGenModelSize
Specifies the size variant of the MusicGen model.
public enum MusicGenModelSize
Fields
Large = 2Large model variant (3.3B parameters).
- Text encoder: 1024 hidden dim
- LM: 2048 hidden dim, 48 layers, 16 heads
- Highest quality, requires significant GPU memory
Medium = 1Medium model variant (1.5B parameters). Default choice.
- Text encoder: 768 hidden dim
- LM: 1536 hidden dim, 24 layers, 16 heads
- Good balance of quality and speed
Melody = 3Melody model variant (1.5B parameters).
- Same architecture as Medium
- Additionally conditioned on melody input
- Can generate music that follows a given melody
Small = 0Small model variant (300M parameters).
- Text encoder: 256 hidden dim
- LM: 1024 hidden dim, 24 layers, 16 heads
- Fast inference, suitable for real-time applications
Stereo = 4Stereo model variant (1.5B parameters).
- Same architecture as Medium
- Generates stereo audio output
- Uses additional codebook for left/right channel separation
Remarks
MusicGen comes in different sizes balancing quality and computational requirements. Larger models produce higher quality music but require more memory and compute.
For Beginners: Think of model sizes like different quality levels:
- Small: Fast generation, good for experimentation (300M parameters)
- Medium: Balanced quality and speed (1.5B parameters)
- Large: Best quality, requires more resources (3.3B parameters)