Enum AudioLDMModelSize
Specifies the size variant of the AudioLDM model.
public enum AudioLDMModelSize
Fields
Base = 1Base model variant (740M parameters). Default choice.
- CLAP encoder: 512 hidden dim
- U-Net: 512 base channels, 8 attention heads
- Good balance of quality and speed
Large = 2Large model variant (1.5B parameters).
- CLAP encoder: 768 hidden dim
- U-Net: 768 base channels, 8 attention heads
- Highest quality, requires significant GPU memory
Music = 4Music-specialized variant.
- Fine-tuned on music datasets
- Better instrument separation
- Improved musical coherence
Small = 0Small model variant (345M parameters).
- CLAP encoder: 256 hidden dim
- U-Net: 320 base channels, 4 attention heads
- Fast inference, suitable for experimentation
V2 = 3AudioLDM-2 variant with improved architecture.
- Uses GPT-2 style text encoder
- Improved CLAP conditioning
- Better audio-text alignment
Remarks
AudioLDM (Audio Latent Diffusion Model) comes in different sizes balancing quality and computational requirements. All variants use latent diffusion in a compressed audio representation space.
For Beginners: Think of model sizes like different quality levels:
- Small: Fast generation, good for experimentation (345M parameters)
- Base: Balanced quality and speed (740M parameters)
- Large: Best quality, requires more resources (1.5B parameters)