Table of Contents

Enum AudioLDMModelSize

Namespace
AiDotNet.Audio.AudioLDM
Assembly
AiDotNet.dll

Specifies the size variant of the AudioLDM model.

public enum AudioLDMModelSize

Fields

Base = 1

Base model variant (740M parameters). Default choice.

  • CLAP encoder: 512 hidden dim
  • U-Net: 512 base channels, 8 attention heads
  • Good balance of quality and speed
Large = 2

Large model variant (1.5B parameters).

  • CLAP encoder: 768 hidden dim
  • U-Net: 768 base channels, 8 attention heads
  • Highest quality, requires significant GPU memory
Music = 4

Music-specialized variant.

  • Fine-tuned on music datasets
  • Better instrument separation
  • Improved musical coherence
Small = 0

Small model variant (345M parameters).

  • CLAP encoder: 256 hidden dim
  • U-Net: 320 base channels, 4 attention heads
  • Fast inference, suitable for experimentation
V2 = 3

AudioLDM-2 variant with improved architecture.

  • Uses GPT-2 style text encoder
  • Improved CLAP conditioning
  • Better audio-text alignment

Remarks

AudioLDM (Audio Latent Diffusion Model) comes in different sizes balancing quality and computational requirements. All variants use latent diffusion in a compressed audio representation space.

For Beginners: Think of model sizes like different quality levels:

  • Small: Fast generation, good for experimentation (345M parameters)
  • Base: Balanced quality and speed (740M parameters)
  • Large: Best quality, requires more resources (1.5B parameters)
Start with Small for testing, use Large for final production.