Table of Contents

Enum StableAudioModelSize

Namespace
AiDotNet.Audio.StableAudio
Assembly
AiDotNet.dll

Specifies the size variant of the Stable Audio model.

public enum StableAudioModelSize

Fields

Base = 1

Base model variant (800M parameters). Default choice.

  • T5 encoder: 768 hidden dim
  • DiT: 1024 hidden dim, 24 blocks
  • Good balance of quality and speed
Large = 2

Large model variant (1.5B parameters).

  • T5 encoder: 1024 hidden dim
  • DiT: 1536 hidden dim, 32 blocks
  • Highest quality, requires significant GPU memory
Open = 3

Stable Audio Open variant.

  • Open-source model with permissive license
  • Optimized for music generation
  • Based on Base architecture
Small = 0

Small model variant (300M parameters).

  • T5 encoder: 256 hidden dim
  • DiT: 512 hidden dim, 12 blocks
  • Fast inference, suitable for experimentation
V2 = 4

Stable Audio 2.0 variant.

  • Improved architecture with better coherence
  • Extended duration support (up to 3 minutes)
  • Enhanced stereo output

Remarks

Stable Audio is a latent diffusion model by Stability AI for high-quality audio generation. It uses a Diffusion Transformer (DiT) architecture instead of U-Net for improved quality and supports variable-length audio generation.

For Beginners: Think of model sizes like different quality levels:

  • Small: Fast generation, good for experimentation (300M parameters)
  • Base: Balanced quality and speed (800M parameters)
  • Large: Best quality, requires more resources (1.5B parameters)
  • Open: Open-source variant with permissive license
Start with Small for testing, use Large for production.