Table of Contents

Class StableAudioModel<T>

Namespace
AiDotNet.Audio.StableAudio
Assembly
AiDotNet.dll

Stable Audio model for generating high-quality audio from text descriptions.

public class StableAudioModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IAudioGenerator<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
StableAudioModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

Stable Audio is Stability AI's state-of-the-art audio generation model that uses latent diffusion with a Diffusion Transformer (DiT) architecture for high-quality music and sound effects generation.

Architecture components:

  1. T5 Text Encoder: Encodes text prompts into conditioning embeddings
  2. VAE: Compresses audio to/from latent space (44.1kHz to 21.5Hz latent)
  3. DiT (Diffusion Transformer): Predicts noise using transformer blocks
  4. Timing Conditioning: Encodes duration and timing information

For Beginners: Stable Audio creates professional-quality audio:

How it works:

  1. You describe the audio you want ("upbeat electronic track")
  2. T5 encodes your text into embeddings
  3. Duration and timing are encoded as conditioning
  4. DiT diffusion generates latent audio representations
  5. VAE decoder converts latents to 44.1kHz stereo audio

Key features:

  • CD-quality 44.1kHz stereo output
  • Variable-length generation (up to 3 minutes)
  • Music and sound effects generation
  • Timing-aware conditioning

Usage:

var model = new StableAudioModel<float>(options);
var audio = model.GenerateAudio("Energetic rock music with electric guitar");

Reference: "Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion" by Evans et al., 2024

Constructors

StableAudioModel(NeuralNetworkArchitecture<T>, StableAudioOptions?, ITokenizer?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a Stable Audio model using native layers for training from scratch.

public StableAudioModel(NeuralNetworkArchitecture<T> architecture, StableAudioOptions? options = null, ITokenizer? tokenizer = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

options StableAudioOptions

Stable Audio configuration options.

tokenizer ITokenizer

Optional tokenizer. If null, creates T5-compatible tokenizer.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer. Defaults to AdamW.

lossFunction ILossFunction<T>

Optional loss function. Defaults to MSE.

Remarks

For Beginners: Use this constructor when: - Training Stable Audio from scratch (requires significant data and compute) - Fine-tuning on custom audio types - Research and experimentation

For most use cases, load pretrained ONNX models instead.

StableAudioModel(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, StableAudioOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a Stable Audio model using pretrained ONNX models for inference.

public StableAudioModel(NeuralNetworkArchitecture<T> architecture, string textEncoderPath, string vaePath, string ditPath, ITokenizer tokenizer, StableAudioOptions? options = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

textEncoderPath string

Path to the T5 text encoder ONNX model.

vaePath string

Path to the VAE ONNX model.

ditPath string

Path to the DiT denoiser ONNX model.

tokenizer ITokenizer

T5 tokenizer for text processing (REQUIRED).

options StableAudioOptions

Stable Audio configuration options.

optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer for fine-tuning.

lossFunction ILossFunction<T>

Optional loss function.

Exceptions

ArgumentException

Thrown when required paths are empty.

FileNotFoundException

Thrown when model files don't exist.

ArgumentNullException

Thrown when tokenizer is null.

Properties

MaxDurationSeconds

Gets the maximum duration of audio that can be generated.

public double MaxDurationSeconds { get; }

Property Value

double

SampleRate

Gets the sample rate of generated audio.

public int SampleRate { get; }

Property Value

int

SupportsAudioContinuation

Gets whether this model supports audio continuation.

public bool SupportsAudioContinuation { get; }

Property Value

bool

SupportsAudioInpainting

Gets whether this model supports audio inpainting.

public bool SupportsAudioInpainting { get; }

Property Value

bool

SupportsTextToAudio

Gets whether this model supports text-to-audio generation.

public bool SupportsTextToAudio { get; }

Property Value

bool

SupportsTextToMusic

Gets whether this model supports text-to-music generation.

public bool SupportsTextToMusic { get; }

Property Value

bool

Methods

ContinueAudio(Tensor<T>, string?, double, int, int?)

Continues existing audio by extending it.

public Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)

Parameters

inputAudio Tensor<T>
prompt string
extensionSeconds double
numInferenceSteps int
seed int?

Returns

Tensor<T>

CreateNewInstance()

Creates a new instance for cloning.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

Dispose(bool)

Disposes of model resources.

protected override void Dispose(bool disposing)

Parameters

disposing bool

GenerateAudio(string, string?, double, int, double, int?)

Generates audio from a text description.

public Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?

Returns

Tensor<T>

GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)

Generates audio asynchronously.

public Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?
cancellationToken CancellationToken

Returns

Task<Tensor<T>>

GenerateMusic(string, string?, double, int, double, int?)

Generates music from a text description.

public Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)

Parameters

prompt string
negativePrompt string
durationSeconds double
numInferenceSteps int
guidanceScale double
seed int?

Returns

Tensor<T>

GetDefaultOptions()

Gets default generation options.

public AudioGenerationOptions<T> GetDefaultOptions()

Returns

AudioGenerationOptions<T>

GetModelMetadata()

Gets model metadata.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

InitializeLayers()

Initializes the neural network layers following the golden standard pattern.

protected override void InitializeLayers()

InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)

Fills in missing or masked sections of audio.

public Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)

Parameters

audio Tensor<T>
mask Tensor<T>
prompt string
numInferenceSteps int
seed int?

Returns

Tensor<T>

PostprocessOutput(Tensor<T>)

Postprocesses model output.

protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)

Parameters

modelOutput Tensor<T>

Returns

Tensor<T>

Predict(Tensor<T>)

Makes a prediction using the model.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

PreprocessAudio(Tensor<T>)

Preprocesses raw audio for model input.

protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)

Parameters

rawAudio Tensor<T>

Returns

Tensor<T>

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

Train(Tensor<T>, Tensor<T>)

Trains the model on input data.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>
expectedOutput Tensor<T>

UpdateParameters(Vector<T>)

Updates model parameters.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>