Class StableAudioModel<T>
- Namespace
- AiDotNet.Audio.StableAudio
- Assembly
- AiDotNet.dll
Stable Audio model for generating high-quality audio from text descriptions.
public class StableAudioModel<T> : AudioNeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IAudioGenerator<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
StableAudioModel<T>
- Implements
- Inherited Members
- Extension Methods
Remarks
Stable Audio is Stability AI's state-of-the-art audio generation model that uses latent diffusion with a Diffusion Transformer (DiT) architecture for high-quality music and sound effects generation.
Architecture components:
- T5 Text Encoder: Encodes text prompts into conditioning embeddings
- VAE: Compresses audio to/from latent space (44.1kHz to 21.5Hz latent)
- DiT (Diffusion Transformer): Predicts noise using transformer blocks
- Timing Conditioning: Encodes duration and timing information
For Beginners: Stable Audio creates professional-quality audio:
How it works:
- You describe the audio you want ("upbeat electronic track")
- T5 encodes your text into embeddings
- Duration and timing are encoded as conditioning
- DiT diffusion generates latent audio representations
- VAE decoder converts latents to 44.1kHz stereo audio
Key features:
- CD-quality 44.1kHz stereo output
- Variable-length generation (up to 3 minutes)
- Music and sound effects generation
- Timing-aware conditioning
Usage:
var model = new StableAudioModel<float>(options);
var audio = model.GenerateAudio("Energetic rock music with electric guitar");
Reference: "Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion" by Evans et al., 2024
Constructors
StableAudioModel(NeuralNetworkArchitecture<T>, StableAudioOptions?, ITokenizer?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a Stable Audio model using native layers for training from scratch.
public StableAudioModel(NeuralNetworkArchitecture<T> architecture, StableAudioOptions? options = null, ITokenizer? tokenizer = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
optionsStableAudioOptionsStable Audio configuration options.
tokenizerITokenizerOptional tokenizer. If null, creates T5-compatible tokenizer.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer. Defaults to AdamW.
lossFunctionILossFunction<T>Optional loss function. Defaults to MSE.
Remarks
For Beginners: Use this constructor when: - Training Stable Audio from scratch (requires significant data and compute) - Fine-tuning on custom audio types - Research and experimentation
For most use cases, load pretrained ONNX models instead.
StableAudioModel(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, StableAudioOptions?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a Stable Audio model using pretrained ONNX models for inference.
public StableAudioModel(NeuralNetworkArchitecture<T> architecture, string textEncoderPath, string vaePath, string ditPath, ITokenizer tokenizer, StableAudioOptions? options = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
textEncoderPathstringPath to the T5 text encoder ONNX model.
vaePathstringPath to the VAE ONNX model.
ditPathstringPath to the DiT denoiser ONNX model.
tokenizerITokenizerT5 tokenizer for text processing (REQUIRED).
optionsStableAudioOptionsStable Audio configuration options.
optimizerIGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer for fine-tuning.
lossFunctionILossFunction<T>Optional loss function.
Exceptions
- ArgumentException
Thrown when required paths are empty.
- FileNotFoundException
Thrown when model files don't exist.
- ArgumentNullException
Thrown when tokenizer is null.
Properties
MaxDurationSeconds
Gets the maximum duration of audio that can be generated.
public double MaxDurationSeconds { get; }
Property Value
SampleRate
Gets the sample rate of generated audio.
public int SampleRate { get; }
Property Value
SupportsAudioContinuation
Gets whether this model supports audio continuation.
public bool SupportsAudioContinuation { get; }
Property Value
SupportsAudioInpainting
Gets whether this model supports audio inpainting.
public bool SupportsAudioInpainting { get; }
Property Value
SupportsTextToAudio
Gets whether this model supports text-to-audio generation.
public bool SupportsTextToAudio { get; }
Property Value
SupportsTextToMusic
Gets whether this model supports text-to-music generation.
public bool SupportsTextToMusic { get; }
Property Value
Methods
ContinueAudio(Tensor<T>, string?, double, int, int?)
Continues existing audio by extending it.
public Tensor<T> ContinueAudio(Tensor<T> inputAudio, string? prompt = null, double extensionSeconds = 5, int numInferenceSteps = 100, int? seed = null)
Parameters
Returns
- Tensor<T>
CreateNewInstance()
Creates a new instance for cloning.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReader
Dispose(bool)
Disposes of model resources.
protected override void Dispose(bool disposing)
Parameters
disposingbool
GenerateAudio(string, string?, double, int, double, int?)
Generates audio from a text description.
public Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?
Returns
- Tensor<T>
GenerateAudioAsync(string, string?, double, int, double, int?, CancellationToken)
Generates audio asynchronously.
public Task<Tensor<T>> GenerateAudioAsync(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null, CancellationToken cancellationToken = default)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?cancellationTokenCancellationToken
Returns
- Task<Tensor<T>>
GenerateMusic(string, string?, double, int, double, int?)
Generates music from a text description.
public Tensor<T> GenerateMusic(string prompt, string? negativePrompt = null, double durationSeconds = 10, int numInferenceSteps = 100, double guidanceScale = 3, int? seed = null)
Parameters
promptstringnegativePromptstringdurationSecondsdoublenumInferenceStepsintguidanceScaledoubleseedint?
Returns
- Tensor<T>
GetDefaultOptions()
Gets default generation options.
public AudioGenerationOptions<T> GetDefaultOptions()
Returns
GetModelMetadata()
Gets model metadata.
public override ModelMetadata<T> GetModelMetadata()
Returns
InitializeLayers()
Initializes the neural network layers following the golden standard pattern.
protected override void InitializeLayers()
InpaintAudio(Tensor<T>, Tensor<T>, string?, int, int?)
Fills in missing or masked sections of audio.
public Tensor<T> InpaintAudio(Tensor<T> audio, Tensor<T> mask, string? prompt = null, int numInferenceSteps = 100, int? seed = null)
Parameters
Returns
- Tensor<T>
PostprocessOutput(Tensor<T>)
Postprocesses model output.
protected override Tensor<T> PostprocessOutput(Tensor<T> modelOutput)
Parameters
modelOutputTensor<T>
Returns
- Tensor<T>
Predict(Tensor<T>)
Makes a prediction using the model.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
PreprocessAudio(Tensor<T>)
Preprocesses raw audio for model input.
protected override Tensor<T> PreprocessAudio(Tensor<T> rawAudio)
Parameters
rawAudioTensor<T>
Returns
- Tensor<T>
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriter
Train(Tensor<T>, Tensor<T>)
Trains the model on input data.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>expectedOutputTensor<T>
UpdateParameters(Vector<T>)
Updates model parameters.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>