Table of Contents

Class RiffusionModel<T>

Namespace
AiDotNet.Diffusion.Models
Assembly
AiDotNet.dll

Riffusion model for music generation via spectrogram diffusion.

public class RiffusionModel<T> : LatentDiffusionModelBase<T>, ILatentDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
RiffusionModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Examples

// Create a Riffusion model
var riffusion = new RiffusionModel<float>();

// Generate music from text
var spectrogram = riffusion.GenerateSpectrogram(
    prompt: "jazz piano solo, smooth and relaxing",
    durationSeconds: 5.0);

// Convert to audio
var audio = riffusion.SpectrogramToAudio(spectrogram);

// Interpolate between two styles
var interpolated = riffusion.InterpolateStyles(
    promptA: "upbeat electronic dance music",
    promptB: "calm ambient soundscape",
    alpha: 0.5);

Remarks

Riffusion generates music by treating audio spectrograms as images and using Stable Diffusion to generate them. The resulting spectrograms are then converted back to audio using the Griffin-Lim algorithm or neural vocoders.

For Beginners: Riffusion creates music by first generating a "picture" of the sound (spectrogram), then converting that picture back into actual audio.

How it works:

  1. You describe the music you want: "jazz piano solo"
  2. Riffusion generates a spectrogram (visual representation of sound)
  3. The spectrogram is converted to playable audio

Key features:

  • Text-to-music generation
  • Style interpolation (blend two music styles)
  • Real-time streaming generation
  • Works with any Stable Diffusion checkpoint

What makes it unique:

  • Treats audio generation as an image generation problem
  • Can leverage all SD techniques: ControlNet, img2img, etc.
  • Fast inference compared to autoregressive music models

Technical details: - Uses mel-spectrograms with specific parameters - Typically 512x512 spectrogram images - Griffin-Lim or neural vocoder for audio reconstruction - Supports seed-based interpolation for smooth transitions - Compatible with LoRA adapters for style transfer

Reference: Based on Riffusion project (riffusion.com)

Constructors

RiffusionModel()

Initializes a new instance of RiffusionModel with default parameters.

public RiffusionModel()

RiffusionModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, StandardVAE<T>?, IConditioningModule<T>?, SpectrogramConfig?, int?)

Initializes a new instance of RiffusionModel with custom parameters.

public RiffusionModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, StandardVAE<T>? vae = null, IConditioningModule<T>? conditioner = null, SpectrogramConfig? spectrogramConfig = null, int? seed = null)

Parameters

options DiffusionModelOptions<T>

Configuration options.

scheduler INoiseScheduler<T>

Optional custom scheduler.

unet UNetNoisePredictor<T>

Optional custom U-Net.

vae StandardVAE<T>

Optional custom VAE.

conditioner IConditioningModule<T>

Optional text conditioning module.

spectrogramConfig SpectrogramConfig

Spectrogram configuration.

seed int?

Optional random seed.

Properties

Conditioner

Gets the conditioning module (optional, for conditioned generation).

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

LatentChannels

Gets the number of latent channels.

public override int LatentChannels { get; }

Property Value

int

Remarks

Typically 4 for Stable Diffusion models.

NoisePredictor

Gets the noise predictor model (U-Net, DiT, etc.).

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

This property returns the total count of trainable parameters in the model. It's useful for understanding model complexity and memory requirements.

SpectrogramConfiguration

Gets the spectrogram configuration.

public SpectrogramConfig SpectrogramConfiguration { get; }

Property Value

SpectrogramConfig

VAE

Gets the VAE model used for encoding and decoding.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Methods

Clone()

Creates a deep copy of the model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>

A new instance with the same parameters.

DeepCopy()

Creates a deep copy of this object.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GenerateAudio(string, string?, double, int, double?, int?)

Generates audio directly from a text prompt.

public virtual Tensor<T> GenerateAudio(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)

Parameters

prompt string

Text description of the desired music.

negativePrompt string

Optional negative prompt.

durationSeconds double

Desired audio duration.

numInferenceSteps int

Denoising steps.

guidanceScale double?

Guidance scale.

seed int?

Random seed.

Returns

Tensor<T>

Audio waveform tensor.

GenerateSpectrogram(string, string?, double, int, double?, int?)

Generates a spectrogram from a text prompt.

public virtual Tensor<T> GenerateSpectrogram(string prompt, string? negativePrompt = null, double durationSeconds = 5, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)

Parameters

prompt string

Text description of the desired music.

negativePrompt string

Optional negative prompt.

durationSeconds double

Desired audio duration in seconds.

numInferenceSteps int

Number of denoising steps.

guidanceScale double?

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Generated spectrogram tensor.

GetParameters()

Gets the parameters that can be optimized.

public override Vector<T> GetParameters()

Returns

Vector<T>

InterpolateStyles(string, string, double, double, int, double?, int?)

Interpolates between two music styles.

public virtual Tensor<T> InterpolateStyles(string promptA, string promptB, double alpha = 0.5, double durationSeconds = 5, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)

Parameters

promptA string

First style description.

promptB string

Second style description.

alpha double

Interpolation factor (0 = promptA, 1 = promptB).

durationSeconds double

Audio duration.

numInferenceSteps int

Denoising steps.

guidanceScale double?

Guidance scale.

seed int?

Random seed.

Returns

Tensor<T>

Interpolated spectrogram.

SetParameters(Vector<T>)

Sets the model parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The parameter vector to set.

Remarks

This method allows direct modification of the model's internal parameters. This is useful for optimization algorithms that need to update parameters iteratively. If the length of parameters does not match ParameterCount, an ArgumentException should be thrown.

Exceptions

ArgumentException

Thrown when the length of parameters does not match ParameterCount.

SpectrogramToAudio(Tensor<T>)

Converts a spectrogram to audio waveform.

public virtual Tensor<T> SpectrogramToAudio(Tensor<T> spectrogram)

Parameters

spectrogram Tensor<T>

Input spectrogram tensor.

Returns

Tensor<T>

Audio waveform tensor.