Table of Contents

Class AudioProcessor<T>

Namespace
AiDotNet.Diffusion.Audio
Assembly
AiDotNet.dll

Complete audio processing pipeline for diffusion-based audio generation.

public class AudioProcessor<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
AudioProcessor<T>
Inherited Members

Remarks

This class combines STFT, Mel spectrogram, and Griffin-Lim into a unified pipeline for audio analysis and synthesis. It's designed for use with diffusion models like Riffusion that generate spectrograms.

For Beginners: This is your one-stop shop for working with audio in diffusion models. It handles:

  • Converting audio waveforms to spectrograms (for training/conditioning)
  • Converting spectrograms back to audio (for generation)
  • Normalizing and denormalizing spectrograms

Typical workflow for Riffusion-style generation:

var processor = new AudioProcessor<float>(sampleRate: 44100);

// Encode reference audio to latent space (via spectrogram)
var spectrogram = processor.AudioToSpectrogram(referenceAudio);
var normalized = processor.NormalizeSpectrogram(spectrogram);

// ... diffusion model generates new spectrogram ...

// Decode generated spectrogram back to audio
var denormalized = processor.DenormalizeSpectrogram(generatedSpec);
var audio = processor.SpectrogramToAudio(denormalized);

Constructors

AudioProcessor(int, int, int, int, double, double?, double, double, int)

Initializes a new audio processor with Riffusion-compatible defaults.

public AudioProcessor(int sampleRate = 44100, int nFft = 2048, int hopLength = 512, int nMels = 512, double fMin = 0, double? fMax = null, double minDb = -100, double maxDb = 20, int griffinLimIterations = 60)

Parameters

sampleRate int

Audio sample rate in Hz (default: 44100).

nFft int

FFT size (default: 2048).

hopLength int

Hop length (default: 512).

nMels int

Number of Mel bins (default: 512 for Riffusion).

fMin double

Minimum frequency (default: 0).

fMax double?

Maximum frequency (default: sampleRate/2).

minDb double

Minimum dB for normalization (default: -100).

maxDb double

Maximum dB for normalization (default: 20).

griffinLimIterations int

Griffin-Lim iterations (default: 60).

Remarks

For Beginners: Default parameters are optimized for Riffusion-style generation at 44.1kHz sample rate. For speech processing, you might use: - sampleRate: 16000 or 22050 - nMels: 80 (common for speech) - nFft: 1024 or 512

Properties

GriffinLim

Gets the Griffin-Lim processor.

public GriffinLim<T> GriffinLim { get; }

Property Value

GriffinLim<T>

HopLength

Gets the hop length.

public int HopLength { get; }

Property Value

int

MelSpectrogram

Gets the Mel spectrogram processor.

public MelSpectrogram<T> MelSpectrogram { get; }

Property Value

MelSpectrogram<T>

NFft

Gets the FFT size.

public int NFft { get; }

Property Value

int

NumMels

Gets the number of Mel bins.

public int NumMels { get; }

Property Value

int

STFT

Gets the STFT processor.

public ShortTimeFourierTransform<T> STFT { get; }

Property Value

ShortTimeFourierTransform<T>

SampleRate

Gets the sample rate.

public int SampleRate { get; }

Property Value

int

Methods

AudioToImageSpectrogram(Tensor<T>, int, int)

Creates a spectrogram suitable for image-based diffusion models.

public Tensor<T> AudioToImageSpectrogram(Tensor<T> audio, int targetWidth = 512, int targetHeight = 512)

Parameters

audio Tensor<T>

Input audio tensor.

targetWidth int

Target spectrogram width (time dimension).

targetHeight int

Target spectrogram height (frequency dimension).

Returns

Tensor<T>

Resized spectrogram tensor [height, width].

Remarks

For Beginners: Diffusion models often expect fixed-size inputs like 512x512 or 1024x1024. This method creates a spectrogram and resizes it to match those dimensions.

AudioToSpectrogram(Tensor<T>)

Converts audio waveform to a normalized Mel spectrogram.

public Tensor<T> AudioToSpectrogram(Tensor<T> audio)

Parameters

audio Tensor<T>

Audio waveform tensor.

Returns

Tensor<T>

Normalized Mel spectrogram [numFrames, nMels] in range [0, 1].

Remarks

For Beginners: This converts audio into a 2D image-like representation that can be processed by image-based diffusion models.

DenormalizeSpectrogram(Tensor<T>)

Denormalizes a [0, 1] spectrogram back to dB.

public Tensor<T> DenormalizeSpectrogram(Tensor<T> normalizedSpectrogram)

Parameters

normalizedSpectrogram Tensor<T>

Normalized spectrogram.

Returns

Tensor<T>

Spectrogram in dB.

DurationToFrames(double)

Computes the number of frames for a given duration.

public int DurationToFrames(double durationSeconds)

Parameters

durationSeconds double

Duration in seconds.

Returns

int

Number of spectrogram frames.

DurationToSamples(double)

Computes the number of samples for a given duration.

public int DurationToSamples(double durationSeconds)

Parameters

durationSeconds double

Duration in seconds.

Returns

int

Number of audio samples.

FramesToDuration(int)

Computes the duration of audio from spectrogram dimensions.

public double FramesToDuration(int numFrames)

Parameters

numFrames int

Number of spectrogram frames.

Returns

double

Duration in seconds.

GetMelFrequencyAxis()

Gets the frequency axis values for a Mel spectrogram.

public double[] GetMelFrequencyAxis()

Returns

double[]

Array of center frequencies in Hz for each Mel bin.

GetTimeAxis(int)

Gets the time axis values for a spectrogram.

public double[] GetTimeAxis(int numFrames)

Parameters

numFrames int

Number of spectrogram frames.

Returns

double[]

Array of time values in seconds for each frame.

NormalizeAudio(Tensor<T>, double)

Normalizes audio to a peak amplitude.

public Tensor<T> NormalizeAudio(Tensor<T> audio, double targetPeak = 0.95)

Parameters

audio Tensor<T>

Input audio tensor.

targetPeak double

Target peak amplitude (default: 0.95).

Returns

Tensor<T>

Normalized audio tensor.

NormalizeSpectrogram(Tensor<T>)

Normalizes a dB spectrogram to [0, 1] range.

public Tensor<T> NormalizeSpectrogram(Tensor<T> dbSpectrogram)

Parameters

dbSpectrogram Tensor<T>

Spectrogram in dB.

Returns

Tensor<T>

Normalized spectrogram in [0, 1].

PadOrTruncate(Tensor<T>, int)

Pads or truncates audio to a specific length.

public Tensor<T> PadOrTruncate(Tensor<T> audio, int targetLength)

Parameters

audio Tensor<T>

Input audio tensor.

targetLength int

Target length in samples.

Returns

Tensor<T>

Audio tensor of specified length (padded with zeros).

PadOrTruncate(Tensor<T>, int, T)

Pads or truncates audio to a specific length.

public Tensor<T> PadOrTruncate(Tensor<T> audio, int targetLength, T padValue)

Parameters

audio Tensor<T>

Input audio tensor.

targetLength int

Target length in samples.

padValue T

Value to use for padding.

Returns

Tensor<T>

Audio tensor of specified length.

SpectrogramToAudio(Tensor<T>, int?)

Converts a normalized spectrogram back to audio.

public Tensor<T> SpectrogramToAudio(Tensor<T> spectrogram, int? length = null)

Parameters

spectrogram Tensor<T>

Normalized spectrogram [numFrames, nMels] in range [0, 1].

length int?

Expected output length (optional).

Returns

Tensor<T>

Audio waveform tensor.

Remarks

For Beginners: This takes a spectrogram (e.g., generated by a diffusion model) and converts it back to an audio waveform that can be played.