Class AudioProcessor<T>
Complete audio processing pipeline for diffusion-based audio generation.
public class AudioProcessor<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
AudioProcessor<T>
- Inherited Members
Remarks
This class combines STFT, Mel spectrogram, and Griffin-Lim into a unified pipeline for audio analysis and synthesis. It's designed for use with diffusion models like Riffusion that generate spectrograms.
For Beginners: This is your one-stop shop for working with audio in diffusion models. It handles:
- Converting audio waveforms to spectrograms (for training/conditioning)
- Converting spectrograms back to audio (for generation)
- Normalizing and denormalizing spectrograms
Typical workflow for Riffusion-style generation:
var processor = new AudioProcessor<float>(sampleRate: 44100);
// Encode reference audio to latent space (via spectrogram)
var spectrogram = processor.AudioToSpectrogram(referenceAudio);
var normalized = processor.NormalizeSpectrogram(spectrogram);
// ... diffusion model generates new spectrogram ...
// Decode generated spectrogram back to audio
var denormalized = processor.DenormalizeSpectrogram(generatedSpec);
var audio = processor.SpectrogramToAudio(denormalized);
Constructors
AudioProcessor(int, int, int, int, double, double?, double, double, int)
Initializes a new audio processor with Riffusion-compatible defaults.
public AudioProcessor(int sampleRate = 44100, int nFft = 2048, int hopLength = 512, int nMels = 512, double fMin = 0, double? fMax = null, double minDb = -100, double maxDb = 20, int griffinLimIterations = 60)
Parameters
sampleRateintAudio sample rate in Hz (default: 44100).
nFftintFFT size (default: 2048).
hopLengthintHop length (default: 512).
nMelsintNumber of Mel bins (default: 512 for Riffusion).
fMindoubleMinimum frequency (default: 0).
fMaxdouble?Maximum frequency (default: sampleRate/2).
minDbdoubleMinimum dB for normalization (default: -100).
maxDbdoubleMaximum dB for normalization (default: 20).
griffinLimIterationsintGriffin-Lim iterations (default: 60).
Remarks
For Beginners: Default parameters are optimized for Riffusion-style generation at 44.1kHz sample rate. For speech processing, you might use: - sampleRate: 16000 or 22050 - nMels: 80 (common for speech) - nFft: 1024 or 512
Properties
GriffinLim
Gets the Griffin-Lim processor.
public GriffinLim<T> GriffinLim { get; }
Property Value
- GriffinLim<T>
HopLength
Gets the hop length.
public int HopLength { get; }
Property Value
MelSpectrogram
Gets the Mel spectrogram processor.
public MelSpectrogram<T> MelSpectrogram { get; }
Property Value
NFft
Gets the FFT size.
public int NFft { get; }
Property Value
NumMels
Gets the number of Mel bins.
public int NumMels { get; }
Property Value
STFT
Gets the STFT processor.
public ShortTimeFourierTransform<T> STFT { get; }
Property Value
SampleRate
Gets the sample rate.
public int SampleRate { get; }
Property Value
Methods
AudioToImageSpectrogram(Tensor<T>, int, int)
Creates a spectrogram suitable for image-based diffusion models.
public Tensor<T> AudioToImageSpectrogram(Tensor<T> audio, int targetWidth = 512, int targetHeight = 512)
Parameters
audioTensor<T>Input audio tensor.
targetWidthintTarget spectrogram width (time dimension).
targetHeightintTarget spectrogram height (frequency dimension).
Returns
- Tensor<T>
Resized spectrogram tensor [height, width].
Remarks
For Beginners: Diffusion models often expect fixed-size inputs like 512x512 or 1024x1024. This method creates a spectrogram and resizes it to match those dimensions.
AudioToSpectrogram(Tensor<T>)
Converts audio waveform to a normalized Mel spectrogram.
public Tensor<T> AudioToSpectrogram(Tensor<T> audio)
Parameters
audioTensor<T>Audio waveform tensor.
Returns
- Tensor<T>
Normalized Mel spectrogram [numFrames, nMels] in range [0, 1].
Remarks
For Beginners: This converts audio into a 2D image-like representation that can be processed by image-based diffusion models.
DenormalizeSpectrogram(Tensor<T>)
Denormalizes a [0, 1] spectrogram back to dB.
public Tensor<T> DenormalizeSpectrogram(Tensor<T> normalizedSpectrogram)
Parameters
normalizedSpectrogramTensor<T>Normalized spectrogram.
Returns
- Tensor<T>
Spectrogram in dB.
DurationToFrames(double)
Computes the number of frames for a given duration.
public int DurationToFrames(double durationSeconds)
Parameters
durationSecondsdoubleDuration in seconds.
Returns
- int
Number of spectrogram frames.
DurationToSamples(double)
Computes the number of samples for a given duration.
public int DurationToSamples(double durationSeconds)
Parameters
durationSecondsdoubleDuration in seconds.
Returns
- int
Number of audio samples.
FramesToDuration(int)
Computes the duration of audio from spectrogram dimensions.
public double FramesToDuration(int numFrames)
Parameters
numFramesintNumber of spectrogram frames.
Returns
- double
Duration in seconds.
GetMelFrequencyAxis()
Gets the frequency axis values for a Mel spectrogram.
public double[] GetMelFrequencyAxis()
Returns
- double[]
Array of center frequencies in Hz for each Mel bin.
GetTimeAxis(int)
Gets the time axis values for a spectrogram.
public double[] GetTimeAxis(int numFrames)
Parameters
numFramesintNumber of spectrogram frames.
Returns
- double[]
Array of time values in seconds for each frame.
NormalizeAudio(Tensor<T>, double)
Normalizes audio to a peak amplitude.
public Tensor<T> NormalizeAudio(Tensor<T> audio, double targetPeak = 0.95)
Parameters
audioTensor<T>Input audio tensor.
targetPeakdoubleTarget peak amplitude (default: 0.95).
Returns
- Tensor<T>
Normalized audio tensor.
NormalizeSpectrogram(Tensor<T>)
Normalizes a dB spectrogram to [0, 1] range.
public Tensor<T> NormalizeSpectrogram(Tensor<T> dbSpectrogram)
Parameters
dbSpectrogramTensor<T>Spectrogram in dB.
Returns
- Tensor<T>
Normalized spectrogram in [0, 1].
PadOrTruncate(Tensor<T>, int)
Pads or truncates audio to a specific length.
public Tensor<T> PadOrTruncate(Tensor<T> audio, int targetLength)
Parameters
audioTensor<T>Input audio tensor.
targetLengthintTarget length in samples.
Returns
- Tensor<T>
Audio tensor of specified length (padded with zeros).
PadOrTruncate(Tensor<T>, int, T)
Pads or truncates audio to a specific length.
public Tensor<T> PadOrTruncate(Tensor<T> audio, int targetLength, T padValue)
Parameters
audioTensor<T>Input audio tensor.
targetLengthintTarget length in samples.
padValueTValue to use for padding.
Returns
- Tensor<T>
Audio tensor of specified length.
SpectrogramToAudio(Tensor<T>, int?)
Converts a normalized spectrogram back to audio.
public Tensor<T> SpectrogramToAudio(Tensor<T> spectrogram, int? length = null)
Parameters
spectrogramTensor<T>Normalized spectrogram [numFrames, nMels] in range [0, 1].
lengthint?Expected output length (optional).
Returns
- Tensor<T>
Audio waveform tensor.
Remarks
For Beginners: This takes a spectrogram (e.g., generated by a diffusion model) and converts it back to an audio waveform that can be played.