Table of Contents

Class MelSpectrogram<T>

Namespace
AiDotNet.Diffusion.Audio
Assembly
AiDotNet.dll

Computes Mel spectrograms from audio signals.

public class MelSpectrogram<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
MelSpectrogram<T>
Inherited Members

Remarks

The Mel spectrogram is a representation of audio that mimics human hearing. It applies the Mel scale, which spaces frequencies according to how humans perceive pitch rather than the physical frequency.

For Beginners: Human hearing doesn't perceive pitch linearly - we can tell the difference between 100Hz and 200Hz easily, but 10,000Hz and 10,100Hz sound almost the same to us. The Mel scale accounts for this.

A Mel spectrogram:

  1. Computes the power spectrogram using STFT
  2. Applies a bank of triangular filters on the Mel scale
  3. Takes the log (optional) to compress dynamic range

This representation is commonly used for:

  • Speech recognition (like Whisper)
  • Music generation (like Riffusion)
  • Audio classification
  • Speaker verification

Usage:

var melSpec = new MelSpectrogram<float>(
    sampleRate: 44100,
    nMels: 128,
    nFft: 2048
);
var mel = melSpec.Forward(audioSignal);
// mel.Shape = [numFrames, nMels]

Constructors

MelSpectrogram(int, int, int, int?, double, double?, IWindowFunction<T>?, bool, double, double)

Initializes a new Mel spectrogram processor.

public MelSpectrogram(int sampleRate = 22050, int nMels = 128, int nFft = 2048, int? hopLength = null, double fMin = 0, double? fMax = null, IWindowFunction<T>? windowFunction = null, bool logMel = true, double refDb = 1, double minDb = -80)

Parameters

sampleRate int

Audio sample rate in Hz (default: 22050).

nMels int

Number of Mel frequency bins (default: 128).

nFft int

FFT size (default: 2048).

hopLength int?

Hop length between frames (default: nFft/4).

fMin double

Minimum frequency in Hz (default: 0).

fMax double?

Maximum frequency in Hz (default: sampleRate/2).

windowFunction IWindowFunction<T>

Window function to use (default: HanningWindow - industry standard for audio).

logMel bool

Whether to apply log compression (default: true).

refDb double

Reference value for dB conversion (default: 1.0).

minDb double

Minimum dB value floor (default: -80).

Remarks

For Beginners: - sampleRate: Must match your audio file's sample rate - nMels: More bins = more frequency detail (128 is common for music, 80 for speech) - nFft: Larger = more frequency resolution, less time resolution - fMin/fMax: Filter out frequencies outside your range of interest - windowFunction: Reduces spectral leakage. Hann (default) is the industry standard. - logMel: Log compression makes the representation more perceptually uniform

Properties

NumMels

Gets the number of Mel bins.

public int NumMels { get; }

Property Value

int

STFT

Gets the STFT parameters.

public ShortTimeFourierTransform<T> STFT { get; }

Property Value

ShortTimeFourierTransform<T>

SampleRate

Gets the sample rate.

public int SampleRate { get; }

Property Value

int

Methods

DbToPower(Tensor<T>)

Converts dB spectrogram back to power.

public Tensor<T> DbToPower(Tensor<T> db)

Parameters

db Tensor<T>

dB spectrogram.

Returns

Tensor<T>

Power spectrogram.

Forward(Tensor<T>)

Computes the Mel spectrogram of an audio signal.

public Tensor<T> Forward(Tensor<T> signal)

Parameters

signal Tensor<T>

Input audio signal.

Returns

Tensor<T>

Mel spectrogram tensor [numFrames, nMels].

Remarks

GPU Acceleration: When GPU is available, this method uses IEngine.MelSpectrogram for hardware-accelerated processing of the entire pipeline (STFT + Mel filterbank + dB conversion).

FromPowerSpectrogram(Tensor<T>)

Computes Mel spectrogram from a pre-computed power spectrogram.

public Tensor<T> FromPowerSpectrogram(Tensor<T> powerSpectrogram)

Parameters

powerSpectrogram Tensor<T>

Power spectrogram [numFrames, numFreqs].

Returns

Tensor<T>

Mel spectrogram tensor [numFrames, nMels].

GetFilterbank()

Gets the Mel filterbank matrix.

public Tensor<T> GetFilterbank()

Returns

Tensor<T>

Filterbank matrix [nMels, numFreqs].

GetMelCenterFrequencies()

Computes the frequency (in Hz) for each Mel bin center.

public double[] GetMelCenterFrequencies()

Returns

double[]

Array of center frequencies.

HzToMel(double)

Converts frequency in Hz to Mel scale.

public static double HzToMel(double hz)

Parameters

hz double

Frequency in Hz.

Returns

double

Frequency in Mels.

Remarks

Uses the formula: mel = 2595 * log10(1 + hz / 700)

InvertMelToMagnitude(Tensor<T>, bool?)

Inverts a Mel spectrogram to approximate magnitude spectrogram.

public Tensor<T> InvertMelToMagnitude(Tensor<T> melSpec, bool? isDb = null)

Parameters

melSpec Tensor<T>

Mel spectrogram (linear or dB).

isDb bool?

Whether the input is in dB (default: true if logMel was enabled).

Returns

Tensor<T>

Approximate magnitude spectrogram.

Remarks

For Beginners: This is an approximate inversion because the Mel filterbank is not perfectly invertible. The result can be used with Griffin-Lim to reconstruct audio.

MelToHz(double)

Converts frequency in Mels to Hz.

public static double MelToHz(double mel)

Parameters

mel double

Frequency in Mels.

Returns

double

Frequency in Hz.

Remarks

Uses the formula: hz = 700 * (10^(mel / 2595) - 1)