Class MelSpectrogram<T>

Namespace: AiDotNet.Diffusion.Audio

Assembly: AiDotNet.dll

Computes Mel spectrograms from audio signals.

public class MelSpectrogram<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

MelSpectrogram<T>

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

The Mel spectrogram is a representation of audio that mimics human hearing. It applies the Mel scale, which spaces frequencies according to how humans perceive pitch rather than the physical frequency.

For Beginners: Human hearing doesn't perceive pitch linearly - we can tell the difference between 100Hz and 200Hz easily, but 10,000Hz and 10,100Hz sound almost the same to us. The Mel scale accounts for this.

A Mel spectrogram:

Computes the power spectrogram using STFT
Applies a bank of triangular filters on the Mel scale
Takes the log (optional) to compress dynamic range

This representation is commonly used for:

Speech recognition (like Whisper)
Music generation (like Riffusion)
Audio classification
Speaker verification

Usage:

var melSpec = new MelSpectrogram<float>(
    sampleRate: 44100,
    nMels: 128,
    nFft: 2048
);
var mel = melSpec.Forward(audioSignal);
// mel.Shape = [numFrames, nMels]

Constructors

MelSpectrogram(int, int, int, int?, double, double?, IWindowFunction<T>?, bool, double, double)

Initializes a new Mel spectrogram processor.

public MelSpectrogram(int sampleRate = 22050, int nMels = 128, int nFft = 2048, int? hopLength = null, double fMin = 0, double? fMax = null, IWindowFunction<T>? windowFunction = null, bool logMel = true, double refDb = 1, double minDb = -80)

Parameters

sampleRate int: Audio sample rate in Hz (default: 22050).
nMels int: Number of Mel frequency bins (default: 128).
nFft int: FFT size (default: 2048).
hopLength int?: Hop length between frames (default: nFft/4).
fMin double: Minimum frequency in Hz (default: 0).
fMax double?: Maximum frequency in Hz (default: sampleRate/2).
windowFunction IWindowFunction<T>: Window function to use (default: HanningWindow - industry standard for audio).
logMel bool: Whether to apply log compression (default: true).
refDb double: Reference value for dB conversion (default: 1.0).
minDb double: Minimum dB value floor (default: -80).

Remarks

For Beginners: - sampleRate: Must match your audio file's sample rate - nMels: More bins = more frequency detail (128 is common for music, 80 for speech) - nFft: Larger = more frequency resolution, less time resolution - fMin/fMax: Filter out frequencies outside your range of interest - windowFunction: Reduces spectral leakage. Hann (default) is the industry standard. - logMel: Log compression makes the representation more perceptually uniform

Properties

NumMels

Gets the number of Mel bins.

public int NumMels { get; }

Property Value

int

STFT

Gets the STFT parameters.

public ShortTimeFourierTransform<T> STFT { get; }

Property Value

ShortTimeFourierTransform<T>

SampleRate

Gets the sample rate.

public int SampleRate { get; }

Property Value

int

Methods

DbToPower(Tensor<T>)

Converts dB spectrogram back to power.

public Tensor<T> DbToPower(Tensor<T> db)

Parameters

db Tensor<T>: dB spectrogram.

Returns

Tensor<T>: Power spectrogram.

Forward(Tensor<T>)

Computes the Mel spectrogram of an audio signal.

public Tensor<T> Forward(Tensor<T> signal)

Parameters

signal Tensor<T>: Input audio signal.

Returns

Tensor<T>: Mel spectrogram tensor [numFrames, nMels].

Remarks

GPU Acceleration: When GPU is available, this method uses IEngine.MelSpectrogram for hardware-accelerated processing of the entire pipeline (STFT + Mel filterbank + dB conversion).

FromPowerSpectrogram(Tensor<T>)

Computes Mel spectrogram from a pre-computed power spectrogram.

public Tensor<T> FromPowerSpectrogram(Tensor<T> powerSpectrogram)

Parameters

powerSpectrogram Tensor<T>: Power spectrogram [numFrames, numFreqs].

Returns

Tensor<T>: Mel spectrogram tensor [numFrames, nMels].

GetFilterbank()

Gets the Mel filterbank matrix.

public Tensor<T> GetFilterbank()

Returns

Tensor<T>: Filterbank matrix [nMels, numFreqs].

GetMelCenterFrequencies()

Computes the frequency (in Hz) for each Mel bin center.

public double[] GetMelCenterFrequencies()

Returns

double[]: Array of center frequencies.

HzToMel(double)

Converts frequency in Hz to Mel scale.

public static double HzToMel(double hz)

Parameters

hz double: Frequency in Hz.

Returns

double: Frequency in Mels.

Remarks

Uses the formula: mel = 2595 * log10(1 + hz / 700)

InvertMelToMagnitude(Tensor<T>, bool?)

Inverts a Mel spectrogram to approximate magnitude spectrogram.

public Tensor<T> InvertMelToMagnitude(Tensor<T> melSpec, bool? isDb = null)

Parameters

melSpec Tensor<T>: Mel spectrogram (linear or dB).
isDb bool?: Whether the input is in dB (default: true if logMel was enabled).

Returns

Tensor<T>: Approximate magnitude spectrogram.

Remarks

For Beginners: This is an approximate inversion because the Mel filterbank is not perfectly invertible. The result can be used with Griffin-Lim to reconstruct audio.

MelToHz(double)

Converts frequency in Mels to Hz.

public static double MelToHz(double mel)

Parameters

mel double: Frequency in Mels.

Returns

double: Frequency in Hz.

Remarks

Uses the formula: hz = 700 * (10^(mel / 2595) - 1)

Table of Contents

Class MelSpectrogram<T>

Type Parameters

Remarks

Constructors

MelSpectrogram(int, int, int, int?, double, double?, IWindowFunction<T>?, bool, double, double)

Parameters

Remarks

Properties

NumMels

Property Value

STFT

Property Value

SampleRate

Property Value

Methods

DbToPower(Tensor<T>)

Parameters

Returns

Forward(Tensor<T>)

Parameters

Returns

Remarks

FromPowerSpectrogram(Tensor<T>)

Parameters

Returns

GetFilterbank()

Returns

GetMelCenterFrequencies()

Returns

HzToMel(double)

Parameters

Returns

Remarks

InvertMelToMagnitude(Tensor<T>, bool?)

Parameters

Returns

Remarks

MelToHz(double)

Parameters

Returns

Remarks