Class MelSpectrogram<T>
Computes Mel spectrograms from audio signals.
public class MelSpectrogram<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
MelSpectrogram<T>
- Inherited Members
Remarks
The Mel spectrogram is a representation of audio that mimics human hearing. It applies the Mel scale, which spaces frequencies according to how humans perceive pitch rather than the physical frequency.
For Beginners: Human hearing doesn't perceive pitch linearly - we can tell the difference between 100Hz and 200Hz easily, but 10,000Hz and 10,100Hz sound almost the same to us. The Mel scale accounts for this.
A Mel spectrogram:
- Computes the power spectrogram using STFT
- Applies a bank of triangular filters on the Mel scale
- Takes the log (optional) to compress dynamic range
This representation is commonly used for:
- Speech recognition (like Whisper)
- Music generation (like Riffusion)
- Audio classification
- Speaker verification
Usage:
var melSpec = new MelSpectrogram<float>(
sampleRate: 44100,
nMels: 128,
nFft: 2048
);
var mel = melSpec.Forward(audioSignal);
// mel.Shape = [numFrames, nMels]
Constructors
MelSpectrogram(int, int, int, int?, double, double?, IWindowFunction<T>?, bool, double, double)
Initializes a new Mel spectrogram processor.
public MelSpectrogram(int sampleRate = 22050, int nMels = 128, int nFft = 2048, int? hopLength = null, double fMin = 0, double? fMax = null, IWindowFunction<T>? windowFunction = null, bool logMel = true, double refDb = 1, double minDb = -80)
Parameters
sampleRateintAudio sample rate in Hz (default: 22050).
nMelsintNumber of Mel frequency bins (default: 128).
nFftintFFT size (default: 2048).
hopLengthint?Hop length between frames (default: nFft/4).
fMindoubleMinimum frequency in Hz (default: 0).
fMaxdouble?Maximum frequency in Hz (default: sampleRate/2).
windowFunctionIWindowFunction<T>Window function to use (default: HanningWindow - industry standard for audio).
logMelboolWhether to apply log compression (default: true).
refDbdoubleReference value for dB conversion (default: 1.0).
minDbdoubleMinimum dB value floor (default: -80).
Remarks
For Beginners: - sampleRate: Must match your audio file's sample rate - nMels: More bins = more frequency detail (128 is common for music, 80 for speech) - nFft: Larger = more frequency resolution, less time resolution - fMin/fMax: Filter out frequencies outside your range of interest - windowFunction: Reduces spectral leakage. Hann (default) is the industry standard. - logMel: Log compression makes the representation more perceptually uniform
Properties
NumMels
Gets the number of Mel bins.
public int NumMels { get; }
Property Value
STFT
Gets the STFT parameters.
public ShortTimeFourierTransform<T> STFT { get; }
Property Value
SampleRate
Gets the sample rate.
public int SampleRate { get; }
Property Value
Methods
DbToPower(Tensor<T>)
Converts dB spectrogram back to power.
public Tensor<T> DbToPower(Tensor<T> db)
Parameters
dbTensor<T>dB spectrogram.
Returns
- Tensor<T>
Power spectrogram.
Forward(Tensor<T>)
Computes the Mel spectrogram of an audio signal.
public Tensor<T> Forward(Tensor<T> signal)
Parameters
signalTensor<T>Input audio signal.
Returns
- Tensor<T>
Mel spectrogram tensor [numFrames, nMels].
Remarks
GPU Acceleration: When GPU is available, this method uses IEngine.MelSpectrogram for hardware-accelerated processing of the entire pipeline (STFT + Mel filterbank + dB conversion).
FromPowerSpectrogram(Tensor<T>)
Computes Mel spectrogram from a pre-computed power spectrogram.
public Tensor<T> FromPowerSpectrogram(Tensor<T> powerSpectrogram)
Parameters
powerSpectrogramTensor<T>Power spectrogram [numFrames, numFreqs].
Returns
- Tensor<T>
Mel spectrogram tensor [numFrames, nMels].
GetFilterbank()
Gets the Mel filterbank matrix.
public Tensor<T> GetFilterbank()
Returns
- Tensor<T>
Filterbank matrix [nMels, numFreqs].
GetMelCenterFrequencies()
Computes the frequency (in Hz) for each Mel bin center.
public double[] GetMelCenterFrequencies()
Returns
- double[]
Array of center frequencies.
HzToMel(double)
Converts frequency in Hz to Mel scale.
public static double HzToMel(double hz)
Parameters
hzdoubleFrequency in Hz.
Returns
- double
Frequency in Mels.
Remarks
Uses the formula: mel = 2595 * log10(1 + hz / 700)
InvertMelToMagnitude(Tensor<T>, bool?)
Inverts a Mel spectrogram to approximate magnitude spectrogram.
public Tensor<T> InvertMelToMagnitude(Tensor<T> melSpec, bool? isDb = null)
Parameters
melSpecTensor<T>Mel spectrogram (linear or dB).
isDbbool?Whether the input is in dB (default: true if logMel was enabled).
Returns
- Tensor<T>
Approximate magnitude spectrogram.
Remarks
For Beginners: This is an approximate inversion because the Mel filterbank is not perfectly invertible. The result can be used with Griffin-Lim to reconstruct audio.
MelToHz(double)
Converts frequency in Mels to Hz.
public static double MelToHz(double mel)
Parameters
meldoubleFrequency in Mels.
Returns
- double
Frequency in Hz.
Remarks
Uses the formula: hz = 700 * (10^(mel / 2595) - 1)