Table of Contents

Class LossScaler<T>

Namespace
AiDotNet.MixedPrecision
Assembly
AiDotNet.dll

Implements dynamic loss scaling for mixed-precision training to prevent gradient underflow.

public class LossScaler<T>

Type Parameters

T
Inheritance
LossScaler<T>
Inherited Members

Examples

// Create a loss scaler with defaults
var scaler = new LossScaler<float>(
    initialScale: 65536.0,
    dynamicScaling: true
);

// In training loop:
float loss = lossFunction.Compute(predictions, targets);
float scaledLoss = scaler.ScaleLoss(loss);

// Backpropagation with scaled loss...
var gradients = model.Backward(scaledLoss);

// Unscale and check for overflow
if (scaler.UnscaleGradientsAndCheck(gradients))
{
    // Safe to update parameters
    optimizer.Update(parameters, gradients);
}
else
{
    // Skip this update due to gradient overflow
    Console.WriteLine($"Gradient overflow, scale reduced to {scaler.Scale}");
}

Remarks

For Beginners: Loss scaling is a technique used in mixed-precision training to prevent very small gradient values from becoming zero (underflow) when using 16-bit precision.

The problem:

  • FP16 (Half) can only represent numbers in the range [6e-8, 65504]
  • During training, gradients are often very small (e.g., 1e-10)
  • Small gradients underflow to zero in FP16, stopping learning

The solution:

  • Scale the loss by a large factor (e.g., 2^16 = 65536) before backpropagation
  • This makes gradients larger, preventing underflow
  • Unscale gradients back to their original values before parameter updates

Dynamic scaling:

  • Automatically adjusts the scale factor during training
  • Increases scale when gradients are stable (no overflow)
  • Decreases scale when gradients overflow (become infinity/NaN)

Technical Details: The algorithm follows NVIDIA's approach: 1. Start with a large initial scale (default: 2^16 = 65536) 2. If no overflow for N steps, increase scale by growth factor (default: 2.0) 3. If overflow detected, decrease scale by backoff factor (default: 0.5) and skip update 4. Monitor consecutive successful updates for scale adjustment

Constructors

LossScaler(double, bool, int, double, double, double, double)

Initializes a new instance of the LossScaler class.

public LossScaler(double initialScale = 65536, bool dynamicScaling = true, int growthInterval = 2000, double growthFactor = 2, double backoffFactor = 0.5, double minScale = 1, double maxScale = 16777216)

Parameters

initialScale double

Initial loss scale factor (default: 65536 = 2^16).

dynamicScaling bool

Enable dynamic scale adjustment (default: true).

growthInterval int

Number of successful updates before scaling up (default: 2000).

growthFactor double

Factor to grow scale by (default: 2.0).

backoffFactor double

Factor to reduce scale by (default: 0.5).

minScale double

Minimum scale value (default: 1.0).

maxScale double

Maximum scale value (default: 2^24 = 16777216).

Remarks

For Beginners: Default values follow NVIDIA's mixed-precision training recommendations: - Initial scale of 2^16 works well for most models - Growth interval of 2000 prevents oscillation - Growth factor of 2.0 and backoff of 0.5 balance exploration - Min/max bounds prevent extreme scale values

Properties

BackoffFactor

Factor by which to multiply the scale when decreasing (default: 0.5).

public double BackoffFactor { get; set; }

Property Value

double

DynamicScaling

Whether to use dynamic loss scaling.

public bool DynamicScaling { get; set; }

Property Value

bool

GrowthFactor

Factor by which to multiply the scale when increasing (default: 2.0).

public double GrowthFactor { get; set; }

Property Value

double

GrowthInterval

Number of consecutive iterations without overflow before increasing scale.

public int GrowthInterval { get; set; }

Property Value

int

MaxScale

Maximum allowed scale value to prevent excessive growth.

public double MaxScale { get; set; }

Property Value

double

MinScale

Minimum allowed scale value to prevent excessive reduction.

public double MinScale { get; set; }

Property Value

double

OverflowRate

Gets the overflow rate (skipped / total).

public double OverflowRate { get; }

Property Value

double

Scale

Current loss scale factor.

public double Scale { get; }

Property Value

double

SkippedUpdates

Gets the number of updates skipped due to overflow.

public int SkippedUpdates { get; }

Property Value

int

TotalUpdates

Gets the total number of updates attempted.

public int TotalUpdates { get; }

Property Value

int

Methods

DetectOverflow(Tensor<T>)

Checks if any gradient in a tensor has overflowed.

public bool DetectOverflow(Tensor<T> gradients)

Parameters

gradients Tensor<T>

The tensor of gradients to check.

Returns

bool

True if any gradient is NaN or infinity; otherwise, false.

DetectOverflow(Vector<T>)

Checks if any gradient in a vector has overflowed.

public bool DetectOverflow(Vector<T> gradients)

Parameters

gradients Vector<T>

The vector of gradients to check.

Returns

bool

True if any gradient is NaN or infinity; otherwise, false.

HasOverflow(T)

Checks if a single value has overflowed (is NaN or infinity).

public bool HasOverflow(T value)

Parameters

value T

The value to check.

Returns

bool

True if the value is NaN or infinity; otherwise, false.

Reset(double?)

Resets the statistics and scale to initial values.

public void Reset(double? newInitialScale = null)

Parameters

newInitialScale double?

Optional new initial scale value.

ScaleLoss(T)

Scales the loss value to prevent gradient underflow.

public T ScaleLoss(T loss)

Parameters

loss T

The original loss value.

Returns

T

The scaled loss value.

Remarks

For Beginners: This multiplies your loss by the scale factor. The scaled loss is used for backpropagation, which makes all gradients proportionally larger.

ToString()

Gets a summary of the loss scaler's current state.

public override string ToString()

Returns

string

A string describing the current state.

UnscaleGradient(T)

Unscales a single gradient value.

public T UnscaleGradient(T gradient)

Parameters

gradient T

The scaled gradient value.

Returns

T

The unscaled gradient value.

UnscaleGradients(Tensor<T>)

Unscales all gradients in a tensor.

public void UnscaleGradients(Tensor<T> gradients)

Parameters

gradients Tensor<T>

The tensor of scaled gradients.

Remarks

For Beginners: This divides all gradient values by the scale factor, returning them to their true magnitudes for parameter updates.

UnscaleGradients(Vector<T>)

Unscales all gradients in a vector.

public void UnscaleGradients(Vector<T> gradients)

Parameters

gradients Vector<T>

The vector of scaled gradients.

UnscaleGradientsAndCheck(Tensor<T>)

Unscales gradients and checks for overflow, updating the scale factor if dynamic scaling is enabled.

public bool UnscaleGradientsAndCheck(Tensor<T> gradients)

Parameters

gradients Tensor<T>

The tensor of scaled gradients.

Returns

bool

True if gradients are valid and update can proceed; false if overflow detected and update should be skipped.

Remarks

For Beginners: This is the main method to use in your training loop. It performs three steps: 1. Unscales the gradients (divides by scale factor) 2. Checks if any gradients are NaN or infinity 3. Adjusts the scale factor if dynamic scaling is enabled

If overflow is detected, you should skip the parameter update for this step.

UnscaleGradientsAndCheck(Vector<T>)

Unscales gradients and checks for overflow (vector version).

public bool UnscaleGradientsAndCheck(Vector<T> gradients)

Parameters

gradients Vector<T>

The vector of scaled gradients.

Returns

bool

True if gradients are valid; false if overflow detected.