Table of Contents

Class LAMBOptimizerOptions<T, TInput, TOutput>

Namespace
AiDotNet.Models.Options
Assembly
AiDotNet.dll

Configuration options for the LAMB (Layer-wise Adaptive Moments for Batch training) optimization algorithm.

public class LAMBOptimizerOptions<T, TInput, TOutput> : GradientBasedOptimizerOptions<T, TInput, TOutput>

Type Parameters

T
TInput
TOutput
Inheritance
OptimizationAlgorithmOptions<T, TInput, TOutput>
GradientBasedOptimizerOptions<T, TInput, TOutput>
LAMBOptimizerOptions<T, TInput, TOutput>
Inherited Members

Examples

var options = new LAMBOptimizerOptions<float, Matrix<float>, Vector<float>>
{
    LearningRate = 0.00176 * Math.Sqrt(batchSize),  // Square root scaling for LAMB
    Beta1 = 0.9,
    Beta2 = 0.999,
    WeightDecay = 0.01,
    BatchSize = 8192
};
var optimizer = new LAMBOptimizer<float, Matrix<float>, Vector<float>>(model, options);

Remarks

LAMB combines Adam's adaptive learning rates (first and second moment estimates) with LARS's layer-wise trust ratio scaling. This enables training with extremely large batch sizes (up to 32K) while maintaining training stability and accuracy.

For Beginners: LAMB is designed for training large models (like BERT, transformers) with very large batch sizes. It combines:

  • From Adam: Adaptive learning rates that adjust per-parameter based on gradient history
  • From LARS: Layer-wise scaling that stabilizes large batch training
The result is an optimizer that can train at batch sizes of 16K-32K while achieving the same accuracy as training with small batches, just much faster.

Based on the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes" by You et al. (2019).

Properties

BatchSize

Gets or sets the batch size for mini-batch gradient descent.

public int BatchSize { get; set; }

Property Value

int

A positive integer, defaulting to 8192 for large batch training.

Remarks

For Beginners: LAMB is designed for very large batch sizes. The default of 8192 is typical for BERT/transformer pretraining. You can go up to 32768 with proper learning rate scaling.

Beta1

Gets or sets the exponential decay rate for the first moment estimates (momentum).

public double Beta1 { get; set; }

Property Value

double

The beta1 value, defaulting to 0.9.

Remarks

For Beginners: Beta1 controls the momentum/smoothing of gradient estimates. A value of 0.9 is standard and works well for most applications.

Beta2

Gets or sets the exponential decay rate for the second moment estimates.

public double Beta2 { get; set; }

Property Value

double

The beta2 value, defaulting to 0.999.

Remarks

For Beginners: Beta2 controls how the optimizer adapts the learning rate based on historical gradient magnitudes. The default of 0.999 works well for most cases.

ClipTrustRatio

Gets or sets whether to clip the trust ratio to prevent extreme scaling.

public bool ClipTrustRatio { get; set; }

Property Value

bool

True to enable clipping (default), false to disable.

Remarks

The trust ratio ||w|| / ||r|| can sometimes become very large, causing instability. Clipping limits the ratio to [0, max_trust_ratio] for more stable training.

For Beginners: Keeping this enabled (default) prevents training from becoming unstable when layer weights are much larger than their updates.

Epsilon

Gets or sets a small constant added to denominators to prevent division by zero.

public double Epsilon { get; set; }

Property Value

double

The epsilon value, defaulting to 1e-6.

Remarks

For Beginners: This is a tiny safety value to prevent numerical issues. LAMB typically uses 1e-6 (slightly larger than Adam's 1e-8) for better stability.

ExcludeBiasFromWeightDecay

Gets or sets whether to exclude bias and normalization parameters from weight decay.

public bool ExcludeBiasFromWeightDecay { get; set; }

Property Value

bool

True to exclude biases from weight decay (default), false to apply to all.

Remarks

Following best practices for transformer training, bias terms and normalization layer parameters (BatchNorm, LayerNorm) are typically excluded from weight decay.

For Beginners: Bias terms are small and don't benefit from weight decay. Keeping this true (default) follows established best practices.

InitialLearningRate

Gets or sets the base learning rate for the LAMB optimizer.

public override double InitialLearningRate { get; set; }

Property Value

double

The learning rate, defaulting to 0.001.

Remarks

For Beginners: Unlike LARS which uses linear scaling, LAMB typically uses square root scaling: LR = base_lr * sqrt(batch_size / 256). The default of 0.001 is a good starting point for transformer models.

LayerBoundaries

Gets or sets the layer size boundaries for layer-wise scaling.

public int[]? LayerBoundaries { get; set; }

Property Value

int[]

Array of layer sizes that define boundaries between layers for LAMB scaling.

Remarks

LAMB applies different scaling factors to different layers. This array defines the cumulative sizes of parameters that belong to each layer.

For Beginners: This tells LAMB where one layer ends and another begins. If not set, all parameters are treated as one layer, which is less optimal.

MaxTrustRatio

Gets or sets the maximum trust ratio when clipping is enabled.

public double MaxTrustRatio { get; set; }

Property Value

double

The maximum trust ratio, defaulting to 10.0.

Remarks

For Beginners: This limits how much the layer-wise scaling can amplify updates. The default of 10.0 is well-tested for transformer training.

SkipTrustRatioLayers

Gets or sets which layers should skip trust ratio scaling and use only Adam updates.

public int[]? SkipTrustRatioLayers { get; set; }

Property Value

int[]

Array of layer indices to skip trust ratio scaling for.

Remarks

Some layers (particularly embedding layers) may work better without trust ratio scaling. These layers use only the Adam update without the layer-wise scaling factor.

For Beginners: Embedding layers in transformers often benefit from being excluded from trust ratio scaling. Set this to skip those layers.

UseBiasCorrection

Gets or sets whether to use bias correction for the moment estimates.

public bool UseBiasCorrection { get; set; }

Property Value

bool

True to enable bias correction (default), false to disable.

Remarks

Bias correction adjusts for the fact that moment estimates are initialized to zero, which would otherwise cause them to be biased toward zero early in training.

For Beginners: Always keep this enabled (default). It's essential for correct Adam-style moment estimates, especially at the start of training.

WarmupEpochs

Gets or sets the number of warmup epochs for learning rate warmup.

public int WarmupEpochs { get; set; }

Property Value

int

The number of warmup epochs, defaulting to 1 epoch worth of steps.

Remarks

Learning rate warmup gradually increases the learning rate from 0 to the target value. LAMB typically uses shorter warmup than LARS due to its adaptive nature.

For Beginners: LAMB is more stable than LARS, so it needs less warmup. 1 epoch of warmup is typically sufficient for most cases.

WeightDecay

Gets or sets the weight decay coefficient.

public double WeightDecay { get; set; }

Property Value

double

The weight decay coefficient, defaulting to 0.01.

Remarks

For Beginners: Weight decay prevents model weights from growing too large. LAMB applies decoupled weight decay (like AdamW) for better regularization. A value of 0.01 is typical for transformer training.