Class LAMBOptimizerOptions<T, TInput, TOutput>
Configuration options for the LAMB (Layer-wise Adaptive Moments for Batch training) optimization algorithm.
public class LAMBOptimizerOptions<T, TInput, TOutput> : GradientBasedOptimizerOptions<T, TInput, TOutput>
Type Parameters
TTInputTOutput
- Inheritance
-
OptimizationAlgorithmOptions<T, TInput, TOutput>GradientBasedOptimizerOptions<T, TInput, TOutput>LAMBOptimizerOptions<T, TInput, TOutput>
- Inherited Members
Examples
var options = new LAMBOptimizerOptions<float, Matrix<float>, Vector<float>>
{
LearningRate = 0.00176 * Math.Sqrt(batchSize), // Square root scaling for LAMB
Beta1 = 0.9,
Beta2 = 0.999,
WeightDecay = 0.01,
BatchSize = 8192
};
var optimizer = new LAMBOptimizer<float, Matrix<float>, Vector<float>>(model, options);
Remarks
LAMB combines Adam's adaptive learning rates (first and second moment estimates) with LARS's layer-wise trust ratio scaling. This enables training with extremely large batch sizes (up to 32K) while maintaining training stability and accuracy.
For Beginners: LAMB is designed for training large models (like BERT, transformers) with very large batch sizes. It combines:
- From Adam: Adaptive learning rates that adjust per-parameter based on gradient history
- From LARS: Layer-wise scaling that stabilizes large batch training
Based on the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes" by You et al. (2019).
Properties
BatchSize
Gets or sets the batch size for mini-batch gradient descent.
public int BatchSize { get; set; }
Property Value
- int
A positive integer, defaulting to 8192 for large batch training.
Remarks
For Beginners: LAMB is designed for very large batch sizes. The default of 8192 is typical for BERT/transformer pretraining. You can go up to 32768 with proper learning rate scaling.
Beta1
Gets or sets the exponential decay rate for the first moment estimates (momentum).
public double Beta1 { get; set; }
Property Value
- double
The beta1 value, defaulting to 0.9.
Remarks
For Beginners: Beta1 controls the momentum/smoothing of gradient estimates. A value of 0.9 is standard and works well for most applications.
Beta2
Gets or sets the exponential decay rate for the second moment estimates.
public double Beta2 { get; set; }
Property Value
- double
The beta2 value, defaulting to 0.999.
Remarks
For Beginners: Beta2 controls how the optimizer adapts the learning rate based on historical gradient magnitudes. The default of 0.999 works well for most cases.
ClipTrustRatio
Gets or sets whether to clip the trust ratio to prevent extreme scaling.
public bool ClipTrustRatio { get; set; }
Property Value
- bool
True to enable clipping (default), false to disable.
Remarks
The trust ratio ||w|| / ||r|| can sometimes become very large, causing instability. Clipping limits the ratio to [0, max_trust_ratio] for more stable training.
For Beginners: Keeping this enabled (default) prevents training from becoming unstable when layer weights are much larger than their updates.
Epsilon
Gets or sets a small constant added to denominators to prevent division by zero.
public double Epsilon { get; set; }
Property Value
- double
The epsilon value, defaulting to 1e-6.
Remarks
For Beginners: This is a tiny safety value to prevent numerical issues. LAMB typically uses 1e-6 (slightly larger than Adam's 1e-8) for better stability.
ExcludeBiasFromWeightDecay
Gets or sets whether to exclude bias and normalization parameters from weight decay.
public bool ExcludeBiasFromWeightDecay { get; set; }
Property Value
- bool
True to exclude biases from weight decay (default), false to apply to all.
Remarks
Following best practices for transformer training, bias terms and normalization layer parameters (BatchNorm, LayerNorm) are typically excluded from weight decay.
For Beginners: Bias terms are small and don't benefit from weight decay. Keeping this true (default) follows established best practices.
InitialLearningRate
Gets or sets the base learning rate for the LAMB optimizer.
public override double InitialLearningRate { get; set; }
Property Value
- double
The learning rate, defaulting to 0.001.
Remarks
For Beginners: Unlike LARS which uses linear scaling, LAMB typically uses square root scaling: LR = base_lr * sqrt(batch_size / 256). The default of 0.001 is a good starting point for transformer models.
LayerBoundaries
Gets or sets the layer size boundaries for layer-wise scaling.
public int[]? LayerBoundaries { get; set; }
Property Value
- int[]
Array of layer sizes that define boundaries between layers for LAMB scaling.
Remarks
LAMB applies different scaling factors to different layers. This array defines the cumulative sizes of parameters that belong to each layer.
For Beginners: This tells LAMB where one layer ends and another begins. If not set, all parameters are treated as one layer, which is less optimal.
MaxTrustRatio
Gets or sets the maximum trust ratio when clipping is enabled.
public double MaxTrustRatio { get; set; }
Property Value
- double
The maximum trust ratio, defaulting to 10.0.
Remarks
For Beginners: This limits how much the layer-wise scaling can amplify updates. The default of 10.0 is well-tested for transformer training.
SkipTrustRatioLayers
Gets or sets which layers should skip trust ratio scaling and use only Adam updates.
public int[]? SkipTrustRatioLayers { get; set; }
Property Value
- int[]
Array of layer indices to skip trust ratio scaling for.
Remarks
Some layers (particularly embedding layers) may work better without trust ratio scaling. These layers use only the Adam update without the layer-wise scaling factor.
For Beginners: Embedding layers in transformers often benefit from being excluded from trust ratio scaling. Set this to skip those layers.
UseBiasCorrection
Gets or sets whether to use bias correction for the moment estimates.
public bool UseBiasCorrection { get; set; }
Property Value
- bool
True to enable bias correction (default), false to disable.
Remarks
Bias correction adjusts for the fact that moment estimates are initialized to zero, which would otherwise cause them to be biased toward zero early in training.
For Beginners: Always keep this enabled (default). It's essential for correct Adam-style moment estimates, especially at the start of training.
WarmupEpochs
Gets or sets the number of warmup epochs for learning rate warmup.
public int WarmupEpochs { get; set; }
Property Value
- int
The number of warmup epochs, defaulting to 1 epoch worth of steps.
Remarks
Learning rate warmup gradually increases the learning rate from 0 to the target value. LAMB typically uses shorter warmup than LARS due to its adaptive nature.
For Beginners: LAMB is more stable than LARS, so it needs less warmup. 1 epoch of warmup is typically sufficient for most cases.
WeightDecay
Gets or sets the weight decay coefficient.
public double WeightDecay { get; set; }
Property Value
- double
The weight decay coefficient, defaulting to 0.01.
Remarks
For Beginners: Weight decay prevents model weights from growing too large. LAMB applies decoupled weight decay (like AdamW) for better regularization. A value of 0.01 is typical for transformer training.