Class LAMBOptimizerOptions<T, TInput, TOutput>

Namespace: AiDotNet.Models.Options

Assembly: AiDotNet.dll

Configuration options for the LAMB (Layer-wise Adaptive Moments for Batch training) optimization algorithm.

public class LAMBOptimizerOptions<T, TInput, TOutput> : GradientBasedOptimizerOptions<T, TInput, TOutput>

Type Parameters

T
TInput
TOutput

Inheritance: object

ModelOptions

OptimizationAlgorithmOptions<T, TInput, TOutput>

GradientBasedOptimizerOptions<T, TInput, TOutput>

LAMBOptimizerOptions<T, TInput, TOutput>

Inherited Members: GradientBasedOptimizerOptions<T, TInput, TOutput>.GradientCache

GradientBasedOptimizerOptions<T, TInput, TOutput>.LossFunction

GradientBasedOptimizerOptions<T, TInput, TOutput>.Regularization

GradientBasedOptimizerOptions<T, TInput, TOutput>.DataSampler

GradientBasedOptimizerOptions<T, TInput, TOutput>.ShuffleData

GradientBasedOptimizerOptions<T, TInput, TOutput>.DropLastBatch

GradientBasedOptimizerOptions<T, TInput, TOutput>.RandomSeed

GradientBasedOptimizerOptions<T, TInput, TOutput>.EnableGradientClipping

GradientBasedOptimizerOptions<T, TInput, TOutput>.GradientClippingMethod

GradientBasedOptimizerOptions<T, TInput, TOutput>.MaxGradientNorm

GradientBasedOptimizerOptions<T, TInput, TOutput>.MaxGradientValue

GradientBasedOptimizerOptions<T, TInput, TOutput>.LearningRateScheduler

GradientBasedOptimizerOptions<T, TInput, TOutput>.SchedulerStepMode

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaxIterations

OptimizationAlgorithmOptions<T, TInput, TOutput>.UseEarlyStopping

OptimizationAlgorithmOptions<T, TInput, TOutput>.EarlyStoppingPatience

OptimizationAlgorithmOptions<T, TInput, TOutput>.BadFitPatience

OptimizationAlgorithmOptions<T, TInput, TOutput>.MinimumFeatures

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaximumFeatures

OptimizationAlgorithmOptions<T, TInput, TOutput>.UseExpressionTrees

OptimizationAlgorithmOptions<T, TInput, TOutput>.InitialLearningRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.UseAdaptiveLearningRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.LearningRateDecay

OptimizationAlgorithmOptions<T, TInput, TOutput>.MinLearningRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaxLearningRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.UseAdaptiveMomentum

OptimizationAlgorithmOptions<T, TInput, TOutput>.InitialMomentum

OptimizationAlgorithmOptions<T, TInput, TOutput>.MomentumIncreaseFactor

OptimizationAlgorithmOptions<T, TInput, TOutput>.MomentumDecreaseFactor

OptimizationAlgorithmOptions<T, TInput, TOutput>.MinMomentum

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaxMomentum

OptimizationAlgorithmOptions<T, TInput, TOutput>.ExplorationRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.MinExplorationRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaxExplorationRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.Tolerance

OptimizationAlgorithmOptions<T, TInput, TOutput>.OptimizationMode

OptimizationAlgorithmOptions<T, TInput, TOutput>.ParameterAdjustmentScale

OptimizationAlgorithmOptions<T, TInput, TOutput>.SignFlipProbability

OptimizationAlgorithmOptions<T, TInput, TOutput>.FeatureSelectionProbability

OptimizationAlgorithmOptions<T, TInput, TOutput>.ParameterAdjustmentProbability

OptimizationAlgorithmOptions<T, TInput, TOutput>.PredictionOptions

OptimizationAlgorithmOptions<T, TInput, TOutput>.ModelStatsOptions

OptimizationAlgorithmOptions<T, TInput, TOutput>.ModelEvaluator

OptimizationAlgorithmOptions<T, TInput, TOutput>.FitDetector

OptimizationAlgorithmOptions<T, TInput, TOutput>.FitnessCalculator

OptimizationAlgorithmOptions<T, TInput, TOutput>.ModelCache

OptimizationAlgorithmOptions<T, TInput, TOutput>.CreateDefaults(OptimizerType)

ModelOptions.Seed

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Examples

var options = new LAMBOptimizerOptions<float, Matrix<float>, Vector<float>>
{
    LearningRate = 0.00176 * Math.Sqrt(batchSize),  // Square root scaling for LAMB
    Beta1 = 0.9,
    Beta2 = 0.999,
    WeightDecay = 0.01,
    BatchSize = 8192
};
var optimizer = new LAMBOptimizer<float, Matrix<float>, Vector<float>>(model, options);

Remarks

LAMB combines Adam's adaptive learning rates (first and second moment estimates) with LARS's layer-wise trust ratio scaling. This enables training with extremely large batch sizes (up to 32K) while maintaining training stability and accuracy.

For Beginners: LAMB is designed for training large models (like BERT, transformers) with very large batch sizes. It combines:

From Adam: Adaptive learning rates that adjust per-parameter based on gradient history
From LARS: Layer-wise scaling that stabilizes large batch training

The result is an optimizer that can train at batch sizes of 16K-32K while achieving the same accuracy as training with small batches, just much faster.

Based on the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes" by You et al. (2019).

Properties

BatchSize

Gets or sets the batch size for mini-batch gradient descent.

public int BatchSize { get; set; }

Property Value

int: A positive integer, defaulting to 8192 for large batch training.

Remarks

For Beginners: LAMB is designed for very large batch sizes. The default of 8192 is typical for BERT/transformer pretraining. You can go up to 32768 with proper learning rate scaling.

Beta1

Gets or sets the exponential decay rate for the first moment estimates (momentum).

public double Beta1 { get; set; }

Property Value

double: The beta1 value, defaulting to 0.9.

Remarks

For Beginners: Beta1 controls the momentum/smoothing of gradient estimates. A value of 0.9 is standard and works well for most applications.

Beta2

Gets or sets the exponential decay rate for the second moment estimates.

public double Beta2 { get; set; }

Property Value

double: The beta2 value, defaulting to 0.999.

Remarks

For Beginners: Beta2 controls how the optimizer adapts the learning rate based on historical gradient magnitudes. The default of 0.999 works well for most cases.

ClipTrustRatio

Gets or sets whether to clip the trust ratio to prevent extreme scaling.

public bool ClipTrustRatio { get; set; }

Property Value

bool: True to enable clipping (default), false to disable.

Remarks

The trust ratio ||w|| / ||r|| can sometimes become very large, causing instability. Clipping limits the ratio to [0, max_trust_ratio] for more stable training.

For Beginners: Keeping this enabled (default) prevents training from becoming unstable when layer weights are much larger than their updates.

Epsilon

Gets or sets a small constant added to denominators to prevent division by zero.

public double Epsilon { get; set; }

Property Value

double: The epsilon value, defaulting to 1e-6.

Remarks

For Beginners: This is a tiny safety value to prevent numerical issues. LAMB typically uses 1e-6 (slightly larger than Adam's 1e-8) for better stability.

ExcludeBiasFromWeightDecay

Gets or sets whether to exclude bias and normalization parameters from weight decay.

public bool ExcludeBiasFromWeightDecay { get; set; }

Property Value

bool: True to exclude biases from weight decay (default), false to apply to all.

Remarks

Following best practices for transformer training, bias terms and normalization layer parameters (BatchNorm, LayerNorm) are typically excluded from weight decay.

For Beginners: Bias terms are small and don't benefit from weight decay. Keeping this true (default) follows established best practices.

InitialLearningRate

Gets or sets the base learning rate for the LAMB optimizer.

public override double InitialLearningRate { get; set; }

Property Value

double: The learning rate, defaulting to 0.001.

Remarks

For Beginners: Unlike LARS which uses linear scaling, LAMB typically uses square root scaling: LR = base_lr * sqrt(batch_size / 256). The default of 0.001 is a good starting point for transformer models.

LayerBoundaries

Gets or sets the layer size boundaries for layer-wise scaling.

public int[]? LayerBoundaries { get; set; }

Property Value

int[]: Array of layer sizes that define boundaries between layers for LAMB scaling.

Remarks

LAMB applies different scaling factors to different layers. This array defines the cumulative sizes of parameters that belong to each layer.

For Beginners: This tells LAMB where one layer ends and another begins. If not set, all parameters are treated as one layer, which is less optimal.

MaxTrustRatio

Gets or sets the maximum trust ratio when clipping is enabled.

public double MaxTrustRatio { get; set; }

Property Value

double: The maximum trust ratio, defaulting to 10.0.

Remarks

For Beginners: This limits how much the layer-wise scaling can amplify updates. The default of 10.0 is well-tested for transformer training.

SkipTrustRatioLayers

Gets or sets which layers should skip trust ratio scaling and use only Adam updates.

public int[]? SkipTrustRatioLayers { get; set; }

Property Value

int[]: Array of layer indices to skip trust ratio scaling for.

Remarks

Some layers (particularly embedding layers) may work better without trust ratio scaling. These layers use only the Adam update without the layer-wise scaling factor.

For Beginners: Embedding layers in transformers often benefit from being excluded from trust ratio scaling. Set this to skip those layers.

UseBiasCorrection

Gets or sets whether to use bias correction for the moment estimates.

public bool UseBiasCorrection { get; set; }

Property Value

bool: True to enable bias correction (default), false to disable.

Remarks

Bias correction adjusts for the fact that moment estimates are initialized to zero, which would otherwise cause them to be biased toward zero early in training.

For Beginners: Always keep this enabled (default). It's essential for correct Adam-style moment estimates, especially at the start of training.

WarmupEpochs

Gets or sets the number of warmup epochs for learning rate warmup.

public int WarmupEpochs { get; set; }

Property Value

int: The number of warmup epochs, defaulting to 1 epoch worth of steps.

Remarks

Learning rate warmup gradually increases the learning rate from 0 to the target value. LAMB typically uses shorter warmup than LARS due to its adaptive nature.

For Beginners: LAMB is more stable than LARS, so it needs less warmup. 1 epoch of warmup is typically sufficient for most cases.

WeightDecay

Gets or sets the weight decay coefficient.

public double WeightDecay { get; set; }

Property Value

double: The weight decay coefficient, defaulting to 0.01.

Remarks

For Beginners: Weight decay prevents model weights from growing too large. LAMB applies decoupled weight decay (like AdamW) for better regularization. A value of 0.01 is typical for transformer training.

Table of Contents

Class LAMBOptimizerOptions<T, TInput, TOutput>

Type Parameters

Examples

Remarks

Properties

BatchSize

Property Value

Remarks

Beta1

Property Value

Remarks

Beta2

Property Value

Remarks

ClipTrustRatio

Property Value

Remarks

Epsilon

Property Value

Remarks

ExcludeBiasFromWeightDecay

Property Value

Remarks

InitialLearningRate

Property Value

Remarks

LayerBoundaries

Property Value

Remarks

MaxTrustRatio

Property Value

Remarks

SkipTrustRatioLayers

Property Value

Remarks

UseBiasCorrection

Property Value

Remarks

WarmupEpochs

Property Value

Remarks

WeightDecay

Property Value

Remarks