Class AdamWOptimizerOptions<T, TInput, TOutput>

Namespace: AiDotNet.Models.Options

Assembly: AiDotNet.dll

Configuration options for the AdamW optimization algorithm with decoupled weight decay.

public class AdamWOptimizerOptions<T, TInput, TOutput> : GradientBasedOptimizerOptions<T, TInput, TOutput>

Type Parameters

T
TInput
TOutput

Inheritance: object

ModelOptions

OptimizationAlgorithmOptions<T, TInput, TOutput>

GradientBasedOptimizerOptions<T, TInput, TOutput>

AdamWOptimizerOptions<T, TInput, TOutput>

Inherited Members: GradientBasedOptimizerOptions<T, TInput, TOutput>.GradientCache

GradientBasedOptimizerOptions<T, TInput, TOutput>.LossFunction

GradientBasedOptimizerOptions<T, TInput, TOutput>.Regularization

GradientBasedOptimizerOptions<T, TInput, TOutput>.DataSampler

GradientBasedOptimizerOptions<T, TInput, TOutput>.ShuffleData

GradientBasedOptimizerOptions<T, TInput, TOutput>.DropLastBatch

GradientBasedOptimizerOptions<T, TInput, TOutput>.RandomSeed

GradientBasedOptimizerOptions<T, TInput, TOutput>.EnableGradientClipping

GradientBasedOptimizerOptions<T, TInput, TOutput>.GradientClippingMethod

GradientBasedOptimizerOptions<T, TInput, TOutput>.MaxGradientNorm

GradientBasedOptimizerOptions<T, TInput, TOutput>.MaxGradientValue

GradientBasedOptimizerOptions<T, TInput, TOutput>.LearningRateScheduler

GradientBasedOptimizerOptions<T, TInput, TOutput>.SchedulerStepMode

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaxIterations

OptimizationAlgorithmOptions<T, TInput, TOutput>.UseEarlyStopping

OptimizationAlgorithmOptions<T, TInput, TOutput>.EarlyStoppingPatience

OptimizationAlgorithmOptions<T, TInput, TOutput>.BadFitPatience

OptimizationAlgorithmOptions<T, TInput, TOutput>.MinimumFeatures

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaximumFeatures

OptimizationAlgorithmOptions<T, TInput, TOutput>.UseExpressionTrees

OptimizationAlgorithmOptions<T, TInput, TOutput>.InitialLearningRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.UseAdaptiveLearningRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.LearningRateDecay

OptimizationAlgorithmOptions<T, TInput, TOutput>.MinLearningRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaxLearningRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.UseAdaptiveMomentum

OptimizationAlgorithmOptions<T, TInput, TOutput>.InitialMomentum

OptimizationAlgorithmOptions<T, TInput, TOutput>.MomentumIncreaseFactor

OptimizationAlgorithmOptions<T, TInput, TOutput>.MomentumDecreaseFactor

OptimizationAlgorithmOptions<T, TInput, TOutput>.MinMomentum

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaxMomentum

OptimizationAlgorithmOptions<T, TInput, TOutput>.ExplorationRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.MinExplorationRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.MaxExplorationRate

OptimizationAlgorithmOptions<T, TInput, TOutput>.Tolerance

OptimizationAlgorithmOptions<T, TInput, TOutput>.OptimizationMode

OptimizationAlgorithmOptions<T, TInput, TOutput>.ParameterAdjustmentScale

OptimizationAlgorithmOptions<T, TInput, TOutput>.SignFlipProbability

OptimizationAlgorithmOptions<T, TInput, TOutput>.FeatureSelectionProbability

OptimizationAlgorithmOptions<T, TInput, TOutput>.ParameterAdjustmentProbability

OptimizationAlgorithmOptions<T, TInput, TOutput>.PredictionOptions

OptimizationAlgorithmOptions<T, TInput, TOutput>.ModelStatsOptions

OptimizationAlgorithmOptions<T, TInput, TOutput>.ModelEvaluator

OptimizationAlgorithmOptions<T, TInput, TOutput>.FitDetector

OptimizationAlgorithmOptions<T, TInput, TOutput>.FitnessCalculator

OptimizationAlgorithmOptions<T, TInput, TOutput>.ModelCache

OptimizationAlgorithmOptions<T, TInput, TOutput>.CreateDefaults(OptimizerType)

ModelOptions.Seed

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

AdamW (Adam with decoupled Weight decay) differs from Adam with L2 regularization. In Adam with L2, weight decay is applied to the gradient before the adaptive learning rate is computed. In AdamW, weight decay is applied directly to the weights after the Adam update, which has been shown to improve generalization.

For Beginners: AdamW is an improved version of Adam that handles weight decay (a technique to prevent overfitting) in a mathematically cleaner way. The difference might seem subtle, but AdamW consistently achieves better results than Adam with L2 regularization, especially when training large models like transformers. If you're not sure which to use, AdamW is generally the better choice.

Based on the paper "Decoupled Weight Decay Regularization" by Ilya Loshchilov and Frank Hutter.

Properties

BatchSize

Gets or sets the batch size for mini-batch gradient descent.

public int BatchSize { get; set; }

Property Value

int: A positive integer, defaulting to 32.

Remarks

For Beginners: The batch size controls how many examples the optimizer looks at before making an update to the model. The default of 32 is a good balance for AdamW.

Beta1

Gets or sets the exponential decay rate for the first moment estimates (momentum).

public double Beta1 { get; set; }

Property Value

double: The beta1 value, defaulting to 0.9.

Remarks

For Beginners: Beta1 controls the momentum of the optimizer. A value of 0.9 means the optimizer gives 90% weight to the previous gradient direction and 10% to the new gradient. Higher values make updates smoother but potentially slower to adapt.

Beta2

Gets or sets the exponential decay rate for the second moment estimates (adaptive learning rate).

public double Beta2 { get; set; }

Property Value

double: The beta2 value, defaulting to 0.999.

Remarks

For Beginners: Beta2 controls how the optimizer adapts the learning rate for each parameter based on historical gradient magnitudes. The default of 0.999 works well for most cases.

Epsilon

Gets or sets a small constant added to denominators to prevent division by zero.

public double Epsilon { get; set; }

Property Value

double: The epsilon value, defaulting to 1e-8.

Remarks

For Beginners: This is a tiny safety value to prevent numerical issues. You rarely need to change this unless you experience NaN values during training.

InitialLearningRate

Gets or sets the initial learning rate for the AdamW optimizer.

public override double InitialLearningRate { get; set; }

Property Value

double: The learning rate, defaulting to 0.001.

Remarks

For Beginners: The learning rate controls how big each step is during training. AdamW typically uses similar learning rates to Adam (0.001 is a good starting point). For fine-tuning pre-trained models, smaller values like 2e-5 to 5e-5 are common.

MaxBeta1

Gets or sets the maximum allowed value for Beta1.

public double MaxBeta1 { get; set; }

Property Value

double: The maximum Beta1 value, defaulting to 0.999.

MaxBeta2

Gets or sets the maximum allowed value for Beta2.

public double MaxBeta2 { get; set; }

Property Value

double: The maximum Beta2 value, defaulting to 0.9999.

MinBeta1

Gets or sets the minimum allowed value for Beta1.

public double MinBeta1 { get; set; }

Property Value

double: The minimum Beta1 value, defaulting to 0.8.

MinBeta2

Gets or sets the minimum allowed value for Beta2.

public double MinBeta2 { get; set; }

Property Value

double: The minimum Beta2 value, defaulting to 0.8.

UseAMSGrad

Gets or sets whether to apply AMSGrad variant for improved convergence guarantees.

public bool UseAMSGrad { get; set; }

Property Value

bool: True to use AMSGrad variant, false for standard AdamW. Default: false

Remarks

For Beginners: AMSGrad is a modification that maintains the maximum of past squared gradients rather than an exponential average. This can help in some cases where standard Adam/AdamW might not converge properly, though in practice the difference is often small.

UseAdaptiveBetas

Gets or sets whether to automatically adjust the Beta parameters during training.

public bool UseAdaptiveBetas { get; set; }

Property Value

bool: True to use adaptive betas (default), false otherwise.

Remarks

For Beginners: When enabled, the algorithm can automatically adjust how much it relies on past information based on training progress. This can help the optimizer adapt to different phases of learning.

WeightDecay

Gets or sets the weight decay coefficient (L2 penalty).

public double WeightDecay { get; set; }

Property Value

double: The weight decay coefficient, defaulting to 0.01.

Remarks

Unlike L2 regularization in standard Adam, AdamW applies weight decay directly to the weights, not through the gradient. This decoupling leads to better generalization.

For Beginners: Weight decay is a regularization technique that prevents the model's weights from becoming too large, which helps prevent overfitting. A value of 0.01 is a good default. Increase it if your model overfits (training loss much lower than validation loss), decrease it if your model underfits (both losses are high).

Table of Contents

Class AdamWOptimizerOptions<T, TInput, TOutput>

Type Parameters

Remarks

Properties

BatchSize

Property Value

Remarks

Beta1

Property Value

Remarks

Beta2

Property Value

Remarks

Epsilon

Property Value

Remarks

InitialLearningRate

Property Value

Remarks

MaxBeta1

Property Value

MaxBeta2

Property Value

MinBeta1

Property Value

MinBeta2

Property Value

UseAMSGrad

Property Value

Remarks

UseAdaptiveBetas

Property Value

Remarks

WeightDecay

Property Value

Remarks