Class AdamWOptimizer<T, TInput, TOutput>
- Namespace
- AiDotNet.Optimizers
- Assembly
- AiDotNet.dll
Implements the AdamW (Adam with decoupled Weight decay) optimization algorithm.
public class AdamWOptimizer<T, TInput, TOutput> : GradientBasedOptimizerBase<T, TInput, TOutput>, IGradientBasedOptimizer<T, TInput, TOutput>, IOptimizer<T, TInput, TOutput>, IModelSerializer
Type Parameters
TThe numeric type used for calculations (e.g., float, double).
TInputTOutput
- Inheritance
-
OptimizerBase<T, TInput, TOutput>GradientBasedOptimizerBase<T, TInput, TOutput>AdamWOptimizer<T, TInput, TOutput>
- Implements
-
IGradientBasedOptimizer<T, TInput, TOutput>IOptimizer<T, TInput, TOutput>
- Inherited Members
- Extension Methods
Examples
var options = new AdamWOptimizerOptions<float, Matrix<float>, Vector<float>>
{
LearningRate = 0.001,
WeightDecay = 0.01,
Beta1 = 0.9,
Beta2 = 0.999
};
var optimizer = new AdamWOptimizer<float, Matrix<float>, Vector<float>>(model, options);
Remarks
AdamW is a variant of Adam that fixes the weight decay implementation. In standard Adam with L2 regularization, weight decay is coupled with the adaptive learning rate, which can lead to suboptimal regularization effects. AdamW decouples weight decay from the gradient-based update, applying it directly to the weights.
The key difference: - Adam with L2: gradient = gradient + lambda * weights (then apply Adam update) - AdamW: weights = weights - lr * adam_update - lr * lambda * weights (decoupled)
For Beginners: AdamW is like Adam but handles regularization (preventing overfitting) in a smarter way. The difference might seem technical, but AdamW consistently achieves better results on tasks like training transformers and large neural networks. If you're choosing between Adam and AdamW, AdamW is generally the better choice.
Based on the paper "Decoupled Weight Decay Regularization" by Ilya Loshchilov and Frank Hutter.
Constructors
AdamWOptimizer(IFullModel<T, TInput, TOutput>?, AdamWOptimizerOptions<T, TInput, TOutput>?)
Initializes a new instance of the AdamWOptimizer class.
public AdamWOptimizer(IFullModel<T, TInput, TOutput>? model, AdamWOptimizerOptions<T, TInput, TOutput>? options = null)
Parameters
modelIFullModel<T, TInput, TOutput>The model to optimize.
optionsAdamWOptimizerOptions<T, TInput, TOutput>The options for configuring the AdamW optimizer.
Remarks
For Beginners: This sets up the AdamW optimizer with its initial configuration. The most important parameters are learning rate (how fast to learn) and weight decay (how much to regularize).
Properties
SupportsGpuUpdate
Gets whether this optimizer supports GPU-accelerated parameter updates.
public override bool SupportsGpuUpdate { get; }
Property Value
UseAMSGrad
Gets whether AMSGrad variant is enabled.
public bool UseAMSGrad { get; }
Property Value
WeightDecay
Gets the current weight decay coefficient.
public double WeightDecay { get; }
Property Value
Methods
Deserialize(byte[])
Deserializes the optimizer's state from a byte array.
public override void Deserialize(byte[] data)
Parameters
databyte[]
DisposeGpuState()
Disposes GPU-allocated optimizer state.
public override void DisposeGpuState()
GenerateGradientCacheKey(IFullModel<T, TInput, TOutput>, TInput, TOutput)
Generates a unique key for caching gradients.
protected override string GenerateGradientCacheKey(IFullModel<T, TInput, TOutput> model, TInput X, TOutput y)
Parameters
modelIFullModel<T, TInput, TOutput>XTInputyTOutput
Returns
GetOptions()
Gets the current optimizer options.
public override OptimizationAlgorithmOptions<T, TInput, TOutput> GetOptions()
Returns
- OptimizationAlgorithmOptions<T, TInput, TOutput>
InitializeAdaptiveParameters()
Initializes the adaptive parameters used by the AdamW optimizer.
protected override void InitializeAdaptiveParameters()
InitializeGpuState(int, IDirectGpuBackend)
Initializes AdamW optimizer state on the GPU.
public override void InitializeGpuState(int parameterCount, IDirectGpuBackend backend)
Parameters
parameterCountintbackendIDirectGpuBackend
Optimize(OptimizationInputData<T, TInput, TOutput>)
Performs the optimization process using the AdamW algorithm.
public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInputData<T, TInput, TOutput> inputData)
Parameters
inputDataOptimizationInputData<T, TInput, TOutput>The input data for optimization, including training data and targets.
Returns
- OptimizationResult<T, TInput, TOutput>
The result of the optimization process, including the best solution found.
Remarks
DataLoader Integration: This optimizer now uses the DataLoader batching infrastructure which supports: - Custom samplers (weighted, stratified, curriculum, importance, active learning) - Reproducible shuffling via RandomSeed - Option to drop incomplete final batches Set these options via GradientBasedOptimizerOptions.DataSampler, ShuffleData, DropLastBatch, and RandomSeed.
Reset()
Resets the optimizer's internal state.
public override void Reset()
ReverseUpdate(Vector<T>, Vector<T>)
Reverses an AdamW gradient update to recover original parameters.
public override Vector<T> ReverseUpdate(Vector<T> updatedParameters, Vector<T> appliedGradients)
Parameters
updatedParametersVector<T>appliedGradientsVector<T>
Returns
- Vector<T>
Serialize()
Serializes the optimizer's state into a byte array.
public override byte[] Serialize()
Returns
- byte[]
UpdateAdaptiveParameters(OptimizationStepData<T, TInput, TOutput>, OptimizationStepData<T, TInput, TOutput>)
Updates the adaptive parameters of the optimizer based on the current and previous optimization steps.
protected override void UpdateAdaptiveParameters(OptimizationStepData<T, TInput, TOutput> currentStepData, OptimizationStepData<T, TInput, TOutput> previousStepData)
Parameters
currentStepDataOptimizationStepData<T, TInput, TOutput>previousStepDataOptimizationStepData<T, TInput, TOutput>
UpdateOptions(OptimizationAlgorithmOptions<T, TInput, TOutput>)
Updates the optimizer's options.
protected override void UpdateOptions(OptimizationAlgorithmOptions<T, TInput, TOutput> options)
Parameters
optionsOptimizationAlgorithmOptions<T, TInput, TOutput>
UpdateParameters(Matrix<T>, Matrix<T>)
Updates a matrix of parameters using the AdamW optimization algorithm.
public override Matrix<T> UpdateParameters(Matrix<T> parameters, Matrix<T> gradient)
Parameters
parametersMatrix<T>gradientMatrix<T>
Returns
- Matrix<T>
UpdateParameters(Vector<T>, Vector<T>)
Updates a vector of parameters using the AdamW optimization algorithm with decoupled weight decay.
public override Vector<T> UpdateParameters(Vector<T> parameters, Vector<T> gradient)
Parameters
parametersVector<T>The current parameter vector to be updated.
gradientVector<T>The gradient vector corresponding to the parameters.
Returns
- Vector<T>
The updated parameter vector.
UpdateParametersGpu(IGpuBuffer, IGpuBuffer, int, IDirectGpuBackend)
Updates parameters on the GPU using the AdamW kernel.
public override void UpdateParametersGpu(IGpuBuffer parameters, IGpuBuffer gradients, int parameterCount, IDirectGpuBackend backend)
Parameters
parametersIGpuBuffergradientsIGpuBufferparameterCountintbackendIDirectGpuBackend
UpdateSolution(IFullModel<T, TInput, TOutput>, Vector<T>)
Updates the current solution using the AdamW update rule with decoupled weight decay.
protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)
Parameters
currentSolutionIFullModel<T, TInput, TOutput>The current solution being optimized.
gradientVector<T>The calculated gradient for the current solution.
Returns
- IFullModel<T, TInput, TOutput>
A new solution with updated parameters.