Table of Contents

Class AdamWOptimizer<T, TInput, TOutput>

Namespace
AiDotNet.Optimizers
Assembly
AiDotNet.dll

Implements the AdamW (Adam with decoupled Weight decay) optimization algorithm.

public class AdamWOptimizer<T, TInput, TOutput> : GradientBasedOptimizerBase<T, TInput, TOutput>, IGradientBasedOptimizer<T, TInput, TOutput>, IOptimizer<T, TInput, TOutput>, IModelSerializer

Type Parameters

T

The numeric type used for calculations (e.g., float, double).

TInput
TOutput
Inheritance
OptimizerBase<T, TInput, TOutput>
GradientBasedOptimizerBase<T, TInput, TOutput>
AdamWOptimizer<T, TInput, TOutput>
Implements
IGradientBasedOptimizer<T, TInput, TOutput>
IOptimizer<T, TInput, TOutput>
Inherited Members
Extension Methods

Examples

var options = new AdamWOptimizerOptions<float, Matrix<float>, Vector<float>>
{
    LearningRate = 0.001,
    WeightDecay = 0.01,
    Beta1 = 0.9,
    Beta2 = 0.999
};
var optimizer = new AdamWOptimizer<float, Matrix<float>, Vector<float>>(model, options);

Remarks

AdamW is a variant of Adam that fixes the weight decay implementation. In standard Adam with L2 regularization, weight decay is coupled with the adaptive learning rate, which can lead to suboptimal regularization effects. AdamW decouples weight decay from the gradient-based update, applying it directly to the weights.

The key difference: - Adam with L2: gradient = gradient + lambda * weights (then apply Adam update) - AdamW: weights = weights - lr * adam_update - lr * lambda * weights (decoupled)

For Beginners: AdamW is like Adam but handles regularization (preventing overfitting) in a smarter way. The difference might seem technical, but AdamW consistently achieves better results on tasks like training transformers and large neural networks. If you're choosing between Adam and AdamW, AdamW is generally the better choice.

Based on the paper "Decoupled Weight Decay Regularization" by Ilya Loshchilov and Frank Hutter.

Constructors

AdamWOptimizer(IFullModel<T, TInput, TOutput>?, AdamWOptimizerOptions<T, TInput, TOutput>?)

Initializes a new instance of the AdamWOptimizer class.

public AdamWOptimizer(IFullModel<T, TInput, TOutput>? model, AdamWOptimizerOptions<T, TInput, TOutput>? options = null)

Parameters

model IFullModel<T, TInput, TOutput>

The model to optimize.

options AdamWOptimizerOptions<T, TInput, TOutput>

The options for configuring the AdamW optimizer.

Remarks

For Beginners: This sets up the AdamW optimizer with its initial configuration. The most important parameters are learning rate (how fast to learn) and weight decay (how much to regularize).

Properties

SupportsGpuUpdate

Gets whether this optimizer supports GPU-accelerated parameter updates.

public override bool SupportsGpuUpdate { get; }

Property Value

bool

UseAMSGrad

Gets whether AMSGrad variant is enabled.

public bool UseAMSGrad { get; }

Property Value

bool

WeightDecay

Gets the current weight decay coefficient.

public double WeightDecay { get; }

Property Value

double

Methods

Deserialize(byte[])

Deserializes the optimizer's state from a byte array.

public override void Deserialize(byte[] data)

Parameters

data byte[]

DisposeGpuState()

Disposes GPU-allocated optimizer state.

public override void DisposeGpuState()

GenerateGradientCacheKey(IFullModel<T, TInput, TOutput>, TInput, TOutput)

Generates a unique key for caching gradients.

protected override string GenerateGradientCacheKey(IFullModel<T, TInput, TOutput> model, TInput X, TOutput y)

Parameters

model IFullModel<T, TInput, TOutput>
X TInput
y TOutput

Returns

string

GetOptions()

Gets the current optimizer options.

public override OptimizationAlgorithmOptions<T, TInput, TOutput> GetOptions()

Returns

OptimizationAlgorithmOptions<T, TInput, TOutput>

InitializeAdaptiveParameters()

Initializes the adaptive parameters used by the AdamW optimizer.

protected override void InitializeAdaptiveParameters()

InitializeGpuState(int, IDirectGpuBackend)

Initializes AdamW optimizer state on the GPU.

public override void InitializeGpuState(int parameterCount, IDirectGpuBackend backend)

Parameters

parameterCount int
backend IDirectGpuBackend

Optimize(OptimizationInputData<T, TInput, TOutput>)

Performs the optimization process using the AdamW algorithm.

public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInputData<T, TInput, TOutput> inputData)

Parameters

inputData OptimizationInputData<T, TInput, TOutput>

The input data for optimization, including training data and targets.

Returns

OptimizationResult<T, TInput, TOutput>

The result of the optimization process, including the best solution found.

Remarks

DataLoader Integration: This optimizer now uses the DataLoader batching infrastructure which supports: - Custom samplers (weighted, stratified, curriculum, importance, active learning) - Reproducible shuffling via RandomSeed - Option to drop incomplete final batches Set these options via GradientBasedOptimizerOptions.DataSampler, ShuffleData, DropLastBatch, and RandomSeed.

Reset()

Resets the optimizer's internal state.

public override void Reset()

ReverseUpdate(Vector<T>, Vector<T>)

Reverses an AdamW gradient update to recover original parameters.

public override Vector<T> ReverseUpdate(Vector<T> updatedParameters, Vector<T> appliedGradients)

Parameters

updatedParameters Vector<T>
appliedGradients Vector<T>

Returns

Vector<T>

Serialize()

Serializes the optimizer's state into a byte array.

public override byte[] Serialize()

Returns

byte[]

UpdateAdaptiveParameters(OptimizationStepData<T, TInput, TOutput>, OptimizationStepData<T, TInput, TOutput>)

Updates the adaptive parameters of the optimizer based on the current and previous optimization steps.

protected override void UpdateAdaptiveParameters(OptimizationStepData<T, TInput, TOutput> currentStepData, OptimizationStepData<T, TInput, TOutput> previousStepData)

Parameters

currentStepData OptimizationStepData<T, TInput, TOutput>
previousStepData OptimizationStepData<T, TInput, TOutput>

UpdateOptions(OptimizationAlgorithmOptions<T, TInput, TOutput>)

Updates the optimizer's options.

protected override void UpdateOptions(OptimizationAlgorithmOptions<T, TInput, TOutput> options)

Parameters

options OptimizationAlgorithmOptions<T, TInput, TOutput>

UpdateParameters(Matrix<T>, Matrix<T>)

Updates a matrix of parameters using the AdamW optimization algorithm.

public override Matrix<T> UpdateParameters(Matrix<T> parameters, Matrix<T> gradient)

Parameters

parameters Matrix<T>
gradient Matrix<T>

Returns

Matrix<T>

UpdateParameters(Vector<T>, Vector<T>)

Updates a vector of parameters using the AdamW optimization algorithm with decoupled weight decay.

public override Vector<T> UpdateParameters(Vector<T> parameters, Vector<T> gradient)

Parameters

parameters Vector<T>

The current parameter vector to be updated.

gradient Vector<T>

The gradient vector corresponding to the parameters.

Returns

Vector<T>

The updated parameter vector.

UpdateParametersGpu(IGpuBuffer, IGpuBuffer, int, IDirectGpuBackend)

Updates parameters on the GPU using the AdamW kernel.

public override void UpdateParametersGpu(IGpuBuffer parameters, IGpuBuffer gradients, int parameterCount, IDirectGpuBackend backend)

Parameters

parameters IGpuBuffer
gradients IGpuBuffer
parameterCount int
backend IDirectGpuBackend

UpdateSolution(IFullModel<T, TInput, TOutput>, Vector<T>)

Updates the current solution using the AdamW update rule with decoupled weight decay.

protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)

Parameters

currentSolution IFullModel<T, TInput, TOutput>

The current solution being optimized.

gradient Vector<T>

The calculated gradient for the current solution.

Returns

IFullModel<T, TInput, TOutput>

A new solution with updated parameters.