Class GradientBasedOptimizerBase<T, TInput, TOutput>

Namespace: AiDotNet.Optimizers

Assembly: AiDotNet.dll

Represents a base class for gradient-based optimization algorithms.

public abstract class GradientBasedOptimizerBase<T, TInput, TOutput> : OptimizerBase<T, TInput, TOutput>, IGradientBasedOptimizer<T, TInput, TOutput>, IOptimizer<T, TInput, TOutput>, IModelSerializer

Type Parameters

T: The numeric type used for calculations, typically float or double.
TInput
TOutput

Inheritance: object

OptimizerBase<T, TInput, TOutput>

GradientBasedOptimizerBase<T, TInput, TOutput>

Implements: IGradientBasedOptimizer<T, TInput, TOutput>

IOptimizer<T, TInput, TOutput>

IModelSerializer

Derived: ADMMOptimizer<T, TInput, TOutput>

AMSGradOptimizer<T, TInput, TOutput>

AdaDeltaOptimizer<T, TInput, TOutput>

AdaMaxOptimizer<T, TInput, TOutput>

AdagradOptimizer<T, TInput, TOutput>

AdamOptimizer<T, TInput, TOutput>

AdamWOptimizer<T, TInput, TOutput>

BFGSOptimizer<T, TInput, TOutput>

ConjugateGradientOptimizer<T, TInput, TOutput>

CoordinateDescentOptimizer<T, TInput, TOutput>

DFPOptimizer<T, TInput, TOutput>

FTRLOptimizer<T, TInput, TOutput>

GradientDescentOptimizer<T, TInput, TOutput>

LAMBOptimizer<T, TInput, TOutput>

LARSOptimizer<T, TInput, TOutput>

LBFGSOptimizer<T, TInput, TOutput>

LevenbergMarquardtOptimizer<T, TInput, TOutput>

LionOptimizer<T, TInput, TOutput>

MiniBatchGradientDescentOptimizer<T, TInput, TOutput>

MomentumOptimizer<T, TInput, TOutput>

NadamOptimizer<T, TInput, TOutput>

NesterovAcceleratedGradientOptimizer<T, TInput, TOutput>

NewtonMethodOptimizer<T, TInput, TOutput>

ProximalGradientDescentOptimizer<T, TInput, TOutput>

RootMeanSquarePropagationOptimizer<T, TInput, TOutput>

StochasticGradientDescentOptimizer<T, TInput, TOutput>

TrustRegionOptimizer<T, TInput, TOutput>

Inherited Members: OptimizerBase<T, TInput, TOutput>.Engine

OptimizerBase<T, TInput, TOutput>.NumOps

OptimizerBase<T, TInput, TOutput>.Random

OptimizerBase<T, TInput, TOutput>.Options

OptimizerBase<T, TInput, TOutput>.PredictionOptions

OptimizerBase<T, TInput, TOutput>.ModelStatsOptions

OptimizerBase<T, TInput, TOutput>.ModelEvaluator

OptimizerBase<T, TInput, TOutput>.FitDetector

OptimizerBase<T, TInput, TOutput>.FitnessCalculator

OptimizerBase<T, TInput, TOutput>.FitnessList

OptimizerBase<T, TInput, TOutput>.IterationHistoryList

OptimizerBase<T, TInput, TOutput>.ModelCache

OptimizerBase<T, TInput, TOutput>.CurrentLearningRate

OptimizerBase<T, TInput, TOutput>.CurrentMomentum

OptimizerBase<T, TInput, TOutput>.IterationsWithoutImprovement

OptimizerBase<T, TInput, TOutput>.IterationsWithImprovement

OptimizerBase<T, TInput, TOutput>.Model

OptimizerBase<T, TInput, TOutput>.Optimize(OptimizationInputData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.GetCachedStepData(string)

OptimizerBase<T, TInput, TOutput>.CacheStepData(string, OptimizationStepData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.AdjustModelParameters(IFullModel<T, TInput, TOutput>, double, double)

OptimizerBase<T, TInput, TOutput>.RandomlySelectFeatures(int, int?, int?)

OptimizerBase<T, TInput, TOutput>.ApplyFeatureSelection(IFullModel<T, TInput, TOutput>, List<int>)

OptimizerBase<T, TInput, TOutput>.AdjustParameters(Vector<T>, double, double)

OptimizerBase<T, TInput, TOutput>.EvaluateSolution(IFullModel<T, TInput, TOutput>, OptimizationInputData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.PrepareAndEvaluateSolution(IFullModel<T, TInput, TOutput>, OptimizationInputData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.CalculateLoss(IFullModel<T, TInput, TOutput>, OptimizationInputData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.CreateOptimizationResult(OptimizationStepData<T, TInput, TOutput>, OptimizationInputData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.ApplyFeatureSelection(IFullModel<T, TInput, TOutput>, int)

OptimizerBase<T, TInput, TOutput>.CreateSolution(TInput)

OptimizerBase<T, TInput, TOutput>.GenerateCacheKey(IFullModel<T, TInput, TOutput>, OptimizationInputData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.UpdateBestSolution(OptimizationStepData<T, TInput, TOutput>, ref OptimizationStepData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.InitializeAdaptiveParameters()

OptimizerBase<T, TInput, TOutput>.Reset()

OptimizerBase<T, TInput, TOutput>.ResetAdaptiveParameters()

OptimizerBase<T, TInput, TOutput>.UpdateAdaptiveParameters(OptimizationStepData<T, TInput, TOutput>, OptimizationStepData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.UpdateIterationHistoryAndCheckEarlyStopping(int, OptimizationStepData<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.ShouldEarlyStop()

OptimizerBase<T, TInput, TOutput>.Serialize()

OptimizerBase<T, TInput, TOutput>.Deserialize(byte[])

OptimizerBase<T, TInput, TOutput>.SerializeAdditionalData(BinaryWriter)

OptimizerBase<T, TInput, TOutput>.DeserializeAdditionalData(BinaryReader)

OptimizerBase<T, TInput, TOutput>.UpdateOptions(OptimizationAlgorithmOptions<T, TInput, TOutput>)

OptimizerBase<T, TInput, TOutput>.Step()

OptimizerBase<T, TInput, TOutput>.CalculateUpdate(Dictionary<string, Vector<T>>)

OptimizerBase<T, TInput, TOutput>.GetOptions()

OptimizerBase<T, TInput, TOutput>.CalculateUpdate(Vector<T>, Vector<T>)

OptimizerBase<T, TInput, TOutput>.InitializeRandomSolution(Vector<T>, Vector<T>)

OptimizerBase<T, TInput, TOutput>.InitializeRandomSolution(TInput)

OptimizerBase<T, TInput, TOutput>.SaveModel(string)

OptimizerBase<T, TInput, TOutput>.LoadModel(string)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributed<T, TInput, TOutput>(IOptimizer<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IOptimizer<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

Gradient-based optimizers use the gradient of the loss function to update the model parameters in a direction that minimizes the loss. This base class provides common functionality for various gradient-based optimization techniques.

For Beginners: Think of gradient-based optimization like finding the bottom of a valley:

You start at a random point on a hilly landscape (your initial model parameters)
You look around to see which way is steepest downhill (calculate the gradient)
You take a step in that direction (update the parameters)
You repeat this process until you reach the bottom of the valley (optimize the model)

This approach helps the model learn by gradually adjusting its parameters to minimize errors.

Constructors

GradientBasedOptimizerBase(IFullModel<T, TInput, TOutput>?, GradientBasedOptimizerOptions<T, TInput, TOutput>)

Initializes a new instance of the GradientBasedOptimizerBase class.

protected GradientBasedOptimizerBase(IFullModel<T, TInput, TOutput>? model, GradientBasedOptimizerOptions<T, TInput, TOutput> options)

Parameters

model IFullModel<T, TInput, TOutput>: The model to optimize (can be null if set later).
options GradientBasedOptimizerOptions<T, TInput, TOutput>: Options for the gradient-based optimizer.

Remarks

For Beginners: This sets up the gradient-based optimizer with its initial settings. It's like preparing for your hike by choosing your starting point, deciding how big your steps will be, and how much you'll consider your previous direction when choosing your next step.

Fields

GradientCache

A cache for storing and retrieving gradients to improve performance.

protected IGradientCache<T> GradientCache

Field Value

IGradientCache<T>

GradientOptions

Options specific to gradient-based optimization algorithms.

protected GradientBasedOptimizerOptions<T, TInput, TOutput> GradientOptions

Field Value

GradientBasedOptimizerOptions<T, TInput, TOutput>

LossFunction

A method used to compare the predicted values vs the actual values.

protected ILossFunction<T> LossFunction

Field Value

ILossFunction<T>

Regularization

A method used to regularize the parameters so they don't get out of control.

protected IRegularization<T, TInput, TOutput> Regularization

Field Value

IRegularization<T, TInput, TOutput>

_currentEpoch

The current epoch number for scheduler tracking.

protected int _currentEpoch

Field Value

int

_currentStep

The current step (batch) number for scheduler tracking.

protected int _currentStep

Field Value

int

_gpuState

GPU-resident optimizer state. Derived classes override to store their specific state.

protected IGpuBuffer? _gpuState

Field Value

IGpuBuffer

_gpuStateInitialized

Whether GPU state has been initialized.

protected bool _gpuStateInitialized

Field Value

bool

_lastComputedGradients

The gradients computed during the last optimization step.

protected Vector<T> _lastComputedGradients

Field Value

Vector<T>

Remarks

This field stores the gradients calculated in the most recent call to CalculateGradient(). It enables external access to gradients for features like gradient clipping, distributed training (true DDP), debugging, and visualization. Returns Vector<T>.Empty() if no gradients have been computed yet.

_learningRateScheduler

The learning rate scheduler to use for adjusting learning rate during training.

protected ILearningRateScheduler? _learningRateScheduler

Field Value

ILearningRateScheduler

Remarks

For Beginners: A learning rate scheduler automatically adjusts how fast your model learns during training. Common strategies include starting high and decreasing over time, or using warmup to slowly increase the learning rate at the beginning.

_mixedPrecisionContext

Mixed-precision training context (null if mixed-precision is disabled).

protected MixedPrecisionContext? _mixedPrecisionContext

Field Value

MixedPrecisionContext

Remarks

For Beginners: Mixed-precision training uses both 16-bit (FP16) and 32-bit (FP32) floating-point numbers during optimization. This context manages the conversion between precisions and handles loss scaling to prevent numerical issues. When enabled, this can provide: - 2-3x faster training on modern GPUs (V100, A100, RTX 3000+) - ~50% memory reduction - Maintained accuracy through careful precision management

_previousGradient

The gradient from the previous optimization step, used for momentum calculations.

protected Vector<T> _previousGradient

Field Value

Vector<T>

_schedulerStepMode

Specifies when to step the learning rate scheduler.

protected SchedulerStepMode _schedulerStepMode

Field Value

SchedulerStepMode

Remarks

Controls whether the scheduler updates after each batch, each epoch, or uses warmup followed by per-epoch stepping.

Properties

CurrentEpoch

Gets the current training epoch.

public int CurrentEpoch { get; }

Property Value

int

CurrentStep

Gets the current training step (batch count).

public int CurrentStep { get; }

Property Value

int

IsMixedPrecisionEnabled

Gets whether mixed-precision training is enabled for this optimizer.

public bool IsMixedPrecisionEnabled { get; }

Property Value

bool

LastComputedGradients

Gets the gradients computed during the last optimization step.

public virtual Vector<T> LastComputedGradients { get; }

Property Value

Vector<T>: Vector of gradients for each parameter. Returns empty vector if no optimization performed yet.

Remarks

This property provides access to the gradients (partial derivatives) calculated during the most recent optimization. Essential for distributed training, gradient clipping, and debugging.

For Beginners: Gradients are "directions" showing how to adjust each parameter to improve the model. This property lets you see those directions after optimization runs.

Industry Standard: PyTorch, TensorFlow, and JAX all expose gradients for features like gradient clipping, true Distributed Data Parallel (DDP), and gradient compression.

LearningRateScheduler

Gets the current learning rate scheduler, if one is configured.

public ILearningRateScheduler? LearningRateScheduler { get; }

Property Value

ILearningRateScheduler

SchedulerStepMode

Gets the current scheduler step mode.

public SchedulerStepMode SchedulerStepMode { get; }

Property Value

SchedulerStepMode

SupportsGpuUpdate

Gets whether this optimizer supports GPU-accelerated parameter updates.

public virtual bool SupportsGpuUpdate { get; }

Property Value

bool

Remarks

For Beginners: Override this in derived classes that have GPU kernel implementations. The base class returns false since it has no specific GPU kernel.

Methods

ApplyGradientClipping(Vector<T>)

Applies gradient clipping based on the configured options.

protected virtual Vector<T> ApplyGradientClipping(Vector<T> gradient)

Parameters

gradient Vector<T>: The gradient to clip.

Returns

Vector<T>: The clipped gradient.

Remarks

For Beginners: Gradient clipping prevents training instability by limiting how large gradients can become. This is especially important for deep networks and RNNs where gradients can "explode" (become extremely large) during backpropagation.

ApplyGradients(Vector<T>, IFullModel<T, TInput, TOutput>)

Applies pre-computed gradients to a model's parameters.

public virtual IFullModel<T, TInput, TOutput> ApplyGradients(Vector<T> gradients, IFullModel<T, TInput, TOutput> model)

Parameters

gradients Vector<T>: Gradients to apply (must match model parameter count)
model IFullModel<T, TInput, TOutput>: Model whose parameters should be updated

Returns

IFullModel<T, TInput, TOutput>: Model with updated parameters

Remarks

Allows applying externally-computed or modified gradients (averaged, compressed, clipped, etc.) to update model parameters. Essential for production distributed training.

For Beginners: This takes pre-calculated "directions" (gradients) and uses them to update the model. Like having a GPS tell you which way to go, this method moves you there.

Production Use Cases: - **True DDP**: Average gradients across GPUs, then apply - **Gradient Compression**: Compress, sync, decompress, then apply - **Federated Learning**: Average gradients from clients before applying - **Gradient Clipping**: Clip gradients to prevent exploding, then apply

Exceptions

ArgumentNullException: If gradients or model is null
ArgumentException: If gradient size doesn't match parameters

ApplyGradients(Vector<T>, Vector<T>, IFullModel<T, TInput, TOutput>)

Applies pre-computed gradients to explicit original parameters (double-step safe).

public virtual IFullModel<T, TInput, TOutput> ApplyGradients(Vector<T> originalParameters, Vector<T> gradients, IFullModel<T, TInput, TOutput> model)

Parameters

originalParameters Vector<T>: Pre-update parameters to start from
gradients Vector<T>: Gradients to apply
model IFullModel<T, TInput, TOutput>: Model template (only used for structure, parameters ignored)

Returns

IFullModel<T, TInput, TOutput>: New model with updated parameters

Remarks

⚠️ RECOMMENDED for Distributed Training: This overload accepts originalParameters explicitly, making it impossible to accidentally apply gradients twice. Use this in distributed optimizers where you need explicit control over which parameter state to start from.

Prevents double-stepping bug: - WRONG: ApplyGradients(g_avg, modelWithLocalUpdate) → double step! - RIGHT: ApplyGradients(originalParams, g_avg, modelTemplate) → single step!

Distributed Pattern: 1. Save originalParams before local optimization 2. Run local optimization → get localGradients 3. Synchronize gradients → get avgGradients 4. Call ApplyGradients(originalParams, avgGradients, model) → correct result!

ApplyMomentum(Vector<T>)

Applies momentum to the gradient calculation.

protected virtual Vector<T> ApplyMomentum(Vector<T> gradient)

Parameters

gradient Vector<T>: The current gradient.

Returns

Vector<T>: The gradient adjusted for momentum.

Remarks

For Beginners: This method considers the direction you were moving in previously when deciding which way to go next. It's like considering your momentum when hiking - you might keep going in roughly the same direction rather than abruptly changing course.

AreGradientsExploding(double)

Checks if the current gradients are exhibiting exploding gradient behavior.

public bool AreGradientsExploding(double threshold = 1000)

Parameters

threshold double: The threshold above which gradients are considered exploding. Default is 1000.

Returns

bool: True if gradients are exploding, false otherwise.

Remarks

For Beginners: This method helps detect when training is becoming unstable. If gradients become too large, it usually indicates a problem with the learning rate or model architecture that needs to be addressed.

AreGradientsVanishing(double)

Checks if the current gradients are exhibiting vanishing gradient behavior.

public bool AreGradientsVanishing(double threshold = 1E-07)

Parameters

threshold double: The threshold below which gradients are considered vanishing. Default is 1e-7.

Returns

bool: True if gradients are vanishing, false otherwise.

Remarks

For Beginners: Vanishing gradients occur when gradients become so small that learning effectively stops. This is common in deep networks and can indicate the need for techniques like residual connections, batch normalization, or different activation functions.

CalculateGradient(IFullModel<T, TInput, TOutput>, TInput, TOutput)

Calculates the gradient for the given model and input data.

protected virtual Vector<T> CalculateGradient(IFullModel<T, TInput, TOutput> solution, TInput X, TOutput y)

Parameters

solution IFullModel<T, TInput, TOutput>: The current solution.
X TInput: The input features.
y TOutput: The target values.

Returns

Vector<T>: The calculated gradient.

Remarks

For Beginners: This method calculates how steep the hill is and in which direction. It helps determine which way the optimizer should step to improve the model.

CalculateGradient(IFullModel<T, TInput, TOutput>, TInput, TOutput, int[])

Calculates the gradient for a given solution using a batch of training data.

protected virtual Vector<T> CalculateGradient(IFullModel<T, TInput, TOutput> solution, TInput xTrain, TOutput yTrain, int[] batchIndices)

Parameters

solution IFullModel<T, TInput, TOutput>: The current solution (model).
xTrain TInput: The training input data.
yTrain TOutput: The training target data.
batchIndices int[]: The indices to use for the current batch.

Returns

Vector<T>: A vector representing the gradient of the loss function with respect to the model parameters.

Remarks

For Beginners: The gradient tells us which direction to adjust our model's parameters to improve performance. It's like a compass showing the way to a better solution.

ComputeHessianEfficiently(IFullModel<T, TInput, TOutput>, OptimizationInputData<T, TInput, TOutput>)

Computes the Hessian matrix (second derivatives) more efficiently when the model supports explicit gradient computation.

protected virtual Matrix<T> ComputeHessianEfficiently(IFullModel<T, TInput, TOutput> model, OptimizationInputData<T, TInput, TOutput> inputData)

Parameters

model IFullModel<T, TInput, TOutput>: The model to compute Hessian for.
inputData OptimizationInputData<T, TInput, TOutput>: The input data for optimization.

Returns

Matrix<T>: The Hessian matrix.

Remarks

For Beginners: The Hessian tells us how the gradient changes - it's the "curvature" of the loss landscape. This is crucial for second-order optimization methods like Newton's method.

Production Enhancement: If the model implements IGradientComputable, this method computes the Hessian by taking gradients of the gradient (using finite differences on the gradient function), which is much more efficient than the traditional double finite differences approach. This is O(n) gradient evaluations instead of O(n²) loss evaluations.

Note: For models implementing IGradientComputable with ComputeSecondOrderGradients support, true Hessian-vector products could be computed even more efficiently. This is currently a middle ground that works with any model implementing ComputeGradients.

ComputeHessianFiniteDifferences(IFullModel<T, TInput, TOutput>, OptimizationInputData<T, TInput, TOutput>)

Computes the Hessian matrix using traditional finite differences (fallback method).

protected virtual Matrix<T> ComputeHessianFiniteDifferences(IFullModel<T, TInput, TOutput> model, OptimizationInputData<T, TInput, TOutput> inputData)

Parameters

model IFullModel<T, TInput, TOutput>
inputData OptimizationInputData<T, TInput, TOutput>

Returns

Matrix<T>

Remarks

For Beginners: This is the slower but more universally applicable method. It approximates the curvature by testing small changes in parameters.

CreateBatcher(OptimizationInputData<T, TInput, TOutput>, int)

Creates a data batcher for the given optimization input data using configured sampling options.

protected OptimizationDataBatcher<T, TInput, TOutput> CreateBatcher(OptimizationInputData<T, TInput, TOutput> inputData, int batchSize)

Parameters

inputData OptimizationInputData<T, TInput, TOutput>: The optimization input data to batch.
batchSize int: The batch size for training.

Returns

OptimizationDataBatcher<T, TInput, TOutput>: An OptimizationDataBatcher configured with the optimizer's sampling options.

Remarks

For Beginners: This method creates a helper that splits your training data into smaller batches for efficient training. The batching behavior is controlled by: - DataSampler (if set): Advanced sampling strategies like weighted/curriculum learning - ShuffleData: Whether to randomize the order each epoch - DropLastBatch: Whether to discard incomplete final batches - RandomSeed: For reproducible randomization

Example usage:

var batcher = CreateBatcher(inputData, batchSize: 32);
foreach (var (xBatch, yBatch, indices) in batcher.GetBatches())
{
    var gradient = CalculateGradient(model, xBatch, yBatch);
    model = UpdateSolution(model, gradient);
}

CreateBatcher(OptimizationInputData<T, TInput, TOutput>, int, IDataSampler)

Creates a data batcher with a custom sampler, overriding the configured options.

protected OptimizationDataBatcher<T, TInput, TOutput> CreateBatcher(OptimizationInputData<T, TInput, TOutput> inputData, int batchSize, IDataSampler sampler)

Parameters

inputData OptimizationInputData<T, TInput, TOutput>: The optimization input data to batch.
batchSize int: The batch size for training.
sampler IDataSampler: The custom sampler to use for advanced sampling strategies.

Returns

OptimizationDataBatcher<T, TInput, TOutput>: An OptimizationDataBatcher with the custom sampler.

Remarks

For Beginners: Use this when you want to try a different sampling strategy without changing the optimizer's default configuration.

Example:

// Create a curriculum learning sampler
var sampler = Samplers.Curriculum(difficulties, totalEpochs: 100);
var batcher = CreateBatcher(inputData, batchSize: 32, sampler: sampler);

// Use balanced sampling for class imbalance
var sampler = Samplers.Balanced(labels, numClasses: 10);
var batcher = CreateBatcher(inputData, batchSize: 32, sampler: sampler);

CreateRegularization(GradientDescentOptimizerOptions<T, TInput, TOutput>)

Creates a regularization technique based on the provided options.

protected IRegularization<T, TInput, TOutput> CreateRegularization(GradientDescentOptimizerOptions<T, TInput, TOutput> options)

Parameters

options GradientDescentOptimizerOptions<T, TInput, TOutput>: The options specifying the regularization technique to use.

Returns

IRegularization<T, TInput, TOutput>: An instance of the specified regularization technique.

Remarks

For Beginners: This method sets up a way to prevent the model from becoming too complex. It's like adding rules to your hiking strategy to avoid taking unnecessarily complicated paths.

DisposeGpuState()

Disposes GPU-allocated optimizer state.

public virtual void DisposeGpuState()

Remarks

For Beginners: The base implementation disposes _gpuState if set. Derived classes with multiple state buffers should override.

GenerateGradientCacheKey(IFullModel<T, TInput, TOutput>, TInput, TOutput)

Generates a unique key for caching gradients.

protected virtual string GenerateGradientCacheKey(IFullModel<T, TInput, TOutput> model, TInput X, TOutput y)

Parameters

model IFullModel<T, TInput, TOutput>: The current model.
X TInput: The input features.
y TOutput: The target values.

Returns

string: A string key for caching the gradient.

Remarks

For Beginners: This method creates a unique identifier for each gradient calculation. It's like labeling each spot on the hill so you can remember what the gradient was there.

GetCurrentLearningRate()

Gets the current learning rate being used by this optimizer.

public double GetCurrentLearningRate()

Returns

double: The current learning rate.

Remarks

For Beginners: The learning rate controls how big each update step is. This value may change during training if a learning rate scheduler is configured.

GetGradientNorm()

Gets the L2 norm of the last computed gradients.

public T GetGradientNorm()

Returns

T: The gradient norm, or 0 if no gradients have been computed.

Remarks

For Beginners: The gradient norm is a measure of how "strong" the overall gradient is. Monitoring this value during training can help diagnose issues with exploding or vanishing gradients.

InitializeGpuState(int, IDirectGpuBackend)

Initializes optimizer state on the GPU for a given parameter count.

public virtual void InitializeGpuState(int parameterCount, IDirectGpuBackend backend)

Parameters

parameterCount int: Number of parameters to initialize state for.
backend IDirectGpuBackend: The GPU backend to use for memory allocation.

Remarks

For Beginners: The base implementation does nothing. Derived classes that maintain optimizer state (like momentum or adaptive learning rates) override this.

IsInWarmupPhase()

Determines whether the scheduler is currently in the warmup phase.

protected virtual bool IsInWarmupPhase()

Returns

bool: True if in warmup phase, false otherwise.

Remarks

Warmup is a technique where the learning rate starts very low and gradually increases to the base learning rate over a specified number of steps. This helps stabilize training in the early phases.

Detection Logic: For LinearWarmupScheduler, this method uses the explicit warmup step count for accurate detection. For other schedulers, warmup detection is not supported and this method returns false. The heuristic of comparing current LR to base LR was removed because it incorrectly identifies decay phases (e.g., cosine annealing) as warmup when the learning rate drops below the base learning rate.

LineSearch(IFullModel<T, TInput, TOutput>, Vector<T>, Vector<T>, OptimizationInputData<T, TInput, TOutput>)

Performs a line search to find an appropriate step size.

protected T LineSearch(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> direction, Vector<T> gradient, OptimizationInputData<T, TInput, TOutput> inputData)

Parameters

currentSolution IFullModel<T, TInput, TOutput>: The current solution.
direction Vector<T>: The search direction.
gradient Vector<T>: The current gradient.
inputData OptimizationInputData<T, TInput, TOutput>: The input data for the optimization process.

Returns

T: The step size to use.

Remarks

For Beginners: This method determines how big of a step to take in the chosen direction. It tries to find a step size that sufficiently decreases the function value while not being too small.

NotifyEpochStart(int)

Notifies the sampler that a new epoch has started (for epoch-aware samplers).

protected void NotifyEpochStart(int currentEpoch)

Parameters

currentEpoch int: The current epoch number (0-based).

Remarks

Call this at the beginning of each training epoch when using adaptive samplers like curriculum learning or self-paced learning that adjust their behavior over time.

OnBatchEnd()

Called at the end of each training batch to update scheduler state if applicable.

public virtual void OnBatchEnd()

Remarks

When to call this method: This method must be called after each batch if you are using StepPerBatch, or during the warmup phase when using WarmupThenEpoch. Failure to call this method will prevent the learning rate scheduler from advancing on a per-batch basis.

For Beginners: A batch is a small subset of your training data processed at once. Some schedulers (like warmup or cyclical learning rates) need to update after every batch for smooth, fine-grained control of the learning rate.

OnEpochEnd()

Called at the end of each training epoch to update scheduler state if applicable.

public virtual void OnEpochEnd()

Remarks

When to call this method: This method must be called at the end of each epoch if you are using StepPerEpoch or WarmupThenEpoch. Failure to call this method will prevent the learning rate scheduler from advancing, resulting in a constant learning rate throughout training.

For Beginners: An epoch is one complete pass through all your training data. Many learning rate schedules (like step decay or cosine annealing) work on an epoch basis, reducing the learning rate after each complete pass through the data.

Reset()

Resets the optimizer to its initial state.

public override void Reset()

Remarks

For Beginners: This method clears all the remembered information and starts fresh. It's like wiping your map clean and starting your hike from the beginning.

ReverseUpdate(Vector<T>, Vector<T>)

Reverses a gradient update to recover original parameters.

public virtual Vector<T> ReverseUpdate(Vector<T> updatedParameters, Vector<T> appliedGradients)

Parameters

updatedParameters Vector<T>: Parameters after gradient application
appliedGradients Vector<T>: The gradients that were applied

Returns

Vector<T>: Estimated original parameters

Remarks

This base implementation uses the vanilla SGD reversal formula: params_old = params_new + learning_rate * gradients

For Adaptive Optimizers (Adam, RMSprop, etc.): This method should be overridden to account for optimizer-specific state. The base implementation is only accurate for vanilla SGD.

For Beginners: This calculates where the parameters were before a gradient update was applied. Think of it like rewinding a step you took.

StepScheduler()

Steps the learning rate scheduler and updates the current learning rate.

public double StepScheduler()

Returns

double: The new learning rate after stepping.

Remarks

This method advances the scheduler by one step and synchronizes the optimizer's learning rate with the scheduler's current value.

For Beginners: Call this method to update the learning rate according to the scheduler's policy. The scheduler will automatically adjust the learning rate based on how many steps have been taken.

UpdateOptions(OptimizationAlgorithmOptions<T, TInput, TOutput>)

Updates the options for the gradient-based optimizer.

protected override void UpdateOptions(OptimizationAlgorithmOptions<T, TInput, TOutput> options)

Parameters

options OptimizationAlgorithmOptions<T, TInput, TOutput>: The new options to apply to the optimizer.

Remarks

For Beginners: This method allows you to change the settings of the optimizer while it's running. It's like adjusting your hiking strategy mid-journey based on the terrain you encounter.

UpdateParameters(Matrix<T>, Matrix<T>)

Updates a matrix of parameters based on the calculated gradient.

public virtual Matrix<T> UpdateParameters(Matrix<T> parameters, Matrix<T> gradient)

Parameters

parameters Matrix<T>: The current parameters.
gradient Matrix<T>: The calculated gradient.

Returns

Matrix<T>: The updated parameters.

Remarks

For Beginners: This method adjusts the model's parameters to improve its performance. It's like taking a step in the direction you've determined will lead you downhill.

UpdateParameters(Tensor<T>, Tensor<T>)

Updates a tensor of parameters based on the calculated gradient.

public virtual Tensor<T> UpdateParameters(Tensor<T> parameters, Tensor<T> gradient)

Parameters

parameters Tensor<T>: The current tensor parameters.
gradient Tensor<T>: The calculated gradient tensor.

Returns

Tensor<T>: The updated tensor parameters.

Remarks

For Beginners: This method adjusts the model's parameters stored in tensor format to improve its performance. It's like taking a step in the direction you've determined will lead you downhill, but for more complex multi-dimensional data structures. Tensors are useful for representing parameters in deep neural networks where data has multiple dimensions (like images with width, height, and channels).

UpdateParameters(Vector<T>, Vector<T>)

Updates a vector of parameters based on the calculated gradient.

public virtual Vector<T> UpdateParameters(Vector<T> parameters, Vector<T> gradient)

Parameters

parameters Vector<T>: The current parameters.
gradient Vector<T>: The calculated gradient.

Returns

Vector<T>: The updated parameters.

Remarks

For Beginners: This method is similar to UpdateMatrix, but for when the parameters are in a vector format instead of a matrix. It's another way of taking a step to improve the model.

UpdateParameters(List<ILayer<T>>)

Updates the parameters of the model based on the calculated gradients.

public virtual void UpdateParameters(List<ILayer<T>> layers)

Parameters

layers List<ILayer<T>>: The layers of the neural network containing the parameters to update.

Remarks

For Beginners: This method adjusts the model's parameters to improve its performance. It's like taking steps in the direction that will lead to better results, based on what we've learned from the data.

UpdateParametersGpu(IGpuBuffer, IGpuBuffer, int, IDirectGpuBackend)

Updates parameters on the GPU using optimizer-specific GPU kernels.

public virtual void UpdateParametersGpu(IGpuBuffer parameters, IGpuBuffer gradients, int parameterCount, IDirectGpuBackend backend)

Parameters

parameters IGpuBuffer: GPU buffer containing parameters to update (modified in-place).
gradients IGpuBuffer: GPU buffer containing gradients.
parameterCount int: Number of parameters.
backend IDirectGpuBackend: The GPU backend to use for execution.

Remarks

For Beginners: The base implementation throws since there's no generic GPU kernel. Derived classes that support GPU updates override this method.

UpdateSolution(IFullModel<T, TInput, TOutput>, Vector<T>)

Updates the current solution based on the calculated gradient.

protected virtual IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)

Parameters

currentSolution IFullModel<T, TInput, TOutput>: The current solution being optimized.
gradient Vector<T>: The calculated gradient.

Returns

IFullModel<T, TInput, TOutput>: A new solution with updated parameters.

Remarks

For Beginners: This method moves the model's parameters in the direction indicated by the gradient, hopefully improving the model's performance.

Table of Contents

Class GradientBasedOptimizerBase<T, TInput, TOutput>

Type Parameters

Remarks

Constructors

GradientBasedOptimizerBase(IFullModel<T, TInput, TOutput>?, GradientBasedOptimizerOptions<T, TInput, TOutput>)

Parameters

Remarks

Fields

GradientCache

Field Value

GradientOptions

Field Value

LossFunction

Field Value

Regularization

Field Value

_currentEpoch

Field Value

_currentStep

Field Value

_gpuState

Field Value

_gpuStateInitialized

Field Value

_lastComputedGradients

Field Value

Remarks

_learningRateScheduler

Field Value

Remarks

_mixedPrecisionContext

Field Value

Remarks

_previousGradient

Field Value

_schedulerStepMode

Field Value

Remarks

Properties

CurrentEpoch

Property Value

CurrentStep

Property Value

IsMixedPrecisionEnabled

Property Value

LastComputedGradients

Property Value

Remarks

LearningRateScheduler

Property Value

SchedulerStepMode

Property Value

SupportsGpuUpdate

Property Value

Remarks

Methods

ApplyGradientClipping(Vector<T>)

Parameters

Returns

Remarks

ApplyGradients(Vector<T>, IFullModel<T, TInput, TOutput>)

Parameters

Returns

Remarks

Exceptions

ApplyGradients(Vector<T>, Vector<T>, IFullModel<T, TInput, TOutput>)

Parameters

Returns

Remarks

ApplyMomentum(Vector<T>)

Parameters

Returns

Remarks

AreGradientsExploding(double)

Parameters

Returns

Remarks

AreGradientsVanishing(double)

Parameters