Table of Contents

Class LARSOptimizerOptions<T, TInput, TOutput>

Namespace
AiDotNet.Models.Options
Assembly
AiDotNet.dll

Configuration options for the LARS (Layer-wise Adaptive Rate Scaling) optimization algorithm.

public class LARSOptimizerOptions<T, TInput, TOutput> : GradientBasedOptimizerOptions<T, TInput, TOutput>

Type Parameters

T
TInput
TOutput
Inheritance
OptimizationAlgorithmOptions<T, TInput, TOutput>
GradientBasedOptimizerOptions<T, TInput, TOutput>
LARSOptimizerOptions<T, TInput, TOutput>
Inherited Members

Examples

var options = new LARSOptimizerOptions<float, Matrix<float>, Vector<float>>
{
    LearningRate = 0.3,        // Base learning rate (will be scaled per-layer)
    Momentum = 0.9,            // Standard momentum
    WeightDecay = 1e-4,        // Weight decay
    TrustCoefficient = 0.001,  // Controls layer-wise LR scaling
    BatchSize = 4096           // Large batch size for SSL
};
var optimizer = new LARSOptimizer<float, Matrix<float>, Vector<float>>(model, options);

Remarks

LARS (Layer-wise Adaptive Rate Scaling) is designed for training with very large batch sizes (4096-32768). It automatically adapts the learning rate for each layer based on the ratio of parameter norm to gradient norm, which helps maintain stable training with large batches.

For Beginners: When training with large batches (common in self-supervised learning), regular optimizers can become unstable. LARS solves this by automatically adjusting learning rates for each layer based on how "big" the weights and gradients are. This makes training more stable and allows you to use much larger batch sizes, which speeds up training significantly.

LARS is particularly important for self-supervised learning methods like SimCLR, which achieve their best results with batch sizes of 4096-8192.

Based on the paper "Large Batch Training of Convolutional Networks" by You et al. (2017).

Properties

BatchSize

Gets or sets the batch size for mini-batch gradient descent.

public int BatchSize { get; set; }

Property Value

int

A positive integer, defaulting to 4096 for large batch SSL training.

Remarks

For Beginners: LARS is specifically designed for large batch sizes. The default of 4096 is typical for self-supervised learning. You can go up to 32768 with proper learning rate warmup.

Epsilon

Gets or sets a small constant added to denominators to prevent division by zero.

public double Epsilon { get; set; }

Property Value

double

The epsilon value, defaulting to 1e-8.

Remarks

For Beginners: This is a tiny safety value to prevent numerical issues. You rarely need to change this unless you experience NaN values during training.

ExcludeBiasFromLARS

Gets or sets whether to exclude bias parameters and normalization layer parameters from LARS scaling.

public bool ExcludeBiasFromLARS { get; set; }

Property Value

bool

True to exclude biases from LARS (default), false to apply LARS to all parameters.

Remarks

In the original LARS paper and most implementations, bias terms and normalization layer parameters (BatchNorm, LayerNorm) are excluded from the layer-wise scaling and only use the base learning rate with momentum.

For Beginners: Some parameters like biases work better with regular learning rates rather than LARS scaling. Keeping this true (default) follows best practices.

InitialLearningRate

Gets or sets the base learning rate for the LARS optimizer.

public override double InitialLearningRate { get; set; }

Property Value

double

The learning rate, defaulting to 0.3.

Remarks

For Beginners: Unlike Adam which typically uses small learning rates (0.001), LARS uses larger base learning rates (0.1-1.0) because it automatically scales them per layer. The default of 0.3 works well for most SSL tasks. Use linear scaling: LR = base_lr * batch_size / 256.

LayerBoundaries

Gets or sets the layer size boundaries for layer-wise scaling.

public int[]? LayerBoundaries { get; set; }

Property Value

int[]

Array of layer sizes that define boundaries between layers for LARS scaling.

Remarks

LARS applies different scaling factors to different layers. This array defines the cumulative sizes of parameters that belong to each layer. If null, each parameter vector is treated as a single layer.

For Beginners: This tells LARS where one layer ends and another begins in the flattened parameter vector. If not set, LARS treats all parameters as one layer, which still works but is less optimal than true layer-wise scaling.

Momentum

Gets or sets the momentum coefficient for the optimizer.

public double Momentum { get; set; }

Property Value

double

The momentum value, defaulting to 0.9.

Remarks

For Beginners: Momentum helps the optimizer maintain direction through noisy gradients. A value of 0.9 means 90% of the update comes from the previous direction. Higher values (up to 0.99) can help with very large batches.

SkipLARSLayers

Gets or sets which layers should skip LARS scaling and use only the base learning rate.

public int[]? SkipLARSLayers { get; set; }

Property Value

int[]

Array of layer indices to skip LARS scaling for.

Remarks

Some layers (particularly the final classifier head) may work better without LARS scaling. This array specifies which layer indices should use only the base learning rate.

For Beginners: In self-supervised learning, you typically want to skip LARS for the projection head layers that are discarded after pretraining anyway.

TrustCoefficient

Gets or sets the LARS trust coefficient (eta).

public double TrustCoefficient { get; set; }

Property Value

double

The trust coefficient, defaulting to 0.001.

Remarks

The trust coefficient controls how much the layer-wise learning rate scaling affects the update. A smaller value means more conservative scaling, while a larger value allows larger per-layer learning rate adjustments.

For Beginners: This controls how aggressively LARS adapts learning rates per layer. The default of 0.001 is well-tested. Smaller values (0.0001) are more conservative, larger values (0.01) more aggressive. Stick with the default unless you have specific issues.

UseNesterov

Gets or sets whether to use Nesterov momentum instead of standard momentum.

public bool UseNesterov { get; set; }

Property Value

bool

True to use Nesterov momentum, false for standard momentum. Default: false

Remarks

For Beginners: Nesterov momentum looks ahead before computing gradients, which can help with convergence. It's slightly more complex but can improve results. The default (standard momentum) works well for most SSL tasks.

WarmupEpochs

Gets or sets the number of warmup steps for learning rate warmup.

public int WarmupEpochs { get; set; }

Property Value

int

The number of warmup steps, defaulting to 10 epochs worth of steps.

Remarks

Learning rate warmup gradually increases the learning rate from 0 to the target value over the specified number of steps. This helps stabilize large batch training.

For Beginners: When training with large batches, starting with a full learning rate can cause training to diverge. Warmup slowly increases the learning rate, giving the model time to stabilize. A warmup of 10 epochs is typical for SSL.

WeightDecay

Gets or sets the weight decay coefficient.

public double WeightDecay { get; set; }

Property Value

double

The weight decay coefficient, defaulting to 1e-4.

Remarks

For Beginners: Weight decay prevents the model weights from growing too large. LARS incorporates weight decay into the layer-wise learning rate calculation, which helps with numerical stability during large batch training.