Table of Contents

Class MetaSGDOptions<T, TInput, TOutput>

Namespace
AiDotNet.MetaLearning.Options
Assembly
AiDotNet.dll

Configuration options for the Meta-SGD (Meta Stochastic Gradient Descent) algorithm.

public class MetaSGDOptions<T, TInput, TOutput> : IMetaLearnerOptions<T>

Type Parameters

T

The numeric data type used for calculations (e.g., float, double).

TInput

The input data type (e.g., Matrix<T>, Tensor<T>).

TOutput

The output data type (e.g., Vector<T>, Tensor<T>).

Inheritance
MetaSGDOptions<T, TInput, TOutput>
Implements
Inherited Members

Remarks

Meta-SGD extends MAML by learning not just the model initialization but also per-parameter learning rates, momentum terms, and update directions. This effectively learns a custom optimizer configuration for each parameter, enabling highly specialized adaptation strategies.

For Beginners: Think of Meta-SGD as "learning how to learn" at the finest grain:

In standard training, you pick one learning rate for all parameters. But different parts of a neural network might benefit from different learning rates. Meta-SGD figures this out automatically by learning: - Per-parameter learning rates: Some weights need small updates, others larger - Per-parameter momentum: Some weights benefit from momentum, others don't - Update directions: Sometimes the gradient direction should be flipped or scaled

Algorithm Overview:

# For each model parameter θ_i, learn:
#   - α_i: the optimal learning rate
#   - β_i: the optimal momentum coefficient
#   - d_i: the optimal update direction/scaling

# Meta-training:
for each task in task_batch:
    adapted_params = initial_params.copy()
    for step in range(K_inner):
        gradients = compute_gradients(adapted_params, support_set)
        for i in range(num_params):
            # Per-parameter update rule
            update_i = α_i * d_i * gradients[i] + β_i * velocity[i]
            adapted_params[i] -= update_i

    query_loss = evaluate(adapted_params, query_set)

# Meta-update: optimize α_i, β_i, d_i using query_loss gradient

Key Insights: 1. Per-parameter optimization allows heterogeneous learning rates across layers 2. First-order method: no Hessian computation needed, much faster than second-order MAML 3. Learned optimizers reveal which parameters are important for quick adaptation 4. Can combine with various base update rules (SGD, Adam, RMSprop)

Reference: Li, Z., Zhou, F., Chen, F., & Li, H. (2017). Meta-SGD: Learning to Learn Quickly for Few-Shot Learning.

Constructors

MetaSGDOptions(IFullModel<T, TInput, TOutput>)

Initializes a new instance of the MetaSGDOptions class with the required meta-model.

public MetaSGDOptions(IFullModel<T, TInput, TOutput> metaModel)

Parameters

metaModel IFullModel<T, TInput, TOutput>

The meta-model to be trained (required).

Examples

// Create Meta-SGD options with minimal configuration
var options = new MetaSGDOptions<double, Tensor, Tensor>(myNeuralNetwork);
var metaSGD = new MetaSGDAlgorithm<double, Tensor, Tensor>(options);

// Create Meta-SGD options with custom per-parameter optimizer
var options = new MetaSGDOptions<double, Tensor, Tensor>(myNeuralNetwork)
{
    UpdateRuleType = MetaSGDUpdateRuleType.Adam,
    LearnMomentum = true,
    LearnDirection = true,
    LearnAdamBetas = true,
    UseParameterGrouping = true,
    NumParameterGroups = 20
};

Exceptions

ArgumentNullException

Thrown when metaModel is null.

Properties

AdamBeta1Init

Gets or sets the initial value for Adam beta1.

public double AdamBeta1Init { get; set; }

Property Value

double

Default: 0.9 (standard Adam default).

AdamBeta2Init

Gets or sets the initial value for Adam beta2.

public double AdamBeta2Init { get; set; }

Property Value

double

Default: 0.999 (standard Adam default).

AdamEpsilonInit

Gets or sets the epsilon value for Adam numerical stability.

public double AdamEpsilonInit { get; set; }

Property Value

double

Default: 1e-8.

AdaptationSteps

Gets or sets the number of gradient steps to take during inner loop adaptation.

public int AdaptationSteps { get; set; }

Property Value

int

Default: 5 (typical for few-shot learning).

Remarks

Meta-SGD uses first-order optimization, so more adaptation steps are computationally cheaper than in MAML. However, too many steps can lead to overfitting on the support set.

CheckpointFrequency

Gets or sets how often to save checkpoints.

public int CheckpointFrequency { get; set; }

Property Value

int

Default: 1000.

DataLoader

Gets or sets the episodic data loader for sampling tasks.

public IEpisodicDataLoader<T, TInput, TOutput>? DataLoader { get; set; }

Property Value

IEpisodicDataLoader<T, TInput, TOutput>

Default: null (tasks must be provided manually to MetaTrain).

EnableCheckpointing

Gets or sets whether to save checkpoints during training.

public bool EnableCheckpointing { get; set; }

Property Value

bool

Default: false.

EvaluationFrequency

Gets or sets how often to evaluate during meta-training.

public int EvaluationFrequency { get; set; }

Property Value

int

Default: 500.

EvaluationTasks

Gets or sets the number of tasks to use for evaluation.

public int EvaluationTasks { get; set; }

Property Value

int

Default: 100.

GradientClipThreshold

Gets or sets the maximum gradient norm for gradient clipping.

public double? GradientClipThreshold { get; set; }

Property Value

double?

Default: 10.0 (prevents exploding gradients during meta-training).

Remarks

Gradient clipping is particularly important in Meta-SGD because the per-parameter learning rates can amplify gradients if they grow too large.

InnerLearningRate

Gets or sets the learning rate for the inner loop (task adaptation).

public double InnerLearningRate { get; set; }

Property Value

double

Default: 0.01.

Remarks

In Meta-SGD, this serves as the initial value for per-parameter learning rates when using uniform initialization. During meta-training, each parameter will learn its own optimal learning rate, which may diverge from this initial value.

For Beginners: This is the starting point for all learning rates. Meta-SGD will adjust each one individually as it learns.

InnerOptimizer

Gets or sets the optimizer for inner-loop adaptation.

public IGradientBasedOptimizer<T, TInput, TOutput>? InnerOptimizer { get; set; }

Property Value

IGradientBasedOptimizer<T, TInput, TOutput>

Default: null (uses the learned per-parameter optimizer).

Remarks

In Meta-SGD, this is typically not used directly because the per-parameter optimizer with learned coefficients replaces standard optimization. This is provided for compatibility with the base interface.

InnerSteps

Gets or sets the number of inner steps during meta-training.

public int InnerSteps { get; set; }

Property Value

int

Default: 5 (matches AdaptationSteps by default).

Remarks

This can be different from AdaptationSteps to allow for different behavior during training vs. adaptation. Some implementations use fewer inner steps during training for efficiency.

LayerDecayFactor

Gets or sets the decay factor per layer depth.

public double LayerDecayFactor { get; set; }

Property Value

double

Default: 0.9.

Remarks

Only used when UseLayerWiseDecay is true. A value of 0.9 means each successive layer has 90% of the previous layer's learning rate.

LearnAdamBetas

Gets or sets whether to learn Adam beta parameters when using Adam update rule.

public bool LearnAdamBetas { get; set; }

Property Value

bool

Default: false.

Remarks

When enabled with Adam update rule, learns per-parameter beta1 and beta2 coefficients. This adds significant complexity but can improve performance.

LearnDirection

Gets or sets whether to learn per-parameter update direction signs.

public bool LearnDirection { get; set; }

Property Value

bool

Default: true (helps with gradient sign ambiguity).

Remarks

The direction parameter can flip or scale the gradient direction for each parameter. This helps when the natural gradient direction isn't optimal for fast adaptation.

Mathematical formulation: θ_i' = θ_i - α_i × d_i × ∇_θ_i L where d_i is the learned direction scaling factor.

LearnLearningRate

Gets or sets whether to learn per-parameter learning rates.

public bool LearnLearningRate { get; set; }

Property Value

bool

Default: true (the core feature of Meta-SGD).

Remarks

This is the defining feature of Meta-SGD. Each parameter gets its own learned learning rate that is optimized during meta-training to enable fast adaptation to new tasks.

For Beginners: Keep this true! This is what makes Meta-SGD special.

LearnMomentum

Gets or sets whether to learn per-parameter momentum coefficients.

public bool LearnMomentum { get; set; }

Property Value

bool

Default: false (adds complexity, only enable if needed).

Remarks

When enabled, each parameter learns its own momentum coefficient. This can help with parameters that benefit from momentum-based updates but adds to the number of meta-parameters to learn.

For Beginners: Leave this false initially. Only enable if you find that per-parameter learning rates alone aren't sufficient.

LearningRateInitRange

Gets or sets the initialization range for learning rates when using random initialization.

public double LearningRateInitRange { get; set; }

Property Value

double

Default: 0.1.

Remarks

When using Random initialization, learning rates are initialized uniformly in [InnerLearningRate - range/2, InnerLearningRate + range/2].

LearningRateInitialization

Gets or sets the initialization strategy for per-parameter learning rates.

public MetaSGDLearningRateInitialization LearningRateInitialization { get; set; }

Property Value

MetaSGDLearningRateInitialization

Default: MetaSGDLearningRateInitialization.Uniform.

Remarks

Initialization strategies: - Uniform: All learning rates start at InnerLearningRate - Random: Random values within LearningRateInitRange - MagnitudeBased: Based on parameter magnitudes - LayerBased: Different rates per layer depth - Xavier: Xavier-style initialization

LearningRateL2Reg

Gets or sets the L2 regularization coefficient for learned learning rates.

public double LearningRateL2Reg { get; set; }

Property Value

double

Default: 0.0 (no regularization).

Remarks

Regularizing learning rates prevents them from growing too large and helps with generalization across tasks.

LearningRateSchedule

Gets or sets the learning rate schedule type.

public MetaSGDLearningRateScheduleType LearningRateSchedule { get; set; }

Property Value

MetaSGDLearningRateScheduleType

Default: MetaSGDLearningRateScheduleType.CosineAnnealing.

LossFunction

Gets or sets the loss function for training.

public ILossFunction<T>? LossFunction { get; set; }

Property Value

ILossFunction<T>

Default: null (uses model's default loss function if available).

Remarks

The loss function is used both in the inner loop (task adaptation) and outer loop (meta-optimization) to guide the learning of per-parameter optimizers.

MaxLearningRate

Gets or sets the maximum allowed per-parameter learning rate.

public double MaxLearningRate { get; set; }

Property Value

double

Default: 1.0 (prevents learning rates from becoming too large).

Remarks

Clipping learning rates to a maximum prevents unstable updates that could blow up during adaptation.

MetaBatchSize

Gets or sets the number of tasks to sample per meta-training iteration.

public int MetaBatchSize { get; set; }

Property Value

int

Default: 4 (typical meta-batch size).

Remarks

Larger batch sizes provide more stable gradients for the meta-parameters but require more memory and computation per iteration.

MetaModel

Gets or sets the meta-model to be trained. This is the only required property.

public IFullModel<T, TInput, TOutput> MetaModel { get; set; }

Property Value

IFullModel<T, TInput, TOutput>

Remarks

The model must implement IFullModel to support parameter getting/setting required for Meta-SGD's per-parameter optimization. Each parameter in the model will have its own learned learning rate, momentum, and direction coefficients.

For Beginners: This is the neural network whose parameters you want to meta-train. Meta-SGD will learn how to optimize each weight in this network.

MetaOptimizer

Gets or sets the optimizer for meta-parameter updates (outer loop).

public IGradientBasedOptimizer<T, TInput, TOutput>? MetaOptimizer { get; set; }

Property Value

IGradientBasedOptimizer<T, TInput, TOutput>

Default: null (uses built-in Adam optimizer with OuterLearningRate).

Remarks

This optimizer updates the learned per-parameter learning rates, momentums, and directions. Adam is typically used as it handles the sparse gradients from per-parameter optimization well.

MinLearningRate

Gets or sets the minimum allowed per-parameter learning rate.

public double MinLearningRate { get; set; }

Property Value

double

Default: 1e-6 (prevents learning rates from becoming too small).

Remarks

Clipping learning rates to a minimum prevents parameters from becoming "frozen" during adaptation.

NumCurvatureSamples

Gets or sets the number of samples for curvature approximation.

public int NumCurvatureSamples { get; set; }

Property Value

int

Default: 10.

Remarks

Number of random directions used for Hessian-vector product approximation. Only used when UseHessianFree is true.

NumMetaIterations

Gets or sets the total number of meta-training iterations to perform.

public int NumMetaIterations { get; set; }

Property Value

int

Default: 10000 (Meta-SGD typically needs many iterations).

Remarks

Meta-SGD often requires more iterations than MAML because it has more meta-parameters to learn (per-parameter coefficients). Monitor the validation loss to determine when to stop.

NumParameterGroups

Gets or sets the number of parameter groups.

public int NumParameterGroups { get; set; }

Property Value

int

Default: 10.

Remarks

Only used when UseParameterGrouping is true. Parameters are partitioned into this many groups, with each group sharing a single learned learning rate.

OuterLearningRate

Gets or sets the learning rate for the outer loop (meta-optimization).

public double OuterLearningRate { get; set; }

Property Value

double

Default: 0.001.

Remarks

This controls how quickly the per-parameter learning rates, momentums, and directions are updated during meta-training. A lower value provides more stable learning but slower convergence.

For Beginners: This controls how fast Meta-SGD learns the optimal per-parameter configurations. Too high causes instability, too low is slow.

ParameterSharingThreshold

Gets or sets the similarity threshold for parameter sharing.

public double ParameterSharingThreshold { get; set; }

Property Value

double

Default: 0.95.

Remarks

Only used when UseParameterSharing is true. Parameters with similarity above this threshold will share learning rate configurations.

RandomSeed

Gets or sets the random seed for reproducibility.

public int? RandomSeed { get; set; }

Property Value

int?

Default: null (non-deterministic).

ScheduleWarmupEpisodes

Gets or sets the number of warmup episodes for learning rate schedule.

public int ScheduleWarmupEpisodes { get; set; }

Property Value

int

Default: 1000.

Remarks

During warmup, the learning rate gradually increases from a small value to the target value. This can help with training stability.

TrustRegionRadius

Gets or sets the trust region radius.

public double TrustRegionRadius { get; set; }

Property Value

double

Default: 1.0.

Remarks

Maximum allowed magnitude for parameter updates. Only used when UseTrustRegion is true.

UpdateRuleType

Gets or sets the update rule type for per-parameter optimization.

public MetaSGDUpdateRuleType UpdateRuleType { get; set; }

Property Value

MetaSGDUpdateRuleType

Default: MetaSGDUpdateRuleType.SGD.

Remarks

Update Rule Types: - SGD: Standard gradient descent with learned learning rates - SGDWithMomentum: Adds learned momentum terms per parameter - Adam: Full Adam optimizer with learned beta parameters - RMSprop: RMSprop with learned decay rates - AdaGrad: AdaGrad with learned accumulation - AdaDelta: AdaDelta with learned decay

For Beginners: Start with SGD and only move to more complex rules if you find SGD isn't working well for your problem.

UseFirstOrder

Gets or sets whether to use first-order approximation.

public bool UseFirstOrder { get; set; }

Property Value

bool

Default: true (Meta-SGD is inherently first-order).

Remarks

Meta-SGD is designed as a first-order algorithm. Unlike MAML, it doesn't require computing gradients through the adaptation process. This property is always effectively true for standard Meta-SGD.

For Beginners: First-order means Meta-SGD doesn't need to compute complex second-order derivatives, making it much faster than MAML.

UseHessianFree

Gets or sets whether to use Hessian-free approximation for meta-gradients.

public bool UseHessianFree { get; set; }

Property Value

bool

Default: false.

Remarks

Hessian-free methods can provide better meta-gradient estimates at the cost of additional computation.

UseLayerWiseDecay

Gets or sets whether to apply layer-wise learning rate decay.

public bool UseLayerWiseDecay { get; set; }

Property Value

bool

Default: false.

Remarks

When enabled, deeper layers get smaller initial learning rates. This can help with training stability in very deep networks.

Formula: layer_lr = base_lr × (LayerDecayFactor ^ layer_depth)

UseLearningRateSchedule

Gets or sets whether to use a learning rate schedule for meta-learning rates.

public bool UseLearningRateSchedule { get; set; }

Property Value

bool

Default: false.

Remarks

When enabled, the meta-learning rate (outer loop) follows a schedule during training. This can help with convergence.

UseParameterGrouping

Gets or sets whether to use parameter grouping.

public bool UseParameterGrouping { get; set; }

Property Value

bool

Default: false.

Remarks

When enabled, parameters are grouped and share learning rates within groups. This reduces the number of meta-parameters and can improve generalization when the model has many parameters.

For Beginners: Enable this for very large models to reduce memory usage and potentially improve generalization.

UseParameterSharing

Gets or sets whether to use parameter sharing based on similarity.

public bool UseParameterSharing { get; set; }

Property Value

bool

Default: false.

Remarks

When enabled, parameters with similar values or gradients share learning rate configurations. This can improve sample efficiency.

UseTrustRegion

Gets or sets whether to use trust region for parameter updates.

public bool UseTrustRegion { get; set; }

Property Value

bool

Default: false.

Remarks

Trust region methods constrain the magnitude of parameter updates, which can improve stability during meta-training.

UseWarmStart

Gets or sets whether to use warm-start initialization for the optimizer.

public bool UseWarmStart { get; set; }

Property Value

bool

Default: true.

Remarks

When enabled, initializes per-parameter learning rates and other coefficients to reasonable default values based on the configuration.

Methods

Clone()

Creates a deep copy of the Meta-SGD options.

public IMetaLearnerOptions<T> Clone()

Returns

IMetaLearnerOptions<T>

A new MetaSGDOptions instance with the same configuration values.

GetEffectiveParameterGroups(int)

Gets the effective number of parameter groups.

public int GetEffectiveParameterGroups(int numModelParameters)

Parameters

numModelParameters int

Number of model parameters.

Returns

int

Number of parameter groups after considering options.

GetTotalMetaParameters(int)

Gets the total number of learnable meta-parameters.

public int GetTotalMetaParameters(int numModelParameters)

Parameters

numModelParameters int

Number of model parameters.

Returns

int

Total number of meta-parameters to learn.

Remarks

The total depends on which features are enabled: - Per-parameter learning rates: +numModelParameters - Per-parameter momentum: +numModelParameters - Per-parameter direction: +numModelParameters - Adam betas (if learning): +2*numModelParameters

IsValid()

Validates that all Meta-SGD configuration options are properly set.

public bool IsValid()

Returns

bool

True if the configuration is valid for Meta-SGD training; otherwise, false.

Remarks

Validates all required hyperparameters and Meta-SGD-specific settings: - Standard meta-learning parameters (learning rates, steps, etc.) - Learning rate bounds and regularization - Parameter grouping configuration - Adam beta parameters if using Adam - Trust region settings