Class MetaSGDOptions<T, TInput, TOutput>
- Namespace
- AiDotNet.MetaLearning.Options
- Assembly
- AiDotNet.dll
Configuration options for the Meta-SGD (Meta Stochastic Gradient Descent) algorithm.
public class MetaSGDOptions<T, TInput, TOutput> : IMetaLearnerOptions<T>
Type Parameters
TThe numeric data type used for calculations (e.g., float, double).
TInputThe input data type (e.g., Matrix<T>, Tensor<T>).
TOutputThe output data type (e.g., Vector<T>, Tensor<T>).
- Inheritance
-
MetaSGDOptions<T, TInput, TOutput>
- Implements
- Inherited Members
Remarks
Meta-SGD extends MAML by learning not just the model initialization but also per-parameter learning rates, momentum terms, and update directions. This effectively learns a custom optimizer configuration for each parameter, enabling highly specialized adaptation strategies.
For Beginners: Think of Meta-SGD as "learning how to learn" at the finest grain:
In standard training, you pick one learning rate for all parameters. But different parts of a neural network might benefit from different learning rates. Meta-SGD figures this out automatically by learning: - Per-parameter learning rates: Some weights need small updates, others larger - Per-parameter momentum: Some weights benefit from momentum, others don't - Update directions: Sometimes the gradient direction should be flipped or scaled
Algorithm Overview:
# For each model parameter θ_i, learn:
# - α_i: the optimal learning rate
# - β_i: the optimal momentum coefficient
# - d_i: the optimal update direction/scaling
# Meta-training:
for each task in task_batch:
adapted_params = initial_params.copy()
for step in range(K_inner):
gradients = compute_gradients(adapted_params, support_set)
for i in range(num_params):
# Per-parameter update rule
update_i = α_i * d_i * gradients[i] + β_i * velocity[i]
adapted_params[i] -= update_i
query_loss = evaluate(adapted_params, query_set)
# Meta-update: optimize α_i, β_i, d_i using query_loss gradient
Key Insights: 1. Per-parameter optimization allows heterogeneous learning rates across layers 2. First-order method: no Hessian computation needed, much faster than second-order MAML 3. Learned optimizers reveal which parameters are important for quick adaptation 4. Can combine with various base update rules (SGD, Adam, RMSprop)
Reference: Li, Z., Zhou, F., Chen, F., & Li, H. (2017). Meta-SGD: Learning to Learn Quickly for Few-Shot Learning.
Constructors
MetaSGDOptions(IFullModel<T, TInput, TOutput>)
Initializes a new instance of the MetaSGDOptions class with the required meta-model.
public MetaSGDOptions(IFullModel<T, TInput, TOutput> metaModel)
Parameters
metaModelIFullModel<T, TInput, TOutput>The meta-model to be trained (required).
Examples
// Create Meta-SGD options with minimal configuration
var options = new MetaSGDOptions<double, Tensor, Tensor>(myNeuralNetwork);
var metaSGD = new MetaSGDAlgorithm<double, Tensor, Tensor>(options);
// Create Meta-SGD options with custom per-parameter optimizer
var options = new MetaSGDOptions<double, Tensor, Tensor>(myNeuralNetwork)
{
UpdateRuleType = MetaSGDUpdateRuleType.Adam,
LearnMomentum = true,
LearnDirection = true,
LearnAdamBetas = true,
UseParameterGrouping = true,
NumParameterGroups = 20
};
Exceptions
- ArgumentNullException
Thrown when metaModel is null.
Properties
AdamBeta1Init
Gets or sets the initial value for Adam beta1.
public double AdamBeta1Init { get; set; }
Property Value
- double
Default: 0.9 (standard Adam default).
AdamBeta2Init
Gets or sets the initial value for Adam beta2.
public double AdamBeta2Init { get; set; }
Property Value
- double
Default: 0.999 (standard Adam default).
AdamEpsilonInit
Gets or sets the epsilon value for Adam numerical stability.
public double AdamEpsilonInit { get; set; }
Property Value
- double
Default: 1e-8.
AdaptationSteps
Gets or sets the number of gradient steps to take during inner loop adaptation.
public int AdaptationSteps { get; set; }
Property Value
- int
Default: 5 (typical for few-shot learning).
Remarks
Meta-SGD uses first-order optimization, so more adaptation steps are computationally cheaper than in MAML. However, too many steps can lead to overfitting on the support set.
CheckpointFrequency
Gets or sets how often to save checkpoints.
public int CheckpointFrequency { get; set; }
Property Value
- int
Default: 1000.
DataLoader
Gets or sets the episodic data loader for sampling tasks.
public IEpisodicDataLoader<T, TInput, TOutput>? DataLoader { get; set; }
Property Value
- IEpisodicDataLoader<T, TInput, TOutput>
Default: null (tasks must be provided manually to MetaTrain).
EnableCheckpointing
Gets or sets whether to save checkpoints during training.
public bool EnableCheckpointing { get; set; }
Property Value
- bool
Default: false.
EvaluationFrequency
Gets or sets how often to evaluate during meta-training.
public int EvaluationFrequency { get; set; }
Property Value
- int
Default: 500.
EvaluationTasks
Gets or sets the number of tasks to use for evaluation.
public int EvaluationTasks { get; set; }
Property Value
- int
Default: 100.
GradientClipThreshold
Gets or sets the maximum gradient norm for gradient clipping.
public double? GradientClipThreshold { get; set; }
Property Value
- double?
Default: 10.0 (prevents exploding gradients during meta-training).
Remarks
Gradient clipping is particularly important in Meta-SGD because the per-parameter learning rates can amplify gradients if they grow too large.
InnerLearningRate
Gets or sets the learning rate for the inner loop (task adaptation).
public double InnerLearningRate { get; set; }
Property Value
- double
Default: 0.01.
Remarks
In Meta-SGD, this serves as the initial value for per-parameter learning rates when using uniform initialization. During meta-training, each parameter will learn its own optimal learning rate, which may diverge from this initial value.
For Beginners: This is the starting point for all learning rates. Meta-SGD will adjust each one individually as it learns.
InnerOptimizer
Gets or sets the optimizer for inner-loop adaptation.
public IGradientBasedOptimizer<T, TInput, TOutput>? InnerOptimizer { get; set; }
Property Value
- IGradientBasedOptimizer<T, TInput, TOutput>
Default: null (uses the learned per-parameter optimizer).
Remarks
In Meta-SGD, this is typically not used directly because the per-parameter optimizer with learned coefficients replaces standard optimization. This is provided for compatibility with the base interface.
InnerSteps
Gets or sets the number of inner steps during meta-training.
public int InnerSteps { get; set; }
Property Value
- int
Default: 5 (matches AdaptationSteps by default).
Remarks
This can be different from AdaptationSteps to allow for different behavior during training vs. adaptation. Some implementations use fewer inner steps during training for efficiency.
LayerDecayFactor
Gets or sets the decay factor per layer depth.
public double LayerDecayFactor { get; set; }
Property Value
- double
Default: 0.9.
Remarks
Only used when UseLayerWiseDecay is true. A value of 0.9 means each successive layer has 90% of the previous layer's learning rate.
LearnAdamBetas
Gets or sets whether to learn Adam beta parameters when using Adam update rule.
public bool LearnAdamBetas { get; set; }
Property Value
- bool
Default: false.
Remarks
When enabled with Adam update rule, learns per-parameter beta1 and beta2 coefficients. This adds significant complexity but can improve performance.
LearnDirection
Gets or sets whether to learn per-parameter update direction signs.
public bool LearnDirection { get; set; }
Property Value
- bool
Default: true (helps with gradient sign ambiguity).
Remarks
The direction parameter can flip or scale the gradient direction for each parameter. This helps when the natural gradient direction isn't optimal for fast adaptation.
Mathematical formulation: θ_i' = θ_i - α_i × d_i × ∇_θ_i L where d_i is the learned direction scaling factor.
LearnLearningRate
Gets or sets whether to learn per-parameter learning rates.
public bool LearnLearningRate { get; set; }
Property Value
- bool
Default: true (the core feature of Meta-SGD).
Remarks
This is the defining feature of Meta-SGD. Each parameter gets its own learned learning rate that is optimized during meta-training to enable fast adaptation to new tasks.
For Beginners: Keep this true! This is what makes Meta-SGD special.
LearnMomentum
Gets or sets whether to learn per-parameter momentum coefficients.
public bool LearnMomentum { get; set; }
Property Value
- bool
Default: false (adds complexity, only enable if needed).
Remarks
When enabled, each parameter learns its own momentum coefficient. This can help with parameters that benefit from momentum-based updates but adds to the number of meta-parameters to learn.
For Beginners: Leave this false initially. Only enable if you find that per-parameter learning rates alone aren't sufficient.
LearningRateInitRange
Gets or sets the initialization range for learning rates when using random initialization.
public double LearningRateInitRange { get; set; }
Property Value
- double
Default: 0.1.
Remarks
When using Random initialization, learning rates are initialized uniformly in [InnerLearningRate - range/2, InnerLearningRate + range/2].
LearningRateInitialization
Gets or sets the initialization strategy for per-parameter learning rates.
public MetaSGDLearningRateInitialization LearningRateInitialization { get; set; }
Property Value
- MetaSGDLearningRateInitialization
Default: MetaSGDLearningRateInitialization.Uniform.
Remarks
Initialization strategies: - Uniform: All learning rates start at InnerLearningRate - Random: Random values within LearningRateInitRange - MagnitudeBased: Based on parameter magnitudes - LayerBased: Different rates per layer depth - Xavier: Xavier-style initialization
LearningRateL2Reg
Gets or sets the L2 regularization coefficient for learned learning rates.
public double LearningRateL2Reg { get; set; }
Property Value
- double
Default: 0.0 (no regularization).
Remarks
Regularizing learning rates prevents them from growing too large and helps with generalization across tasks.
LearningRateSchedule
Gets or sets the learning rate schedule type.
public MetaSGDLearningRateScheduleType LearningRateSchedule { get; set; }
Property Value
- MetaSGDLearningRateScheduleType
Default: MetaSGDLearningRateScheduleType.CosineAnnealing.
LossFunction
Gets or sets the loss function for training.
public ILossFunction<T>? LossFunction { get; set; }
Property Value
- ILossFunction<T>
Default: null (uses model's default loss function if available).
Remarks
The loss function is used both in the inner loop (task adaptation) and outer loop (meta-optimization) to guide the learning of per-parameter optimizers.
MaxLearningRate
Gets or sets the maximum allowed per-parameter learning rate.
public double MaxLearningRate { get; set; }
Property Value
- double
Default: 1.0 (prevents learning rates from becoming too large).
Remarks
Clipping learning rates to a maximum prevents unstable updates that could blow up during adaptation.
MetaBatchSize
Gets or sets the number of tasks to sample per meta-training iteration.
public int MetaBatchSize { get; set; }
Property Value
- int
Default: 4 (typical meta-batch size).
Remarks
Larger batch sizes provide more stable gradients for the meta-parameters but require more memory and computation per iteration.
MetaModel
Gets or sets the meta-model to be trained. This is the only required property.
public IFullModel<T, TInput, TOutput> MetaModel { get; set; }
Property Value
- IFullModel<T, TInput, TOutput>
Remarks
The model must implement IFullModel to support parameter getting/setting required for Meta-SGD's per-parameter optimization. Each parameter in the model will have its own learned learning rate, momentum, and direction coefficients.
For Beginners: This is the neural network whose parameters you want to meta-train. Meta-SGD will learn how to optimize each weight in this network.
MetaOptimizer
Gets or sets the optimizer for meta-parameter updates (outer loop).
public IGradientBasedOptimizer<T, TInput, TOutput>? MetaOptimizer { get; set; }
Property Value
- IGradientBasedOptimizer<T, TInput, TOutput>
Default: null (uses built-in Adam optimizer with OuterLearningRate).
Remarks
This optimizer updates the learned per-parameter learning rates, momentums, and directions. Adam is typically used as it handles the sparse gradients from per-parameter optimization well.
MinLearningRate
Gets or sets the minimum allowed per-parameter learning rate.
public double MinLearningRate { get; set; }
Property Value
- double
Default: 1e-6 (prevents learning rates from becoming too small).
Remarks
Clipping learning rates to a minimum prevents parameters from becoming "frozen" during adaptation.
NumCurvatureSamples
Gets or sets the number of samples for curvature approximation.
public int NumCurvatureSamples { get; set; }
Property Value
- int
Default: 10.
Remarks
Number of random directions used for Hessian-vector product approximation. Only used when UseHessianFree is true.
NumMetaIterations
Gets or sets the total number of meta-training iterations to perform.
public int NumMetaIterations { get; set; }
Property Value
- int
Default: 10000 (Meta-SGD typically needs many iterations).
Remarks
Meta-SGD often requires more iterations than MAML because it has more meta-parameters to learn (per-parameter coefficients). Monitor the validation loss to determine when to stop.
NumParameterGroups
Gets or sets the number of parameter groups.
public int NumParameterGroups { get; set; }
Property Value
- int
Default: 10.
Remarks
Only used when UseParameterGrouping is true. Parameters are partitioned into this many groups, with each group sharing a single learned learning rate.
OuterLearningRate
Gets or sets the learning rate for the outer loop (meta-optimization).
public double OuterLearningRate { get; set; }
Property Value
- double
Default: 0.001.
Remarks
This controls how quickly the per-parameter learning rates, momentums, and directions are updated during meta-training. A lower value provides more stable learning but slower convergence.
For Beginners: This controls how fast Meta-SGD learns the optimal per-parameter configurations. Too high causes instability, too low is slow.
ParameterSharingThreshold
Gets or sets the similarity threshold for parameter sharing.
public double ParameterSharingThreshold { get; set; }
Property Value
- double
Default: 0.95.
Remarks
Only used when UseParameterSharing is true. Parameters with similarity above this threshold will share learning rate configurations.
RandomSeed
Gets or sets the random seed for reproducibility.
public int? RandomSeed { get; set; }
Property Value
- int?
Default: null (non-deterministic).
ScheduleWarmupEpisodes
Gets or sets the number of warmup episodes for learning rate schedule.
public int ScheduleWarmupEpisodes { get; set; }
Property Value
- int
Default: 1000.
Remarks
During warmup, the learning rate gradually increases from a small value to the target value. This can help with training stability.
TrustRegionRadius
Gets or sets the trust region radius.
public double TrustRegionRadius { get; set; }
Property Value
- double
Default: 1.0.
Remarks
Maximum allowed magnitude for parameter updates. Only used when UseTrustRegion is true.
UpdateRuleType
Gets or sets the update rule type for per-parameter optimization.
public MetaSGDUpdateRuleType UpdateRuleType { get; set; }
Property Value
- MetaSGDUpdateRuleType
Default: MetaSGDUpdateRuleType.SGD.
Remarks
Update Rule Types: - SGD: Standard gradient descent with learned learning rates - SGDWithMomentum: Adds learned momentum terms per parameter - Adam: Full Adam optimizer with learned beta parameters - RMSprop: RMSprop with learned decay rates - AdaGrad: AdaGrad with learned accumulation - AdaDelta: AdaDelta with learned decay
For Beginners: Start with SGD and only move to more complex rules if you find SGD isn't working well for your problem.
UseFirstOrder
Gets or sets whether to use first-order approximation.
public bool UseFirstOrder { get; set; }
Property Value
- bool
Default: true (Meta-SGD is inherently first-order).
Remarks
Meta-SGD is designed as a first-order algorithm. Unlike MAML, it doesn't require computing gradients through the adaptation process. This property is always effectively true for standard Meta-SGD.
For Beginners: First-order means Meta-SGD doesn't need to compute complex second-order derivatives, making it much faster than MAML.
UseHessianFree
Gets or sets whether to use Hessian-free approximation for meta-gradients.
public bool UseHessianFree { get; set; }
Property Value
- bool
Default: false.
Remarks
Hessian-free methods can provide better meta-gradient estimates at the cost of additional computation.
UseLayerWiseDecay
Gets or sets whether to apply layer-wise learning rate decay.
public bool UseLayerWiseDecay { get; set; }
Property Value
- bool
Default: false.
Remarks
When enabled, deeper layers get smaller initial learning rates. This can help with training stability in very deep networks.
Formula: layer_lr = base_lr × (LayerDecayFactor ^ layer_depth)
UseLearningRateSchedule
Gets or sets whether to use a learning rate schedule for meta-learning rates.
public bool UseLearningRateSchedule { get; set; }
Property Value
- bool
Default: false.
Remarks
When enabled, the meta-learning rate (outer loop) follows a schedule during training. This can help with convergence.
UseParameterGrouping
Gets or sets whether to use parameter grouping.
public bool UseParameterGrouping { get; set; }
Property Value
- bool
Default: false.
Remarks
When enabled, parameters are grouped and share learning rates within groups. This reduces the number of meta-parameters and can improve generalization when the model has many parameters.
For Beginners: Enable this for very large models to reduce memory usage and potentially improve generalization.
UseParameterSharing
Gets or sets whether to use parameter sharing based on similarity.
public bool UseParameterSharing { get; set; }
Property Value
- bool
Default: false.
Remarks
When enabled, parameters with similar values or gradients share learning rate configurations. This can improve sample efficiency.
UseTrustRegion
Gets or sets whether to use trust region for parameter updates.
public bool UseTrustRegion { get; set; }
Property Value
- bool
Default: false.
Remarks
Trust region methods constrain the magnitude of parameter updates, which can improve stability during meta-training.
UseWarmStart
Gets or sets whether to use warm-start initialization for the optimizer.
public bool UseWarmStart { get; set; }
Property Value
- bool
Default: true.
Remarks
When enabled, initializes per-parameter learning rates and other coefficients to reasonable default values based on the configuration.
Methods
Clone()
Creates a deep copy of the Meta-SGD options.
public IMetaLearnerOptions<T> Clone()
Returns
- IMetaLearnerOptions<T>
A new MetaSGDOptions instance with the same configuration values.
GetEffectiveParameterGroups(int)
Gets the effective number of parameter groups.
public int GetEffectiveParameterGroups(int numModelParameters)
Parameters
numModelParametersintNumber of model parameters.
Returns
- int
Number of parameter groups after considering options.
GetTotalMetaParameters(int)
Gets the total number of learnable meta-parameters.
public int GetTotalMetaParameters(int numModelParameters)
Parameters
numModelParametersintNumber of model parameters.
Returns
- int
Total number of meta-parameters to learn.
Remarks
The total depends on which features are enabled: - Per-parameter learning rates: +numModelParameters - Per-parameter momentum: +numModelParameters - Per-parameter direction: +numModelParameters - Adam betas (if learning): +2*numModelParameters
IsValid()
Validates that all Meta-SGD configuration options are properly set.
public bool IsValid()
Returns
- bool
True if the configuration is valid for Meta-SGD training; otherwise, false.
Remarks
Validates all required hyperparameters and Meta-SGD-specific settings: - Standard meta-learning parameters (learning rates, steps, etc.) - Learning rate bounds and regularization - Parameter grouping configuration - Adam beta parameters if using Adam - Trust region settings