Table of Contents

Class LocalSGDOptimizer<T, TInput, TOutput>

Namespace
AiDotNet.DistributedTraining
Assembly
AiDotNet.dll

Implements Local SGD distributed training optimizer - parameter averaging after local optimization.

public class LocalSGDOptimizer<T, TInput, TOutput> : ShardedOptimizerBase<T, TInput, TOutput>, IShardedOptimizer<T, TInput, TOutput>, IOptimizer<T, TInput, TOutput>, IModelSerializer

Type Parameters

T

The numeric type

TInput

The input type for the model

TOutput

The output type for the model

Inheritance
ShardedOptimizerBase<T, TInput, TOutput>
LocalSGDOptimizer<T, TInput, TOutput>
Implements
IShardedOptimizer<T, TInput, TOutput>
IOptimizer<T, TInput, TOutput>
Inherited Members
Extension Methods

Remarks

Strategy Overview: Local SGD allows each worker to perform multiple local optimization steps independently, then synchronizes model parameters (not gradients) across all workers using AllReduce averaging. This reduces communication frequency compared to traditional DDP while maintaining convergence. Based on "Don't Use Large Mini-Batches, Use Local SGD" (Lin et al., 2020).

For Beginners: Unlike traditional DDP which synchronizes gradients before every parameter update, Local SGD lets each worker train independently for several steps, then averages the final model parameters. Think of it like students studying independently for a week, then meeting to average their understanding, rather than checking answers after every practice problem.

Key Difference from DDP: - **Local SGD (this class)**: Optimize locally → Average PARAMETERS → Continue training - **True DDP**: Compute gradients → Average GRADIENTS → Apply averaged gradients → Continue training

Use Cases: - Reducing communication frequency in distributed training - Slower network connections where communication is expensive - Works with any optimizer (Adam, SGD, RMSprop, etc.) - Large models where parameter synchronization dominates training time

Trade-offs: - Memory: Each process stores full model and optimizer state - Communication: Very low - parameters synchronized less frequently than gradients - Convergence: Slightly different trajectory than DDP but reaches similar final accuracy - Complexity: Low - straightforward parameter averaging - Best for: Communication-constrained distributed training

Production Note: For true DDP (gradient averaging), use GradientCompressionOptimizer with compression ratio = 1.0, which properly averages gradients before parameter updates.

Constructors

LocalSGDOptimizer(IOptimizer<T, TInput, TOutput>, IShardingConfiguration<T>)

Creates a Local SGD optimizer that averages parameters across workers.

public LocalSGDOptimizer(IOptimizer<T, TInput, TOutput> wrappedOptimizer, IShardingConfiguration<T> config)

Parameters

wrappedOptimizer IOptimizer<T, TInput, TOutput>

The base optimizer to wrap (SGD, Adam, etc.)

config IShardingConfiguration<T>

Configuration for distributed training communication

Methods

Deserialize(byte[])

Loads a previously serialized model from binary data.

public override void Deserialize(byte[] data)

Parameters

data byte[]

The byte array containing the serialized model data.

Remarks

This method takes binary data created by the Serialize method and uses it to restore a model to its previous state.

For Beginners: This is like opening a saved file to continue your work.

When you call this method:

  • You provide the binary data (bytes) that was previously created by Serialize
  • The model rebuilds itself using this data
  • After deserializing, the model is exactly as it was when serialized
  • It's ready to make predictions without needing to be trained again

For example:

  • You download a pre-trained model file for detecting spam emails
  • You deserialize this file into your application
  • Immediately, your application can detect spam without any training
  • The model has all the knowledge that was built into it by its original creator

This is particularly useful when:

  • You want to use a model that took days to train
  • You need to deploy the same model across multiple devices
  • You're creating an application that non-technical users will use

Think of it like installing the brain of a trained expert directly into your application.

Optimize(OptimizationInputData<T, TInput, TOutput>)

Performs the optimization process to find the best parameters for a model.

public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInputData<T, TInput, TOutput> inputData)

Parameters

inputData OptimizationInputData<T, TInput, TOutput>

The data needed for optimization, including the objective function, initial parameters, and any constraints.

Returns

OptimizationResult<T, TInput, TOutput>

The result of the optimization process, including the optimized parameters and performance metrics.

Remarks

This method takes input data and attempts to find the optimal parameters that minimize or maximize the objective function.

For Beginners: This is where the actual "learning" happens. The optimizer looks at your data and tries different parameter values to find the ones that make your model perform best.

The process typically involves:

  1. Evaluating how well the current parameters perform
  2. Calculating how to change the parameters to improve performance
  3. Updating the parameters
  4. Repeating until the model performs well enough or reaches a maximum number of attempts

Serialize()

Converts the current state of a machine learning model into a binary format.

public override byte[] Serialize()

Returns

byte[]

A byte array containing the serialized model data.

Remarks

This method captures all the essential information about a trained model and converts it into a sequence of bytes that can be stored or transmitted.

For Beginners: This is like exporting your work to a file.

When you call this method:

  • The model's current state (all its learned patterns and parameters) is captured
  • This information is converted into a compact binary format (bytes)
  • You can then save these bytes to a file, database, or send them over a network

For example:

  • After training a model to recognize cats vs. dogs in images
  • You can serialize the model to save all its learned knowledge
  • Later, you can use this saved data to recreate the model exactly as it was
  • The recreated model will make the same predictions as the original

Think of it like taking a snapshot of your model's brain at a specific moment in time.

SynchronizeOptimizerState()

Synchronizes optimizer state (like momentum buffers) across all processes.

public override void SynchronizeOptimizerState()

Remarks

For Beginners: Some optimizers (like Adam) keep track of past gradients to make smarter updates. This method makes sure all processes have the same optimizer state, so they stay coordinated. It's like making sure all team members are reading from the same playbook.