Class ZeRO2Optimizer<T, TInput, TOutput>

Namespace: AiDotNet.DistributedTraining

Assembly: AiDotNet.dll

Implements ZeRO Stage 2 optimizer - shards gradients and optimizer states across ranks.

public class ZeRO2Optimizer<T, TInput, TOutput> : ShardedOptimizerBase<T, TInput, TOutput>, IShardedOptimizer<T, TInput, TOutput>, IOptimizer<T, TInput, TOutput>, IModelSerializer

Type Parameters

T: The numeric type
TInput: The input type for the model
TOutput: The output type for the model

Inheritance: object

ShardedOptimizerBase<T, TInput, TOutput>

ZeRO2Optimizer<T, TInput, TOutput>

Implements: IShardedOptimizer<T, TInput, TOutput>

IOptimizer<T, TInput, TOutput>

IModelSerializer

Inherited Members: ShardedOptimizerBase<T, TInput, TOutput>.NumOps

ShardedOptimizerBase<T, TInput, TOutput>.Config

ShardedOptimizerBase<T, TInput, TOutput>.WrappedOptimizer

ShardedOptimizerBase<T, TInput, TOutput>.WrappedOptimizerInternal

ShardedOptimizerBase<T, TInput, TOutput>.Rank

ShardedOptimizerBase<T, TInput, TOutput>.WorldSize

ShardedOptimizerBase<T, TInput, TOutput>.ShardingConfiguration

ShardedOptimizerBase<T, TInput, TOutput>.Optimize(OptimizationInputData<T, TInput, TOutput>)

ShardedOptimizerBase<T, TInput, TOutput>.SynchronizeOptimizerState()

ShardedOptimizerBase<T, TInput, TOutput>.SynchronizeParameters(IFullModel<T, TInput, TOutput>)

ShardedOptimizerBase<T, TInput, TOutput>.ShouldEarlyStop()

ShardedOptimizerBase<T, TInput, TOutput>.GetOptions()

ShardedOptimizerBase<T, TInput, TOutput>.LastComputedGradients

ShardedOptimizerBase<T, TInput, TOutput>.ApplyGradients(Vector<T>, IFullModel<T, TInput, TOutput>)

ShardedOptimizerBase<T, TInput, TOutput>.Reset()

ShardedOptimizerBase<T, TInput, TOutput>.Serialize()

ShardedOptimizerBase<T, TInput, TOutput>.Deserialize(byte[])

ShardedOptimizerBase<T, TInput, TOutput>.SaveModel(string)

ShardedOptimizerBase<T, TInput, TOutput>.LoadModel(string)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributed<T, TInput, TOutput>(IOptimizer<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IOptimizer<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

Strategy Overview: True ZeRO-2 implementation using ReduceScatter for gradient sharding. Each rank: 1. Computes local gradients on full parameter set 2. ReduceScatter: reduces gradients AND scatters them (each rank gets a shard) 3. Updates only its shard of parameters using its shard of gradients 4. AllGather: reconstructs full parameters from shards for next forward pass

This saves memory by distributing gradient storage and parameter updates across ranks.

For Beginners: ZeRO-2 divides the work of storing and updating parameters across processes. Think of it like a team where each person is responsible for maintaining a specific section of a large document. Everyone reads the full document (forward pass), but each person only stores and updates their assigned section (backward pass). Before the next iteration, they share their sections to reconstruct the full document.

Use Cases: - Large models where gradient memory is significant (billions of parameters) - Want memory savings beyond DDP - Good network for AllGather operations - Works with ANY gradient-based optimizer (SGD, Adam, RMSprop, etc.)

Trade-offs: - Memory: Very Good - gradients and optimizer states sharded (1/N of DDP) - Communication: ReduceScatter + AllGather (vs AllReduce for DDP) - Synchronization: Perfect - all ranks reconstruct identical parameters - Complexity: Moderate - requires parameter sharding logic - Best for: Large models with limited GPU memory

Memory Savings vs DDP: - DDP: Each rank stores full gradients + full optimizer state - ZeRO-2: Each rank stores 1/N gradients + 1/N optimizer state - Savings increase linearly with world size

Constructors

ZeRO2Optimizer(IOptimizer<T, TInput, TOutput>, IShardingConfiguration<T>)

Creates a ZeRO-2 optimizer that shards gradients and optimizer states.

public ZeRO2Optimizer(IOptimizer<T, TInput, TOutput> wrappedOptimizer, IShardingConfiguration<T> config)

Parameters

wrappedOptimizer IOptimizer<T, TInput, TOutput>: The base optimizer to wrap (any gradient-based optimizer: SGD, Adam, RMSprop, etc.)
config IShardingConfiguration<T>: Configuration for distributed training communication

Exceptions

ArgumentException: If wrapped optimizer is not gradient-based

Methods

Deserialize(byte[])

Loads a previously serialized model from binary data.

public override void Deserialize(byte[] data)

Parameters

data byte[]: The byte array containing the serialized model data.

Remarks

This method takes binary data created by the Serialize method and uses it to restore a model to its previous state.

For Beginners: This is like opening a saved file to continue your work.

When you call this method:

You provide the binary data (bytes) that was previously created by Serialize
The model rebuilds itself using this data
After deserializing, the model is exactly as it was when serialized
It's ready to make predictions without needing to be trained again

For example:

You download a pre-trained model file for detecting spam emails
You deserialize this file into your application
Immediately, your application can detect spam without any training
The model has all the knowledge that was built into it by its original creator

This is particularly useful when:

You want to use a model that took days to train
You need to deploy the same model across multiple devices
You're creating an application that non-technical users will use

Think of it like installing the brain of a trained expert directly into your application.

Optimize(OptimizationInputData<T, TInput, TOutput>)

Performs the optimization process to find the best parameters for a model.

public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInputData<T, TInput, TOutput> inputData)

Parameters

inputData OptimizationInputData<T, TInput, TOutput>: The data needed for optimization, including the objective function, initial parameters, and any constraints.

Returns

OptimizationResult<T, TInput, TOutput>: The result of the optimization process, including the optimized parameters and performance metrics.

Remarks

This method takes input data and attempts to find the optimal parameters that minimize or maximize the objective function.

For Beginners: This is where the actual "learning" happens. The optimizer looks at your data and tries different parameter values to find the ones that make your model perform best.

The process typically involves:

Evaluating how well the current parameters perform
Calculating how to change the parameters to improve performance
Updating the parameters
Repeating until the model performs well enough or reaches a maximum number of attempts

Serialize()

Converts the current state of a machine learning model into a binary format.

public override byte[] Serialize()

Returns

byte[]: A byte array containing the serialized model data.

Remarks

This method captures all the essential information about a trained model and converts it into a sequence of bytes that can be stored or transmitted.

For Beginners: This is like exporting your work to a file.

When you call this method:

The model's current state (all its learned patterns and parameters) is captured
This information is converted into a compact binary format (bytes)
You can then save these bytes to a file, database, or send them over a network

For example:

After training a model to recognize cats vs. dogs in images
You can serialize the model to save all its learned knowledge
Later, you can use this saved data to recreate the model exactly as it was
The recreated model will make the same predictions as the original

Think of it like taking a snapshot of your model's brain at a specific moment in time.

SynchronizeOptimizerState()

Synchronizes optimizer state (like momentum buffers) across all processes.

public override void SynchronizeOptimizerState()

Remarks

For Beginners: Some optimizers (like Adam) keep track of past gradients to make smarter updates. This method makes sure all processes have the same optimizer state, so they stay coordinated. It's like making sure all team members are reading from the same playbook.

Table of Contents

Class ZeRO2Optimizer<T, TInput, TOutput>

Type Parameters

Remarks

Constructors

ZeRO2Optimizer(IOptimizer<T, TInput, TOutput>, IShardingConfiguration<T>)

Parameters

Exceptions

Methods

Deserialize(byte[])

Parameters

Remarks

Optimize(OptimizationInputData<T, TInput, TOutput>)

Parameters

Returns

Remarks

Serialize()

Returns

Remarks

SynchronizeOptimizerState()

Remarks