Namespace AiDotNet.DistributedTraining
Classes
- AsyncSGDOptimizer<T, TInput, TOutput>
Implements Asynchronous SGD optimizer - allows asynchronous parameter updates without strict barriers.
- CommunicationBackendBase<T>
Provides base implementation for distributed communication backends.
- CommunicationManager
Central manager for distributed communication operations.
- DDPModel<T, TInput, TOutput>
Implements DDP (Distributed Data Parallel) model wrapper for distributed training.
- DDPOptimizer<T, TInput, TOutput>
Implements true DDP (Distributed Data Parallel) optimizer - industry-standard gradient averaging.
- DistributedExtensions
Provides extension methods for easily enabling distributed training on models and optimizers.
- ElasticOptimizer<T, TInput, TOutput>
Implements Elastic optimizer - supports dynamic worker addition/removal during training.
- FSDPModel<T, TInput, TOutput>
Implements FSDP (Fully Sharded Data Parallel) model wrapper that shards parameters across multiple processes.
- FSDPOptimizer<T, TInput, TOutput>
Implements FSDP (Fully Sharded Data Parallel) optimizer wrapper that coordinates optimization across multiple processes.
- GlooCommunicationBackend<T>
Gloo-based communication backend for CPU-based collective operations.
- HybridShardedModel<T, TInput, TOutput>
Implements 3D Parallelism (Hybrid Sharded) model - combines data, tensor, and pipeline parallelism.
- HybridShardedOptimizer<T, TInput, TOutput>
Implements 3D Parallelism optimizer - coordinates across data, tensor, and pipeline dimensions.
- InMemoryCommunicationBackend<T>
Provides an in-memory implementation of distributed communication for testing and single-machine scenarios.
- LocalSGDOptimizer<T, TInput, TOutput>
Implements Local SGD distributed training optimizer - parameter averaging after local optimization.
- MPICommunicationBackend<T>
MPI.NET-based communication backend for production distributed training.
- NCCLCommunicationBackend<T>
NVIDIA NCCL-based communication backend for GPU-to-GPU communication.
- ParameterAnalyzer<T>
Analyzes model parameters and creates optimized groupings for distributed communication.
- ParameterAnalyzer<T>.ParameterGroup
Represents a group of parameters that should be communicated together.
- PipelineParallelModel<T, TInput, TOutput>
Implements Pipeline Parallel model wrapper - splits model into stages across ranks.
- PipelineParallelOptimizer<T, TInput, TOutput>
Implements Pipeline Parallel optimizer - coordinates optimization across pipeline stages.
- ShardedModelBase<T, TInput, TOutput>
Provides base implementation for distributed models with parameter sharding.
- ShardedOptimizerBase<T, TInput, TOutput>
Provides base implementation for distributed optimizers with parameter sharding.
- ShardingConfiguration<T>
Default implementation of sharding configuration for distributed training.
- TensorParallelModel<T, TInput, TOutput>
Implements Tensor Parallel model wrapper - splits individual layers across ranks (Megatron-LM style).
- TensorParallelOptimizer<T, TInput, TOutput>
Implements Tensor Parallel optimizer - coordinates updates for tensor-parallel layers.
- ZeRO1Model<T, TInput, TOutput>
Implements ZeRO Stage 1 model wrapper - shards optimizer states only.
- ZeRO1Optimizer<T, TInput, TOutput>
Implements ZeRO Stage 1 optimizer - shards optimizer states only.
- ZeRO2Model<T, TInput, TOutput>
Implements ZeRO Stage 2 model wrapper - shards optimizer states and gradients.
- ZeRO2Optimizer<T, TInput, TOutput>
Implements ZeRO Stage 2 optimizer - shards gradients and optimizer states across ranks.
- ZeRO3Model<T, TInput, TOutput>
Implements ZeRO Stage 3 model wrapper - full sharding of parameters, gradients, and optimizer states.
- ZeRO3Optimizer<T, TInput, TOutput>
Implements ZeRO Stage 3 optimizer - full sharding equivalent to FSDP.
Interfaces
- ICommunicationBackend<T>
Defines the contract for distributed communication backends.
- IShardedModel<T, TInput, TOutput>
Defines the contract for models that support distributed training with parameter sharding.
- IShardedOptimizer<T, TInput, TOutput>
Defines the contract for optimizers that support distributed training with parameter sharding.
- IShardingConfiguration<T>
Configuration for parameter sharding in distributed training.
Enums
- ReductionOperation
Defines the supported reduction operations for collective communication.