Enum DistributedStrategy
Defines the distributed training strategy to use.
public enum DistributedStrategy
Fields
DDP = 1Distributed Data Parallel (DDP) - Most common strategy. Parameters are replicated on each GPU, gradients are averaged.
Use when: Your model fits on one GPU but you want faster training.
Memory: Each GPU has full model copy.
Communication: Low (only gradients).
FSDP = 2Fully Sharded Data Parallel (FSDP) - PyTorch style, full parameter sharding.
Use when: Your model is too large to fit on one GPU.
Memory: Excellent (parameters sharded).
Communication: High (AllGather for parameters).
Hybrid = 8Hybrid - 3D parallelism combining data + tensor + pipeline.
Use when: Training frontier models (100B+ parameters) on 100s of GPUs.
Complexity: Very high, requires expert knowledge.
None = 0No distributed training - Single device training (default).
Use when: Training on a single GPU or CPU.
Memory: Full model on one device.
Communication: None.
PipelineParallel = 6Pipeline Parallel - Model split into stages across GPUs.
Use when: You have a very deep model with many layers.
Memory: Excellent for deep models.
TensorParallel = 7Tensor Parallel - Individual layers split across GPUs (Megatron-LM style).
Use when: You have very wide layers (large hidden dimensions).
Communication: High (requires fast interconnect like NVLink).
ZeRO1 = 3ZeRO Stage 1 - Only optimizer states are sharded.
Use when: Optimizer state memory is the bottleneck (e.g., Adam).
Memory: Good (4-8x reduction in optimizer memory).
ZeRO2 = 4ZeRO Stage 2 - Optimizer states + gradients are sharded.
Use when: Both optimizer and gradient memory are issues.
Memory: Very good (optimizer + gradient savings).
ZeRO3 = 5ZeRO Stage 3 - Full sharding (equivalent to FSDP).
Use when: You prefer ZeRO terminology over FSDP.
Memory: Excellent (everything sharded).
Remarks
For Beginners: These strategies determine how your model and data are split across multiple GPUs or machines. DDP (Distributed Data Parallel) is the most common and works for 90% of use cases.