Table of Contents

Enum DistributedStrategy

Namespace
AiDotNet.Enums
Assembly
AiDotNet.dll

Defines the distributed training strategy to use.

public enum DistributedStrategy

Fields

DDP = 1

Distributed Data Parallel (DDP) - Most common strategy. Parameters are replicated on each GPU, gradients are averaged.

Use when: Your model fits on one GPU but you want faster training.

Memory: Each GPU has full model copy.

Communication: Low (only gradients).

FSDP = 2

Fully Sharded Data Parallel (FSDP) - PyTorch style, full parameter sharding.

Use when: Your model is too large to fit on one GPU.

Memory: Excellent (parameters sharded).

Communication: High (AllGather for parameters).

Hybrid = 8

Hybrid - 3D parallelism combining data + tensor + pipeline.

Use when: Training frontier models (100B+ parameters) on 100s of GPUs.

Complexity: Very high, requires expert knowledge.

None = 0

No distributed training - Single device training (default).

Use when: Training on a single GPU or CPU.

Memory: Full model on one device.

Communication: None.

PipelineParallel = 6

Pipeline Parallel - Model split into stages across GPUs.

Use when: You have a very deep model with many layers.

Memory: Excellent for deep models.

TensorParallel = 7

Tensor Parallel - Individual layers split across GPUs (Megatron-LM style).

Use when: You have very wide layers (large hidden dimensions).

Communication: High (requires fast interconnect like NVLink).

ZeRO1 = 3

ZeRO Stage 1 - Only optimizer states are sharded.

Use when: Optimizer state memory is the bottleneck (e.g., Adam).

Memory: Good (4-8x reduction in optimizer memory).

ZeRO2 = 4

ZeRO Stage 2 - Optimizer states + gradients are sharded.

Use when: Both optimizer and gradient memory are issues.

Memory: Very good (optimizer + gradient savings).

ZeRO3 = 5

ZeRO Stage 3 - Full sharding (equivalent to FSDP).

Use when: You prefer ZeRO terminology over FSDP.

Memory: Excellent (everything sharded).

Remarks

For Beginners: These strategies determine how your model and data are split across multiple GPUs or machines. DDP (Distributed Data Parallel) is the most common and works for 90% of use cases.