Distributed Training Tutorial
{: .no_toc }
Scale your training across multiple GPUs and nodes. {: .fs-6 .fw-300 }
Table of contents
{: .no_toc .text-delta }
- TOC {:toc}
Overview
AiDotNet provides 10+ distributed training strategies:
- DDP: Distributed Data Parallel - replicate model, split data
- FSDP: Fully Sharded Data Parallel - shard model across GPUs
- ZeRO: Zero Redundancy Optimizer (1/2/3) - memory optimization
- Pipeline: Pipeline Parallelism - split model by layers
- Tensor: Tensor Parallelism - split individual operations
DDP (Distributed Data Parallel)
Best for: Models that fit on a single GPU
using AiDotNet.DistributedTraining;
// Initialize process group
var config = new DistributedConfig
{
Backend = DistributedBackend.NCCL,
WorldSize = 4 // 4 GPUs
};
using var context = DistributedContext.Initialize(config);
// Wrap model
var ddpModel = DDP.Wrap(model);
// Training loop (same as single GPU)
for (int epoch = 0; epoch < epochs; epoch++)
{
foreach (var batch in dataLoader)
{
var output = ddpModel.Forward(batch.Input);
var loss = lossFunction.Compute(output, batch.Target);
ddpModel.Backward(loss);
optimizer.Step();
optimizer.ZeroGrad();
}
}
Multi-Node Setup
# Node 0 (master)
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=29500
export WORLD_SIZE=8
export RANK=0
dotnet run
# Node 1
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=29500
export WORLD_SIZE=8
export RANK=4
dotnet run
FSDP (Fully Sharded Data Parallel)
Best for: Large models that don't fit on a single GPU
using AiDotNet.DistributedTraining.FSDP;
var fsdpConfig = new FSDPConfig<float>
{
ShardingStrategy = ShardingStrategy.FullShard,
MixedPrecision = new FSDPMixedPrecisionConfig
{
Enabled = true,
ParameterDtype = DataType.Float32,
ReduceDtype = DataType.Float32,
BufferDtype = DataType.BFloat16
},
ActivationCheckpointing = new ActivationCheckpointingConfig
{
Enabled = true,
CheckpointInterval = 2
}
};
// Wrap model with FSDP
var fsdpModel = FSDP<float>.Wrap(model, fsdpConfig);
// Memory is now sharded across GPUs!
Memory Comparison
| Strategy | 7B Model Memory/GPU |
|---|---|
| Single GPU | 28+ GB (OOM) |
| DDP (4 GPU) | 28+ GB (OOM) |
| FSDP SHARD_GRAD_OP | ~14 GB |
| FSDP FULL_SHARD | ~8 GB |
| FSDP + Checkpointing | ~5 GB |
ZeRO Optimization
DeepSpeed-style memory optimization:
using AiDotNet.DistributedTraining.ZeRO;
// ZeRO Stage 1: Partition optimizer states
var zero1 = new ZeROOptimizer<float>(
baseOptimizer: new AdamOptimizer<float>(),
stage: ZeROStage.Stage1);
// ZeRO Stage 2: + Partition gradients
var zero2 = new ZeROOptimizer<float>(
baseOptimizer: new AdamOptimizer<float>(),
stage: ZeROStage.Stage2);
// ZeRO Stage 3: + Partition parameters
var zero3 = new ZeROOptimizer<float>(
baseOptimizer: new AdamOptimizer<float>(),
stage: ZeROStage.Stage3);
Pipeline Parallelism
Split model across GPUs by layers:
using AiDotNet.DistributedTraining.Pipeline;
var pipelineConfig = new PipelineConfig
{
NumStages = 4,
MicroBatchSize = 4,
NumMicroBatches = 8
};
// Define pipeline stages (which layers go where)
var stages = new[]
{
new PipelineStage(layers: model.Layers[..6], device: 0),
new PipelineStage(layers: model.Layers[6..12], device: 1),
new PipelineStage(layers: model.Layers[12..18], device: 2),
new PipelineStage(layers: model.Layers[18..], device: 3)
};
var pipelineModel = Pipeline.Wrap(model, stages, pipelineConfig);
Using AiModelBuilder
var result = await new AiModelBuilder<float, Tensor<float>, Tensor<float>>()
.ConfigureModel(largeModel)
.ConfigureOptimizer(new AdamWOptimizer<float>())
.ConfigureDistributedTraining(new DistributedConfig
{
Strategy = DistributedStrategy.FSDP,
WorldSize = 8,
ShardingStrategy = ShardingStrategy.FullShard
})
.ConfigureGpuAcceleration(new GpuAccelerationConfig
{
Enabled = true,
MixedPrecision = true
})
.BuildAsync(trainData, trainLabels);
Cloud Training
vast.ai Setup
// Configure for vast.ai multi-GPU instances
var config = new DistributedConfig
{
Backend = DistributedBackend.NCCL,
WorldSize = Environment.GetEnvironmentVariable("WORLD_SIZE"),
Rank = Environment.GetEnvironmentVariable("RANK"),
MasterAddress = Environment.GetEnvironmentVariable("MASTER_ADDR"),
MasterPort = Environment.GetEnvironmentVariable("MASTER_PORT")
};
Azure ML
# azure-ml-config.yml
compute:
instance_type: Standard_NC24ads_A100_v4
instance_count: 4
distributed:
type: PyTorch # Uses NCCL
Checkpointing
DDP Checkpointing
// Save (only on rank 0)
if (context.Rank == 0)
{
model.SaveCheckpoint("checkpoint.pt");
}
context.Barrier();
// Load (all ranks)
model.LoadCheckpoint("checkpoint.pt");
FSDP Checkpointing
// FSDP handles distributed state automatically
fsdpModel.SaveCheckpoint("fsdp_checkpoint.pt");
// Load
fsdpModel.LoadCheckpoint("fsdp_checkpoint.pt");
Best Practices
- Start with DDP: Use FSDP only if model doesn't fit
- Use gradient accumulation: Increase effective batch size
- Enable mixed precision: Faster training, less memory
- Monitor GPU utilization: Use nvidia-smi or dcgm
- Profile communication: Ensure NCCL is performing well
Gradient Accumulation
int accumulationSteps = 4;
int effectiveBatchSize = batchSize * accumulationSteps * worldSize;
for (int step = 0; step < accumulationSteps; step++)
{
var loss = ComputeLoss(batch) / accumulationSteps;
loss.Backward();
}
optimizer.Step();
optimizer.ZeroGrad();
Troubleshooting
NCCL Timeout
export NCCL_DEBUG=INFO
export NCCL_SOCKET_TIMEOUT=300000
Memory Issues
// Reduce batch size
// Enable gradient checkpointing
// Use FSDP FULL_SHARD
Slow Communication
# Check if using NVLink
nvidia-smi topo -m
# Use NCCL_IB_DISABLE for cloud without InfiniBand
export NCCL_IB_DISABLE=1