Class SSLDistributedConfig
- Namespace
- AiDotNet.SelfSupervisedLearning
- Assembly
- AiDotNet.dll
Configuration for distributed SSL training using DDP (Distributed Data Parallel).
public class SSLDistributedConfig
- Inheritance
-
SSLDistributedConfig
- Inherited Members
Remarks
For Beginners: This configuration enables training across multiple GPUs or machines. DDP (Distributed Data Parallel) is the industry-standard approach used by PyTorch, TensorFlow, and JAX for distributed training.
How DDP works for SSL:
- Each GPU/worker processes its own batch of data
- Each worker computes local gradients
- Gradients are averaged across all workers (AllReduce)
- All workers apply the same averaged gradients
- All workers now have identical parameters
SSL-specific benefits:
- Contrastive methods like SimCLR benefit from larger effective batch sizes
- DDP with 4 GPUs and batch_size=1024 gives effective batch size of 4096
- Memory bank methods like MoCo can share queue across workers
Properties
Backend
Gets or sets the communication backend type.
public SSLCommunicationBackend Backend { get; set; }
Property Value
Remarks
Default: SSLCommunicationBackend.InMemory
Use NCCL for multi-GPU, MPI for multi-node training.
Enabled
Gets or sets whether distributed training is enabled.
public bool Enabled { get; set; }
Property Value
Remarks
Default: false
FindUnusedParameters
Gets or sets whether to use find_unused_parameters behavior.
public bool FindUnusedParameters { get; set; }
Property Value
Remarks
Default: false
Enable if some parameters don't receive gradients (e.g., frozen layers).
GradientSyncFrequency
Gets or sets the gradient synchronization frequency.
public int GradientSyncFrequency { get; set; }
Property Value
Remarks
Default: 1 (sync every step)
Set to higher values for gradient accumulation across workers.
Rank
Gets or sets the rank of this worker (0-indexed).
public int Rank { get; set; }
Property Value
Remarks
Default: 0
Each worker must have a unique rank from 0 to WorldSize-1.
SharedMemoryQueue
Gets or sets whether all workers share the same memory queue (for MoCo).
public bool SharedMemoryQueue { get; set; }
Property Value
Remarks
Default: true
When true, the memory queue is synchronized across workers for MoCo methods.
SyncBatchNorm
Gets or sets whether to synchronize BatchNorm statistics across workers.
public bool SyncBatchNorm { get; set; }
Property Value
Remarks
Default: true
SyncBN is important for SSL methods where batch statistics affect training.
UseGradientCompression
Gets or sets whether to use gradient compression for communication.
public bool UseGradientCompression { get; set; }
Property Value
Remarks
Default: false
Gradient compression reduces communication overhead but may affect convergence.
WorldSize
Gets or sets the number of workers (GPUs/processes) for distributed training.
public int WorldSize { get; set; }
Property Value
Remarks
Default: 1
Set to the number of GPUs available for training.
Methods
GetConfiguration()
Gets the configuration as a dictionary.
public IDictionary<string, object> GetConfiguration()