Table of Contents

Class SSLDistributedConfig

Namespace
AiDotNet.SelfSupervisedLearning
Assembly
AiDotNet.dll

Configuration for distributed SSL training using DDP (Distributed Data Parallel).

public class SSLDistributedConfig
Inheritance
SSLDistributedConfig
Inherited Members

Remarks

For Beginners: This configuration enables training across multiple GPUs or machines. DDP (Distributed Data Parallel) is the industry-standard approach used by PyTorch, TensorFlow, and JAX for distributed training.

How DDP works for SSL:

  1. Each GPU/worker processes its own batch of data
  2. Each worker computes local gradients
  3. Gradients are averaged across all workers (AllReduce)
  4. All workers apply the same averaged gradients
  5. All workers now have identical parameters

SSL-specific benefits:

  • Contrastive methods like SimCLR benefit from larger effective batch sizes
  • DDP with 4 GPUs and batch_size=1024 gives effective batch size of 4096
  • Memory bank methods like MoCo can share queue across workers

Properties

Backend

Gets or sets the communication backend type.

public SSLCommunicationBackend Backend { get; set; }

Property Value

SSLCommunicationBackend

Remarks

Default: SSLCommunicationBackend.InMemory

Use NCCL for multi-GPU, MPI for multi-node training.

Enabled

Gets or sets whether distributed training is enabled.

public bool Enabled { get; set; }

Property Value

bool

Remarks

Default: false

FindUnusedParameters

Gets or sets whether to use find_unused_parameters behavior.

public bool FindUnusedParameters { get; set; }

Property Value

bool

Remarks

Default: false

Enable if some parameters don't receive gradients (e.g., frozen layers).

GradientSyncFrequency

Gets or sets the gradient synchronization frequency.

public int GradientSyncFrequency { get; set; }

Property Value

int

Remarks

Default: 1 (sync every step)

Set to higher values for gradient accumulation across workers.

Rank

Gets or sets the rank of this worker (0-indexed).

public int Rank { get; set; }

Property Value

int

Remarks

Default: 0

Each worker must have a unique rank from 0 to WorldSize-1.

SharedMemoryQueue

Gets or sets whether all workers share the same memory queue (for MoCo).

public bool SharedMemoryQueue { get; set; }

Property Value

bool

Remarks

Default: true

When true, the memory queue is synchronized across workers for MoCo methods.

SyncBatchNorm

Gets or sets whether to synchronize BatchNorm statistics across workers.

public bool SyncBatchNorm { get; set; }

Property Value

bool

Remarks

Default: true

SyncBN is important for SSL methods where batch statistics affect training.

UseGradientCompression

Gets or sets whether to use gradient compression for communication.

public bool UseGradientCompression { get; set; }

Property Value

bool

Remarks

Default: false

Gradient compression reduces communication overhead but may affect convergence.

WorldSize

Gets or sets the number of workers (GPUs/processes) for distributed training.

public int WorldSize { get; set; }

Property Value

int

Remarks

Default: 1

Set to the number of GPUs available for training.

Methods

GetConfiguration()

Gets the configuration as a dictionary.

public IDictionary<string, object> GetConfiguration()

Returns

IDictionary<string, object>