Table of Contents

Class DDPModel<T, TInput, TOutput>

Namespace
AiDotNet.DistributedTraining
Assembly
AiDotNet.dll

Implements DDP (Distributed Data Parallel) model wrapper for distributed training.

public class DDPModel<T, TInput, TOutput> : ShardedModelBase<T, TInput, TOutput>, IShardedModel<T, TInput, TOutput>, IFullModel<T, TInput, TOutput>, IModel<TInput, TOutput, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, TInput, TOutput>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, TInput, TOutput>>, IGradientComputable<T, TInput, TOutput>, IJitCompilable<T>

Type Parameters

T

The numeric type

TInput

The input type for the model

TOutput

The output type for the model

Inheritance
ShardedModelBase<T, TInput, TOutput>
DDPModel<T, TInput, TOutput>
Implements
IShardedModel<T, TInput, TOutput>
IFullModel<T, TInput, TOutput>
IModel<TInput, TOutput, ModelMetadata<T>>
IParameterizable<T, TInput, TOutput>
ICloneable<IFullModel<T, TInput, TOutput>>
IGradientComputable<T, TInput, TOutput>
Inherited Members
Extension Methods

Remarks

Strategy Overview: DDP (Distributed Data Parallel) is the most common and straightforward distributed training strategy. Each process maintains a full replica of the model. During training, gradients are synchronized across all processes using AllReduce, ensuring all replicas stay identical. This is PyTorch's default distributed training strategy.

For Beginners: This class implements DDP (Distributed Data Parallel), the simplest and most popular way to train models across multiple GPUs or machines. Unlike FSDP which shards parameters, DDP keeps a complete copy of the model on each process. It automatically handles: - Keeping full model parameters on each process (no sharding) - Averaging gradients across all processes after backward pass - Ensuring all model replicas stay synchronized

Think of it like multiple chefs each making the full recipe. After each step, they compare notes and average their learnings, so everyone stays on the same page. This is simpler than FSDP where each person only knows part of the recipe.

Use Cases: - Standard multi-GPU training where model fits in single GPU memory - When communication is fast (NVLink, InfiniBand) - Simpler debugging than FSDP (full model on each process) - Default choice for most distributed training scenarios

Trade-offs: - Memory: Moderate - each process stores full model (parameters replicated) - Communication: Low - only gradients synchronized (AllReduce after backward) - Complexity: Low - simplest distributed strategy - Best for: Models that fit in single GPU memory, fast interconnects

Example:

// Original model
var model = new NeuralNetworkModel<double>(...);

// Wrap it for DDP distributed training var backend = new InMemoryCommunicationBackend<double>(rank: 0, worldSize: 4); var config = new ShardingConfiguration<double>(backend); var ddpModel = new DDPModel<double, Tensor<double>, Tensor<double>>(model, config);

// Now train as usual - DDP magic happens automatically! ddpModel.Train(inputs, outputs);

Constructors

DDPModel(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Creates a new DDP model wrapping an existing model.

public DDPModel(IFullModel<T, TInput, TOutput> wrappedModel, IShardingConfiguration<T> config)

Parameters

wrappedModel IFullModel<T, TInput, TOutput>

The model to wrap with DDP capabilities

config IShardingConfiguration<T>

Configuration for sharding and communication

Remarks

For Beginners: This constructor takes your existing model and makes it distributed using DDP strategy. You provide: 1. The model you want to make distributed 2. A configuration that tells us how to do the distribution

The constructor automatically: - Ensures each process has a full copy of the model - Sets up communication channels for gradient synchronization - Prepares everything for DDP distributed training

Exceptions

ArgumentNullException

Thrown if model or config is null

Methods

Clone()

Creates a shallow copy of this object.

public override IFullModel<T, TInput, TOutput> Clone()

Returns

IFullModel<T, TInput, TOutput>

Deserialize(byte[])

Loads a previously serialized model from binary data.

public override void Deserialize(byte[] data)

Parameters

data byte[]

The byte array containing the serialized model data.

Remarks

This method takes binary data created by the Serialize method and uses it to restore a model to its previous state.

For Beginners: This is like opening a saved file to continue your work.

When you call this method:

  • You provide the binary data (bytes) that was previously created by Serialize
  • The model rebuilds itself using this data
  • After deserializing, the model is exactly as it was when serialized
  • It's ready to make predictions without needing to be trained again

For example:

  • You download a pre-trained model file for detecting spam emails
  • You deserialize this file into your application
  • Immediately, your application can detect spam without any training
  • The model has all the knowledge that was built into it by its original creator

This is particularly useful when:

  • You want to use a model that took days to train
  • You need to deploy the same model across multiple devices
  • You're creating an application that non-technical users will use

Think of it like installing the brain of a trained expert directly into your application.

GetModelMetadata()

Retrieves metadata and performance metrics about the trained model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

An object containing metadata and performance metrics about the trained model.

Remarks

This method provides information about the model's structure, parameters, and performance metrics.

For Beginners: Model metadata is like a report card for your machine learning model.

Just as a report card shows how well a student is performing in different subjects, model metadata shows how well your model is performing and provides details about its structure.

This information typically includes:

  • Accuracy measures: How well does the model's predictions match actual values?
  • Error metrics: How far off are the model's predictions on average?
  • Model parameters: What patterns did the model learn from the data?
  • Training information: How long did training take? How many iterations were needed?

For example, in a house price prediction model, metadata might include:

  • Average prediction error (e.g., off by $15,000 on average)
  • How strongly each feature (bedrooms, location) influences the prediction
  • How well the model fits the training data

This information helps you understand your model's strengths and weaknesses, and decide if it's ready to use or needs more training.

InitializeSharding()

Initializes DDP - no actual parameter sharding, each process keeps full parameters.

protected override void InitializeSharding()

Remarks

For Beginners: Unlike FSDP which splits parameters, DDP keeps the full model on each process. This method sets up the local shard to actually be the full parameter set.

LoadModel(string)

Loads the model from a file.

public override void LoadModel(string filePath)

Parameters

filePath string

The path to the file containing the saved model.

Remarks

This method provides a convenient way to load a model directly from disk. It combines file I/O operations with deserialization.

For Beginners: This is like clicking "Open" in a document editor. Instead of manually reading from a file and then calling Deserialize(), this method does both steps for you.

Exceptions

FileNotFoundException

Thrown when the specified file does not exist.

IOException

Thrown when an I/O error occurs while reading from the file or when the file contains corrupted or invalid model data.

Predict(TInput)

Uses the trained model to make predictions for new input data.

public override TOutput Predict(TInput input)

Parameters

input TInput

A matrix where each row represents a new example to predict and each column represents a feature.

Returns

TOutput

A vector containing the predicted values for each input example.

Remarks

After training, this method applies the learned patterns to new data to predict outcomes.

For Beginners: Prediction is when the model uses what it learned to make educated guesses about new information.

Continuing the fruit identification example:

  • After learning from many examples, the child (model) can now identify new fruits they haven't seen before
  • They look at the color, shape, and size to make their best guess

In machine learning:

  • You give the model new data it hasn't seen during training
  • The model applies the patterns it learned to make predictions
  • The output is the model's best estimate based on its training

For example, in a house price prediction model:

  • You provide features of a new house (square footage, bedrooms, location)
  • The model predicts what price that house might sell for

This method is used after training is complete, when you want to apply your model to real-world data.

SaveModel(string)

Saves the model to a file.

public override void SaveModel(string filePath)

Parameters

filePath string

The path where the model should be saved.

Remarks

This method provides a convenient way to save the model directly to disk. It combines serialization with file I/O operations.

For Beginners: This is like clicking "Save As" in a document editor. Instead of manually calling Serialize() and then writing to a file, this method does both steps for you.

Exceptions

IOException

Thrown when an I/O error occurs while writing to the file.

UnauthorizedAccessException

Thrown when the caller does not have the required permission to write to the specified file path.

Serialize()

Converts the current state of a machine learning model into a binary format.

public override byte[] Serialize()

Returns

byte[]

A byte array containing the serialized model data.

Remarks

This method captures all the essential information about a trained model and converts it into a sequence of bytes that can be stored or transmitted.

For Beginners: This is like exporting your work to a file.

When you call this method:

  • The model's current state (all its learned patterns and parameters) is captured
  • This information is converted into a compact binary format (bytes)
  • You can then save these bytes to a file, database, or send them over a network

For example:

  • After training a model to recognize cats vs. dogs in images
  • You can serialize the model to save all its learned knowledge
  • Later, you can use this saved data to recreate the model exactly as it was
  • The recreated model will make the same predictions as the original

Think of it like taking a snapshot of your model's brain at a specific moment in time.

SynchronizeGradients()

Synchronizes gradients across all processes using AllReduce.

public override void SynchronizeGradients()

Remarks

For Beginners: After training on local data, each process has computed gradients based on its batch. This method averages those gradients across all processes so everyone has the same update. This is the core of DDP - gradient averaging via AllReduce.

Train(TInput, TOutput)

Trains the model using input features and their corresponding target values.

public override void Train(TInput input, TOutput expectedOutput)

Parameters

input TInput
expectedOutput TOutput

Remarks

This method takes training data and adjusts the model's internal parameters to learn patterns in the data.

For Beginners: Training is like teaching the model by showing it examples.

Imagine teaching a child to identify fruits:

  • You show them many examples of apples, oranges, and bananas (input features x)
  • You tell them the correct name for each fruit (target values y)
  • Over time, they learn to recognize the patterns that distinguish each fruit

In machine learning:

  • The x parameter contains features (characteristics) of your data
  • The y parameter contains the correct answers you want the model to learn
  • During training, the model adjusts its internal calculations to get better at predicting y from x

For example, in a house price prediction model:

  • x would contain features like square footage, number of bedrooms, location
  • y would contain the actual sale prices of those houses

WithParameters(Vector<T>)

Creates a new instance with the specified parameters.

public override IFullModel<T, TInput, TOutput> WithParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

Returns

IFullModel<T, TInput, TOutput>