Table of Contents

Interface ICheckpointManager<T, TInput, TOutput>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for checkpoint management systems that save and restore training state.

public interface ICheckpointManager<T, TInput, TOutput>

Type Parameters

T

The numeric data type used for calculations (e.g., float, double).

TInput
TOutput

Remarks

A checkpoint manager handles saving and restoring the complete state of model training, allowing you to pause and resume training, recover from failures, and track model evolution.

For Beginners: Think of checkpoints like save points in a video game. They let you:

  • Save your progress so you can come back later
  • Go back to an earlier point if something goes wrong
  • Keep the best version you've found so far

Checkpoints typically save:

  • Model parameters (weights and biases)
  • Optimizer state (momentum, learning rate schedule, etc.)
  • Training metadata (epoch number, step count)
  • Performance metrics

Why checkpoint management matters:

  • Training can be interrupted (crashes, time limits)
  • You want to keep the best model even if later training makes it worse
  • Long training runs need progress saved periodically
  • Enables experimentation with different training strategies from same point

Methods

CleanupKeepBest(string, int, MetricOptimizationDirection)

Deletes checkpoints except the best N according to a metric.

int CleanupKeepBest(string metricName, int keepBest = 3, MetricOptimizationDirection direction = MetricOptimizationDirection.Minimize)

Parameters

metricName string

The metric to use for determining best checkpoints.

keepBest int

Number of best checkpoints to keep.

direction MetricOptimizationDirection

Whether to minimize or maximize the metric.

Returns

int

Number of checkpoints deleted.

CleanupOldCheckpoints(int)

Deletes old checkpoints, keeping only a specified number of the most recent ones.

int CleanupOldCheckpoints(int keepLast = 5)

Parameters

keepLast int

Number of recent checkpoints to keep.

Returns

int

Number of checkpoints deleted.

Remarks

For Beginners: Checkpoints take up disk space, so this helps clean up old ones while keeping your most recent saves. It's like deleting old game saves to free up space.

ConfigureAutoCheckpointing(int, int, bool, string?)

Sets up automatic checkpointing during training.

void ConfigureAutoCheckpointing(int saveFrequency, int keepLast = 5, bool saveOnImprovement = true, string? metricName = null)

Parameters

saveFrequency int

Save every N steps.

keepLast int

Number of recent checkpoints to keep.

saveOnImprovement bool

Whether to save when metric improves.

metricName string

Metric to track for improvement-based saving.

Remarks

For Beginners: This configures automatic saving, so checkpoints are created periodically without you having to manually save them.

DeleteCheckpoint(string)

Deletes a specific checkpoint.

void DeleteCheckpoint(string checkpointId)

Parameters

checkpointId string

The ID of the checkpoint to delete.

GetCheckpointDirectory()

Gets the storage path for checkpoints.

string GetCheckpointDirectory()

Returns

string

ListCheckpoints(string?, bool)

Lists all available checkpoints.

List<CheckpointMetadata<T>> ListCheckpoints(string? sortBy = null, bool descending = true)

Parameters

sortBy string

Optional metric to sort by.

descending bool

Whether to sort in descending order.

Returns

List<CheckpointMetadata<T>>

List of checkpoint metadata.

LoadBestCheckpoint(string, MetricOptimizationDirection)

Loads the checkpoint with the best metric value.

Checkpoint<T, TInput, TOutput>? LoadBestCheckpoint(string metricName, MetricOptimizationDirection direction)

Parameters

metricName string

The name of the metric to optimize.

direction MetricOptimizationDirection

Whether to minimize or maximize the metric.

Returns

Checkpoint<T, TInput, TOutput>

The best checkpoint, or null if none exist.

Remarks

For Beginners: This finds and loads the checkpoint where your model performed best according to a specific metric (like lowest loss or highest accuracy).

LoadCheckpoint(string)

Loads a checkpoint and restores the training state.

Checkpoint<T, TInput, TOutput> LoadCheckpoint(string checkpointId)

Parameters

checkpointId string

The ID of the checkpoint to load.

Returns

Checkpoint<T, TInput, TOutput>

A checkpoint object containing the restored state.

LoadLatestCheckpoint()

Loads the most recent checkpoint.

Checkpoint<T, TInput, TOutput>? LoadLatestCheckpoint()

Returns

Checkpoint<T, TInput, TOutput>

The latest checkpoint, or null if none exist.

SaveCheckpoint<TMetadata>(IModel<TInput, TOutput, TMetadata>, IOptimizer<T, TInput, TOutput>, int, int, Dictionary<string, T>, Dictionary<string, object>?)

Saves a checkpoint of the current training state.

string SaveCheckpoint<TMetadata>(IModel<TInput, TOutput, TMetadata> model, IOptimizer<T, TInput, TOutput> optimizer, int epoch, int step, Dictionary<string, T> metrics, Dictionary<string, object>? metadata = null) where TMetadata : class

Parameters

model IModel<TInput, TOutput, TMetadata>

The model to checkpoint.

optimizer IOptimizer<T, TInput, TOutput>

The optimizer state to save.

epoch int

The current training epoch.

step int

The current training step.

metrics Dictionary<string, T>

Current performance metrics.

metadata Dictionary<string, object>

Additional metadata to save with the checkpoint.

Returns

string

The unique identifier for the saved checkpoint.

Type Parameters

TMetadata

Remarks

For Beginners: This saves everything about the current state of training so you can restore it later.

ShouldAutoSaveCheckpoint(int, double?, bool)

Determines whether an automatic checkpoint should be saved based on current configuration.

bool ShouldAutoSaveCheckpoint(int currentStep, double? metricValue = null, bool shouldMinimize = true)

Parameters

currentStep int

The current training step.

metricValue double?

Optional metric value for improvement-based checkpointing.

shouldMinimize bool

Whether the metric should be minimized (true) or maximized (false).

Returns

bool

True if a checkpoint should be saved.

TryAutoSaveCheckpoint<TMetadata>(IModel<TInput, TOutput, TMetadata>, IOptimizer<T, TInput, TOutput>, int, int, Dictionary<string, T>, double?, bool, Dictionary<string, object>?)

Attempts to save a checkpoint automatically based on configured auto-checkpoint settings. This method is called internally by training facades - users don't need to call it directly.

string? TryAutoSaveCheckpoint<TMetadata>(IModel<TInput, TOutput, TMetadata> model, IOptimizer<T, TInput, TOutput> optimizer, int epoch, int step, Dictionary<string, T> metrics, double? metricValue = null, bool shouldMinimize = true, Dictionary<string, object>? metadata = null) where TMetadata : class

Parameters

model IModel<TInput, TOutput, TMetadata>

The model to checkpoint.

optimizer IOptimizer<T, TInput, TOutput>

The optimizer state to checkpoint.

epoch int

The current epoch.

step int

The current training step.

metrics Dictionary<string, T>

Training metrics to store with the checkpoint.

metricValue double?

Optional metric value for improvement-based checkpointing.

shouldMinimize bool

Whether the metric should be minimized (true) or maximized (false).

metadata Dictionary<string, object>

Optional additional metadata.

Returns

string

The checkpoint ID if saved, or null if no checkpoint was saved.

Type Parameters

TMetadata

The type of model metadata.

UpdateAutoSaveState(int, double?, bool)

Updates the auto-save state after a checkpoint is saved.

void UpdateAutoSaveState(int step, double? metricValue = null, bool shouldMinimize = true)

Parameters

step int

The step at which the checkpoint was saved.

metricValue double?

Optional metric value for improvement tracking.

shouldMinimize bool

Whether the metric should be minimized.