Interface ICheckpointManager<T, TInput, TOutput>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for checkpoint management systems that save and restore training state.
public interface ICheckpointManager<T, TInput, TOutput>
Type Parameters
TThe numeric data type used for calculations (e.g., float, double).
TInputTOutput
Remarks
A checkpoint manager handles saving and restoring the complete state of model training, allowing you to pause and resume training, recover from failures, and track model evolution.
For Beginners: Think of checkpoints like save points in a video game. They let you:
- Save your progress so you can come back later
- Go back to an earlier point if something goes wrong
- Keep the best version you've found so far
Checkpoints typically save:
- Model parameters (weights and biases)
- Optimizer state (momentum, learning rate schedule, etc.)
- Training metadata (epoch number, step count)
- Performance metrics
Why checkpoint management matters:
- Training can be interrupted (crashes, time limits)
- You want to keep the best model even if later training makes it worse
- Long training runs need progress saved periodically
- Enables experimentation with different training strategies from same point
Methods
CleanupKeepBest(string, int, MetricOptimizationDirection)
Deletes checkpoints except the best N according to a metric.
int CleanupKeepBest(string metricName, int keepBest = 3, MetricOptimizationDirection direction = MetricOptimizationDirection.Minimize)
Parameters
metricNamestringThe metric to use for determining best checkpoints.
keepBestintNumber of best checkpoints to keep.
directionMetricOptimizationDirectionWhether to minimize or maximize the metric.
Returns
- int
Number of checkpoints deleted.
CleanupOldCheckpoints(int)
Deletes old checkpoints, keeping only a specified number of the most recent ones.
int CleanupOldCheckpoints(int keepLast = 5)
Parameters
keepLastintNumber of recent checkpoints to keep.
Returns
- int
Number of checkpoints deleted.
Remarks
For Beginners: Checkpoints take up disk space, so this helps clean up old ones while keeping your most recent saves. It's like deleting old game saves to free up space.
ConfigureAutoCheckpointing(int, int, bool, string?)
Sets up automatic checkpointing during training.
void ConfigureAutoCheckpointing(int saveFrequency, int keepLast = 5, bool saveOnImprovement = true, string? metricName = null)
Parameters
saveFrequencyintSave every N steps.
keepLastintNumber of recent checkpoints to keep.
saveOnImprovementboolWhether to save when metric improves.
metricNamestringMetric to track for improvement-based saving.
Remarks
For Beginners: This configures automatic saving, so checkpoints are created periodically without you having to manually save them.
DeleteCheckpoint(string)
Deletes a specific checkpoint.
void DeleteCheckpoint(string checkpointId)
Parameters
checkpointIdstringThe ID of the checkpoint to delete.
GetCheckpointDirectory()
Gets the storage path for checkpoints.
string GetCheckpointDirectory()
Returns
ListCheckpoints(string?, bool)
Lists all available checkpoints.
List<CheckpointMetadata<T>> ListCheckpoints(string? sortBy = null, bool descending = true)
Parameters
Returns
- List<CheckpointMetadata<T>>
List of checkpoint metadata.
LoadBestCheckpoint(string, MetricOptimizationDirection)
Loads the checkpoint with the best metric value.
Checkpoint<T, TInput, TOutput>? LoadBestCheckpoint(string metricName, MetricOptimizationDirection direction)
Parameters
metricNamestringThe name of the metric to optimize.
directionMetricOptimizationDirectionWhether to minimize or maximize the metric.
Returns
- Checkpoint<T, TInput, TOutput>
The best checkpoint, or null if none exist.
Remarks
For Beginners: This finds and loads the checkpoint where your model performed best according to a specific metric (like lowest loss or highest accuracy).
LoadCheckpoint(string)
Loads a checkpoint and restores the training state.
Checkpoint<T, TInput, TOutput> LoadCheckpoint(string checkpointId)
Parameters
checkpointIdstringThe ID of the checkpoint to load.
Returns
- Checkpoint<T, TInput, TOutput>
A checkpoint object containing the restored state.
LoadLatestCheckpoint()
Loads the most recent checkpoint.
Checkpoint<T, TInput, TOutput>? LoadLatestCheckpoint()
Returns
- Checkpoint<T, TInput, TOutput>
The latest checkpoint, or null if none exist.
SaveCheckpoint<TMetadata>(IModel<TInput, TOutput, TMetadata>, IOptimizer<T, TInput, TOutput>, int, int, Dictionary<string, T>, Dictionary<string, object>?)
Saves a checkpoint of the current training state.
string SaveCheckpoint<TMetadata>(IModel<TInput, TOutput, TMetadata> model, IOptimizer<T, TInput, TOutput> optimizer, int epoch, int step, Dictionary<string, T> metrics, Dictionary<string, object>? metadata = null) where TMetadata : class
Parameters
modelIModel<TInput, TOutput, TMetadata>The model to checkpoint.
optimizerIOptimizer<T, TInput, TOutput>The optimizer state to save.
epochintThe current training epoch.
stepintThe current training step.
metricsDictionary<string, T>Current performance metrics.
metadataDictionary<string, object>Additional metadata to save with the checkpoint.
Returns
- string
The unique identifier for the saved checkpoint.
Type Parameters
TMetadata
Remarks
For Beginners: This saves everything about the current state of training so you can restore it later.
ShouldAutoSaveCheckpoint(int, double?, bool)
Determines whether an automatic checkpoint should be saved based on current configuration.
bool ShouldAutoSaveCheckpoint(int currentStep, double? metricValue = null, bool shouldMinimize = true)
Parameters
currentStepintThe current training step.
metricValuedouble?Optional metric value for improvement-based checkpointing.
shouldMinimizeboolWhether the metric should be minimized (true) or maximized (false).
Returns
- bool
True if a checkpoint should be saved.
TryAutoSaveCheckpoint<TMetadata>(IModel<TInput, TOutput, TMetadata>, IOptimizer<T, TInput, TOutput>, int, int, Dictionary<string, T>, double?, bool, Dictionary<string, object>?)
Attempts to save a checkpoint automatically based on configured auto-checkpoint settings. This method is called internally by training facades - users don't need to call it directly.
string? TryAutoSaveCheckpoint<TMetadata>(IModel<TInput, TOutput, TMetadata> model, IOptimizer<T, TInput, TOutput> optimizer, int epoch, int step, Dictionary<string, T> metrics, double? metricValue = null, bool shouldMinimize = true, Dictionary<string, object>? metadata = null) where TMetadata : class
Parameters
modelIModel<TInput, TOutput, TMetadata>The model to checkpoint.
optimizerIOptimizer<T, TInput, TOutput>The optimizer state to checkpoint.
epochintThe current epoch.
stepintThe current training step.
metricsDictionary<string, T>Training metrics to store with the checkpoint.
metricValuedouble?Optional metric value for improvement-based checkpointing.
shouldMinimizeboolWhether the metric should be minimized (true) or maximized (false).
metadataDictionary<string, object>Optional additional metadata.
Returns
- string
The checkpoint ID if saved, or null if no checkpoint was saved.
Type Parameters
TMetadataThe type of model metadata.
UpdateAutoSaveState(int, double?, bool)
Updates the auto-save state after a checkpoint is saved.
void UpdateAutoSaveState(int step, double? metricValue = null, bool shouldMinimize = true)