Table of Contents

Interface IDataVersionControl<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for data version control systems that track dataset changes over time.

public interface IDataVersionControl<T>

Type Parameters

T

The numeric data type used for calculations (e.g., float, double).

Remarks

A data version control system manages versions of datasets used for training and evaluating models, ensuring reproducibility and traceability.

For Beginners: Think of data version control like Git, but for your datasets instead of code. Just like Git tracks changes to your code, data version control tracks changes to your data:

  • Records what data was used to train each model
  • Lets you go back to previous versions of datasets
  • Helps reproduce experiments with exact same data
  • Tracks where data came from and how it was transformed

Common scenarios include:

  • Dataset updates (new examples added, errors corrected)
  • Data preprocessing changes (different normalization, feature engineering)
  • Train/validation/test splits that need to be reproduced
  • Tracking data lineage for compliance

Why data version control matters:

  • Models trained on different data versions perform differently
  • Reproducing results requires exact same data
  • Debugging requires knowing what data was used
  • Compliance and auditing need data traceability
  • Collaboration requires shared understanding of data versions

Methods

CompareDatasetVersions(string, string, string)

Compares two dataset versions to see what changed.

DatasetComparison<T> CompareDatasetVersions(string datasetName, string version1Hash, string version2Hash)

Parameters

datasetName string

Name of the dataset.

version1Hash string

First version hash.

version2Hash string

Second version hash.

Returns

DatasetComparison<T>

Comparison showing differences between versions.

ComputeDatasetHash(string)

Computes and stores a hash of the dataset for integrity verification.

string ComputeDatasetHash(string dataPath)

Parameters

dataPath string

Path to the dataset.

Returns

string

The computed hash.

Remarks

For Beginners: A hash is like a fingerprint for your dataset. If even one value changes, the hash will be different. This helps verify data integrity.

CreateDatasetSnapshot(string, Dictionary<string, string>, string?)

Creates a snapshot of multiple related datasets together.

string CreateDatasetSnapshot(string snapshotName, Dictionary<string, string> datasets, string? description = null)

Parameters

snapshotName string

Name for the snapshot.

datasets Dictionary<string, string>

Dictionary mapping dataset names to their version hashes.

description string

Description of the snapshot.

Returns

string

The unique identifier for the snapshot.

Remarks

For Beginners: This captures multiple datasets at once (like train, validation, and test sets) so you can reproduce experiments that use all of them together.

CreateDatasetVersion(string, string, string?, Dictionary<string, object>?, Dictionary<string, string>?)

Creates a new dataset version.

string CreateDatasetVersion(string datasetName, string dataPath, string? description = null, Dictionary<string, object>? metadata = null, Dictionary<string, string>? tags = null)

Parameters

datasetName string

Name of the dataset.

dataPath string

Path to the data file(s).

description string

Description of this version.

metadata Dictionary<string, object>

Additional metadata about the dataset.

tags Dictionary<string, string>

Tags for categorizing the dataset.

Returns

string

The unique identifier (version hash) for this dataset version.

Remarks

For Beginners: This saves a snapshot of your dataset with a unique identifier, like committing changes in Git.

DeleteDatasetVersion(string, string)

Deletes a specific dataset version.

void DeleteDatasetVersion(string datasetName, string versionHash)

Parameters

datasetName string

Name of the dataset.

versionHash string

Version to delete.

GetDatasetByTag(string, string)

Gets a dataset version by its tag.

DatasetVersion<T> GetDatasetByTag(string datasetName, string tag)

Parameters

datasetName string

Name of the dataset.

tag string

The tag to look up.

Returns

DatasetVersion<T>

The dataset version with that tag.

GetDatasetForRun(string)

Gets the dataset version used by a specific training run.

DatasetVersion<T> GetDatasetForRun(string runId)

Parameters

runId string

ID of the training run.

Returns

DatasetVersion<T>

Information about the dataset version used.

GetDatasetLineage(string, string)

Gets the lineage information for a dataset version.

DatasetLineage GetDatasetLineage(string datasetName, string versionHash)

Parameters

datasetName string

Name of the dataset.

versionHash string

Version hash.

Returns

DatasetLineage

Lineage information showing how the dataset was created.

GetDatasetSnapshot(string)

Retrieves a dataset snapshot.

DatasetSnapshot GetDatasetSnapshot(string snapshotName)

Parameters

snapshotName string

Name of the snapshot.

Returns

DatasetSnapshot

Information about all datasets in the snapshot.

GetDatasetStatistics(string, string)

Gets statistics about a dataset version.

DatasetStatistics<T> GetDatasetStatistics(string datasetName, string versionHash)

Parameters

datasetName string

Name of the dataset.

versionHash string

Version hash.

Returns

DatasetStatistics<T>

Statistical summary of the dataset.

Remarks

For Beginners: This provides summary statistics about the dataset like number of rows, columns, data types, and basic descriptive statistics.

GetDatasetVersion(string, string?)

Retrieves a specific version of a dataset.

DatasetVersion<T> GetDatasetVersion(string datasetName, string? versionHash = null)

Parameters

datasetName string

Name of the dataset.

versionHash string

The version hash to retrieve. If null, gets latest.

Returns

DatasetVersion<T>

Information about the dataset version.

GetLatestDatasetVersion(string)

Gets the latest version of a dataset.

DatasetVersion<T> GetLatestDatasetVersion(string datasetName)

Parameters

datasetName string

Name of the dataset.

Returns

DatasetVersion<T>

The latest version of the dataset.

GetRunsUsingDataset(string, string)

Gets all training runs that used a specific dataset version.

List<string> GetRunsUsingDataset(string datasetName, string versionHash)

Parameters

datasetName string

Name of the dataset.

versionHash string

Version hash.

Returns

List<string>

List of run IDs that used this dataset version.

LinkDatasetToRun(string, string, string, string?)

Links a dataset version to a model training run.

void LinkDatasetToRun(string datasetName, string versionHash, string runId, string? modelId = null)

Parameters

datasetName string

Name of the dataset.

versionHash string

Version of the dataset.

runId string

ID of the training run or experiment.

modelId string

ID of the model that was trained.

Remarks

For Beginners: This creates a record showing which dataset version was used to train which model, enabling full reproducibility.

ListDatasetVersions(string)

Lists all versions of a dataset.

List<DatasetVersionInfo<T>> ListDatasetVersions(string datasetName)

Parameters

datasetName string

Name of the dataset.

Returns

List<DatasetVersionInfo<T>>

List of all dataset versions with metadata.

ListDatasets(string?, Dictionary<string, string>?)

Lists all tracked datasets.

List<string> ListDatasets(string? filter = null, Dictionary<string, string>? tags = null)

Parameters

filter string

Optional filter expression.

tags Dictionary<string, string>

Optional tags to filter by.

Returns

List<string>

List of dataset names matching the criteria.

RecordDatasetLineage(string, string, DatasetLineage)

Records metadata about how a dataset was created or transformed.

void RecordDatasetLineage(string datasetName, string versionHash, DatasetLineage lineage)

Parameters

datasetName string

Name of the dataset.

versionHash string

Version of the dataset.

lineage DatasetLineage

Lineage information (source datasets, transformations applied).

Remarks

For Beginners: Lineage tracks the "family history" of your dataset - where it came from, what preprocessing was applied, etc. This is crucial for understanding and reproducing your work.

TagDatasetVersion(string, string, string)

Tags a dataset version for easy reference.

void TagDatasetVersion(string datasetName, string versionHash, string tag)

Parameters

datasetName string

Name of the dataset.

versionHash string

Version to tag.

tag string

The tag name to assign.

Remarks

For Beginners: Tags are like bookmarks - they let you give a version a memorable name like "production-data" or "v2-cleaned" instead of using the hash.

VerifyDatasetIntegrity(string, string, string)

Verifies that a dataset hasn't been modified by comparing its hash.

bool VerifyDatasetIntegrity(string datasetName, string versionHash, string currentDataPath)

Parameters

datasetName string

Name of the dataset.

versionHash string

Version to verify.

currentDataPath string

Current location of the data.

Returns

bool

True if the data matches the version, false if modified.