Interface IDataVersionControl<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for data version control systems that track dataset changes over time.
public interface IDataVersionControl<T>
Type Parameters
TThe numeric data type used for calculations (e.g., float, double).
Remarks
A data version control system manages versions of datasets used for training and evaluating models, ensuring reproducibility and traceability.
For Beginners: Think of data version control like Git, but for your datasets instead of code. Just like Git tracks changes to your code, data version control tracks changes to your data:
- Records what data was used to train each model
- Lets you go back to previous versions of datasets
- Helps reproduce experiments with exact same data
- Tracks where data came from and how it was transformed
Common scenarios include:
- Dataset updates (new examples added, errors corrected)
- Data preprocessing changes (different normalization, feature engineering)
- Train/validation/test splits that need to be reproduced
- Tracking data lineage for compliance
Why data version control matters:
- Models trained on different data versions perform differently
- Reproducing results requires exact same data
- Debugging requires knowing what data was used
- Compliance and auditing need data traceability
- Collaboration requires shared understanding of data versions
Methods
CompareDatasetVersions(string, string, string)
Compares two dataset versions to see what changed.
DatasetComparison<T> CompareDatasetVersions(string datasetName, string version1Hash, string version2Hash)
Parameters
datasetNamestringName of the dataset.
version1HashstringFirst version hash.
version2HashstringSecond version hash.
Returns
- DatasetComparison<T>
Comparison showing differences between versions.
ComputeDatasetHash(string)
Computes and stores a hash of the dataset for integrity verification.
string ComputeDatasetHash(string dataPath)
Parameters
dataPathstringPath to the dataset.
Returns
- string
The computed hash.
Remarks
For Beginners: A hash is like a fingerprint for your dataset. If even one value changes, the hash will be different. This helps verify data integrity.
CreateDatasetSnapshot(string, Dictionary<string, string>, string?)
Creates a snapshot of multiple related datasets together.
string CreateDatasetSnapshot(string snapshotName, Dictionary<string, string> datasets, string? description = null)
Parameters
snapshotNamestringName for the snapshot.
datasetsDictionary<string, string>Dictionary mapping dataset names to their version hashes.
descriptionstringDescription of the snapshot.
Returns
- string
The unique identifier for the snapshot.
Remarks
For Beginners: This captures multiple datasets at once (like train, validation, and test sets) so you can reproduce experiments that use all of them together.
CreateDatasetVersion(string, string, string?, Dictionary<string, object>?, Dictionary<string, string>?)
Creates a new dataset version.
string CreateDatasetVersion(string datasetName, string dataPath, string? description = null, Dictionary<string, object>? metadata = null, Dictionary<string, string>? tags = null)
Parameters
datasetNamestringName of the dataset.
dataPathstringPath to the data file(s).
descriptionstringDescription of this version.
metadataDictionary<string, object>Additional metadata about the dataset.
tagsDictionary<string, string>Tags for categorizing the dataset.
Returns
- string
The unique identifier (version hash) for this dataset version.
Remarks
For Beginners: This saves a snapshot of your dataset with a unique identifier, like committing changes in Git.
DeleteDatasetVersion(string, string)
Deletes a specific dataset version.
void DeleteDatasetVersion(string datasetName, string versionHash)
Parameters
GetDatasetByTag(string, string)
Gets a dataset version by its tag.
DatasetVersion<T> GetDatasetByTag(string datasetName, string tag)
Parameters
Returns
- DatasetVersion<T>
The dataset version with that tag.
GetDatasetForRun(string)
Gets the dataset version used by a specific training run.
DatasetVersion<T> GetDatasetForRun(string runId)
Parameters
runIdstringID of the training run.
Returns
- DatasetVersion<T>
Information about the dataset version used.
GetDatasetLineage(string, string)
Gets the lineage information for a dataset version.
DatasetLineage GetDatasetLineage(string datasetName, string versionHash)
Parameters
Returns
- DatasetLineage
Lineage information showing how the dataset was created.
GetDatasetSnapshot(string)
Retrieves a dataset snapshot.
DatasetSnapshot GetDatasetSnapshot(string snapshotName)
Parameters
snapshotNamestringName of the snapshot.
Returns
- DatasetSnapshot
Information about all datasets in the snapshot.
GetDatasetStatistics(string, string)
Gets statistics about a dataset version.
DatasetStatistics<T> GetDatasetStatistics(string datasetName, string versionHash)
Parameters
Returns
- DatasetStatistics<T>
Statistical summary of the dataset.
Remarks
For Beginners: This provides summary statistics about the dataset like number of rows, columns, data types, and basic descriptive statistics.
GetDatasetVersion(string, string?)
Retrieves a specific version of a dataset.
DatasetVersion<T> GetDatasetVersion(string datasetName, string? versionHash = null)
Parameters
datasetNamestringName of the dataset.
versionHashstringThe version hash to retrieve. If null, gets latest.
Returns
- DatasetVersion<T>
Information about the dataset version.
GetLatestDatasetVersion(string)
Gets the latest version of a dataset.
DatasetVersion<T> GetLatestDatasetVersion(string datasetName)
Parameters
datasetNamestringName of the dataset.
Returns
- DatasetVersion<T>
The latest version of the dataset.
GetRunsUsingDataset(string, string)
Gets all training runs that used a specific dataset version.
List<string> GetRunsUsingDataset(string datasetName, string versionHash)
Parameters
Returns
LinkDatasetToRun(string, string, string, string?)
Links a dataset version to a model training run.
void LinkDatasetToRun(string datasetName, string versionHash, string runId, string? modelId = null)
Parameters
datasetNamestringName of the dataset.
versionHashstringVersion of the dataset.
runIdstringID of the training run or experiment.
modelIdstringID of the model that was trained.
Remarks
For Beginners: This creates a record showing which dataset version was used to train which model, enabling full reproducibility.
ListDatasetVersions(string)
Lists all versions of a dataset.
List<DatasetVersionInfo<T>> ListDatasetVersions(string datasetName)
Parameters
datasetNamestringName of the dataset.
Returns
- List<DatasetVersionInfo<T>>
List of all dataset versions with metadata.
ListDatasets(string?, Dictionary<string, string>?)
Lists all tracked datasets.
List<string> ListDatasets(string? filter = null, Dictionary<string, string>? tags = null)
Parameters
filterstringOptional filter expression.
tagsDictionary<string, string>Optional tags to filter by.
Returns
RecordDatasetLineage(string, string, DatasetLineage)
Records metadata about how a dataset was created or transformed.
void RecordDatasetLineage(string datasetName, string versionHash, DatasetLineage lineage)
Parameters
datasetNamestringName of the dataset.
versionHashstringVersion of the dataset.
lineageDatasetLineageLineage information (source datasets, transformations applied).
Remarks
For Beginners: Lineage tracks the "family history" of your dataset - where it came from, what preprocessing was applied, etc. This is crucial for understanding and reproducing your work.
TagDatasetVersion(string, string, string)
Tags a dataset version for easy reference.
void TagDatasetVersion(string datasetName, string versionHash, string tag)
Parameters
datasetNamestringName of the dataset.
versionHashstringVersion to tag.
tagstringThe tag name to assign.
Remarks
For Beginners: Tags are like bookmarks - they let you give a version a memorable name like "production-data" or "v2-cleaned" instead of using the hash.
VerifyDatasetIntegrity(string, string, string)
Verifies that a dataset hasn't been modified by comparing its hash.
bool VerifyDatasetIntegrity(string datasetName, string versionHash, string currentDataPath)
Parameters
datasetNamestringName of the dataset.
versionHashstringVersion to verify.
currentDataPathstringCurrent location of the data.
Returns
- bool
True if the data matches the version, false if modified.