Class MolecularDatasetLoader<T>
Loads molecular graph datasets (ZINC, QM9) for graph-level property prediction and generation.
public class MolecularDatasetLoader<T> : GraphDataLoaderBase<T>, IGraphDataLoader<T>, IDataLoader<T>, IResettable, ICountable, IBatchIterable<GraphData<T>>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
MolecularDatasetLoader<T>
- Implements
-
IDataLoader<T>
- Inherited Members
- Extension Methods
Remarks
Molecular datasets represent molecules as graphs where atoms are nodes and chemical bonds are edges. These datasets are fundamental benchmarks for graph neural networks in drug discovery and materials science.
For Beginners: Molecular graphs represent chemistry as networks.
Graph Representation of Molecules:
Water (H₂O):
- Nodes: 3 atoms (O, H, H)
- Edges: 2 bonds (O-H, O-H)
- Node features: Atom type, charge, hybridization
- Edge features: Bond type (single, double, triple)
Why model molecules as graphs?
- Structure matters: Same atoms, different arrangement = different properties
- Example: Diamond vs Graphite (both pure carbon!)
- Bonds are relationships: Like social networks, but for atoms
- GNNs excel: Message passing mimics electron delocalization
Major Molecular Datasets:
ZINC:
- Size: 250,000 drug-like molecules (subset: 12,000)
- Source: ZINC database (commercially available compounds)
- Tasks: Graph regression on constrained solubility
- Features:
- Atoms: C, N, O, F, P, S, Cl, Br, I (28 atom types)
- Bonds: Single, double, triple, aromatic
- Use case: Drug discovery, molecular generation
QM9:
- Size: 134,000 small organic molecules
- Source: Quantum mechanical calculations
- Tasks: Regression on 19 quantum properties
- Energy, enthalpy, heat capacity
- HOMO/LUMO gap (electronic properties)
- Dipole moment, polarizability
- Atoms: C, H, N, O, F (up to 9 heavy atoms)
- Use case: Property prediction, molecular design
Constructors
MolecularDatasetLoader(MolecularDataset, int, string?, bool)
Initializes a new instance of the MolecularDatasetLoader<T> class.
public MolecularDatasetLoader(MolecularDatasetLoader<T>.MolecularDataset dataset, int batchSize = 32, string? dataPath = null, bool autoDownload = true)
Parameters
datasetMolecularDatasetLoader<T>.MolecularDatasetWhich molecular dataset to load.
batchSizeintNumber of molecules per batch.
dataPathstringPath to dataset files (optional, will download if not found).
autoDownloadboolWhether to automatically download the dataset if not found locally.
Remarks
Molecular datasets are loaded from SMILES strings or SDF files and converted to graph representations with appropriate features.
For Beginners: Using molecular datasets:
// Load QM9 for property prediction
var loader = new MolecularDatasetLoader<double>(
MolecularDatasetLoader<double>.MolecularDataset.QM9,
batchSize: 32,
autoDownload: true);
// Load the data
await loader.LoadAsync();
// Create graph classification task
var task = loader.CreateGraphClassificationTask();
// Or for generation
var genTask = loader.CreateGraphGenerationTask();
Properties
Description
Gets a description of the dataset and its intended use.
public override string Description { get; }
Property Value
Name
Gets the human-readable name of this data loader.
public override string Name { get; }
Property Value
Remarks
Examples: "MNIST", "Cora Citation Network", "IMDB Reviews"
NumClasses
Gets the number of classes for classification tasks.
public override int NumClasses { get; }
Property Value
Methods
CreateGraphClassificationTask(double, double, int?)
Creates a graph classification task for datasets with multiple graphs.
public override GraphClassificationTask<T> CreateGraphClassificationTask(double trainRatio = 0.8, double valRatio = 0.1, int? seed = null)
Parameters
Returns
CreateGraphGenerationTask()
Creates a graph generation task for molecular generation.
public GraphGenerationTask<T> CreateGraphGenerationTask()
Returns
- GraphGenerationTask<T>
Graph generation task configured for molecular generation.
Remarks
For Beginners: Molecular generation with GNNs:
Goal: Create new, valid molecules with desired properties
Why it's hard:
- Validity: Generated molecules must obey chemistry rules
- Diversity: Don't generate same molecules repeatedly
- Novelty: Create new molecules, not just copy training set
- Property control: Generate molecules with specific properties
CreateLinkPredictionTask(double, double, int?)
Creates a link prediction task for predicting missing edges.
public override LinkPredictionTask<T> CreateLinkPredictionTask(double trainRatio = 0.85, double negativeRatio = 1, int? seed = null)
Parameters
Returns
CreateNodeClassificationTask(double, double, int?)
Creates a node classification task with train/val/test split.
public override NodeClassificationTask<T> CreateNodeClassificationTask(double trainRatio = 0.1, double valRatio = 0.1, int? seed = null)
Parameters
Returns
LoadDataCoreAsync(CancellationToken)
Core data loading implementation to be provided by derived classes.
protected override Task LoadDataCoreAsync(CancellationToken cancellationToken)
Parameters
cancellationTokenCancellationTokenCancellation token for async operation.
Returns
- Task
A task that completes when loading is finished.
Remarks
Derived classes must implement this to perform actual data loading: - Load from files, databases, or remote sources - Parse and validate data format - Store in appropriate internal structures
UnloadDataCore()
Core data unloading implementation to be provided by derived classes.
protected override void UnloadDataCore()
Remarks
Derived classes should implement this to release resources: - Clear internal data structures - Release file handles or connections - Allow garbage collection of loaded data