Table of Contents

Class MolecularDatasetLoader<T>

Namespace
AiDotNet.Data.Graph
Assembly
AiDotNet.dll

Loads molecular graph datasets (ZINC, QM9) for graph-level property prediction and generation.

public class MolecularDatasetLoader<T> : GraphDataLoaderBase<T>, IGraphDataLoader<T>, IDataLoader<T>, IResettable, ICountable, IBatchIterable<GraphData<T>>

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
MolecularDatasetLoader<T>
Implements
Inherited Members
Extension Methods

Remarks

Molecular datasets represent molecules as graphs where atoms are nodes and chemical bonds are edges. These datasets are fundamental benchmarks for graph neural networks in drug discovery and materials science.

For Beginners: Molecular graphs represent chemistry as networks.

Graph Representation of Molecules:

Water (H₂O):
- Nodes: 3 atoms (O, H, H)
- Edges: 2 bonds (O-H, O-H)
- Node features: Atom type, charge, hybridization
- Edge features: Bond type (single, double, triple)

Why model molecules as graphs?

  • Structure matters: Same atoms, different arrangement = different properties
    • Example: Diamond vs Graphite (both pure carbon!)
  • Bonds are relationships: Like social networks, but for atoms
  • GNNs excel: Message passing mimics electron delocalization

Major Molecular Datasets:

ZINC:

  • Size: 250,000 drug-like molecules (subset: 12,000)
  • Source: ZINC database (commercially available compounds)
  • Tasks: Graph regression on constrained solubility
  • Features:
    • Atoms: C, N, O, F, P, S, Cl, Br, I (28 atom types)
    • Bonds: Single, double, triple, aromatic
  • Use case: Drug discovery, molecular generation

QM9:

  • Size: 134,000 small organic molecules
  • Source: Quantum mechanical calculations
  • Tasks: Regression on 19 quantum properties
    • Energy, enthalpy, heat capacity
    • HOMO/LUMO gap (electronic properties)
    • Dipole moment, polarizability
  • Atoms: C, H, N, O, F (up to 9 heavy atoms)
  • Use case: Property prediction, molecular design

Constructors

MolecularDatasetLoader(MolecularDataset, int, string?, bool)

Initializes a new instance of the MolecularDatasetLoader<T> class.

public MolecularDatasetLoader(MolecularDatasetLoader<T>.MolecularDataset dataset, int batchSize = 32, string? dataPath = null, bool autoDownload = true)

Parameters

dataset MolecularDatasetLoader<T>.MolecularDataset

Which molecular dataset to load.

batchSize int

Number of molecules per batch.

dataPath string

Path to dataset files (optional, will download if not found).

autoDownload bool

Whether to automatically download the dataset if not found locally.

Remarks

Molecular datasets are loaded from SMILES strings or SDF files and converted to graph representations with appropriate features.

For Beginners: Using molecular datasets:

// Load QM9 for property prediction
var loader = new MolecularDatasetLoader<double>(
    MolecularDatasetLoader<double>.MolecularDataset.QM9,
    batchSize: 32,
    autoDownload: true);

// Load the data
await loader.LoadAsync();

// Create graph classification task
var task = loader.CreateGraphClassificationTask();

// Or for generation
var genTask = loader.CreateGraphGenerationTask();

Properties

Description

Gets a description of the dataset and its intended use.

public override string Description { get; }

Property Value

string

Name

Gets the human-readable name of this data loader.

public override string Name { get; }

Property Value

string

Remarks

Examples: "MNIST", "Cora Citation Network", "IMDB Reviews"

NumClasses

Gets the number of classes for classification tasks.

public override int NumClasses { get; }

Property Value

int

Methods

CreateGraphClassificationTask(double, double, int?)

Creates a graph classification task for datasets with multiple graphs.

public override GraphClassificationTask<T> CreateGraphClassificationTask(double trainRatio = 0.8, double valRatio = 0.1, int? seed = null)

Parameters

trainRatio double
valRatio double
seed int?

Returns

GraphClassificationTask<T>

CreateGraphGenerationTask()

Creates a graph generation task for molecular generation.

public GraphGenerationTask<T> CreateGraphGenerationTask()

Returns

GraphGenerationTask<T>

Graph generation task configured for molecular generation.

Remarks

For Beginners: Molecular generation with GNNs:

Goal: Create new, valid molecules with desired properties

Why it's hard:

  • Validity: Generated molecules must obey chemistry rules
  • Diversity: Don't generate same molecules repeatedly
  • Novelty: Create new molecules, not just copy training set
  • Property control: Generate molecules with specific properties

CreateLinkPredictionTask(double, double, int?)

Creates a link prediction task for predicting missing edges.

public override LinkPredictionTask<T> CreateLinkPredictionTask(double trainRatio = 0.85, double negativeRatio = 1, int? seed = null)

Parameters

trainRatio double
negativeRatio double
seed int?

Returns

LinkPredictionTask<T>

CreateNodeClassificationTask(double, double, int?)

Creates a node classification task with train/val/test split.

public override NodeClassificationTask<T> CreateNodeClassificationTask(double trainRatio = 0.1, double valRatio = 0.1, int? seed = null)

Parameters

trainRatio double
valRatio double
seed int?

Returns

NodeClassificationTask<T>

LoadDataCoreAsync(CancellationToken)

Core data loading implementation to be provided by derived classes.

protected override Task LoadDataCoreAsync(CancellationToken cancellationToken)

Parameters

cancellationToken CancellationToken

Cancellation token for async operation.

Returns

Task

A task that completes when loading is finished.

Remarks

Derived classes must implement this to perform actual data loading: - Load from files, databases, or remote sources - Parse and validate data format - Store in appropriate internal structures

UnloadDataCore()

Core data unloading implementation to be provided by derived classes.

protected override void UnloadDataCore()

Remarks

Derived classes should implement this to release resources: - Clear internal data structures - Release file handles or connections - Allow garbage collection of loaded data