Skip to content

seongsukim-ml/QHFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

120 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

(QHFlow) High-order Equivariant Flow Matching for Density Functional Theory Hamiltonian Prediction

CUDA versions Python versions Python versions Python versions

Seongsu Kim, Nayoung Kim, Dongwoo Kim, and Sungsoo Ahn @ KAIST SPML Lab (Aug, 2025)

🌟 [NeurIPS '25 Spotlight] This repository contains an implementation of the QHFlow for the DFT Hamiltonian prediction. This repository is still updating.

Table of Contents

Packages and Requirements

All codes are tested and confirmed to work with python 3.12 and CUDA 12.1. A similar environment should also work, as this project does not rely on some rapidly changing packages.

# Example CUDA 12.1 with torch 2.4.1
conda create -n qhflow python=3.12 psi4 -y
conda activate qhflow

pip install pyscf==2.10.0
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index https://download.pytorch.org/whl/cu121
pip install torch_geometric==2.3.0
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cu121.html

pip install -r requirements.txt

Directory and Files

The project follows this directory structure (will be updated soon):

.
β”œβ”€β”€ src/                       # Source code (python files should be run here)
β”‚   β”œβ”€β”€ experiment/            # Training/finetune/inference entrypoints
β”‚   β”œβ”€β”€ config_md17/           # MD17 configs (dataset/model)
β”‚   β”œβ”€β”€ config_qh9/            # QH9 configs (dataset/model)
β”‚   β”œβ”€β”€ dataset_module/        # Dataset loaders and split utilities
β”‚   β”‚   β”œβ”€β”€ qh9_datasets_shard.py    # Main QH9 dataset classes with LMDB sharding
β”‚   β”‚   β”œβ”€β”€ lmdb_shard.py            # LMDB sharding utilities for efficient data loading
β”‚   β”‚   β”œβ”€β”€ data_dft_utils.py        # DFT calculation utilities (overlap, Hamiltonian)
β”‚   β”‚   β”œβ”€β”€ ori_dataset.py           # md17 dataset implementations
β”‚   β”‚   └── qh9_datasets_split.py    # Legacy dataset split utilities (deprecated)
β”‚   β”œβ”€β”€ models/                # QHFlow / QHNet
β”‚   β”œβ”€β”€ pl_module/             # PyTorch Lightning modules
β”‚   β”œβ”€β”€ utils.py
β”‚   ...
β”œβ”€β”€ dataset/                   # Data root (auto or manual download)
β”œβ”€β”€ _my_scripts/               # Helper scripts for dataset processing 
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ ckpts                      # Pretrained/finetuned checkpoints files
β”œβ”€β”€ README.md
...

Project setup

Dataset

MD17 is downloaded automatically, but the QH9 dataset requires manual download due to gdown instability.

To download QH9, use the commands below:

mkdir -p ./dataset/QH9Stable/raw/
gdown https://drive.google.com/uc?id=1LcEJGhB8VUGkuyb0oQ_9ANJdSkky9xMS -O ./dataset/QH9Stable/raw/QH9Stable.db

mkdir -p ./dataset/QH9Dynamic_300k/raw/
gdown https://drive.google.com/uc?id=1sbf-sFhh3ZmhXgTcN2ke_la39MaG0Yho -O ./dataset/QH9Dynamic_300k/raw/QH9Dynamic_300k.db

Processing from raw files to torch datasets runs automatically on the first training run. Or, you can process manually with the sharding process:

python -m dataset_module.qh9_datasets_shard \
    --name=${NAME}  \
    --num_chunks=30 --chunk_idx=${DB_IDX} \
    --split=${SPLIT}

where NAME is the dataset name (QH9Stable / QH9Dynamic). Use the following SPLIT options:

  • QH9Stable: random, size_ood
  • QH9Dynamic: geometry, mol

Data is assembled automatically when the final chunk is processed.

Note

  • The legacy qh9_datasets_split.py module will be deprecated. Use qh9_datasets_shard.py for all new dataset processing operations.
  • We plan to provide pre-processed datasets for all datasets to facilitate easier setup and usage.

Checkpoints

We plan to provide pre-trained model checkpoints for all datasets. Currently, we can provide checkpoints upon request. The checkpoint files are organized as follows:

MD17 Dataset:

ckpts/md17/${DATASET}/checkpoints/weights.ckpt
# ckpt=../ckpts/md17/water/checkpoints/weights.ckpt           # Example

QH9 Dataset:

ckpts/${DATASET}/${SPLIT}/checkpoints/weights.ckpt       # Pretrained
ckpts/${DATASET}/${SPLIT}-FT/checkpoints/weights.ckpt    # Finetuned

# ckpt=${ROOT}$/ckpts/QH9Stable/random/checkpoints/weights.ckpt     # Example (Pretrained)
# ckpt=${ROOT}$/ckpts/QH9Stable/random-FT/checkpoints/weights.ckpt  # Example (Finetuned)

Where ${DATASET} and ${SPLIT} should be replaced with the specific dataset and split names:

  • MD17 DATASET: ethanol, malondialdehyde, uracil, water
  • QH9 DATASET: QH9Stable, QH9Dynamic
    • QH9Stable SPLIT: random, size_ood
    • QH9Dynamic SPLIT: geometry, mol

To use these checkpoints, specify the path in the ckpt parameter when running inference or prediction commands. ${ROOT} is the path of this repository or the parent path of the checkpoints directory.

Usage

Prerequisites All commands should be run from the QHFlow/src directory.

Available Datasets

  • MD17 DATASET: ethanol, malondialdehyde, uracil, water
  • QH9 DATASET: QH9Stable, QH9Dynamic
    • QH9Stable SPLIT (dataset.split): random, size_ood
    • QH9Dynamic SPLIT (dataset.split): geometry, mol

Tips

Training Tips:

  • You can enable Weights & Biases logging with wandb.mode=online
  • Training automatically resumes when interrupted and restarted.
  • Use CUDA_VISIBLE_DEVICES to specify GPU devices: CUDA_VISIBLE_DEVICES=0,1 python -m experiment.train_md17 dataset=water

Performance Tips:

  • For faster training, you can use multiple GPUs. For example, CUDA_VISIBLE_DEVICES=0,1,2,3 with strategy=ddp devices=4
  • Monitor GPU memory usage and adjust batch size if needed

Debugging Tips:

  • Check logs in the logs/ directory for detailed training information
  • Monitor validation metrics to ensure proper training progress

Training and Inference

Training from scratch

python -m experiment.train_md17 dataset=${DATASET}
python -m experiment.train_qh9  dataset=${DATASET} dataset.split=${SPLIT}

Examples:

# Train MD17 model
python -m experiment.train_md17 dataset=water

# Train QH9 model
python -m experiment.train_qh9 dataset=QH9Stable dataset.split=random

Finetuning

(Note: currently not working. Will be fixed) Finetuning requires a pretrained model as a starting point, which is specified using the 'original_ckpt' parameter in the command.

python -m experiment.train_qh9-finetune \
  dataset=${DATASET} \
  dataset.split=${SPLIT} \
  +original_ckpt=${PRETRAINED_CKPT}

Example:

python -m experiment.train_qh9-finetune \
  dataset=QH9Stable \
  dataset.split=random \
  +original_ckpt=../ckpts/QH9Stable/random/checkpoints/weights.ckpt

Inference

SCF acceleration measurement

python -m experiment.train_md17 \
  mode=inference \
  dataset=${DATASET} \
  ckpt=${CKPT}

python -m experiment.train_qh9 \ 
  mode=inference \
  dataset=${DATASET} \
  dataset.split=${SPLIT} \
  ckpt=${CKPT}

Examples:

# MD17 inference
python -m experiment.train_md17 \
  mode=inference \
  dataset=water \
  ckpt=${ROOT}/ckpts/md17/water/checkpoints/weights.ckpt

# QH9 inference
python -m experiment.train_qh9 \
  mode=inference \
  dataset=QH9Stable \
  dataset.split=random \
  ckpt=${ROOT}/ckpts/QH9Stable/random/checkpoints/weights.ckpt

Prediction (Saving the outputs)

This mode is used to predict test files and save individual Hamiltonian matrices for each sample. The predictions are saved to disk for further analysis.

Output Format:

  • Hamiltonian matrices are saved as individual files
  • Each prediction corresponds to a test sample
  • Files are organized by dataset and model configuration
python -m experiment.train_md17 \
  mode=predict \
  dataset=${DATASET} \
  ckpt=${CKPT}

python -m experiment.train_qh9 \
  mode=predict \
  dataset=${DATASET} \
  dataset.split=${SPLIT} \
  ckpt=${CKPT}

Examples:

# MD17 prediction
python -m experiment.train_md17 \
  mode=predict \
  dataset=water \
  ckpt=${ROOT}/ckpts/md17/water/checkpoints/weights.ckpt

# QH9 prediction
python -m experiment.train_qh9 \
  mode=predict \
  dataset=QH9Stable \
  dataset.split=random \
  ckpt=${ROOT}/ckpts/QH9Stable/random/checkpoints/weights.ckpt

Output Location:

  • Predictions are typically saved in the outputs/ directory in default

πŸ“š Citation

@article{kim2025high,
  title={High-order Equivariant Flow Matching for Density Functional Theory Hamiltonian Prediction},
  author={Kim, Seongsu and Kim, Nayoung and Kim, Dongwoo and Ahn, Sungsoo},
  journal={arXiv preprint arXiv:2505.18817},
  year={2025}
}

πŸ–‡οΈ Acknowledgements

This project is based on the repo AIRS (QHNet).

MD17 Dataset: Revised MD17 dataset (rMD17) QH9 Dataset: QHBench/QH9

About

🌟 [NeurIPS '25 Spotlight] Official implement of QHFlow for DFT Hamiltonian prediction

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors