Seongsu Kim, Nayoung Kim, Dongwoo Kim, and Sungsoo Ahn @ KAIST SPML Lab (Aug, 2025)
π [NeurIPS '25 Spotlight] This repository contains an implementation of the QHFlow for the DFT Hamiltonian prediction. This repository is still updating.
All codes are tested and confirmed to work with python 3.12 and CUDA 12.1. A similar environment should also work, as this project does not rely on some rapidly changing packages.
# Example CUDA 12.1 with torch 2.4.1
conda create -n qhflow python=3.12 psi4 -y
conda activate qhflow
pip install pyscf==2.10.0
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index https://download.pytorch.org/whl/cu121
pip install torch_geometric==2.3.0
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cu121.html
pip install -r requirements.txtThe project follows this directory structure (will be updated soon):
.
βββ src/ # Source code (python files should be run here)
β βββ experiment/ # Training/finetune/inference entrypoints
β βββ config_md17/ # MD17 configs (dataset/model)
β βββ config_qh9/ # QH9 configs (dataset/model)
β βββ dataset_module/ # Dataset loaders and split utilities
β β βββ qh9_datasets_shard.py # Main QH9 dataset classes with LMDB sharding
β β βββ lmdb_shard.py # LMDB sharding utilities for efficient data loading
β β βββ data_dft_utils.py # DFT calculation utilities (overlap, Hamiltonian)
β β βββ ori_dataset.py # md17 dataset implementations
β β βββ qh9_datasets_split.py # Legacy dataset split utilities (deprecated)
β βββ models/ # QHFlow / QHNet
β βββ pl_module/ # PyTorch Lightning modules
β βββ utils.py
β ...
βββ dataset/ # Data root (auto or manual download)
βββ _my_scripts/ # Helper scripts for dataset processing
βββ requirements.txt
βββ ckpts # Pretrained/finetuned checkpoints files
βββ README.md
...
MD17 is downloaded automatically, but the QH9 dataset requires manual download due to gdown instability.
To download QH9, use the commands below:
mkdir -p ./dataset/QH9Stable/raw/
gdown https://drive.google.com/uc?id=1LcEJGhB8VUGkuyb0oQ_9ANJdSkky9xMS -O ./dataset/QH9Stable/raw/QH9Stable.db
mkdir -p ./dataset/QH9Dynamic_300k/raw/
gdown https://drive.google.com/uc?id=1sbf-sFhh3ZmhXgTcN2ke_la39MaG0Yho -O ./dataset/QH9Dynamic_300k/raw/QH9Dynamic_300k.dbProcessing from raw files to torch datasets runs automatically on the first training run. Or, you can process manually with the sharding process:
python -m dataset_module.qh9_datasets_shard \
--name=${NAME} \
--num_chunks=30 --chunk_idx=${DB_IDX} \
--split=${SPLIT}where NAME is the dataset name (QH9Stable / QH9Dynamic). Use the following SPLIT options:
QH9Stable:random,size_oodQH9Dynamic:geometry,mol
Data is assembled automatically when the final chunk is processed.
Note
- The legacy
qh9_datasets_split.pymodule will be deprecated. Useqh9_datasets_shard.pyfor all new dataset processing operations. - We plan to provide pre-processed datasets for all datasets to facilitate easier setup and usage.
We plan to provide pre-trained model checkpoints for all datasets. Currently, we can provide checkpoints upon request. The checkpoint files are organized as follows:
MD17 Dataset:
ckpts/md17/${DATASET}/checkpoints/weights.ckpt
# ckpt=../ckpts/md17/water/checkpoints/weights.ckpt # ExampleQH9 Dataset:
ckpts/${DATASET}/${SPLIT}/checkpoints/weights.ckpt # Pretrained
ckpts/${DATASET}/${SPLIT}-FT/checkpoints/weights.ckpt # Finetuned
# ckpt=${ROOT}$/ckpts/QH9Stable/random/checkpoints/weights.ckpt # Example (Pretrained)
# ckpt=${ROOT}$/ckpts/QH9Stable/random-FT/checkpoints/weights.ckpt # Example (Finetuned)Where ${DATASET} and ${SPLIT} should be replaced with the specific dataset and split names:
- MD17 DATASET:
ethanol,malondialdehyde,uracil,water - QH9 DATASET:
QH9Stable,QH9Dynamic- QH9Stable SPLIT:
random,size_ood - QH9Dynamic SPLIT:
geometry,mol
- QH9Stable SPLIT:
To use these checkpoints, specify the path in the ckpt parameter when running inference or prediction commands. ${ROOT} is the path of this repository or the parent path of the checkpoints directory.
Prerequisites
All commands should be run from the QHFlow/src directory.
Available Datasets
- MD17 DATASET:
ethanol,malondialdehyde,uracil,water - QH9 DATASET:
QH9Stable,QH9Dynamic- QH9Stable SPLIT (dataset.split):
random,size_ood - QH9Dynamic SPLIT (dataset.split):
geometry,mol
- QH9Stable SPLIT (dataset.split):
Training Tips:
- You can enable Weights & Biases logging with
wandb.mode=online - Training automatically resumes when interrupted and restarted.
- Use
CUDA_VISIBLE_DEVICESto specify GPU devices:CUDA_VISIBLE_DEVICES=0,1 python -m experiment.train_md17 dataset=water
Performance Tips:
- For faster training, you can use multiple GPUs. For example,
CUDA_VISIBLE_DEVICES=0,1,2,3withstrategy=ddp devices=4 - Monitor GPU memory usage and adjust batch size if needed
Debugging Tips:
- Check logs in the
logs/directory for detailed training information - Monitor validation metrics to ensure proper training progress
python -m experiment.train_md17 dataset=${DATASET}
python -m experiment.train_qh9 dataset=${DATASET} dataset.split=${SPLIT}Examples:
# Train MD17 model
python -m experiment.train_md17 dataset=water
# Train QH9 model
python -m experiment.train_qh9 dataset=QH9Stable dataset.split=random(Note: currently not working. Will be fixed) Finetuning requires a pretrained model as a starting point, which is specified using the 'original_ckpt' parameter in the command.
python -m experiment.train_qh9-finetune \
dataset=${DATASET} \
dataset.split=${SPLIT} \
+original_ckpt=${PRETRAINED_CKPT}Example:
python -m experiment.train_qh9-finetune \
dataset=QH9Stable \
dataset.split=random \
+original_ckpt=../ckpts/QH9Stable/random/checkpoints/weights.ckptSCF acceleration measurement
python -m experiment.train_md17 \
mode=inference \
dataset=${DATASET} \
ckpt=${CKPT}
python -m experiment.train_qh9 \
mode=inference \
dataset=${DATASET} \
dataset.split=${SPLIT} \
ckpt=${CKPT}Examples:
# MD17 inference
python -m experiment.train_md17 \
mode=inference \
dataset=water \
ckpt=${ROOT}/ckpts/md17/water/checkpoints/weights.ckpt
# QH9 inference
python -m experiment.train_qh9 \
mode=inference \
dataset=QH9Stable \
dataset.split=random \
ckpt=${ROOT}/ckpts/QH9Stable/random/checkpoints/weights.ckptThis mode is used to predict test files and save individual Hamiltonian matrices for each sample. The predictions are saved to disk for further analysis.
Output Format:
- Hamiltonian matrices are saved as individual files
- Each prediction corresponds to a test sample
- Files are organized by dataset and model configuration
python -m experiment.train_md17 \
mode=predict \
dataset=${DATASET} \
ckpt=${CKPT}
python -m experiment.train_qh9 \
mode=predict \
dataset=${DATASET} \
dataset.split=${SPLIT} \
ckpt=${CKPT}Examples:
# MD17 prediction
python -m experiment.train_md17 \
mode=predict \
dataset=water \
ckpt=${ROOT}/ckpts/md17/water/checkpoints/weights.ckpt
# QH9 prediction
python -m experiment.train_qh9 \
mode=predict \
dataset=QH9Stable \
dataset.split=random \
ckpt=${ROOT}/ckpts/QH9Stable/random/checkpoints/weights.ckptOutput Location:
- Predictions are typically saved in the
outputs/directory in default
@article{kim2025high,
title={High-order Equivariant Flow Matching for Density Functional Theory Hamiltonian Prediction},
author={Kim, Seongsu and Kim, Nayoung and Kim, Dongwoo and Ahn, Sungsoo},
journal={arXiv preprint arXiv:2505.18817},
year={2025}
}
This project is based on the repo AIRS (QHNet).
MD17 Dataset: Revised MD17 dataset (rMD17) QH9 Dataset: QHBench/QH9