H2MOF: Predicting Hydrogen Adsorption in MOFs with Geometric Descriptors, Energy Voxels, and Molecular Point Clouds
This repository contains the code for the study “Predicting Hydrogen Adsorption in MOFs: A Comparative Analysis of Geometric Descriptors, Energy Voxels and Molecular Point Clouds”.
It compares three representation–model pairs using consistent data splits and identical evaluation:
- RF on five geometric descriptors (void_fraction, surface_area_m2g, surface_area_m2cm3, pld, lcd)
- RetNet, a compact 3D CNN on Boltzmann-transformed energy voxels
- PointNet on molecular point clouds via the AIdsorb pipeline
Requirements: Python ≥ 3.10. GPU is recommended for voxel and point-cloud models.
-
(Optional but recommended) Install PyTorch from its official instructions for your platform/GPU.
-
Install this package. Use extras depending on which methods you plan to run:
# Base (RF + common utilities)
pip install -e .
# Energy voxels (adds pymoxel)
pip install -e .[voxels]
# Molecular point clouds (adds AIdsorb + Lightning)
pip install -e .[pointclouds]
# Everything
pip install -e .[all]Download the hMOF JSONs from MOFX-DB (mof.tech.northwestern.edu) and place them under a folder such as hmof_json/ where files are named like hMOF-<id>.json. The included CLI builds a single labels table gathering the five descriptors and the 77 K H₂ adsorption targets at 2 and 100 bar:
python -m h2mof labels \
--input-dir hmof_json \
--output-csv hmof_data.csvThis produces a CSV indexed by MOF_name with columns:
- void_fraction, surface_area_m2g, surface_area_m2cm3, pld, lcd
- H2_adsorption_2bar, H2_adsorption_100bar
By default the project looks for hmof_data.csv in the project root (see Paths below).
Two method-specific assets are required:
Use the moxel CLI to convert CIFs to 3D potential grids (stored as .npy):
moxel hmof_cif hmof_voxels \
--grid_size 25 \
--cutoff 10 \
--epsilon 50 \
--sigma 2.5Notes:
- The default RetNet assumes
--grid_size 25. If you change this, you must also change the first Linear layer input dimension in the model. - h2mof expects per-MOF voxel files at
hmof_voxels/<MOF_name>.npy.
Use AIdsorb to generate per-MOF point clouds (also .npy):
aidsorb create hmof_cif --outname hmof_point_cloudsh2mof expects files at hmof_point_clouds/<MOF_name>.npy.
Use AIdsorb to create splits with a single source of truth for IDs:
aidsorb prepare hmof_point_clouds --split_ratio "(0.8, 0.1, 0.1)"Ensure the following files exist at the project root (or set H2MOF_ROOT to the folder containing them):
train.jsonvalidation.jsontest.json
The CLI uses the intersection of available IDs across selected methods at each split, so every method sees the same MOFs when comparing performance.
Train and evaluate RF, RetNet, and PointNet with shared splits; write parity plots, residual CSVs, and a comparison bar chart:
python -m h2mof bench \
--rf --voxels --pointclouds \
--pressures 2 100 \
--epochs-voxels 100 \
--epochs-pointclouds 150Key behavior:
- RF ignores
--epochs-*; RetNet/PointNet honor them. - Splits: IDs come from the root
train.json,validation.json,test.jsonand are intersected across methods. - Outputs go under
outputs/(see below).
Generate learning curves (validation R² vs. training set size) with repeated stratified runs per size. An internal validation split is carved out of the training IDs (by default 10%) and used only for checkpoint selection.
python -m h2mof curves \
--sizes 1000 5000 15000 30000 60000 \
--repeats 3 \
--rf --voxels --pointclouds \
--pressures 2 100 \
--val-frac 0.10 \
--epochs-voxels 100 \
--epochs-pointclouds 150Useful options:
--resume --resume-dir outputs/resumeto continue partially completed runs safely--sizesand--repeatsaccept parallel lists or a single--repeatsvalue broadcast to all sizes
Outputs: a PNG and tidy CSV per pressure, summarizing mean/std R² per method and size.
Evaluate pre-trained models (trained on the full hMOF training set) on another dataset such as Tobacco.
python -m h2mof eval-external \
--dataset-name tobacco \
--csv tobacco_data.csv \
--voxels-dir tobacco_voxels \
--pc-dir tobacco_point_clouds \
--rf --voxels --pointclouds \
--pressures 100Key paths (defaults can be changed via environment variables; see next section):
- Labels CSV:
hmof_data.csv - Assets:
- Voxels:
hmof_voxels/<MOF_name>.npy - Point clouds:
hmof_point_clouds/<MOF_name>.npy
- Voxels:
- Splits:
train.json,validation.json,test.json - Models and reports (created by CLI):
outputs/models/<method>/<pressure>bar/(method artifacts andtest_results.json)outputs/parity_plots/<method>/<method>_parity_<pressure>bar.pngoutputs/residuals/<method>/<method>_top100_residuals_<pressure>bar.csvoutputs/comparison/comparison_r2.pngoutputs/learning_curves_H2_<pressure>bar.{png,csv}- External evaluation:
outputs/external/<dataset>/{parity_plots,residuals,comparison,models,meta}
Paths and behavior are controlled by the following environment variables:
H2MOF_ROOT: project root (defaults to current working directory)H2MOF_OUTPUTS_DIR: outputs root (defaults to<ROOT>/outputs)H2MOF_LOG_LEVEL: console log level (INFO,DEBUG, …), defaultINFOH2MOF_RICH: set0to disable rich console loggingH2MOF_NUM_WORKERS: DataLoader workers (defaults to 8; set to 0 if you see worker issues)
Devices and precision:
- The code picks
cuda>mps>cpuautomatically and uses AMP (bf16/float16 on CUDA when available). - PyTorch Lightning is configured to use mixed precision on GPU and 32-bit on CPU/MPS.
Reproducibility:
- All CLI commands accept
--seed(default 42). Internally, Python/NumPy/PyTorch RNGs and DataLoader workers are seeded. - Learning-curve runs can be resumed safely with
--resume.
- RF (geometric descriptors): scikit-learn
RandomForestRegressorwith a small grid over depth/feature/leaf config; trained on the five descriptors. No GPU required. - RetNet (energy voxels): compact 3D CNN consuming
exp(-E/T)grids; training-set mean/std computed on the Boltzmann transform and applied to all splits. Defaultgrid_size=25. - PointNet (point clouds): training and inference via AIdsorb Lightning modules; center transform, classification head configured for single-output regression.