H2MOF: Predicting Hydrogen Adsorption in MOFs with Geometric Descriptors, Energy Voxels, and Molecular Point Clouds

This repository contains the code for the study “Predicting Hydrogen Adsorption in MOFs: A Comparative Analysis of Geometric Descriptors, Energy Voxels and Molecular Point Clouds”.

It compares three representation–model pairs using consistent data splits and identical evaluation:

RF on five geometric descriptors (void_fraction, surface_area_m2g, surface_area_m2cm3, pld, lcd)
RetNet, a compact 3D CNN on Boltzmann-transformed energy voxels
PointNet on molecular point clouds via the AIdsorb pipeline

Installation

Requirements: Python ≥ 3.10. GPU is recommended for voxel and point-cloud models.

(Optional but recommended) Install PyTorch from its official instructions for your platform/GPU.
Install this package. Use extras depending on which methods you plan to run:

# Base (RF + common utilities)
pip install -e .

# Energy voxels (adds pymoxel)
pip install -e .[voxels]

# Molecular point clouds (adds AIdsorb + Lightning)
pip install -e .[pointclouds]

# Everything
pip install -e .[all]

Data and labels (77 K H₂)

Download the hMOF JSONs from MOFX-DB (mof.tech.northwestern.edu) and place them under a folder such as hmof_json/ where files are named like hMOF-<id>.json. The included CLI builds a single labels table gathering the five descriptors and the 77 K H₂ adsorption targets at 2 and 100 bar:

python -m h2mof labels \
  --input-dir hmof_json \
  --output-csv hmof_data.csv

This produces a CSV indexed by MOF_name with columns:

void_fraction, surface_area_m2g, surface_area_m2cm3, pld, lcd
H2_adsorption_2bar, H2_adsorption_100bar

By default the project looks for hmof_data.csv in the project root (see Paths below).

Asset generation

Two method-specific assets are required:

1) Energy voxels (for RetNet)

Use the moxel CLI to convert CIFs to 3D potential grids (stored as .npy):

moxel hmof_cif hmof_voxels \
  --grid_size 25 \
  --cutoff 10 \
  --epsilon 50 \
  --sigma 2.5

Notes:

The default RetNet assumes --grid_size 25. If you change this, you must also change the first Linear layer input dimension in the model.
h2mof expects per-MOF voxel files at hmof_voxels/<MOF_name>.npy.

2) Molecular point clouds (for PointNet)

Use AIdsorb to generate per-MOF point clouds (also .npy):

aidsorb create hmof_cif --outname hmof_point_clouds

h2mof expects files at hmof_point_clouds/<MOF_name>.npy.

Data splits (consistent IDs across methods)

Use AIdsorb to create splits with a single source of truth for IDs:

aidsorb prepare hmof_point_clouds --split_ratio "(0.8, 0.1, 0.1)"

Ensure the following files exist at the project root (or set H2MOF_ROOT to the folder containing them):

train.json
validation.json
test.json

The CLI uses the intersection of available IDs across selected methods at each split, so every method sees the same MOFs when comparing performance.

Quickstart: benchmark all methods

Train and evaluate RF, RetNet, and PointNet with shared splits; write parity plots, residual CSVs, and a comparison bar chart:

python -m h2mof bench \
  --rf --voxels --pointclouds \
  --pressures 2 100 \
  --epochs-voxels 100 \
  --epochs-pointclouds 150

Key behavior:

RF ignores --epochs-*; RetNet/PointNet honor them.
Splits: IDs come from the root train.json, validation.json, test.json and are intersected across methods.
Outputs go under outputs/ (see below).

Learning curves

Generate learning curves (validation R² vs. training set size) with repeated stratified runs per size. An internal validation split is carved out of the training IDs (by default 10%) and used only for checkpoint selection.

python -m h2mof curves \
  --sizes 1000 5000 15000 30000 60000 \
  --repeats 3 \
  --rf --voxels --pointclouds \
  --pressures 2 100 \
  --val-frac 0.10 \
  --epochs-voxels 100 \
  --epochs-pointclouds 150

Useful options:

--resume --resume-dir outputs/resume to continue partially completed runs safely
--sizes and --repeats accept parallel lists or a single --repeats value broadcast to all sizes

Outputs: a PNG and tidy CSV per pressure, summarizing mean/std R² per method and size.

External evaluation (e.g., Tobacco dataset)

Evaluate pre-trained models (trained on the full hMOF training set) on another dataset such as Tobacco.

python -m h2mof eval-external \
  --dataset-name tobacco \
  --csv tobacco_data.csv \
  --voxels-dir tobacco_voxels \
  --pc-dir tobacco_point_clouds \
  --rf --voxels --pointclouds \
  --pressures 100

Outputs and directory structure

Key paths (defaults can be changed via environment variables; see next section):

Labels CSV: hmof_data.csv
Assets:
- Voxels: hmof_voxels/<MOF_name>.npy
- Point clouds: hmof_point_clouds/<MOF_name>.npy
Splits: train.json, validation.json, test.json
Models and reports (created by CLI):
- outputs/models/<method>/<pressure>bar/ (method artifacts and test_results.json)
- outputs/parity_plots/<method>/<method>_parity_<pressure>bar.png
- outputs/residuals/<method>/<method>_top100_residuals_<pressure>bar.csv
- outputs/comparison/comparison_r2.png
- outputs/learning_curves_H2_<pressure>bar.{png,csv}
- External evaluation: outputs/external/<dataset>/{parity_plots,residuals,comparison,models,meta}

Configuration, devices, and reproducibility

Paths and behavior are controlled by the following environment variables:

H2MOF_ROOT: project root (defaults to current working directory)
H2MOF_OUTPUTS_DIR: outputs root (defaults to <ROOT>/outputs)
H2MOF_LOG_LEVEL: console log level (INFO, DEBUG, …), default INFO
H2MOF_RICH: set 0 to disable rich console logging
H2MOF_NUM_WORKERS: DataLoader workers (defaults to 8; set to 0 if you see worker issues)

Devices and precision:

The code picks cuda > mps > cpu automatically and uses AMP (bf16/float16 on CUDA when available).
PyTorch Lightning is configured to use mixed precision on GPU and 32-bit on CPU/MPS.

Reproducibility:

All CLI commands accept --seed (default 42). Internally, Python/NumPy/PyTorch RNGs and DataLoader workers are seeded.
Learning-curve runs can be resumed safely with --resume.

Method notes

RF (geometric descriptors): scikit-learn RandomForestRegressor with a small grid over depth/feature/leaf config; trained on the five descriptors. No GPU required.
RetNet (energy voxels): compact 3D CNN consuming exp(-E/T) grids; training-set mean/std computed on the Boltzmann transform and applied to all splits. Default grid_size=25.
PointNet (point clouds): training and inference via AIdsorb Lightning modules; center transform, classification head configured for single-output regression.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src/h2mof		src/h2mof
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

H2MOF: Predicting Hydrogen Adsorption in MOFs with Geometric Descriptors, Energy Voxels, and Molecular Point Clouds

Installation

Data and labels (77 K H₂)

Asset generation

1) Energy voxels (for RetNet)

2) Molecular point clouds (for PointNet)

Data splits (consistent IDs across methods)

Quickstart: benchmark all methods

Learning curves

External evaluation (e.g., Tobacco dataset)

Outputs and directory structure

Configuration, devices, and reproducibility

Method notes

About

Uh oh!

Releases

Packages

Languages

License

Alex2003D/H2MOF

Folders and files

Latest commit

History

Repository files navigation

H2MOF: Predicting Hydrogen Adsorption in MOFs with Geometric Descriptors, Energy Voxels, and Molecular Point Clouds

Installation

Data and labels (77 K H₂)

Asset generation

1) Energy voxels (for RetNet)

2) Molecular point clouds (for PointNet)

Data splits (consistent IDs across methods)

Quickstart: benchmark all methods

Learning curves

External evaluation (e.g., Tobacco dataset)

Outputs and directory structure

Configuration, devices, and reproducibility

Method notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages