Skip to content

Predicting hydrogen adsorption in MOFs with RF (geometrical descriptors), RetNet (energy voxels), and PointNet (point clouds)

License

Notifications You must be signed in to change notification settings

Alex2003D/H2MOF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

H2MOF: Predicting Hydrogen Adsorption in MOFs with Geometric Descriptors, Energy Voxels, and Molecular Point Clouds

This repository contains the code for the study “Predicting Hydrogen Adsorption in MOFs: A Comparative Analysis of Geometric Descriptors, Energy Voxels and Molecular Point Clouds”.

It compares three representation–model pairs using consistent data splits and identical evaluation:

  • RF on five geometric descriptors (void_fraction, surface_area_m2g, surface_area_m2cm3, pld, lcd)
  • RetNet, a compact 3D CNN on Boltzmann-transformed energy voxels
  • PointNet on molecular point clouds via the AIdsorb pipeline

Installation

Requirements: Python ≥ 3.10. GPU is recommended for voxel and point-cloud models.

  1. (Optional but recommended) Install PyTorch from its official instructions for your platform/GPU.

  2. Install this package. Use extras depending on which methods you plan to run:

# Base (RF + common utilities)
pip install -e .

# Energy voxels (adds pymoxel)
pip install -e .[voxels]

# Molecular point clouds (adds AIdsorb + Lightning)
pip install -e .[pointclouds]

# Everything
pip install -e .[all]

Data and labels (77 K H₂)

Download the hMOF JSONs from MOFX-DB (mof.tech.northwestern.edu) and place them under a folder such as hmof_json/ where files are named like hMOF-<id>.json. The included CLI builds a single labels table gathering the five descriptors and the 77 K H₂ adsorption targets at 2 and 100 bar:

python -m h2mof labels \
  --input-dir hmof_json \
  --output-csv hmof_data.csv

This produces a CSV indexed by MOF_name with columns:

  • void_fraction, surface_area_m2g, surface_area_m2cm3, pld, lcd
  • H2_adsorption_2bar, H2_adsorption_100bar

By default the project looks for hmof_data.csv in the project root (see Paths below).

Asset generation

Two method-specific assets are required:

1) Energy voxels (for RetNet)

Use the moxel CLI to convert CIFs to 3D potential grids (stored as .npy):

moxel hmof_cif hmof_voxels \
  --grid_size 25 \
  --cutoff 10 \
  --epsilon 50 \
  --sigma 2.5

Notes:

  • The default RetNet assumes --grid_size 25. If you change this, you must also change the first Linear layer input dimension in the model.
  • h2mof expects per-MOF voxel files at hmof_voxels/<MOF_name>.npy.

2) Molecular point clouds (for PointNet)

Use AIdsorb to generate per-MOF point clouds (also .npy):

aidsorb create hmof_cif --outname hmof_point_clouds

h2mof expects files at hmof_point_clouds/<MOF_name>.npy.

Data splits (consistent IDs across methods)

Use AIdsorb to create splits with a single source of truth for IDs:

aidsorb prepare hmof_point_clouds --split_ratio "(0.8, 0.1, 0.1)"

Ensure the following files exist at the project root (or set H2MOF_ROOT to the folder containing them):

  • train.json
  • validation.json
  • test.json

The CLI uses the intersection of available IDs across selected methods at each split, so every method sees the same MOFs when comparing performance.

Quickstart: benchmark all methods

Train and evaluate RF, RetNet, and PointNet with shared splits; write parity plots, residual CSVs, and a comparison bar chart:

python -m h2mof bench \
  --rf --voxels --pointclouds \
  --pressures 2 100 \
  --epochs-voxels 100 \
  --epochs-pointclouds 150

Key behavior:

  • RF ignores --epochs-*; RetNet/PointNet honor them.
  • Splits: IDs come from the root train.json, validation.json, test.json and are intersected across methods.
  • Outputs go under outputs/ (see below).

Learning curves

Generate learning curves (validation R² vs. training set size) with repeated stratified runs per size. An internal validation split is carved out of the training IDs (by default 10%) and used only for checkpoint selection.

python -m h2mof curves \
  --sizes 1000 5000 15000 30000 60000 \
  --repeats 3 \
  --rf --voxels --pointclouds \
  --pressures 2 100 \
  --val-frac 0.10 \
  --epochs-voxels 100 \
  --epochs-pointclouds 150

Useful options:

  • --resume --resume-dir outputs/resume to continue partially completed runs safely
  • --sizes and --repeats accept parallel lists or a single --repeats value broadcast to all sizes

Outputs: a PNG and tidy CSV per pressure, summarizing mean/std R² per method and size.

External evaluation (e.g., Tobacco dataset)

Evaluate pre-trained models (trained on the full hMOF training set) on another dataset such as Tobacco.

python -m h2mof eval-external \
  --dataset-name tobacco \
  --csv tobacco_data.csv \
  --voxels-dir tobacco_voxels \
  --pc-dir tobacco_point_clouds \
  --rf --voxels --pointclouds \
  --pressures 100

Outputs and directory structure

Key paths (defaults can be changed via environment variables; see next section):

  • Labels CSV: hmof_data.csv
  • Assets:
    • Voxels: hmof_voxels/<MOF_name>.npy
    • Point clouds: hmof_point_clouds/<MOF_name>.npy
  • Splits: train.json, validation.json, test.json
  • Models and reports (created by CLI):
    • outputs/models/<method>/<pressure>bar/ (method artifacts and test_results.json)
    • outputs/parity_plots/<method>/<method>_parity_<pressure>bar.png
    • outputs/residuals/<method>/<method>_top100_residuals_<pressure>bar.csv
    • outputs/comparison/comparison_r2.png
    • outputs/learning_curves_H2_<pressure>bar.{png,csv}
    • External evaluation: outputs/external/<dataset>/{parity_plots,residuals,comparison,models,meta}

Configuration, devices, and reproducibility

Paths and behavior are controlled by the following environment variables:

  • H2MOF_ROOT: project root (defaults to current working directory)
  • H2MOF_OUTPUTS_DIR: outputs root (defaults to <ROOT>/outputs)
  • H2MOF_LOG_LEVEL: console log level (INFO, DEBUG, …), default INFO
  • H2MOF_RICH: set 0 to disable rich console logging
  • H2MOF_NUM_WORKERS: DataLoader workers (defaults to 8; set to 0 if you see worker issues)

Devices and precision:

  • The code picks cuda > mps > cpu automatically and uses AMP (bf16/float16 on CUDA when available).
  • PyTorch Lightning is configured to use mixed precision on GPU and 32-bit on CPU/MPS.

Reproducibility:

  • All CLI commands accept --seed (default 42). Internally, Python/NumPy/PyTorch RNGs and DataLoader workers are seeded.
  • Learning-curve runs can be resumed safely with --resume.

Method notes

  • RF (geometric descriptors): scikit-learn RandomForestRegressor with a small grid over depth/feature/leaf config; trained on the five descriptors. No GPU required.
  • RetNet (energy voxels): compact 3D CNN consuming exp(-E/T) grids; training-set mean/std computed on the Boltzmann transform and applied to all splits. Default grid_size=25.
  • PointNet (point clouds): training and inference via AIdsorb Lightning modules; center transform, classification head configured for single-output regression.

About

Predicting hydrogen adsorption in MOFs with RF (geometrical descriptors), RetNet (energy voxels), and PointNet (point clouds)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages