This repository contains an implementation of the molecular graph deep sets (MolSets) model for molecular mixture properties, associated with our paper Learning molecular mixture property using chemistry-aware graph neural network.
models.py and dmpnn.py contain implementations of MolSets with standard graph convolutions and DMPNN, respectively.
main.py, main_dmpnn.py, and predict.py are for evaluation and prediction; see Usage for details.
data_utils.py is for processing molecular graph data.
data/ provides datasets used in the paper.
Details on datasets:
data_compiled.csvcontains cleaned raw data from the dataset curated in ACS Cent. Sci. 2023, 9, 2, 206–216.prepare_data.pyis for processing the raw data, e.g., converting SMILES to graphs.data_list.pklcontains processed data from the dataset.- An integer index;
- A list of solvent molecular graphs in
torch_geometric.data.Dataformat; - A list of solvent molecular weights (g/mol);
- A list of solvent weight fractions;
- Salt molality (mol/kg);
- Salt molecular graph;
- Logarithm conductivity at 298 K (log S/cm).
data_df_stats.pklorganizes the data with some statistics inpandas.DataFrameformat.all_bin_candidates.pklcontains the candidates (equal weight binary molecular mixture + 1 m salt) for virtual screening. Organized in the same way asdata_list.pkl.
results provides model checkpoints and saves files generated in runs.
*Note: Git LFS is required to download the .pkl files properly. Please download them here if you do not have Git LFS.
**Data handling is not yet optimized for efficiency. Contributions are welcome!
MolSets requires the following packages:
- PyTorch >= 2.0
- PyG (
torch_geometric) - PyTorch Scatter (only for DMPNN)
The environment can be set up by running
conda env create -f environment.yml
However, there may be package compatibility issues that need manual corrections. CUDA and GPU-enabled versions of PyTorch and PyG are required to run on GPUs.
Use main.py to train the MolSets model (with standard graph convolutions) or evaluate it on a dataset. Set the hyperparameters in hyperpars, and the data path in dataset, then run
(screen) python main.py
and see the results. Training may take minutes to hours depending on the device and data size. For the model with DMPNN, use main_dmpnn.py instead, following similar procedures.
Use predict.py to make inferences on candidate mixtures with a trained model. Specify the path to the candidate data file in candidate_data and the model checkpoint file in model.load_. Information about training data is needed if feature normalization is used, as in data_utils.py.
After setup, run
python predict.py
and the predictions will be written in a .csv file.
If you find this code useful, please consider citing the following paper:
@article{zhang2024molsets,
author = {Zhang, Hengrui and Lai, Tianxing and Chen, Jie and Manthiram, Arumugam and Rondinelli, James M. and Chen, Wei},
title = {Learning molecular mixture property using chemistry-aware graph neural network},
journal = {PRX Energy},
year = {2024},
volume = {3},
number = {2},
pages = {023006},
doi = {10.1103/PRXEnergy.3.023006}
}
