KERMT

This is the official code repository for the paper titled Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction.

Kinetic GROVER Multi-Task (KERMT) is a pretrained graph neural network model for molecular property prediction.

KERMT is an enhanced reimplementation of the GROVER model. The KERMT implementation uses PyTorch Distributed Data Parallel (DDP) for distributed pretraining, automates hyperparameter tuning, and accelerates finetuning and prediction using cuik-molmaker.

This implementation is based on the original GROVER implementation and paper.

Requirements

We recommend using a Docker container for running the model. For developers, we have provided a Dockerfile that was used to create the container.

Setup

Clone the repository

git clone https://github.com/NVIDIA-Digital-Bio/KERMT.git
cd KERMT

Build the container

docker build --rm -t kermt:latest -f Dockerfile .

Run the container with GPUs

docker run --rm --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /path/to/data:/data -v /path/to/reference_pretrained_models:/reference_pretrained_models -it --name kermt  kermt:latest

source /softwares/miniconda3/etc/profile.d/conda.sh && conda activate kermt
cd code

Setup environment

export PYTHONPATH=$PWD
export CUBLAS_WORKSPACE_CONFIG=:4096:8 # for deterministic results

[Alternative to Docker container] Install conda environment from file

# Create conda environment
cd KERMT
conda env create -n kermt -f environment.yml
conda activate kermt

# Install cuik-molmaker
pip install cuik_molmaker==0.1.1 --index-url https://pypi.nvidia.com/rdkit-2025.03.2_torch-2.7.1/

Pretained Model Download

The pretrained models models can be downloaded from the following links.

Pretraining

Data Preparation

Prepare data by generating task labels for functional group prediction task. The SMILES string should be present in a CSV file with a column named smiles. See tests/data/smis_only.csv for an example.

python scripts/save_features.py --data_path tests/data/smis_only.csv  \
                                --save_path tests/data/smis_only.npz   \
                                --features_generator fgtasklabel \
                                --restart

Atom/Bond Contextual Property (Vocabulary)

The atom/bond Contextual Property (Vocabulary) is extracted by scripts/build_vocab.py.

python scripts/build_vocab.py --data_path tests/data/smis_only.csv  \
                             --vocab_save_folder tests/data/smis_only  \
                             --dataset_name smis_only

The outputs of this script are vocabulary dicts of atoms and bonds, smis_only_atom_vocab.pkl and smis_only_bond_vocab.pkl, respectively. For more options for contextual property extraction, please refer to scripts/build_vocab.py.

Data Splitting

Split pretraining data and features into smaller files for memory efficiency.

python scripts/split_data.py --data_path tests/data/smis_only.csv  \
                             --features_path tests/data/smis_only.npz  \
                             --sample_per_file 100  \
                             --output_path tests/data/smis_only

It is recommended to set sample_per_file to a larger value for big datasets.

The output dataset folder will look like this:

smis_only
  |- feature # the semantic motif labels
  |- graph # the smiles
  |- summary.txt

Running Pretraining

For pretraining on multiple GPUs, set available number of GPUs as WORLD_SIZE. As pretraining datasets are large, it is recommended to use a large batch size and ensure near-full GPU memory utilization for maximum efficiency. This example shows how to pretrain on 2 GPUs on a prepared pretraining dataset in the tests/data/pretrain directory.

WORLD_SIZE=2 python pretrain_ddp.py  \
    --train_data_path tests/data/pretrain/train_9k \
    --val_data_path tests/data/pretrain/val_1k \
    --save_dir model/pretrain \
    --atom_vocab_path tests/data/pretrain/pretrain_atom_vocab.pkl \
    --bond_vocab_path tests/data/pretrain/pretrain_bond_vocab.pkl \
    --batch_size 256   --dropout 0.1 --depth 6 --num_attn_head 4 --hidden_size 800 \
    --epochs 100 --init_lr 1E-5 --max_lr 1.5E-4 --final_lr 1E-5 --warmup_epochs 20 \
    --weight_decay 1E-7 --activation PReLU --backbone gtrans --embedding_output_type \
    both  --tensorboard --save_interval 100 --use_cuikmolmaker_featurization

For preparing your own pretraining dataset, please run the Data Preparation, vocabulary generation, and data splitting sections above.

Finetuning

The dataset for finetuning should be organized into three .csv files for train, validation, and test sets. Each of the .csv files should contain a column named as smiles and columns for prediction tasks. See tests/data/finetune/ for examples.

(Optional) Molecular feature extraction

Given a labelled molecular dataset, it is possible to precompute additional molecular features required to finetune the model from the existing pretrained model. The feature matrix is stored as .npz. This examples shows how to precompute normalized RDKit 2D features for training dataset. This step should be repeated for validation and test datasets.

python scripts/save_features.py --data_path tests/data/finetune/train.csv \
                                --save_path tests/data/finetune/train.npz \
                                --features_generator rdkit_2d_normalized \
                                --restart

Finetuning with labelled data

python main.py finetune \
    --data_path tests/data/finetune/train.csv \
    --separate_val_path tests/data/finetune/val.csv \
    --separate_test_path tests/data/finetune/test.csv \
    --save_dir test_run/finetune \
    --checkpoint_path reference_pretrained_models/grover_base.pt \
    --dataset_type regression \
    --split_type scaffold_balanced \
    --ensemble_size 1 \
    --num_folds 1 \
    --no_features_scaling \
    --ffn_hidden_size 700 \
    --ffn_num_layers 3 \
    --bond_drop_rate 0.1 \
    --epochs 2 \
    --metric mae \
    --self_attention \
    --dist_coff 0.15 \
    --max_lr 1e-4 \
    --final_lr 2e-5 \
    --dropout 0.0 \
    --use_cuikmolmaker_featurization \
    --features_generator rdkit_2d_normalized_cuik_molmaker \
    --rdkit2D_normalization_type fast \

--use_cuikmolmaker_featurization flag is used to enable cuik-molmaker for computing atom and bond features. Additionally, normalized RDKit 2D features can also be computed using cuik-molmaker by setting --features_generator rdkit_2d_normalized_cuik_molmaker. --rdkit2D_normalization_type is used to specify the type of normalization that should be applied to RDKit 2D features.

Finetuning with hyperparameter optimization

python main_hpo.py finetune --data_path tests/data/finetune/train.csv \
                            --features_path path/to/train.npz \
                            --separate_val_path tests/data/finetune/val.csv \
                            --separate_val_features_path  path/to/val.npz \
                            --separate_test_path tests/data/finetune/test.csv \
                            --separate_test_features_path  path/to/test.npz \
                            --save_dir finetune_hpo/ \
                            --checkpoint_path reference_pretrained_models/grover_base.pt \
                            --dataset_type regression \
                            --split_type scaffold_balanced \
                            --ensemble_size 1 \
                            --num_folds 1 \
                            --no_features_scaling \
                            --weight_decay 5e-06 \
                            --fine_tune_coff 1.0 \
                            --epochs 100 \
                            --n_trials 100 \
                            --metric mae \
                            --self_attention

The number of trials can be set using --n_trials flag and the number of epochs per trial can be set using --epochs flag.

Prediction

A finetuned model can be used to make predictions on target molecules.

Prediction with Finetuned Model

python main.py predict \
    --data_path tests/data/finetune/test.csv \
    --checkpoint_dir path/to/finetuned_model/ \
    --no_features_scaling \
    --output path/to/predictions.csv

Hardware Requirements

GPUs are required for pretraining, finetuning, and prediction. Multiple GPUs can be used for distributed pretraining. NVIDIA GPUs with atleast 32GB of vRAM and Volta or newer architectures is recommended.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KERMT

Requirements

Setup

Clone the repository

Build the container

Run the container with GPUs

Setup environment

[Alternative to Docker container] Install conda environment from file

Pretained Model Download

Pretraining

Data Preparation

Atom/Bond Contextual Property (Vocabulary)

Data Splitting

Running Pretraining

Finetuning

(Optional) Molecular feature extraction

Finetuning with labelled data

Finetuning with hyperparameter optimization

Prediction

Prediction with Finetuned Model

Hardware Requirements

References

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
figures		figures
kermt		kermt
scripts		scripts
task		task
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py
main_hpo.py		main_hpo.py
pretrain_ddp.py		pretrain_ddp.py
third_party.txt		third_party.txt

License

NVIDIA-Digital-Bio/KERMT

Folders and files

Latest commit

History

Repository files navigation

KERMT

Requirements

Setup

Clone the repository

Build the container

Run the container with GPUs

Setup environment

[Alternative to Docker container] Install conda environment from file

Pretained Model Download

Pretraining

Data Preparation

Atom/Bond Contextual Property (Vocabulary)

Data Splitting

Running Pretraining

Finetuning

(Optional) Molecular feature extraction

Finetuning with labelled data

Finetuning with hyperparameter optimization

Prediction

Prediction with Finetuned Model

Hardware Requirements

References

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages