A Multi-objective Framework for Universal Protein–Ligand Interaction Prediction
Article
·
GitHub
·
Zenodo
Training Datasets
·
Trained Models
Table of Contents
In this study, we propose and implement DeepRLI, an interaction prediction framework that is universally applicable across various tasks, leveraging a multi-objective strategy. Innovatively, this work proposes a multi-objective learning strategy that includes scoring, docking, and screening as optimization goals. This allows the deep learning model to have three relatively independent downstream readout networks, which can be optimized separately to enhance the task specificity of each output. The model incorporates an improved graph transformer with a cosine envelope constraint, integrates a novel physical information module, and introduces a new contrastive learning strategy. With these designs, DeepRLI demonstrates superior comprehensive performance, accommodating applications such as binding affinity prediction, binding pose prediction, and virtual screening, showcasing its potential in practical drug development.
The architecture of DeepRLI is illustrated in Figure 1.
Python virtual environment needs to be created in advance.
- Import from the
.ymlfileconda env create -n deeprli -f environment.yml
- Create step by step
conda create -n deeprli python=3.11 conda activate deeprli conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia conda install -c dglteam/label/cu118 dgl==1.1.2.cu118 conda install -c conda-forge rdkit==2023.09.2
- Clone the repository
git clone https://github.com/fairydance/DeepRLI.git
- Set the environment variable
export PYTHONPATH="${REPO_ROOT}/src:${PYTHONPATH}"
Whether it is training or inference, structured data needs to be preprocessed into graph data. Executing preprocessing task first requires building a preset directory structure as follows:
${DATA_ROOT_DIR}
├── index
└── raw
The directory raw should contain the structure files of ligands and proteins. And the index directory provides index files for the data to be processed. The examples/data directory in this repository provides data examples for reference.
The script for this job is deposited in the ${REPO_ROOT}/src/deeprli/preprocess directory. Run it as below:
python preprocess.py\
--data-root "${DATA_ROOT_DIR}"\
--data-index "${DATA_INDEX_FILE}"\
--ligand-file-types "sdf,mol2"\
--dist-cutoff 6.5The path of ${DATA_INDEX_FILE} in the above command is relative to the ${DATA_ROOT_DIR}. After execution, each complex data after processing will be stored in the ${DATA_ROOT_DIR}/processed location, and finally a file containing all processed complex data will be packaged and saved in the ${DATA_ROOT_DIR}/compiled directory. The "data_file" for training and inference input is the packaged file. In addition, this script will also output an index file at the end that stores the successfully processed data.
It should be noted that the ${DATA_INDEX_FILE} can only contain the processed data for the subsequent training or inference task. If there exist items reside in the ${DATA_INDEX_FILE} but cannot be found in the ${DATA_ROOT_DIR}/processed folder, the data preprocessing will be re-executed.
The script for model training is in the ${REPO_ROOT}/src/deeprli/train directory. It can not only provide the necessary input with command line parameters, but also obtain the corresponding input by reading the json-formatted configuration file, and the former has a higher priority.
python train.py --config "${CONFIG_FILE}"An example of the configuration file is as follows:
{
"train_data_root": "${TRAIN_DATA_ROOT}",
"train_data_index": "${TRAIN_DATA_INDEX}",
"train_data_files": "${TRAIN_DATA_FILES}",
"epoch": 1000,
"batch": 6,
"initial_lr": 0.0002,
"lr_reduction_factor": 0.5,
"lr_reduction_patience": 15,
"min_lr": 1e-6,
"weight_decay": 0,
"f_dropout_rate": 0.0,
"g_dropout_rate": 0.0,
"hidden_dim": 64,
"num_attention_heads": 8,
"use_layer_norm": false,
"use_batch_norm": true,
"use_residual": true,
"gpu_id": 0,
"enable_data_parallel": false,
"use_all_train_data": true,
"save_path": "${SAVE_PATH}"
}The script for the inference task is in the ${REPO_ROOT}/src/deeprli/infer directory. It possesses a parameter input method similar to the above training script.
python infer.heavy.py --config "${CONFIG_FILE}"An example of the configuration file is as follows:
{
"model": "/path/to/trained_model.state_dict.pth",
"model_format": "state_dict",
"data_root": "${DATA_ROOT}",
"data_index": "${DATA_INDEX}",
"data_file": "${DATA_FILE}",
"batch": 32,
"gpu_id": 0,
"save_path": "${SAVE_PATH}"
}Distributed under the MIT License. See LICENSE.txt for more information.
- Haoyu Lin (developer) - hylin@pku.edu.cn
- Jianfeng Pei (supervisor) - jfpei@pku.edu.cn
- Luhua Lai (supervisor) - lhlai@pku.edu.cn
We would like to express our gratitude to all members of Luhua Lai's group for their valuable suggestions and insights. We also acknowledge the support of computing resources provided by the high-performance computing platform at the Peking-Tsinghua Center for Life Sciences, Peking University.