PTransIPs: Identification of SARS-CoV-2 phosphorylation sites based on protein pretrained model embedding and transformer [Paper]
- 1. Steup
- 2. Generate two pretrained embedding
- 3. Training PTransIPs Model
- 4. Evaluate the model performance on independent testset
- 5. Some Visualization Analysis
Note: We recommend use Python 3.9 for PTransIPs, and use conda to manage your environments!
To get started, simply install conda and run:
git clone https://github.com/StatXzy7/PTransIPs.git
conda create --name PTransIPs python==3.9
...
pip install -r requirements.txt(For ones that wish to skip this step: We have already uploaded complete embeddings for Y sites in the data folder ./embedding/. For S/T sites, you may download complete embeddings from All PTransIPs pretrained embeddings and place them under the directory./embedding/)
The orginal fasta/csv sequence file already exists in ./data/.
To generate sequence pretrained embedding, run ./src/pretrained_embedding_generate.py directly:
python src/pretrained_embedding_generate.pyThe code is set to generate embeddings for Y sites as default, if you attempt to do that for S/T sites, you should run the code after commenting Y sites' part and uncommenting S/T sites' part!
You may also refer to ProtTrans for detailed explanations.
To generate structure embeddding, firstly, git clone the EMBER2 project. After moving the file ./src/pretrained_embedding_generate.py into the EMBER2 folder, you may run the codes:
git clone https://github.com/kWeissenow/EMBER2.git
cp src/structure_embedding_generate.py EMBER2/
python EMBER2/structure_embedding_generate.py -i "data/Y-train.fa" -o "EMBER2/output"
python EMBER2/structure_embedding_generate.py -i "data/Y-test.fa" -o "EMBER2/output"Here, structure_embedding_generate.py is set to generate embeddings for Y sites as default, if you attempt to do that for S/T sites, you may run as follows after modify the codes by commenting Y sites' part and uncommenting S/T sites' part!
python EMBER2/structure_embedding_generate.py -i "data/ST-train.fa" -o "EMBER2/output"
python EMBER2/structure_embedding_generate.py -i "data/ST-test.fa" -o "EMBER2/output"You may also refer to EMBER2 for detailed explanations.
(For ones that wish to skip this step: you may Download the PTransIPs model directly. Remember to place them under .\model\Y_train or .\model\ST_train so that you can proceed to the evaluation step directly.)
Run ./src/train.py to train the PTransIPs model in ./src/PTransIPs_model.py.
Important parameters are:
--Y: To specify that we train the model on Y sites.--ST: To specify that we train the model on ST sites.--device: To specify which GPU to train the model on. (input an integer to specify, default iscuda:0)
Example: Train PTransIPs on ST sites with default GPU:
python src/train.py --STRun ./src/model_performance_evaluate.py to evaluate the model performance on independent testset.
Important parameters are:
--Y: To specify that we evalute the model trained on Y sites.--ST: To specify that we evaluate the model trained on ST sites.--path: To specify the path of model we evaluate, if you trained as default code, you should specify./model/Y_trainfor Y sites and./model/ST_trainfor ST sites.(but this part CAN't be empty!)
Example: Evaluate PTransIPs model trained on Y sites with default path:
python src/model_performance_evaluate.py \
--Y \
--path ./model/Y_trainFiles path/PTransIPs_test_prob.npy and path/PTransIPs_text_result.txt will be created, representing the prediction probability and performance of PTransIPs, respectively. (where path/ depends on which sites you choose`)
You can see the results directly in the files uploaded, in the directory figures/umap_pdf.
Run ./src/umap_test.py to generate umap visualization figures. Remember to modify the path of the model to the one that you want to visualize.
python src/umap_test_Y.py
python src/umap_test_ST.pyRun ./src/Generate_tfseq.py files to generate sequence for Two Sample Logo analysis. Remember to modify the path of the model to the one that you want to visualize.
python src/Generate_tfseq_Y.py
python src/Generate_tfseq_ST.pyPlease feel free to email us at ziyangxu0205@gmail.com or haitian.zhong@cripac.ia.ac.cn. If you find this work useful in your own research, please consider citing our work.
@ARTICLE{xu2024ptransips,
author={Xu, Ziyang and Zhong, Haitian and He, Bingrui and Wang, Xueying and Lu, Tianchi},
journal={IEEE Journal of Biomedical and Health Informatics},
title={PTransIPs: Identification of Phosphorylation Sites Enhanced by Protein PLM Embeddings},
year={2024},
volume={},
number={},
pages={1-10},
keywords={Proteins;Protein engineering;Amino acids;Training;Biological system modeling;Data models;Vectors;Phosphorylation sites;protein pre-trained language model;CNN;Transformer},
doi={10.1109/JBHI.2024.3377362}}