PTransIPs

PTransIPs: Identification of SARS-CoV-2 phosphorylation sites based on protein pretrained model embedding and transformer [Paper]

Architecture

1. Setup

🔧Pip Installation

Note: We recommend use Python 3.9 for PTransIPs, and use conda to manage your environments!

To get started, simply install conda and run:

git clone https://github.com/StatXzy7/PTransIPs.git
conda create --name PTransIPs python==3.9
...
pip install -r requirements.txt

2. Generate two pretrained embedding

(For ones that wish to skip this step: We have already uploaded complete embeddings for Y sites in the data folder ./embedding/. For S/T sites, you may download complete embeddings from All PTransIPs pretrained embeddings and place them under the directory./embedding/)

Input1: Sequence

The orginal fasta/csv sequence file already exists in ./data/.

Input2: Sequence pretrained embedding

To generate sequence pretrained embedding, run ./src/pretrained_embedding_generate.py directly:

python src/pretrained_embedding_generate.py

The code is set to generate embeddings for Y sites as default, if you attempt to do that for S/T sites, you should run the code after commenting Y sites' part and uncommenting S/T sites' part!

You may also refer to ProtTrans for detailed explanations.

Input3: Structure pretrained embedding

To generate structure embeddding, firstly, git clone the EMBER2 project. After moving the file ./src/pretrained_embedding_generate.py into the EMBER2 folder, you may run the codes:

git clone https://github.com/kWeissenow/EMBER2.git
cp src/structure_embedding_generate.py EMBER2/
python EMBER2/structure_embedding_generate.py -i "data/Y-train.fa" -o "EMBER2/output"
python EMBER2/structure_embedding_generate.py -i "data/Y-test.fa" -o "EMBER2/output"

Here, structure_embedding_generate.py is set to generate embeddings for Y sites as default, if you attempt to do that for S/T sites, you may run as follows after modify the codes by commenting Y sites' part and uncommenting S/T sites' part!

python EMBER2/structure_embedding_generate.py -i "data/ST-train.fa" -o "EMBER2/output"
python EMBER2/structure_embedding_generate.py -i "data/ST-test.fa" -o "EMBER2/output"

You may also refer to EMBER2 for detailed explanations.

3. Training PTransIPs Model

(For ones that wish to skip this step: you may Download the PTransIPs model directly. Remember to place them under .\model\Y_train or .\model\ST_train so that you can proceed to the evaluation step directly.)

Run ./src/train.py to train the PTransIPs model in ./src/PTransIPs_model.py.

Important parameters are:

--Y: To specify that we train the model on Y sites.
--ST: To specify that we train the model on ST sites.
--device: To specify which GPU to train the model on. (input an integer to specify, default is cuda:0)

Example: Train PTransIPs on ST sites with default GPU:

python src/train.py --ST

4. Evaluate the model performance on independent testset

Run ./src/model_performance_evaluate.py to evaluate the model performance on independent testset.

Important parameters are:

--Y: To specify that we evalute the model trained on Y sites.
--ST: To specify that we evaluate the model trained on ST sites.
--path: To specify the path of model we evaluate, if you trained as default code, you should specify ./model/Y_train for Y sites and ./model/ST_train for ST sites.(but this part CAN't be empty!)

Example: Evaluate PTransIPs model trained on Y sites with default path:

python src/model_performance_evaluate.py \
        --Y \
        --path ./model/Y_train

Files path/PTransIPs_test_prob.npy and path/PTransIPs_text_result.txt will be created, representing the prediction probability and performance of PTransIPs, respectively. (where path/ depends on which sites you choose`)

5. Some Visualization Analysis

You can see the results directly in the files uploaded, in the directory figures/umap_pdf.

Run ./src/umap_test.py to generate umap visualization figures. Remember to modify the path of the model to the one that you want to visualize.

python src/umap_test_Y.py
python src/umap_test_ST.py

Run ./src/Generate_tfseq.py files to generate sequence for Two Sample Logo analysis. Remember to modify the path of the model to the one that you want to visualize.

python src/Generate_tfseq_Y.py
python src/Generate_tfseq_ST.py

Citation

Please feel free to email us at ziyangxu0205@gmail.com or haitian.zhong@cripac.ia.ac.cn. If you find this work useful in your own research, please consider citing our work.

@ARTICLE{xu2024ptransips,
  author={Xu, Ziyang and Zhong, Haitian and He, Bingrui and Wang, Xueying and Lu, Tianchi},
  journal={IEEE Journal of Biomedical and Health Informatics}, 
  title={PTransIPs: Identification of Phosphorylation Sites Enhanced by Protein PLM Embeddings}, 
  year={2024},
  volume={},
  number={},
  pages={1-10},
  keywords={Proteins;Protein engineering;Amino acids;Training;Biological system modeling;Data models;Vectors;Phosphorylation sites;protein pre-trained language model;CNN;Transformer},
  doi={10.1109/JBHI.2024.3377362}}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
embedding		embedding
figures/umap_pdf		figures/umap_pdf
model/Y		model/Y
sequence_to_analysis		sequence_to_analysis
src		src
.gitignore		.gitignore
README.md		README.md
flowchart.png		flowchart.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PTransIPs

Architecture

Table of Contents

1. Setup

🔧Pip Installation

2. Generate two pretrained embedding

Input1: Sequence

Input2: Sequence pretrained embedding

Input3: Structure pretrained embedding

3. Training PTransIPs Model

4. Evaluate the model performance on independent testset

5. Some Visualization Analysis

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

StatXzy7/PTransIPs

Folders and files

Latest commit

History

Repository files navigation

PTransIPs

Architecture

Table of Contents

1. Setup

🔧Pip Installation

2. Generate two pretrained embedding

Input1: Sequence

Input2: Sequence pretrained embedding

Input3: Structure pretrained embedding

3. Training PTransIPs Model

4. Evaluate the model performance on independent testset

5. Some Visualization Analysis

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages