-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Please work to debug the step1_pdb_process.py script.
Script: https://github.com/lemaslab/CAMP/blob/master/data_prepare/step1_pdb_process.py
Input Data (RCSB PDB) : Download the fasta files from ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz and pdb files
Programs: PLIP
For each peptide-protein pair, the peptide sequence was directly obtained from the RCSB PDB with binding residues marked by PepBDB and the protein sequence was obtained by mapping to UniProt [12]. We first downloaded all complexes containing peptides as ligands from the RCSB PDB released by September 2019. Then we used the Protein Ligand Interaction Predictor (PLIP) program [10] (http://github.com/ssalentin/plip) to extract the interacting chains of peptide and protein sequences from the complex structures. Given a complex structure, PLIP recognizes seven types of non-covalent interactions, including hydrogen bonds, hydrophobic interactions, pi-stackings, pi-cations, salt bridges, water bridges and halogen bonds. A residue from the peptide and another one from the protein, with at least one noncovalent interaction was considered as an interacting pair. We then retrieved the corresponding interacting labels from PepBDB [11], a structure database of peptide-protein complexes derived from the RCSB Protein Data Bank (PDB) [3–5], which contains the peptide residues involved in hydrogen bonds and hydrophobic ontacts with the partner proteins. The peptide binding residues detected by PepBDB were then mapped to the peptide sequences (which were annotated from the RSCB PDB) using an alignment tool based on the Smith-Waterman algorithm [21] (https://github.com/mengyao/Complete-Striped-SmithWaterman-Library). To achieve the high quality of the data, we only kept those peptide sequences with at least 80% matched residues. In total, we collected 7,233 peptide-protein pairs with 3,318 distinct protein sequences and 5,283 distinct peptide sequences, and 90.99% of the pairs had labels of peptide binding residues.