Skip to content

Jerryaa98/Protein-strands-generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project In Molecular Biology

De Novo Strands Generation

A machine learning project to generate new Beta strands with potential applications in synthetic biology and drug discovery.

Students: Jerry Abu Ayoub & Mario Barbara

Prerequisites

pip install biopython
pip install biotite
pip install esm

cd scripts/
./download_stride.sh

Cloning the repository

clone the repository and change working directory

git clone https://github.com/Jerryaa98/Protein-strands-generation.git

cd Protein-strands-generation

Install and preprocess data

install all the protein data files (pdb) from the ECOD database and preprocess the data using this python script

python preprocesser.py --load_pdb True --save_faulty_ids True

Filter Outliers:

We removed all strands of 25 < length or 5 > length bsed on the strands length distribution below

BetaStrandLength.png

when finished, there should be a folder with all the pdb files inside, and another folder with all secondary structure files inside that were exctracted using stride as well as a strands.json file with all Meta Information as well as a file with all the faulty sequence IDs.

Generation using ESM3

mask the strands and generate new strands using ESM3, the generation is done by masking iteratively each beta strand, and using the model to predict a sequence, as well as using this newly generated sequence as a prior for the next generation.

python strands_generation.py

by the end, there should be a general folder for each protein such that it has all the generated sequences for it as new pdb files.

there are multiple strategies for generation, by default the script uses 'sequential' generation.

there's also an option to generate using different seeds and temperatures, for that you may run the following script:

python hyperparam_analysis.py

Result analysis

Compare the newly generated strands with the general space of known sequences using three metrics.

  • Structrual Similarity (SVD)
  • Sequence Identity (Percentage %)
  • Sequence Similarity (Blosum Matrix)
python generation_comp.py

2 examples for strategy comparisons for the same protein:-

  • Random generation e1af6A1.png

  • Reverse generation e1af6A1.png

An example NxLoop strategy such that N = 10 :- e1qj8A1.png

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages