Project In Molecular Biology

De Novo Strands Generation

A machine learning project to generate new Beta strands with potential applications in synthetic biology and drug discovery.

Students: Jerry Abu Ayoub & Mario Barbara

Prerequisites

pip install biopython
pip install biotite
pip install esm

cd scripts/
./download_stride.sh

Cloning the repository

clone the repository and change working directory

git clone https://github.com/Jerryaa98/Protein-strands-generation.git

cd Protein-strands-generation

Install and preprocess data

install all the protein data files (pdb) from the ECOD database and preprocess the data using this python script

python preprocesser.py --load_pdb True --save_faulty_ids True

Filter Outliers:

We removed all strands of 25 < length or 5 > length bsed on the strands length distribution below

when finished, there should be a folder with all the pdb files inside, and another folder with all secondary structure files inside that were exctracted using stride as well as a strands.json file with all Meta Information as well as a file with all the faulty sequence IDs.

Generation using ESM3

mask the strands and generate new strands using ESM3, the generation is done by masking iteratively each beta strand, and using the model to predict a sequence, as well as using this newly generated sequence as a prior for the next generation.

python strands_generation.py

by the end, there should be a general folder for each protein such that it has all the generated sequences for it as new pdb files.

there are multiple strategies for generation, by default the script uses 'sequential' generation.

there's also an option to generate using different seeds and temperatures, for that you may run the following script:

python hyperparam_analysis.py

Result analysis

Compare the newly generated strands with the general space of known sequences using three metrics.

Structrual Similarity (SVD)
Sequence Identity (Percentage %)
Sequence Similarity (Blosum Matrix)

python generation_comp.py

2 examples for strategy comparisons for the same protein:-

Random generation
Reverse generation

An example NxLoop strategy such that N = 10 :-

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
alphafold_data		alphafold_data
comparison_results		comparison_results
images		images
scripts		scripts
.gitignore		.gitignore
ESM_ECOD_COMP.py		ESM_ECOD_COMP.py
README.md		README.md
beta_strands_analysis.py		beta_strands_analysis.py
esmPDBgen.py		esmPDBgen.py
filter_fualty.py		filter_fualty.py
generated_strand_visualization.ipynb		generated_strand_visualization.ipynb
generation_comp.py		generation_comp.py
hyperparam_analysis.py		hyperparam_analysis.py
hyperparam_gen_comp.py		hyperparam_gen_comp.py
pdb_validation.py		pdb_validation.py
preprocesser.py		preprocesser.py
strands_generation.py		strands_generation.py
strandsloader.py		strandsloader.py
test_generation.py		test_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project In Molecular Biology

De Novo Strands Generation

Students: Jerry Abu Ayoub & Mario Barbara

Prerequisites

Cloning the repository

Install and preprocess data

Generation using ESM3

Result analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project In Molecular Biology

De Novo Strands Generation

Students: Jerry Abu Ayoub & Mario Barbara

Prerequisites

Cloning the repository

Install and preprocess data

Generation using ESM3

Result analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages