A machine learning project to generate new Beta strands with potential applications in synthetic biology and drug discovery.
pip install biopython
pip install biotite
pip install esm
cd scripts/
./download_stride.shclone the repository and change working directory
git clone https://github.com/Jerryaa98/Protein-strands-generation.git
cd Protein-strands-generationinstall all the protein data files (pdb) from the ECOD database and preprocess the data using this python script
python preprocesser.py --load_pdb True --save_faulty_ids TrueFilter Outliers:
We removed all strands of 25 < length or 5 > length bsed on the strands length distribution below
when finished, there should be a folder with all the pdb files inside, and another folder with all secondary structure files inside that were exctracted using stride as well as a strands.json file with all Meta Information as well as a file with all the faulty sequence IDs.
mask the strands and generate new strands using ESM3, the generation is done by masking iteratively each beta strand, and using the model to predict a sequence, as well as using this newly generated sequence as a prior for the next generation.
python strands_generation.pyby the end, there should be a general folder for each protein such that it has all the generated sequences for it as new pdb files.
there are multiple strategies for generation, by default the script uses 'sequential' generation.
there's also an option to generate using different seeds and temperatures, for that you may run the following script:
python hyperparam_analysis.pyCompare the newly generated strands with the general space of known sequences using three metrics.
- Structrual Similarity (SVD)
- Sequence Identity (Percentage %)
- Sequence Similarity (Blosum Matrix)
python generation_comp.py2 examples for strategy comparisons for the same protein:-




