- Code base termed "Site Saturation Mutagenesis Landscape Analysis (SSMuLA)" for our paper titled "Evaluation of Machine Learning-Assisted Directed Evolution Across Diverse Combinatorial Landscapes"
- Data and results can be found at Zenodo

- For the overall environment
SSMuLA
conda env create -f SSMuLA.yml
- Then install EVmutation from the develop branch after the environment is created
- For the ESM-IF environment
conda create -n inverse python=3.9
conda activate inverse
conda install pytorch cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg -c conda-forge
conda install pip
pip install biotite
pip install git+https://github.com/facebookresearch/esm.git
or install the ESM-IF environment esmif
conda env create -f esmif.yml
- For the CoVES environment
coves
conda env create -f coves.yml
- For installing Triad command line, see instructions here
- For running ESM-2 fintuning simulations, use the
finetune.ymlenvironment
conda env create -f finetune.yml
- Frozen environment can be found in
envs/frozen
-
The
data/folder is organized by protein type. Each protein directory contains:.fasta: FASTA file for the parent sequence.pdb: PDB file for the parent structure.model: EVmutation model filefitness_landscape/: Folder containing CSV files for all fitness landscapes for this protein type, each listing amino acid substitutions and their corresponding fitness values from the original sourcesscale2max/: the folder containing processed fitness csv files returned from theprocess_allfunction in theSSMuLA.fitness_process_vismodule where the maximum fitness value is normalized to 1 for each landscape
-
Landscapes summarized in the table below and described in detail in the paper:
| Landscape | PDB ID | Sites |
|---|---|---|
| ParD2 | 6X0A | I61, L64, K80 |
| ParD3 | 5CEG | D61, K64, E80 |
| GB1 | 2GI9 | V39, D40, G41, V54 |
| DHFR | 6XG5 | A26, D27, L28 |
| T7 | 1CEZ | N748, R756, Q758 |
| TEV | 1LVM | T146, D148, H167, S170 |
| TrpB3A | 8VHH | A104, E105, T106 |
| TrpB3B | E105, T106, G107 | |
| TrpB3C | T106, G107, A108 | |
| TrpB3D | T117, A118, A119 | |
| TrpB3E | F184, G185, S186 | |
| TrpB3F | L162, I166, Y301 | |
| TrpB3G | V227, S228, Y301 | |
| TrpB3H | S228, G230, S231 | |
| TrpB3I | Y182, V183, F184 | |
| TrpB4 | V183, F184, V227, S228 |
- Run
python -m tests.test_preprocess
refer to the test file and the script documentation for further details
- Processed with
fitness_process_vis - Rename columns to be
AAs,AA1,AA2,AA3,AA4,fitness, addactiveif not already there and addmutscolumns - Scale to
max(with option to scale toparent) - Processed data saved in
scale2maxfolder - The landscape stats will be saved
- Run
python -m tests.local_optima
refer to the test file and the script documentation for further details
- Calculate local optima with
calc_local_optimafunction inSSMuLA.local_optima
- Run
python -m tests.pairwise_epistasis
refer to the test file and the script documentation for further details
- Calculate pairwise epistasis with
calc_all_pairwise_epistasisfunction inSSMuLA.pairwise_epistasis - Start from all active variants scaled to max fitness without post filtering
- Initial results will be saved under the default path
results/pairwise_epistasisfolder (corresponding to theactive_startsubfolder in the zenodo repo) - Post processing the output with
plot_pairwise_epistasisfunction inSSMuLA.pairwise_epistasis - Post processed results will be saved under the default path
results/pairwise_epistasis_detsfolder with summary files (corresponding to theprocessedsubfolder) andresults/pairwise_epistasis_visfor each of the landscape with a master summary file across all landscapes (in thepairwise_epistasis_summary.csv)
- The currrent pipeline runs EVmutation and ESM together, and then append the rest based
- All EVmutation predictions run with EVcouplings
- All settings remain default
- Model parameters in the
.modelfiles are downloaded and renamed
- The logits will be generated and saved in the output folder
- Run
python -m tests.test_ev_esm
refer to the test file and the script documentation for further details
- Directly calculated from
n_mut - For Hamming ditsance testing, run
python -m tests.hamming_distance
to deploy run_hd_avg_fit and run_hd_avg_metric from SSMuLA.calc_hd
refer to the test file and the script documentation for further details
- Run
python -m tests.test_esmif
refer to the test file and the script documentation for further details
- Generate the input fasta files with
get_all_mutfastafromSSMuLA.zs_datato be used in ESM-IF - Set up the environment for ESM-IF to
conda create -n inverse python=3.9
conda activate inverse
conda install pytorch cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg -c conda-forge
conda install pip
pip install biotite
pip install git+https://github.com/facebookresearch/esm.git
or use
- With in the
esmiffolder within the new environment, run
./esmif.sh
- ESM-IF results will be saved in the same directory as the
esmif.shscript
- Follow the instructions in the CoVES
- Prepare input data in the
coves_datafolder - Run
run_all_covesfromSSMuLA.run_covesto get all scores - Append scores with
append_all_coves_scoresfromSSMuLA.run_coves
- Prep mutation file in
.mutformat such asA_1A+A_2A+A_3A+A_4AwithTriadGenMutFileclass inSSMuLA.triad_prepost - Run
python -m tests.test_triad_pre
refer to the test file and the script documentation for further details
- With
triad-2.1.3local command line - Prepare structure with
2prep_structures.sh - Run
3getfixed.sh - Parse results with
ParseTriadResultsclass inSSMuLA.triad_prepost
- Run
python -m tests.test_zs
refer to the test file and the script documentation for further details
- Run
de_simulationsand visualise withplot_de_simulations - Run
python -m tests.test_de
and
python -m tests.test_de_vis
refer to the test file and the script documentation for further details
- Use
MLDE_liteenvironment - For using learned ESM embeddings, first run
gen_all_learned_embfromSSMuLA.gen_learned_emb, else skip this step - Run
python -m tests.test_gen_learned_emb
- Run
run_all_mlde_parallelizedfromSSMuLA.mlde_liteto run simulations - Run
python -m tests.test_mlde
- Important options including:
n_mut_cutoffs: list of integers for Hamming distance cutoff options where[0]means none and[2]for Hamming distance of two for ensemblezs_predictors: list of strings for zero-shot predictors, i.e.["none", "Triad", "ev", "esm"]wherenonemeans not focused training and thus default MLDE runs; the list can be extended for non-Hamming distance ensemble, including["Triad-esmif", "Triad-ev", "Triad-esm", "two-best"]ft_lib_fracs: list of floats for fraction of libraries to use for focused training, i.e.[0.5, 0.25, 0.125]encoding: list of strings for encoding options, i.e.["one-hot"] + DEFAULT_LEARNED_EMB_COMBOmodel_classes: list of strings for model classes, i.e.["boosting", "ridge"]n_samples: list of integers for number of training samples to use, i.e.[96, 384]n_split: integer for number of splits for cross-validation, i.e.5n_replicate: integer for number of replicates for each model, i.e.50n_tops: integer for number of variants to test the prediction, i.e.[96, 384]refer to the test file and the script documentation for further details
- Run
MLDESumfromSSMuLA.mlde_analysisto get the summary dataframe and optional visualization
python -m tests.test_mlde_vis
- See details in alde4ssmula repository
aggregate_alde_dffromSSMuLA.alde_analysisto get the summary dataframe
python -m tests.test_alde
- Run
train_predict_per_proteinfromSSMuLA.plm_finetunefor ESM-2 LoRA fine-tuning simulations
- All notebooks in
fig_notebooksare used to reproduce figures in the paper with files downloaded from Zenodo