Skip to content

kalininalab/StructGuy_evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StructGuy Evaluation Repository

Setup:

Step 1: Install StructMAn

(estimated time: X, required disk space: X, required memory:) Follow the instructions in the StructMAn repository.

Step 2: Install StructGuy

(estimated time: X, required disk space: 1 Tb, required memory: 800 Gb) Follow the instructions in the StructGuy repository

Step 3: Clone the Evaluation repository

git clone https://github.com/kalininalab/StructGuy_evaluation.git

This will create the StructGuy_evaluation directory in the current directory.

Reproduce the MAVE goldstandard (MGS) dataset:

Step 1: Install MaveTools-Fork

First clone our Fork of the MaveTools Package:

git clone https://github.com/AlexanderGress/MaveTools.git

This will create a directory called MaveTools in the current directory, here call the pip installation while having the StructMAn/StructGuy conda environment active (conda activate [environment name (i.e. structman)]):

pip install MaveTools/

Step 2: Run the goldstandard dataset generation script

Go into the StructGuy_evaluation/training_dataset_generation/ directory and call:

python make_gold_standard.py

(estimated time: X, required disk space: X)
This script will download the MaveDB and ProteinGym substitutions datasets and stores them into StructGuy_evaluation/datasets/. Then it will use those two resources to generate the scaled and filtered goldstandard dataset dedicated for the training of supervised machine learning methods.

Note

To be further utilized by StructGuy, the dataset needs to be featurized.

Generate features for a dataset:

Whether to train on or to predict a dataset, a respective feature table has to be calculated. The first step to do so is the calculation of structural features by applying the StructMAn annotation pipeline. Therefor a dataset needs to be prepared to be processable by StructMAn, which is explained in this tutorial.

Tip

We provide a toy dataset in StructGuy_evaluation/datasets/Toy_example/toy_example.fasta. It is in a format ready to be processed by StructMAn. It contains the MAVE data for five proteins, the minimal amount of proteins for the five-fold cross-validation in the hyperparameter optimization.

Calling StructMAn:

structman -i [path to dataset] -n [number of threads]
  • -i Path to a StructMAn-readable dataset file.
  • -n Provides the maximal number of threads that should be used.

Tip

For processing the toy example, go to StructGuy_evaluation/datasets/Toy_example/ and call:

structman -i toy_example.fasta

StructMAn generates a config file named [name of dataset].structguy_project.conf in the corresponding output directory.
It is required for the subsequent callings of StructGuy.

Calling of non-structural features generation script:

structguy generate_features -i [path to structguy_project.conf] -n [number of threads]
  • -i Path to the structuguy_project.conf file that got produced by StructMAn.
  • -n Provides the maximal number of threads that should be used.

Tip

For processing the toy example, go to StructGuy_evaluation/datasets/Toy_example/ and call:

structguy generate_features -i Output/toy_example/toy_example.structguy_project.conf

Train a model:

With and without hyperparameter optimization

Train StructGuy

Tip

Easiest way to use StructGuy is by downloading the model we trained in (add_link_to_publication_later) from Hugging Face

Without Hyperparameter Optimization

structguy build_model -i [path to name_of_dataset.structguy_project.conf] --nocv --nohpo --hp [path to a hyperparameter list] -n [number of threads]
  • -i Path to the structuguy_project.conf file that got produced by StructMAn.
  • --nocv Skips any cross-validation setups and directly trains on the full dataset.
  • --nohpo Skips the hyperparameter optimization.
  • --hp Path to a file with a list of hyperparameters. Optional, if not given, default parameters are taken. An example can be found here: StructGuy_evaluation/configs_and_parameters/structguy_hyperparameters_for_goldstandard.conf
  • -n Provides the maximal number of threads that should be used.

Tip

For training a model on the toy example with the original set of hyperparameters, go to StructGuy_evaluation/datasets/Toy_example/ and call:

structguy build_model -i Output/toy_example/toy_example.structguy_project.conf --nocv --nohpo --hp ../../configs_and_parameters/structguy_hyperparameters_for_goldstandard.conf

With Hyperparameter Optimization

Warning

This will consume great amounts of computing resources and time.

structguy build_model -i [path to name_of_dataset.structguy_project.conf] --hp [path to a hyperparameter list] -n [number of threads]
  • -i Path to the structuguy_project.conf file that got produced by StructMAn.
  • --hp Path to a file with a list of hyperparameters that are used as a start for the hyperparameter optimization. Optional, if not given, default parameters are taken. An example can be found here: StructGuy_evaluation/configs_and_parameters/structguy_hyperparameters_for_goldstandard.conf
  • -n Provides the maximal number of threads that should be used.

Tip

For optimizing the hyperparameters and training a model on the toy example, go to StructGuy_evaluation/datasets/Toy_example/ and call:

structguy build_model -i Output/toy_example/toy_example.structguy_project.conf

Predict a dataset:

structguy predict -i [path to name_of_dataset.structguy_project.conf] -m [path to model.dump file] -n [number of threads]
  • -i Path to the structuguy_project.conf file that got produced by StructMAn.
  • -m Path to an already trained model, either generated by structguy build_model or downloaded from Hugging Face.
  • -n Provides the maximal number of threads that should be used.

Predict ProteinGym substitutions

Note

This section describes the preparations necessary for the "Comparison to unsupervised model from the ProteinGym benchmark" evaluation from the paper. This step is computationally expensive and can be omitted, we provide the predictions in StructGuy_evaluation/benchmarks/proteingym_substitutions_predicted_by_structguy.tsv.gz

Step 1:

Go to the StructGuy_evaluation/evaluations/ directory and call:

python prepare_proteingym_for_structguy.py

This downloads the ProteinGym substitutions dataset and generates the StructMAn-readable input file to StructGuy_evaluation/datasets/proteingym_substitutions_for_structguy/proteingym_substitutions.fasta

Step 2:

Go to the StructGuy_evaluation/datasets/proteingym_substitutions_for_structguy/ directory and call:

structman -i proteingym_substitutions.fasta

This will generate an Outfolder directory containing the StructGuy_evaluation/datasets/proteingym_substitutions_for_structguy/Output/proteingym_substitutions/proteingym_substitutions.structguy_project.conf file.

Step 3:

Call the StructGuy feature generation pipeline:

structguy generate_features -i Output/proteingym_substitutions/proteingym_substitutions.structguy_project.conf

Now the dataset is fully featurized is ready for the prediction process of StructGuy.

Step 4:

Call the StructGuy prediction pipeline:

structguy predict -i Output/proteingym_substitutions/proteingym_substitutions.structguy_project.conf -m [path to model.dump file]

Predict ProteinGym Clinical Substitutions

Note

This section describes the preparations necessary for the "Application to ProteinGym clinical substitutions" evaluation from the paper. This step is computationally expensive and can be omitted, we provide the predictions in StructGuy_evaluation/benchmarks/proteingym_clinical_substitutions_predicted_by_structguy.tsv.gz

Step 1:

Go to the StructGuy_evaluation/evaluations/ directory and call:

python prepare_clinvar_for_structguy.py

This downloads the ProteinGym clinical substitutions dataset and generates the StructMAn-readable input file to StructGuy_evaluation/datasets/ProteinGym_ClinVar/pg_clinvar.fasta

Step 2:

Go to the StructGuy_evaluation/datasets/ProteinGym_ClinVar/ directory and call:

structman -i pg_clinvar.fasta

This will generate an Outfolder directory containing the StructGuy_evaluation/datasets/ProteinGym_ClinVar/Output/pg_clinvar/pg_clinvar.structguy_project.conf file.

Step 3:

Call the StructGuy feature generation pipeline:

structguy generate_features -i Output/pg_clinvar/pg_clinvar.structguy_project.conf

Now the dataset is fully featurized is ready for the prediction process of StructGuy.

Step 4:

Call the StructGuy prediction pipeline:

structguy predict -i Output/pg_clinvar/pg_clinvar.structguy_project.conf -m [path to model.dump file]

Evaluations

Generalization vs. Unsupervised ProteinGym Benchmark

Note

This section corresponds to the "Comparison to unsupervised model from the ProteinGym benchmark" evaluation from the paper.

Optional Step 0: Predict ProteinGym Substitutions with StructGuy

Perform the steps explained in the Prediction section

Note

If successfully applied, this step generates the StructGuy_evaluation/datasets/proteingym_substitutions_for_structguy/Output/proteingym_substitutions/predictions.tsv file and is used as basis for the evaluation. If this file is not present, StructGuy_evaluation/benchmarks/proteingym_substitutions_predicted_by_structguy.tsv.gz will be used automatically.

Step 1: Call Evaluation Script

Go to the StructGuy_evaluation/evaluations/ directory and call:

python generalization_vs_unsupervised_benchmark.py

This will generate the generalization_vs_unsupervised.tsv and generalization_vs_unsupervised_old_pg.tsv results tables.

Evaluation on ProteinGym Clinical Substitutions

Note

This section corresponds to the "Application to ProteinGym clinical substitutions" evaluation from the paper.

Optional Step 0: Predict ProteinGym Clinical Substitutions with StructGuy

Perform the steps explained in the Prediction section

Note

If successfully applied, this step generates the StructGuy_evaluation/datasets/ProteinGym_ClinVar/Output/pg_clinvar/predictions.tsv file and is used as basis for the evaluation. If this file is not present, StructGuy_evaluation/benchmarks/proteingym_clinical_substitutions_predicted_by_structguy.tsv.gz will be used automatically.

Step 1: Call Evaluation Script

Go to the StructGuy_evaluation/evaluations/ directory and call:

python evaluate_proteingym_clinvar.py

This will print the average protein-wise AUC value for the predictions from StructGuy on the ProteinGym clinical substitutions dataset into the command prompt.

Evaluation on ProteinGym Supervised Benchmark

Note

This section corresponds to the "Comparison to supervised models from the ProteinGym benchmark" evaluation from the paper.

Go to the StructGuy_evaluation/evaluations/ directory and call:

python supervised_benchmark.py [overwrite]
  • When called without overwrite the precalculated results from StructGuy_evaluation/evaluations/supervised_evalution.tsv are taken.
  • When called with overwrite the complete supervised benchmark gets repeated and overwrites StructGuy_evaluation/evaluations/supervised_evalution.tsv in the process.

Warning

Calling this with overwrite will train and test 3255 individual models and is therefor computationally expensive.

The script will generate the StructGuy_evaluation/evaluations/supervised_benchmark_mean_rhos.tsv file that contains the results for the benchmark.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages