GitHub

The code provided has been tested on the Booster partition of Leonardo, the pre-exascale Tier-0 EuroHPC supercomputer, at CINECA and on the DGX partition of Orfeo, the supercomputer hosted at AREA Science Park.

Hardware requirements

Specifically, we tested the finetuning on the following architectures:

Leonardo Booster one-node configuration:
- Processors: single socket 32-core Intel Xeon Platinum 8358 CPU, 2.60GHz (Ice Lake)
- RAM: 512 GB DDR4 3200 MHz
- Accelerators: 4x NVIDIA custom Ampere A100 GPU 64GB HBM2e, NVLink 3.0 (200GB/s)
- Network: 2 x dual port HDR100 per node (400Gbps/node)
- All the nodes are interconnected through an Nvidia Mellanox network (Dragon Fly+).
Orfeo DGX one-node configuration:
- Processors: 2 x 64-core AMD EPYC 7H12 (2.6 GHz base, 3.3 GHz boost)
- RAM: 1024 GB DDR4 3200 MT/s
- Accelerators: 8x NVIDIA Ampere A100 SXM GPU 40GB HBM2e, NVLink 3.0 ?? (200GB/s)

Software requirements

The software stack on Leonardo is as follows

$> srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=8 --gres=gpu:4 -p boost_usr_prod --mem=450GB --time 02:50:00 --pty /bin/bash
$> module load python cuda nvhpc

Regarding the operating system, we tested the code with the following OS

$> cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.7 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.7"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.7 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.7"

As far as CUDA is concerned, we tested the code with the following configuration

$> nvidia-smi | grep CUDA 
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |

Finally, we tested the code with the following python version

$> python3 --version
Python 3.11.6

The dependencies are listed in the requirements.txt file.

Installation

The suggested procedure to install the dependencies on Leonardo is the following

$> module load python cuda nvhpc
$> python3 -m venv PLM4Muts_venv --system-site-packages
$> source PLM4Muts_venv/bin/activate
$> pip install -r requirements.txt

On Leonardo, pre-trained weights can be downloaded only on the login node. Furthermore, the home directory is limited to 50GB per user. For this reasons, we dowloaded the weights in the src/models/models_cache/ directory using the download_weights.job and the src/download_weights.py scripts.

Data Format and Structure

Data must be in csv format. The following columns must be complete and specified in the header:

'pdb_id', 'code', 'pos', 'wt_seq', 'mut_seq', 'wt_msa', 'ddg'

For training, create a directory associated with the experiment in datasets/train_name containing the following subdirectories: train, test and validation.

Each subdir must contain the database/db_name.csv and MSA_train_name directory with the wild type MSA.

For the inference only the test set is needed.

Training

For training, associate the experiments with a directory for example runs/experiment_name. This must contain:

the script to launch the program (similar to the finetuning.job for systems with Slurm scheduler)
the config.yaml file reporting the paths and parameters specific to the run.

The following example to show the required fields in this config.yaml

output_dir: "runs/experiment_name"
dataset_dir: "datasets/train_name"
model: "MSA_Finetuning"
learning_rate: 1.0e-4
max_epochs: 20
loss_fn: "L1"
max_length: 1024
seeds: [10, 11, 12]
optimizer:
  name: "AdamW"
  weight_decay: 0.01
  momentum: 0.
MSA:
  max_tokens: 16000
snapshot_file: "runs/ experiment_name/snapshots/MSA_Finetuning.pt"

The possible options for the model field are:

MSA_Finetuning
ESM2_Finetuning
PROST5_Finetuning
MSA_Baseline
ESM2_Baseline
PROST5_Baseline

The possible options for the loss_fn field are:

"L1"
"MSE"

The max_length parameter regulates the max length of the sequences that can be loaded in memory. That is, sequences with length greater than max_length are discarded.

Seeds consists of a list of three integers. The final seed per GPU is computed according to:

seed = seeds[0] * (seeds[1] + seeds[2] * GPU_rank)

The possible options for the optimizer field are:

"Adam"
"AdamW"
"SGD"

For AdamW and SGD one can also specify the weight_decay parameter.

For SGD one can eventually specify the momentum parameter.

The max_tokens parameters is effective only for the MSA case and specifies the max number of tokens that can be loaded in memory (that is length x depth).

Too large values of max_tokens can result in memory issues such as CUDA_MEMORY_ERROR.

Outputs can be found in runs/experiment_name/results/ and consists of the following files:

db_test_name.res: selected epoch, rmse, mae, corr, p-value for the test set
db_test_name_labels_preds.diffs: predicted and experimental values for each sequence in test set
epochs_statistics.csv: summary of results at all epochs for test, validation and training
early_stopping_epoch.log: selected epoch
test_db_test_name_metrics.log: rmse, mae, corr on test set for all training epochs
train_db_train_name_metrics.log: rmse, mae, corr on cross-validated training set for all training epochs
val_db_val_name_metrics.log: rmse, mae, corr on validation set for all training epochs
seeds.log: seeds parameter used for the run
epochs_rsme.png
epochs_corr.png
epochs_mae.png

Model weights on Zenodo

Weights associated to MSA fine-tuned model, trained with the Megascale data (S155329) and, more specifically, resulting from this run , are available on Zenodo here .

To download the trained weights

$> wget https://zenodo.org/records/14026821/files/MSA_Finetuning.zip
$> unzip MSA_Finetuning.zip

Inference

In order to perform an Inference have a look at the runs/Inference_MSA_Finetuning and datasets/Inference folders.

As an example, we provide an example dataset, where we consider 13 mutations of the 1A7V protein. Dataset files correctly organized as follows

Inference/
└── test
    ├── databases
    │   └── db_s13.csv
    ├── MSA_s13
    │   └── 1A7V
    └── translated_databases
        └── tb_s13.csv

We have generated translated_databases/tb_s13.csv by means of the src/ProstT5TranslationDDP.py program (see for instance runs/S1465_Translate/translateS1465.sh for more details).

Now, in runs/Inference_MSA_Finetuning we have a config.yaml file where we specify for the MSA model

output_dir: "runs/Inference_MSA_Finetuning"
dataset_dir: "datasets/Inference"
model: "MSA_Finetuning"
max_length: 1024
MSA:
  max_tokens: 16000
snapshot_file: "runs/S1413_MSA_Finetuning/snapshots/MSA_Finetuning.pt"

Where here the MSA_Finetuning.pt can be either the weigths downloaded from Zenodo (see Model weights on Zenodo section) or a new trained model resulting from the Training section.

More generally, we have

output_dir: "path_to_your_output_dir"
dataset_dir: "path_to_your_dataset_dir"
model: "model_name" ["MSA_Finetuning" or "ESM2_Finetuning" or "PROST5_Finetuning" or "MSA_Baseline" or "ESM2_Baseline" or "PROST5_Baseline"]
max_length:  max_length_of_the_sequences_that_can_be_loaded_in_memory
MSA:
  max_tokens: max_number_of_tokens_that_can_be_loaded_in_memory [only for "MSA_Finetuning" or "MSA_Baseline"]
snapshot_file: "path_to_your_model_weights.pt"

To perform the inference we provide a slurm job template in runs/Inference_MSA_Finetuning/inference.job, to be adjusted in accordance to your needs.

Licensing

This work is licensed under multiple licences.

Because keeping this section up-to-date is challenging, here is a brief summary as of July 2024:

All original source code and scripts are licensed under AGPL-3.0-or-later.
All documentation, data, images are licensed under CC-BY-4.0.
Some configuration files (.gitignore, requirements.txt, REUSE.toml, reuse.spdx) are licensed under CC0-1.0.
Some code (preprocess\_data/scripts/reformat.pl) borrowed from the HHsuite version 3.0.0 is licensed under GPL-3.0-or-later.

For more accurate information, check the REUSE.toml file or the SPDX License List reuse.spdx.

Publications

Cuturello F., Celoria M., Ansuini A., Cazzaniga A. (2024). Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models. Bioinformatics, Volume 40, Issue 7, July 2024, btae447, https://doi.org/10.1093/bioinformatics/btae447

Citation

@article{10.1093/bioinformatics/btae447,
    author = {Cuturello, Francesca and Celoria, Marco and Ansuini, Alessio and Cazzaniga, Alberto},
    title = "{Enhancing predictions of protein stability changes induced by single mutations using MSA-based language models}",
    journal = {Bioinformatics},
    volume = {40},
    number = {7},
    pages = {btae447},
    year = {2024},
    month = {07},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btae447},
    url = {https://doi.org/10.1093/bioinformatics/btae447},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/40/7/btae447/58644482/btae447.pdf},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hardware requirements

Software requirements

Installation

Data Format and Structure

Training

Model weights on Zenodo

Inference

Licensing

Publications

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
LICENSES		LICENSES
datasets		datasets
preprocess_data		preprocess_data
runs		runs
src		src
.gitignore		.gitignore
README.md		README.md
REUSE.toml		REUSE.toml
download_weights.job		download_weights.job
requirements.txt		requirements.txt
reuse.spdx		reuse.spdx

RitAreaSciencePark/PLM4Muts

Folders and files

Latest commit

History

Repository files navigation

Hardware requirements

Software requirements

Installation

Data Format and Structure

Training

Model weights on Zenodo

Inference

Licensing

Publications

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages