Training scripts for MAPLE Graphormer models for publication
- Install the Package via Pip Symlinks. Training uses cuda-12.2 (minimum).
conda env create -f deeptrain-2.yml
conda activate deeptrain-2
pip install -e .
- Set Up Weights & Biases (wandb)
- Follow the official quickstart guide to configure Weights & Biases for experiment tracking.
Download and extract datasets.zip and raw_data.zip from the accompanying Zenodo repository and place their contents in this directory. Then, run the modules below to pre-generate the required graphs for training.
from omnicons import datasetprep
datasetprep.prepare_ms1_graphs()
datasetprep.prepare_ms2_graphs()
datasetprep.prep_msdial_dataset()- Masked Language Modeling (MLM) – MS1 signals are randomly masked, and the model is trained to predict properties of the masked metabolites.
save.pyconverts DeepSpeed checkpoints into standard PyTorch checkpoint format.
cd training/MS1Former/MLMTraining
CUDA_VISIBLE_DEVICES=0 python train.py -logger_entity new_user
python save.py
- Taxonomy-Supervised Classification – Predicts taxonomic ranks (from phylum to genus) from spectral embeddings using parallel classification heads. This step fine-tunes the MS1Former model previously trained with MLM.
export.pyconverts the trained PyTorch model into TorchScript format for deployment.
cd training/MS1Former/TaxonomyTraining
CUDA_VISIBLE_DEVICES=0 python train.py -logger_entity new_user
python save.py
python export.py
- Masked Language Modelling (MLM) - MS2 fragments and neutral losses are randomly masked, and the model is trained to predict their corresponding masses from the remaining context.
cd training/MS2Former/MLMTraining
CUDA_VISIBLE_DEVICES=0 python train.py -logger_entity new_user
python save.py
- Chemotype-Supervised Classification – Predicts biosynthetic classes from fragmentation embeddings by fine-tuning the MS2Former model previously trained with MLM. This step uses an augmented dataset generated via graph label propagation.
cd training/MS2Former/ChemotypeTraining
CUDA_VISIBLE_DEVICES=0 python train.py -logger_entity new_user
python save.py
python export.py
- Molecular Similarity-Supervised Classification – Predicts Tanimoto similarity bins for pairwise MS2 spectra using an external compound dataset from MS-DIAL. Trained in parallel with the chemotype dataset to preserve underlying biochemical organization.
cd training/MS2Former/TanimotoTraining
CUDA_VISIBLE_DEVICES= python train.py -logger_entity new_user
python save.py
python export.py