maple-graphormer-training

Training scripts for MAPLE Graphormer models for publication

Installation

Training-Only Installation

Install the Package via Pip Symlinks. Training uses cuda-12.2 (minimum).

    conda env create -f deeptrain-2.yml
    conda activate deeptrain-2
    pip install -e .

Set Up Weights & Biases (wandb)
- Follow the official quickstart guide to configure Weights & Biases for experiment tracking.

Dataset Preparation

Download and extract datasets.zip and raw_data.zip from the accompanying Zenodo repository and place their contents in this directory. Then, run the modules below to pre-generate the required graphs for training.

from omnicons import datasetprep

datasetprep.prepare_ms1_graphs()
datasetprep.prepare_ms2_graphs()
datasetprep.prep_msdial_dataset()

MS1Former Training

Masked Language Modeling (MLM) – MS¹ signals are randomly masked, and the model is trained to predict properties of the masked metabolites. save.py converts DeepSpeed checkpoints into standard PyTorch checkpoint format.

cd training/MS1Former/MLMTraining
CUDA_VISIBLE_DEVICES=0 python train.py -logger_entity new_user
python save.py

Taxonomy-Supervised Classification – Predicts taxonomic ranks (from phylum to genus) from spectral embeddings using parallel classification heads. This step fine-tunes the MS1Former model previously trained with MLM. export.py converts the trained PyTorch model into TorchScript format for deployment.

cd training/MS1Former/TaxonomyTraining
CUDA_VISIBLE_DEVICES=0 python train.py -logger_entity new_user
python save.py
python export.py

MS2Former Training

Masked Language Modelling (MLM) - MS² fragments and neutral losses are randomly masked, and the model is trained to predict their corresponding masses from the remaining context.

cd training/MS2Former/MLMTraining
CUDA_VISIBLE_DEVICES=0 python train.py -logger_entity new_user
python save.py

Chemotype-Supervised Classification – Predicts biosynthetic classes from fragmentation embeddings by fine-tuning the MS2Former model previously trained with MLM. This step uses an augmented dataset generated via graph label propagation.

cd training/MS2Former/ChemotypeTraining
CUDA_VISIBLE_DEVICES=0 python train.py -logger_entity new_user
python save.py
python export.py

Molecular Similarity-Supervised Classification – Predicts Tanimoto similarity bins for pairwise MS² spectra using an external compound dataset from MS-DIAL. Trained in parallel with the chemotype dataset to preserve underlying biochemical organization.

cd training/MS2Former/TanimotoTraining
CUDA_VISIBLE_DEVICES= python train.py -logger_entity new_user
python save.py
python export.py

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
omnicons		omnicons
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
deeptrain-2.yml		deeptrain-2.yml
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

maple-graphormer-training

Installation

Training-Only Installation

Dataset Preparation

MS1Former Training

MS2Former Training

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

magarveylab/maple-graphormer-training

Folders and files

Latest commit

History

Repository files navigation

maple-graphormer-training

Installation

Training-Only Installation

Dataset Preparation

MS1Former Training

MS2Former Training

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages