SIMBA: Spectral Identification of Molecule Bio-Analogues

SIMBA is a transformer-based neural network that accurately predicts chemical structural similarity from tandem mass spectrometry (MS/MS) spectra. Unlike traditional methods relying on heuristic metrics (e.g., modified cosine similarity), SIMBA directly models structural differences, enabling precise analog identification in metabolomics.

SIMBA predicts two interpretable metrics:

Substructure Edit Distance: Number of molecular graph edits required to convert one molecule into another.
Maximum Common Edge Substructure (MCES) Distance: Number of bond modifications required to achieve molecular equivalence.

🚀 Quickstart

Requirements

Python 3.11.7
UV (recommended) or Conda

Installation

Option 1: UV (Recommended - Fastest ⚡)

Install UV:

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Setup SIMBA (~2-5 minutes):

# Clone the repository
git clone https://github.com/bittremieux-lab/simba.git
cd simba

# Create virtual environment and install dependencies
uv sync

# Activate the environment
source .venv/bin/activate  # macOS/Linux
# or
.venv\Scripts\activate     # Windows

For Jupyter notebooks:

# Install notebook dependencies
uv sync --extra notebooks

# Register the kernel
python -m ipykernel install --user --name=simba --display-name="SIMBA (UV)"

To use notebooks in VS Code:

Open any .ipynb file in the notebooks/ folder
Click "Select Kernel" in the top-right corner
Choose "SIMBA (UV)" or "Python 3.11 (.venv: venv)"
If the kernel doesn't appear, reload VS Code window (Cmd+Shift+P → "Developer: Reload Window")

Option 2: Conda (Alternative)

# Create and activate environment
conda env create -f environment.yml
conda activate simba

# Install the module
pip install -e .

Note for macOS users:

brew install xz

🔎 Computing Structural Similarities

We provide a pretrained SIMBA model trained on spectra from MassSpecGym. The model operates in positive ionization mode for protonated adducts.

Usage Example

Follow the Run Inference Notebook for a comprehensive tutorial:

Runtime: < 10 minutes (including model/data download)
Example data: data folder.
Supported format: .mgf

Performance

Using an Apple M3 Pro (36 GB RAM):

Embedding computation: ~100,000 spectra in ~1 minute
Similarity computation: 1 query vs. 100,000 spectra in ~10 seconds

SIMBA caches computed embeddings, significantly speeding repeated library searches.

🔬 Analog Discovery Using SIMBA

Modern metabolomics relies on tandem mass spectrometry (MS/MS) to identify unknown compounds by comparing their spectra against large reference libraries. SIMBA enables analog discovery—finding structurally related molecules—by predicting the 2 complementary, interpretable metrics directly from spectra.

Usage Example

Perform analog discovery to find structurally similar molecules using one of the following methods:

Option 1: CLI Command (Recommended)

simba analog-discovery \
  --model-path /path/to/model.ckpt \
  --query-spectra /path/to/query.mgf \
  --reference-spectra /path/to/reference_library.mgf \
  --output-dir /path/to/output \
  --query-index 0 \
  --top-k 10 \
  --device cpu \
  --compute-ground-truth

Parameters:

--model-path: Path to trained SIMBA model checkpoint (.ckpt file)
--query-spectra: Path to query spectra file (.mgf or .pkl format)
--reference-spectra: Path to reference library spectra file (.mgf or .pkl format)
--output-dir: Directory where results will be saved
--query-index: Index of the query spectrum to analyze (default: 0)
--top-k: Number of top matches to return (default: 10)
--device: Hardware device: cpu or gpu (default: cpu)
--batch-size: Batch size for processing (default: 32)
--cache-embeddings / --no-cache-embeddings: Cache embeddings for faster repeated searches (default: True)
--use-gnps-format / --no-use-gnps-format: Whether spectra files use GNPS format (default: False)
--compute-ground-truth: Compute ground truth edit distance and MCES for validation
--save-rankings: Save complete ranking matrix to file

Output: The command generates several files in the output directory:

results.json: Summary of top matches with predictions and ground truth
matches.csv: Detailed table of all matches
query_molecule.png: Structure of the query molecule
match_N_molecule.png: Structures of matched molecules
mirror_plot_match_N.png: Mirror plots comparing query and matched spectra
rankings.npy: Complete ranking matrix (if --save-rankings is used)

Example workflow:

# Find analogs for spectrum #5 in your query file
simba analog-discovery \
  --model-path ~/models/best_model.ckpt \
  --query-spectra ~/data/casmi2022.mgf \
  --reference-spectra ~/data/massspecgym_library.mgf \
  --output-dir ~/results/analog_discovery \
  --query-index 5 \
  --top-k 20 \
  --compute-ground-truth

Option 2: Jupyter Notebook (Interactive)

For interactive exploration, use the Run Analog Discovery Notebook.

The notebook demonstrates:

Loading a pretrained SIMBA model and MS/MS data
Computing distance matrices between query and reference spectra
Extracting top analogs for a given query
Comparing predictions against ground truth and visualizing matches

Note: Both methods produce equivalent results. The CLI command is recommended for automated workflows and batch processing, while the notebook is better for interactive analysis and visualization.

📚 Training Your Custom SIMBA Model

SIMBA supports training custom models using your own MS/MS datasets in .mgf format.

Step 1: Generate Training Data

Preprocess your MS/MS spectral data using one of the following methods:

Option 1: CLI Command (Recommended)

simba preprocess \
  --spectra-path /path/to/your/spectra.mgf \
  --workspace /path/to/preprocessed_data \
  --max-spectra-train 10000 \
  --mapping-file-name mapping_unique_smiles.pkl \
  --num-workers 0

Parameters:

--spectra-path: Path to input spectra file (.mgf format)
--workspace: Directory where preprocessed data will be saved
--max-spectra-train: Maximum number of spectra to process for training (default: 10000). Set to large number to process all
--max-spectra-val: Maximum number of spectra for validation (default: 1000000)
--max-spectra-test: Maximum number of spectra for testing (default: 1000000)
--mapping-file-name: Filename for the mapping file (default: mapping_unique_smiles.pkl)
--num-workers: Number of worker processes for parallel computation (default: 0)
--val-split: Fraction of data for validation (default: 0.1)
--test-split: Fraction of data for testing (default: 0.1)
--overwrite: Overwrite existing preprocessing files

Option 2: Python Script (Legacy)

python preprocessing_scripts/final_generation_data.py  \
   --spectra_path=/path/to/your/spectra.mgf   \
   --workspace=/path/to/preprocessed_data/  \
   --MAX_SPECTRA_TRAIN=10000 \
   --mapping_file_name=mapping_unique_smiles.pkl  \
   --PREPROCESSING_NUM_WORKERS=0

Parameters:

spectra_path: Location of spectra
workspace: Location where the calculated distances are going to be saved
MAX_SPECTRA_TRAIN: Maximum number of spectra to be processed. Set to large number to avoid removing spectra
mapping_file_name: Name of the file that saves the mapping of the spectra from spectra to unique compounds
PREPROCESSING_NUM_WORKERS: Number of processors to be used (default: 0)

Note: Both methods produce identical results. The preprocessing computes:

Edit distance between molecular structures
MCES (Maximum Common Edge Substructure) distance
Train/validation/test splits

The output includes a file mapping_unique_smiles.pkl with mapping information between unique compounds and corresponding spectra. Each compound can have several spectra and this file saves information about this mapping.

Output

Numpy arrays with indexes and structural similarity metrics
Pickle file (mapping_unique_smiles.pkl) mapping spectra indexes to SMILES structures

Accessing Data Mapping

import pickle

with open('/path/to/output_dir/mapping_unique_smiles.pkl', 'rb') as f:
    data = pickle.load(f)

mol_train = data['molecule_pairs_train']
print(mol_train.df_smiles)

The dataframe df_smiles contains the mapping from indexes of unique compounds to the original spectra loaded.

Step 2: Model Training

Train your SIMBA model using one of the following methods:

Option 1: CLI Command (Recommended)

simba train \
  --checkpoint-dir /path/to/checkpoints/ \
  --preprocessing-dir /path/to/preprocessed_data/ \
  --preprocessing-pickle mapping_unique_smiles.pkl \
  --epochs 10 \
  --accelerator cpu \
  --batch-size 32 \
  --num-workers 0 \
  --learning-rate 0.0001 \
  --val-check-interval 10000

Parameters:

--checkpoint-dir: Directory where the trained model will be saved
--preprocessing-dir: Directory where preprocessing files are stored
--preprocessing-pickle: Filename of the mapping pickle file
--epochs: Number of training epochs (default: 10)
--accelerator: Hardware accelerator: cpu or gpu (default: cpu)
--batch-size: Batch size for training and validation (default: 32)
--num-workers: Number of data loading workers (default: 0)
--learning-rate: Learning rate for the optimizer (default: 0.0001)
--val-check-interval: Validation check frequency in training steps (default: 10000)

Option 2: Python Script (Legacy)

python training_scripts/final_training.py  \
  --CHECKPOINT_DIR=/path/to/checkpoints/ \
  --PREPROCESSING_PICKLE_FILE=mapping_unique_smiles.pkl \
  --PREPROCESSING_DIR_TRAIN=/path/to/preprocessed_data/ \
  --TRAINING_NUM_WORKERS=0  \
  --ACCELERATOR=cpu  \
  --epochs=10 \
  --VAL_CHECK_INTERVAL=10000

Parameters:

CHECKPOINT_DIR: Place where the trained model will be saved
PREPROCESSING_DIR_TRAIN: Folder where the preprocessing files are saved
PREPROCESSING_PICKLE_FILE: File name with the mapping
ACCELERATOR: cpu or gpu
epochs: Number of epochs to be trained
VAL_CHECK_INTERVAL: Used to check validation performance every N steps

Note: Both methods produce identical results and use the mapping file produced in Step 1. The preprocessing directory PREPROCESSING_DIR_TRAIN / --preprocessing-dir must be the same where the preprocessing files were generated. The best-performing model (lowest validation loss) is saved in CHECKPOINT_DIR / --checkpoint-dir.

Step 3: Model Inference & Evaluation

Run inference on test data using your trained model with one of the following methods:

Option 1: CLI Command (Recommended)

simba inference \
  --checkpoint-dir /path/to/checkpoints/ \
  --preprocessing-dir /path/to/preprocessed_data/ \
  --preprocessing-pickle mapping_unique_smiles.pkl \
  --batch-size 64 \
  --accelerator cpu \
  --use-last-model

Parameters:

--checkpoint-dir: Directory containing the trained model checkpoint
--preprocessing-dir: Directory where preprocessed data is stored
--preprocessing-pickle: Filename of the mapping pickle file
--batch-size: Batch size for inference (default: 64)
--accelerator: Hardware accelerator: cpu, gpu, or auto (default: auto)
--use-last-model: Use last.ckpt instead of best_model.ckpt (optional flag)
--uniformize-testing / --no-uniformize-testing: Balance edit distance classes (default: True)
--output-dir: Directory to save plots and results (default: checkpoint-dir)

Output: The command generates evaluation metrics and visualization plots:

Edit distance correlation (Spearman)
MCES/Tanimoto correlation (Spearman)
Confusion matrix for edit distance predictions (cm.png)
Hexbin plot showing prediction accuracy (hexbin_plot_*.png)
Scatter plot of predictions vs. ground truth (scatter_plot_*.png)

Option 2: Python Script (Legacy)

python inference_scripts/inference_multitasking.py \
  --CHECKPOINT_DIR=/path/to/checkpoints/ \
  --PREPROCESSING_DIR=/path/to/preprocessed_data/ \
  --PREPROCESSING_DIR_TRAIN=/path/to/preprocessed_data/ \
  --PREPROCESSING_PICKLE_FILE=mapping_unique_smiles.pkl \
  --UNIFORMIZE_DURING_TESTING=1

Parameters:

CHECKPOINT_DIR: Folder where the trained model is saved and testing results will be saved
PREPROCESSING_DIR and PREPROCESSING_DIR_TRAIN: Location where the preprocessing files are saved
PREPROCESSING_PICKLE_FILE: Mapping file
UNIFORMIZE_DURING_TESTING: Whether to balance the edit distance classes (1=True, 0=False)

Note: Both methods produce identical results.

🛠️ Development & Contributing

Setting Up Development Environment

# Clone and install with dev dependencies
git clone https://github.com/bittremieux-lab/simba.git
cd simba
uv sync --all-extras

# Install pre-commit hooks
uv run pre-commit install

Running Tests

# Run all tests
uv run pytest tests/ -v

# Run with coverage report
uv run pytest tests/ --cov=simba --cov-report=html --cov-report=term-missing

# View coverage report
open htmlcov/index.html

Code Quality

The project uses:

Ruff for linting and formatting
pytest for testing
pre-commit hooks for automated checks

# Run linter
uv run ruff check simba/

# Format code
uv run ruff format simba/

# Run pre-commit on all files
uv run pre-commit run --all-files

Pre-commit hooks automatically run on every commit and check:

Code formatting (Ruff)
Linting (Ruff)
Tests (pytest)
File formatting (trailing whitespace, line endings)
YAML/TOML syntax

📬 Contact & Support

Code repository: SIMBA GitHub
For questions, issues, or feature requests, please open an issue.

📦 Data Availability

Training and testing datasets available at: [https://zenodo.org/records/15275257].

Name		Name	Last commit message	Last commit date
Latest commit History 446 Commits
.github/workflows		.github/workflows
.vscode		.vscode
data		data
docs		docs
legacy		legacy
notebooks/final_tutorials		notebooks/final_tutorials
simba		simba
tests		tests
tools/slurm		tools/slurm
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
environment.yml		environment.yml
environment_cleaned.yml		environment_cleaned.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

bittremieuxlab/simba

Folders and files

Latest commit

History

Repository files navigation

SIMBA: Spectral Identification of Molecule Bio-Analogues

🚀 Quickstart

Requirements

Installation

Option 1: UV (Recommended - Fastest ⚡)

Option 2: Conda (Alternative)

🔎 Computing Structural Similarities

Usage Example

Performance

🔬 Analog Discovery Using SIMBA

Usage Example

Option 1: CLI Command (Recommended)

Option 2: Jupyter Notebook (Interactive)

📚 Training Your Custom SIMBA Model

Step 1: Generate Training Data

Option 1: CLI Command (Recommended)

Option 2: Python Script (Legacy)

Output

Accessing Data Mapping

Step 2: Model Training

Option 1: CLI Command (Recommended)

Option 2: Python Script (Legacy)

Step 3: Model Inference & Evaluation

Option 1: CLI Command (Recommended)

Option 2: Python Script (Legacy)

🛠️ Development & Contributing

Setting Up Development Environment

Running Tests

Code Quality

📬 Contact & Support

📦 Data Availability

About

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages