NMIRacle (NMR-IR oracle) is a multi-modal generative framework for de novo molecular structure elucidation from spectroscopic data. The model learns to generate molecular structures (SMILES) directly from raw IR, ¹H-NMR, and ¹³C-NMR spectra through a two-stage training approach that combines count-aware fragment encodings with multi-spectral conditioning.
NMIRacle expects data in the following structure:
datasets/
├── pretrain/ # Stage 1: Fragment pre-training
│ ├── smiles.npy # Molecular SMILES strings
│ ├── substructure_counts.h5 # Fragment compositions
│ ├── split_indices.p # Train/val/test splits
│
└── multispectra/ # Stage 2: Spectra fine-tuning
├── smiles.npy # Molecular SMILES strings
├── spectra.h5 # Multi-modal spectra (IR, ¹H, ¹³C)
├── substructure_counts.h5 # Fragment compositions
└── split_indices.p # Train/val/test splits
The spectra.h5 contains concatenated (in order):
ir: IR spectra (1,800 features)hnmr: ¹H-NMR spectra (10,000 features)cnmr: ¹³C-NMR spectra (10,000 features)
Dataset files are available at the following link.
Train the model to reconstruct molecules from count-aware fragment representations:
python -m nmiracle.train \
training_stage=sub2struct \
data/dataset=pretrain_dataset \
trainer.max_epochs=500 \
logger.wandb.enabled=trueFine-tune with spectral conditioning (requires Stage 1 checkpoint):
python -m nmiracle.train \
training_stage=spec2struct \
data/dataset=multispectra_dataset \
data.use_ir=true \
data.use_hnmr=true \
data.use_cnmr=false \
model.pretrained_structure_model_path=/path/to/stage1/checkpoint.ckpt \
trainer.max_epochs=300 \
logger.wandb.enabled=truepython -m nmiracle.test_sub2struct \
--model_path nmiracle/ckpts/sub2struct \
--checkpoint best.ckpt \
--temperature 1.0 \
--top_k 5 \
--num_sequences 15python -m nmiracle.test_spec2struct \
--model_path nmiracle/ckpts/spec2struct_ir_hnmr_cnmr \
--checkpoint epoch=295-val_loss=0.15.ckpt \
--temperature 1.0 \
--top_k 5 \
--num_sequences 15We provide checkpoints of trained models at the following link.
If you use NMIRacle in your research, please cite:
@misc{ottomano2025nmiraclemultimodalgenerativemolecular,
title={NMIRacle: Multi-modal Generative Molecular Elucidation from IR and NMR Spectra},
author={Federico Ottomano and Yingzhen Li and Alex M. Ganose},
year={2025},
eprint={2512.19733},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2512.19733},
}- Built with PyTorch and PyTorch Lightning
- Inspired by NMR2Struct and related work.
⭐ If you find NMIRacle useful, please star the repository!
