VibraCLIP is a multi-modal framework, inspired by the CLIP model [1], that integrates molecular graph representations with infrared (IR) and Raman spectra, from the QM9S Dataset [2] and experimental data, leveraging advanced machine learning to capture complex relationships between molecular structures and vibrational spectroscopy. By aligning these diverse data modalities in a shared representation space, VibraCLIP enables precise molecular identification, bridging the gap between spectral data and molecular interpretation.
After installing conda, run the following commands to create a new environment
named vibraclip_cpu/gpu and install the dependencies.
conda env create -f env_gpu.yml
conda activate vibraclip_gpu
pre-commit installWe recommend to use
lightninginstead of the deprecatedpytorch_lightning, by implementing the changes suggested in the following pull request topytorch_geometriclink.
To generate the LMDB file, first you need to place a pickle file in the data folder with all the raw information inside. We provide both the pickle file (see supplementary data) and the generation script in the scripts folder, so the user can either use our pickle file to re-generate the lmdb file or create a new pickle file from the original QM9S dataset[2].
Then, after getting the pickle file, the user needs to generate the LMDB file using the create_lmdb.py script as follows:
from vibraclip.preprocessing.graph import QM9Spectra
# Paths
data_path = "./data/qm9s_ir_raman.pkl"
db_path = "./data/qm9s_ir_raman"
# LMDB Generator
extractor = QM9Spectra(
data_path=data_path, # Path where the pickle file is placed
db_path=db_path, # Path where the LMDB file will be located
spectra_dim=1750, # To interpolate both IR and Raman spectra to a given dimension
)
# Run
extractor.get_lmdb()
# extractor.get_pickle()This method automatically generates the molecular graph representations and stores the processed IR and Raman spectra inside the PyG Data object with other metadata. Finally, the LMDB file is exported in the same data folder.
To train VibraCLIP, we use hydra to configure the model's hyperparameters and training settings through the config.yaml file stored in the configs folder. We provide the general configuration file config.yaml for pre-training the model and the config_ft.yaml for the fine-tuning (realignment) stage with the QM9S external dataset from PubChem and experimental data. We refer the user to check all the hyperparameters used in our work within the yaml files.
Please, change the experiment id inside the
config.yamlfile with a lable that tracks your experiments as (e.g.,id: "vibraclip_graph_ir_mass_01"). Also, inside thepathstheroot_dirtag should be changed to the path wherevibraclipis cloned (e.g.,root_dir: "/home/USER/vibraclip").
VibraCLIP considers different scenarios for training, depending on the included modalities:
To train VibraCLIP only on the Graph-IR relationship, use the following command:
python main_ir.py --config-name config.yamlThen, to train VibraCLIP on the Graph-IR-Raman relationships, use the following command:
python main_ir_raman.py --config-name config.yamlNote that both models can be trained using the same config.yaml file.
The model's checkpoint files are stored automatically in the checkpoints folder and the RetrievalAccuracy callbacks will save a pickle file in the outputs folder for further analysis of the model's performance in the test dataset. These outputs can be visualized with the provided jupyter notebooks (see Evaluate vibraclip performance section).
We strongly recommend to use wandb platform to track the training/validation/testing loss functions during the execution.
When training starts, a
processedfolder is created inside thedatadirectory; if another training is launched, the system will automatically reuse the processed data, and to force reprocessing, simply delete theprocessedfolder.
For HPO we use optuna python library to optimize both the model's architecture and the training hyperparameters. Since, VibraCLIP is a multi-modal framework we used a multi-objective strategy optimization where both the validation loss referred to the graph representation and also the validation loss from the spectra. We recommend the user to look at the main_optuna.py script before launching an HPO experiment.
We provide two jupyter notebooks, along with the pickle files (see supplementary data section) with all the testing data, in the folder notebooks to analyze and visualize the performance of VibraCLIP to get the same plots that are in the publication manuscript.
notebooks/vibraclip_metrics.ipynb: To plot the retrieval accuracy plot of the test set and chemical spaces based on TopK.notebooks/vibraclip_plots.ipynb: The actual retrieval accuracy plots from the publication for better comparison.
We also include a notebooks/figures folder with all the performance and molecular grids from the publication. The notebooks/outputs folder is to place the callback pickle files for analysis.
Inside the Makefile there are a few handy commands to streamline cleaning tasks.
make wandb-sync # In case of using wandb offline
make clean-data # To remove the processed folder from PyG
make clean-all # To clean __pycache__ folders and unnecessary thingsThe supplemetary data has been published in a Zenodo repository, providing the datasets in both pickle and LMDB formats, the pre-trained VibraCLIP checkpoints for all experiments, and the output callback pickle files to ensure full repoducibility of the reported results.
To reproduce the reported results, the dataset pickle and LMDB files should be placed in the
datafolder, the pre-trained checkpoints in thepre_trainedfolder, and the outputs from the callback can be visualize by placing them within thenotebooks/outputsfolder.
The authors thank the Institute of Chemical Research of Catalonia (ICIQ) Summer Fellow Program for its support. We also acknowledge the Department of Research and Universities of the Generalitat de Catalunya for funding through grant (reference: SGR-01155). Additionally, we are grateful to Dr. Georgiana Stoica and Mariona Urtasun from the ICIQ Research Support Area (Spectroscopy and Material Characterization Unit) for their valuable assistance. Computational resources were provided by the Barcelona Supercomputing Center (BSC), which we gratefully acknowledge.
VibraCLIP is released under the MIT license.
If you use this codebase in your work, please consider citing:
@article{vibraclip,
title = {Multi-Modal Contrastive Learning for Chemical Structure Elucidation with VibraCLIP},
author = {Rocabert-Oriols, Pau and Conte, Camilla Lo and Lopez, Nuria and Heras-Domingo, Javier},
journal = {Digital Discovery},
volume={4},
number={12},
pages={3818--3827},
year = {2025},
publisher={Royal Society of Chemistry},
doi = {https://doi.org/10.1039/D5DD00269A},
}[1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sutskever, I., Learning transferable visual models from natural language supervision. ICML, 2021, 8748-8763, URL
[2] Zou, Z., Zhang, Y., Liang, L., Wei, M., Leng, J., Jiang, J., Hu, W., A deep learning model for predicting selected organic molecular spectra. Nature Computational Science, 2023, 3(11), 957-964, URL
