GitHub - ChemistryLLMs/AMORE: Code and data of "Lost in Translation: Chemical LMs and the Misunderstanding of Molecule Structures" paper

Lost in Translation: Chemical LMs and the Misunderstanding of Molecule Structures

📃 Paper from EMNLP 2024 findings

Abstract

The recent integration of chemistry with natural language processing (NLP) has advanced drug discovery. Molecule representation in language models (LMs) is crucial in enhancing chemical understanding. We propose Augmented Molecular Retrieval (♡AMORE), a flexible zero-shot framework that assesses trustworthiness of Chemical LMs of different natures: trained solely on molecules for chemical tasks and on a combined corpus of natural language texts and string-based structures. The framework relies on molecule augmentations that preserve an underlying chemical, such as kekulization and cycle replacements. The metric is based on the similarity score between distributed representations of molecules and their augmentations.

AMORE

AMORE consists of two main parts: data augmentation and ranking by embeddings distance. Augmentations can be applied to every SMILES data. Embeding distance ranking is based on FAISS and is provided both for last lidden state and all layers states. We also provide already augmented dataset files.

Experimental datasets

Experimental datasets are provided in the folder "data". CHEBI-20 folder contains test part of CHEBI-20 dataset and augmented ones, qm9 folder contains filteret isomeric molecules from qm9 dataset and its augmentations. Every dataset consist of original dataset file and four augmented files, created by using augmentation code.

Augmentations

'./code/augmentation.py' creates 4 types of augmentations:

rdkit canonicalization
explicit addition of hydrogens
kekulization
replacement of cycle identifiers by random numbers

Each augmentation type creates textual representation of molecule using SMILES syntax rules. All textual representations present the same molecule. Full description and theoretical base of each augmenttion type of are provided in the paper.

Molecule description task evaluation

"dataset_augmentation_example" is an example of data from original dataset augmentation and getting predictions. "metrics" contains example of calculating ROUGE and METEOR between original gold descriptions and predictions on our augmented datasets.

Last hiddden state evaluation

"eval_faiss.py" consist of functions for compute embeddings and metrics of all provided models. "Last_hidden_state_MRR" contains examples of calculating MRR metrics on our datasets and models. "last_hidden_state_reversed" contains examples of calculating acc@1 and acc@5 metrics on our datasets and models.

All hidden states evaluation

"hidden.py" consist of functions for compute embeddings of all layers and metrics of all provided models. "hidden_states_scifive" contains example of calculating acc@1 and acc@5 metrics on our datasets.

Repeat and reconstruct experiments

"tutorials" folder provides explonations about all evaluation parts and shows how to apply the framework to your own datasets.

References

If you use our repository, please cite the following related paper:

@inproceedings{ganeeva-etal-2024-lost,
    title = "Lost in Translation: Chemical Language Models and the Misunderstanding of Molecule Structures",
    author = "Ganeeva, Veronika  and
      Sakhovskiy, Andrey  and
      Khrabrov, Kuzma  and
      Savchenko, Andrey  and
      Kadurin, Artur  and
      Tutubalina, Elena",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.760",
    pages = "12994--13013",
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
code		code
data		data
images		images
tutorials		tutorials
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lost in Translation: Chemical LMs and the Misunderstanding of Molecule Structures

Abstract

AMORE

Experimental datasets

Augmentations

Molecule description task evaluation

Last hiddden state evaluation

All hidden states evaluation

Repeat and reconstruct experiments

References

About

Uh oh!

Releases

Packages

Languages

ChemistryLLMs/AMORE

Folders and files

Latest commit

History

Repository files navigation

Lost in Translation: Chemical LMs and the Misunderstanding of Molecule Structures

Abstract

AMORE

Experimental datasets

Augmentations

Molecule description task evaluation

Last hiddden state evaluation

All hidden states evaluation

Repeat and reconstruct experiments

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages