Retrosynthesis Pathway Optimization Augmented by Molecular Analogs

This repository implements an end-to-end machine learning pipeline that proposes and scores synthetic pathways for known drug molecules. Our system combines a reaction yield predictor based on graph neural networks (GNNs) with a SMILES-based VAE for molecular analog generation. Together, these components enable speculative retrosynthetic planning by suggesting high-yield reaction pathways, expanding the synthesis search space through analog substitution. Users can run the pipeline in the retrosynthesis_model_pipeline.ipynb notebook.

Developed as the final project for CHEM 277B – Spring 2025, UC Berkeley MSSE.

Authors: Haris Saeed, Paul Graggs, Girnar Joshi, Festo Muhire

Motivation

Drug discovery is inherently complex, costly, and time-consuming, with high failure rates. Despite numerous emerging technologies, finding efficient synthesis routes remains difficult.
The drug discovery field is highly competitive with many companies investing heavily, which increases pressure to innovate and reduce costs while accelerating timelines.
Using AI to explore alternative synthesis pathways can enhance the availability of known drugs and lower costs by predicting and optimizing high-yield production routes efficiently.

Components

1. The Pipeline

Given a target SMILES, this module:

Splits it into synthons using RDKit BRICS
Searches for matching or similar reactants in the dataset
Uses a trained GNN to predict yield of forming the target from those reactants

2. The System

Builds multi-step retrosynthetic pathways by recursively applying the pipeline to intermediate reactants.

3. The Process

Runs the system multiple times to generate multiple unique pathways, scoring each and selecting the highest-yield route.

ML Models

Yield Predictor (GNN)

Trained on USPTO-Applications dataset (~2M reactions)
Uses PyTorch Geometric graph objects (atom and bond features)
Architecture: Graph Attention Network (GAT) → MLP predictor
Goal: Predict reaction yield (0–100%) for reactant → product conversions

Molecular Analog Generator (SMILES-VAE)

Trained on synthons generated via BRICS from 10k SMILES
GRU-based VAE trained with cross-entropy + KL divergence loss
Analog generation via latent noise + top-k sampling
Only valid SMILES are returned via RDKit filtering

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
files		files
FinalProjectUtils.py		FinalProjectUtils.py
NewYieldPredictor.py		NewYieldPredictor.py
README.md		README.md
TheNewProcess.py		TheNewProcess.py
retrosynthesis_model_pipeline.ipynb		retrosynthesis_model_pipeline.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrosynthesis Pathway Optimization Augmented by Molecular Analogs

Motivation

Components

1. The Pipeline

2. The System

3. The Process

ML Models

Yield Predictor (GNN)

Molecular Analog Generator (SMILES-VAE)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Retrosynthesis Pathway Optimization Augmented by Molecular Analogs

Motivation

Components

1. The Pipeline

2. The System

3. The Process

ML Models

Yield Predictor (GNN)

Molecular Analog Generator (SMILES-VAE)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages