QUARC is a data-driven model for recommending agents, temperature, and equivalence ratios for organic synthesis (see paper).
Important
The QUARC models used in the paper rely on the NameRxn reaction classification codes as part of the model input. Specifically, the reaction class is encoded as a one-hot vector, requiring access to the full NameRxn code mapping.
-
Users with a Pistachio license can access the 2023Q4 reaction-type mapping (2271 NameRxn classes plus Unrecognized) from the
Pistachio Reaction Types.csvfile on the Pistachio webapp. Alternatively, you may email xiaoqis@mit.edu to obtain the file directly. -
Users without NameRxn access can try our open-source version, which eliminates this dependency. This version is planned to be integrated into ASKCOS in the October release.
If you just want to predict conditions for your reactions using the provided pretrained models:
# 1. Create conda environment
conda env create -f environment.yml -n quarc
conda activate quarc
pip install --no-deps -e .
# 2. Configure NameRxn Code Mapping (REQUIRED)
export PISTACHIO_NAMERXN_PATH="/path/to/your/Pistachio Reaction Types.csv"
# 3. Set data paths (or uses defaults in configs/quarc_config.yaml)
export DATA_ROOT="~/quarc/data"
export PROCESSED_DATA_ROOT="~/quarc/data/processed"The 2023Q4 version of Pistachio Reaction Types.csv is required for compatibility with the pretrained models (requires 2272 classes for reaction class encoding). Using a different version may cause the model to fail.
sh checkpoints/download_trained_models.sh# Get predictions using the example input file
python scripts/inference.py \
--config-path configs/ffn_pipeline.yaml \
--input data/example_input.json \
--output predictions.json \
--top-k 5Results will be in predictions.json with recommended agents, temperatures, and amounts. Atom-mapped SMILES are required for the GNN models.
Input Format
[
{
"rxn_smiles": "[CH3:1][O:2][C:3]...",
"rxn_class": "1.8.7",
"doc_id": "my_reaction_1"
}
]Model Options
# FFN models (works with any SMILES)
python scripts/inference.py \
--config-path configs/ffn_pipeline.yaml \
--input input.json \
--output predictions.json \
--top-k 5
# GNN models (requires atom-mapped SMILES)
python scripts/inference.py \
--config-path configs/gnn_pipeline.yaml \
--input input.json \
--output predictions.json \
--top-k 5
# Also supports pickle input (e.g., preprocessed test sets)
python scripts/inference.py \
--config-path configs/ffn_pipeline.yaml \
--input data/processed/overlap/overlap_test.pickle \
--output predictions.json \
--top-k 5Note: Requires Pistachio's density data and NameRxn access
Create conda environment and install dependencies:
conda env create -f environment.yml -n quarc
conda activate quarc
pip install --no-deps -e .Configure paths using one of the following options:
-
Option 1: Environment Variables (Recommended)
Variables can be set directly in terminal or in a
.envfile.# data paths export DATA_ROOT="~/quarc/data" export PROCESSED_DATA_ROOT="~/quarc/data/processed" export CHECKPOINTS_ROOT="~/quarc/checkpoints" export LOGS_ROOT="~/quarc/logs" # needed for inference export PISTACHIO_NAMERXN_PATH="/path/to/Pistachio Reaction Types.csv" # needed for preprocessing export PISTACHIO_DENSITY_PATH="/path/to/density.tsv" export RAW_DIR="/path/to/pistachio/extract"
-
Option 2: Edit Configuration Files Edit
configs/quarc_config.yamlto modify default paths.
The default values in src/quarc/settings.py are overridden by configs/quarc_config.yaml if present, then further overridden by environment variables, with each step taking precedence over the previous.
The preprocessing pipeline transforms raw Pistachio data into ReactionDatum objects that are used for training. You can configure the preprocessing pipeline in configs/preprocess_config.yaml. The dirs section contains placeholders paths that will be overridden by the environment variables. Details of the preprocessing pipeline are described here.
# Run complete preprocessing pipeline
python scripts/preprocess.py \
--config configs/preprocess_config.yaml \
--all
# Or individual steps
python scripts/preprocess.py \
--config configs/preprocess_config.yaml \
--chunk-json \
--collect-dedup \
...Note that running the --generate-agent-class step will overwrite the agent_encoder_list.json and agent_other_dict.json that we provide in the data/processed/ directory. If you want to use the provided agent_encoder_list.json and agent_other_dict.json, you can skip the --generate-agent-class step.
Run training for each stage:
# Example: stage 1 agent model gnn
python scripts/train.py \
--stage 1 \
--model-type gnn \
--graph-hidden-size 1024 \
--depth 2 \
--hidden-size 2048 \
--n-blocks 3 \
--max-epochs 30 \
--batch-size 512 \
--max-lr 1e-3 \
--logger-name stage1_gnn
--output-size 1376 \
--num-classes 1376 \
...Details of training parameters can be found in src/quarc/cli/quarc_parser.py. For binned classification tasks, custom binning can be specified using the --binning-path argument. Example binning configs can be found in configs/binning_config.yaml.
For stage 1 agent prediction, the tensorboard logger only keeps track of the greedy search accuracy. You may want to perform offline beam search evaluation to select the best checkpoint.
To chain the individually trained models together, you can create a new pipeline config file using the configs/ffn_pipeline.yaml as a template.
By default, each stage in the pipeline is assigned an equal weight of 0.25. To improve the overall performance of chained models, you can tune these weights using hyperparameter optimization with Optuna:
python scripts/optimize_weights.py \
--config-path configs/new_pipeline.yaml \
--n-trials 30 \
--sample-size 1000 \
--use-top-k 5 # use top-5 accuracy as the objectiveTip
The optimization script uses the EnumeratePredictor to generate predictions and rank them on the fly. For faster optimization using a larger sample size, you can consider switching to the PrecomputedHierarchicalPredictor, which caches model predictions to avoid redundant computations.
The new pipeline config file can be used for inference:
python scripts/inference.py \
--config-path configs/new_pipeline.yaml \
--input data/processed/overlap/overlap_test.pickle \
--output predictions.json \
--top-k 5If you find our code or model useful, we kindly ask that you consider citing our work in your papers.
@article{Sun2025quarc,
title={Data-Driven Recommendation of Agents, Temperature, and Equivalence Ratios for Organic Synthesis},
author={Sun, Xiaoqi and Liu, Jiannan and Mahjour, Babak and Jensen, Klavs F and Coley, Connor W},
journal={ChemRxiv},
doi={10.26434/chemrxiv-2025-4wzkh},
year={2025}
}