Skip to content

Code for "Scalable and Cost-Efficient de Novo Template-Based Molecular Generation" (NeurIPS 2025)

License

Notifications You must be signed in to change notification settings

koziarskilab/SCENT

Repository files navigation

SCENT: Scalable and Cost-Efficient de Novo Template-Based Molecular Generation

Code for "Scalable and Cost-Efficient de Novo Template-Based Molecular Generation" [arXiv]. Repository based on RGFN and RetroGFN.

SCENT is a follow-up of the RGFN paper [arXiv] extending RGFN with the following components:

  • Recursive Cost Guidance for the backward policy, which utilizes a machine learning model to approximate the recursive cost of backward transitions. From this, we derive two targeted strategies:
    • Synthesis Cost Guidance, which reduces synthesis cost and improves diversity in large building block settings; and
    • Decomposability Guidance, which enforces the validity of backward transitions by discouraging transitions through indecomposable molecules.
  • Exploitation Penalty, a simple yet effective regularizer that promotes exploration by penalizing repeated state-action pairs.
  • Dynamic Library mechanism that augments the initial set of building blocks with high-reward intermediates discovered during optimization. This mechanism enables full-tree synthesis and increases mode coverage in both low- and high-resource scenarios.

SCENT finds many more high-reward modes than RGFN, while being drastically faster (since Decomposability Guidance replaces expensive recursive decomposability checks).

Graphical abstract

This repository allows users to reproduce experiments described in the associated paper and easily extend the framework to use custom oracles for optimization, e.g. by specifying a different protein target for the GPU-accelerated docking.

Setup

To create the conda environment, run the following commands:

conda create --name rgfn python=3.11.8 -y
conda activate rgfn

# If using CUDA (NOTE: this assumes CUDA 11.8, replace with your CUDA version):
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install dgl==2.2.1+cu118 -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html

# If using CPU:
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cpu
pip install dgl==1.1.2 -f https://data.dgl.ai/wheels/torch-2.3/cpu/repo.html

pip install -e .

# Might be needed for some systems:
conda install conda-forge::xorg-libxrender

# Optional development tools:
pip install pre-commit
pre-commit install

For convenience all experiments will be logged to experiments directory. If you want to store the experiments results on other partition, you can create a symlink to the desired location:

ln -s <your_path> experiments

(Optional) Setup Vina-GPU-2.1

Before following the instructions below, please read these important notes:

  1. The user-specified workspace directory must be a full path.
  2. The Vina-GPU-2.1 docking proxy requires a system-wide installation of CUDA to work on GPU (loaded by, e.g., module load cuda/11.8).
  3. The setup.sh script installs Boost 1.83.0 to the workspace directory. If you would like to use a different version of boost, you will need to modify the setup.sh script.

To set up QV2GPU for use in the QV2GPU rescoring proxy, run the following command:

sh external/install_qv2gpu.sh <workspace_directory>

The default installation path points to Vina-GPU-2.1. If you would like to modify this path, you can create a symlink:

ln -s <your-Vina-GPU-2.1_installation_path> Vina-GPU-2.1

(Optional) Setup GNEprop proxies

Note that GNEprop is part of a not-yet published project, and access to the code and the models is restricted. In particular, you will need an access to a private github repository for the instructions below to work.

To setup environment for GNEprop proxies, run the following commands:

pip install descriptastorus==2.6.1 "ray[tune]==2.31.0" ax-platform==0.4.0 matplotlib==3.9.0 seaborn==0.13.2 umap-learn==0.5.6 pytorch-lightning==1.9.5
pip install lightning-bolts==0.7.0 --no-deps

To download the code for GNEprop model, run the following command:

sh external/setup_gneprop.sh

You also need to download the models manually from google drive and put them inside external/gneprop/models. The code will look for the checkpoints in that location.

Finally, note that GNEprop requires a system-wide installation of CUDA to work on GPU (loaded by, e.g., module load cuda/11.8).

Train

To train the SCENT using sEH proxy, run:

python train.py --cfg configs/scent_seh_proxy.gin

The script will dump the results under experiments/scent_seh_proxy/<timestamp> directory. Our code uses gin-config package that allows for lightweight models configuration along with dependency injection.

Building Blocks Library

Configuration files for the SMALL, MEDIUM, and LARGE settings are available in configs/envs/settings. The building block superset used in the MEDIUM and LARGE settings can be requested from Enamine. The small_extended.gin configuration file defines the SMALL setting with additional support for 3- and 4-ary reaction templates.

Custom Building Blocks / Templates Library

To use a custom building blocks / reaction templates library, you can create a new configuration file in configs/envs/settings and specify the paths. If you don't have costs estimates and yields, you can set the appropriate paths to None (we also recommend disabling the Synthesis Cost Guidance in that case):

cost_path = None
yield_path = None
yield_value = None
include 'configs/policies/scent_decomposability_guided.gin' # to disable Synthesis Cost Guidance

Composed Proxy

The repository provides a ComposedProxy class that allows you to aggregate multiple proxies into a single proxy. An example configuration of ComposedProxy can be found in configs/proxies/composed/qed.gin. A configuration that combines sEH and QED looks like:

include 'configs/scent_seh_proxy.gin'
include 'configs/proxies/composed/qed.gin'

The resulting proxy is defined: $proxy(x) = seh(x) * qed(x)^{0.5}$.

Project Structure

API

Under rgfn.api, the repository provides a flexible API that clearly separates the GFlowNet components. The states, actions and action spaces can be represented as an arbitrary classes, which allows for easy implementation of GFlowNets with non-static environments (e.g. with dynamic action spaces).

  • env_base.py Base class for environments. It provides a minimal and flexible interface that can be used to implement environments with dynamic action spaces. An action space is a set of possible actions that can be taken from a state in forward (forward action space) and backward (backward action space) direction. The reward is decoupled from the environment, so that environment should only describe the possible transitions between states. The environment can be reversed to enable backward sampling of the trajectories.
  • policy_base.py A base class for policies. Given the current batch of states, a policy samples corresponding actions. It also computes the log probabilities when chosen actions and following states are provided.
  • sampler_base.py A base class for samplers. A sampler samples trajectories from the environment using a policy.
  • trajectories.py. A trajectory is a sequence of states and actions sampled by a sampler using the environment and the policy. Every state has a corresponding forward and backward action space which describe the possible actions that can be taken from that state. Trajectories are stored in a batch manner. The terminal states in the trajectories are assigned with rewards.
  • reward_base.py. A class representing the reward function. The reward function is a function of a proxy output that takes a batch of states and computes rewards that are used to train the policy.
  • proxy_base.py. A base class for proxies. A proxy is a function that takes a batch of states and computes values that are then used to compute the reward.
  • objective_base.py. A base class for GFN objectives. An objective is a function that takes a batch of trajectories and computes the loss (objective)
  • replay_buffer_base.py. A base class for replay buffers. A replay buffer stores terminal states or trajectories and can sample them in backward direction using the provided sampler.

Shared

Under rgfn.shared, the repository provides shared utilities that are used across the different GFlowNets implementations, e.g. Trajectory Balance Objective, Conditioned Trajectory Balance Objective, uniform policy, cached proxy base class, random samplers, reward_prioritized buffer, etc.

GFNs

Under rgfn.gfns, the repository provides the implementation of the GFlowNets.

Citation

@inproceedings{gainski2025scalable,
  title={Scalable and cost-efficient de novo template-based molecular generation},
  author={Gai{\'n}ski, Piotr and Boussif, Oussama and Shevchuk, Dmytro and Rekesh, Andrei and Parviz, Ali and Tyers, Mike and Batey, Robert A and Koziarski, Micha{\l}},
  booktitle={ICLR 2025 Workshop on Generative and Experimental Perspectives for Biomolecular Design},
  year={2025}
}

About

Code for "Scalable and Cost-Efficient de Novo Template-Based Molecular Generation" (NeurIPS 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages