MolbitAL - MOdular Ligand screening for Billion‑scale ITerative selection with Active Learning

MolbitAL is a modular active‑learning engine designed for ultra‑large ligand virtual screening. It coordinates end‑to‑end workflows—from ligand preparation and molecular docking to machine learning with acquisition—scaling to billion compounds efficiently on locals or HPC cluster.
The framework plugs into Autodck-Vina with its variants (AutoDock‑Vina, Vina‑GPU 2.1, QVina 2.1, QVina‑W, smina, Uni‑Dock), and currently supports Random Forest (RF) and XGBoost‑Distribution (XGB‑D) surrogate models for fast affinity ranking.
(aimnet-x2d will be intergrated soon)

Features

High scalability – splits SMILES collections into tractable samples and streams them through GPU‑accelerated docking engines.
Modular architecture – switch docking engines, ML models, or acquisition functions via a single JSON config.
Stateful cycles – resume an active‑learning cycles seamlessly (checkpointed by project_name).
Isomer/Tautomer enumeration – integrates OpenEye-Toolkits.

Requirements

Python ≥ 3.10
CUDA 12+ (optional, for Vina‑GPU 2.1 and Uni‑Dock)
OpenEye Toolkits (optional, for isomer/tautomer enumeration)

Installation

# 1. Clone the repository
git clone https://github.com/isayeblab/molbital.git

# 2. Create the conda environment
mamba env create --file environment.yml
mamba activate molbital

# 3. Install MolbitAL and docking wrappers
python setup.py install

Note: For Autodock-Vina-GPU2.1, you need to manually assign OpenCL, boost, and CUDA toolkits. (For the detail, see https://github.com/DeltaGroupNJUPT/Vina-GPU-2.1)

Quick Start

When your SMILES directories, fingerprints, and binding pocket grid are ready, launch an entire active‑learning loop with one line:

python run_al.py --config run_al.json

Key fields in run_al.json

Field	Description
`project_name`	Folder where every cycle’s outputs are stored
`smiles_dir`	Directory containing SMILES chunks (`*.smi`)
`fingerprint_dir`	Matching binary fingerprints (`*.txt`)
`receptor`	Prepared protein in PDBQT format
`config`	Docking box file generated by `build_autobox.py`
`docking_engine`	`vina`, `vina-gpu2.1`, `qvina2.1`, `qvina-w`, `smina` and `unidock`
`ml_model`	`rf` or `xgbd`
`acquisition`	`greedy`, `lcb`, `unc`
`start_cycle` / `end_cycle`	Range of AL iterations
`max_gpu_memory`	(Uni‑Dock) cap GPU memory per batch
`workers`	Number of CPU workers for I/O intensive steps
`script_dir`	Root directory that contains the `phase_*` and `utils` folders

If start_cycle > 1, MolbitAL automatically resumes from the previous checkpoint.

Manual Workflow

Below is the manual, broken down into each phase.

Protein Preparation

python prepare_receptor4.py -r 3me3_protein.pdb \
                             -A checkhydrogens \
                             -o 3me3_protein.pdbqt

Ligand Preparation

Split libraries into N ligand subsample (adjust --target_num :).

python utils/split_smiles.py --input total.smi \
                             --output smiles/ \
                             --target_num 1000

Optional OpenEye enumeration:

./generate_isomers.sh  -i smiles/ -o smiles_isomers  -m 16 -n -b
./generate_tautomers.sh -i smiles_isomers/ -o smiles_all -m 16 -p -t 30.0

see the bash scripts for option in detail

Fingerprint Generation

./generate_fingerprints.sh -i smiles_all/ -o smiles_all/

Docking

Generate an autobox grid:

python utils/build_autobox.py --input box.sdf \
                              --output auto_box.txt

Run docking (unidock example):

python phase_1/run_docking.py \
       --engine unidock \
       --receptor 3me3_protein.pdbqt \
       --ligand-index pkm2_train/train_pdbqt/ \
       --config pkm2_box.txt \
       --smiles-file pkm2_train/train_smiles.smi \
       --output-dir pkm2_train/train_docking \
       --results-csv scores.csv \
       --active \
       --extra-args --max_gpu_memory 2

Run docking (Vina example):

python phase_1/run_docking.py \
       --engine vina \
       --receptor 3me3_protein.pdbqt \
       --ligands-dir pkm2_train/train_pdbqt/ \
       --config pkm2_box.txt \
       --smiles-file pkm2_train/train_smiles.smi \
       --output-dir pkm2_train/train_docking \
       --results-csv scores.csv \
       --active

For in detail description and options for unidock, please see https://github.com/dptech-corp/Uni-Dock Substitute --engine with vina-gpu, smina, qvina, or qvina-w as needed.

Machine Learning

python phase_2/train_ml.py \
       --input-csv pkm2_train/train_docking/scores.csv \
       --input-fp  pkm2_train/train_fingerprints.txt \
       --output-dir pkm2_train/train_ml \
       --save-predictions \
       --model rf

Inference & Acquisition

python phase_2/inference.py \
       --model pkm2_train/train_ml/ml_models_rf/rf_model.joblib \
       --scaler pkm2_train/train_ml/ml_models_rf/scaler.joblib \
       --input-smi-dir pkm2_split/ \
       --input-fp-dir  pkm2_fingerprints/ \
       --output-dir pkm2_train/train_inference \
       --cycle 2 \
       --k 2.0 \
       --top-compounds 1000 \
       --train-smi pkm2_train/train_smiles.smi \
       --train-fp  pkm2_train/train_fingerprints.txt \
       --active \
       --acquisition greedy

When --active is set, MolbitAL writes the next‑cycle SMILES and fingerprints automatically.

Acquisition Strategies

Name	Formula	Intuition
Greedy	( ŷ(x) )	Exploit top‑ranked predictions
LCB	( ŷ(x) - k*σ(x) )	Balance exploitation and exploration via weight (k)
UNC	( σ(x) )	Pure exploration based sampling the most uncertain compounds

--k applies the weight of the uncertainty (default 2)

Datasets

MolbitAL tests two benchmarks to mimic sparse hit scenarios:

PKM2 – Pyruvate kinase muscle isoform 2 (Active: 546, Inactive: 244679, 0.2% active)
ALDH1 – Aldehyde dihydrogenase 1 (Active: 5363 , Inactive: 101874, 5% active)

Both subsets are distilled from the LIT‑PCBA open benchmark; see https://drugdesign.unistra.fr/LIT-PCBA/

License

MolbitAL is released under the MIT License (see LICENSE).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MolbitAL - MOdular Ligand screening for Billion‑scale ITerative selection with Active Learning

Table of Contents

Features

Requirements

Installation

Quick Start

Manual Workflow

Protein Preparation

Ligand Preparation

Fingerprint Generation

Docking

Machine Learning

Inference & Acquisition

Acquisition Strategies

Datasets

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
docking_engines		docking_engines
phase_0		phase_0
phase_1		phase_1
phase_2		phase_2
test		test
utils		utils
LICENSE		LICENSE
README.md		README.md

License

ilkwonch/MolbitAL

Folders and files

Latest commit

History

Repository files navigation

MolbitAL - MOdular Ligand screening for Billion‑scale ITerative selection with Active Learning

Table of Contents

Features

Requirements

Installation

Quick Start

Manual Workflow

Protein Preparation

Ligand Preparation

Fingerprint Generation

Docking

Machine Learning

Inference & Acquisition

Acquisition Strategies

Datasets

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages