MolbitAL is a modular active‑learning engine designed for ultra‑large ligand virtual screening. It coordinates end‑to‑end workflows—from ligand preparation and molecular docking to machine learning with acquisition—scaling to billion compounds efficiently on locals or HPC cluster.
The framework plugs into Autodck-Vina with its variants (AutoDock‑Vina, Vina‑GPU 2.1, QVina 2.1, QVina‑W, smina, Uni‑Dock), and currently supports Random Forest (RF) and XGBoost‑Distribution (XGB‑D) surrogate models for fast affinity ranking.
(aimnet-x2d will be intergrated soon)
- Features
- Requirements
- Installation
- Quick Start
- Detailed Workflow
- Acquisition Strategies
- Datasets
- License
- Citation
- Acknowledgements
- High scalability – splits SMILES collections into tractable samples and streams them through GPU‑accelerated docking engines.
- Modular architecture – switch docking engines, ML models, or acquisition functions via a single JSON config.
- Stateful cycles – resume an active‑learning cycles seamlessly (checkpointed by project_name).
- Isomer/Tautomer enumeration – integrates OpenEye-Toolkits.
- Python ≥ 3.10
- CUDA 12+ (optional, for Vina‑GPU 2.1 and Uni‑Dock)
- OpenEye Toolkits (optional, for isomer/tautomer enumeration)
# 1. Clone the repository
git clone https://github.com/isayeblab/molbital.git
# 2. Create the conda environment
mamba env create --file environment.yml
mamba activate molbital
# 3. Install MolbitAL and docking wrappers
python setup.py installNote: For Autodock-Vina-GPU2.1, you need to manually assign OpenCL, boost, and CUDA toolkits. (For the detail, see https://github.com/DeltaGroupNJUPT/Vina-GPU-2.1)
When your SMILES directories, fingerprints, and binding pocket grid are ready, launch an entire active‑learning loop with one line:
python run_al.py --config run_al.jsonKey fields in run_al.json
| Field | Description |
|---|---|
project_name |
Folder where every cycle’s outputs are stored |
smiles_dir |
Directory containing SMILES chunks (*.smi) |
fingerprint_dir |
Matching binary fingerprints (*.txt) |
receptor |
Prepared protein in PDBQT format |
config |
Docking box file generated by build_autobox.py |
docking_engine |
vina, vina-gpu2.1, qvina2.1, qvina-w, smina and unidock |
ml_model |
rf or xgbd |
acquisition |
greedy, lcb, unc |
start_cycle / end_cycle |
Range of AL iterations |
max_gpu_memory |
(Uni‑Dock) cap GPU memory per batch |
workers |
Number of CPU workers for I/O intensive steps |
script_dir |
Root directory that contains the phase_* and utils folders |
If start_cycle > 1, MolbitAL automatically resumes from the previous checkpoint.
Below is the manual, broken down into each phase.
python prepare_receptor4.py -r 3me3_protein.pdb \
-A checkhydrogens \
-o 3me3_protein.pdbqtSplit libraries into N ligand subsample (adjust --target_num :).
python utils/split_smiles.py --input total.smi \
--output smiles/ \
--target_num 1000Optional OpenEye enumeration:
./generate_isomers.sh -i smiles/ -o smiles_isomers -m 16 -n -b
./generate_tautomers.sh -i smiles_isomers/ -o smiles_all -m 16 -p -t 30.0see the bash scripts for option in detail
./generate_fingerprints.sh -i smiles_all/ -o smiles_all/Generate an autobox grid:
python utils/build_autobox.py --input box.sdf \
--output auto_box.txtRun docking (unidock example):
python phase_1/run_docking.py \
--engine unidock \
--receptor 3me3_protein.pdbqt \
--ligand-index pkm2_train/train_pdbqt/ \
--config pkm2_box.txt \
--smiles-file pkm2_train/train_smiles.smi \
--output-dir pkm2_train/train_docking \
--results-csv scores.csv \
--active \
--extra-args --max_gpu_memory 2Run docking (Vina example):
python phase_1/run_docking.py \
--engine vina \
--receptor 3me3_protein.pdbqt \
--ligands-dir pkm2_train/train_pdbqt/ \
--config pkm2_box.txt \
--smiles-file pkm2_train/train_smiles.smi \
--output-dir pkm2_train/train_docking \
--results-csv scores.csv \
--active For in detail description and options for unidock, please see https://github.com/dptech-corp/Uni-Dock
Substitute --engine with vina-gpu, smina, qvina, or qvina-w as needed.
python phase_2/train_ml.py \
--input-csv pkm2_train/train_docking/scores.csv \
--input-fp pkm2_train/train_fingerprints.txt \
--output-dir pkm2_train/train_ml \
--save-predictions \
--model rfpython phase_2/inference.py \
--model pkm2_train/train_ml/ml_models_rf/rf_model.joblib \
--scaler pkm2_train/train_ml/ml_models_rf/scaler.joblib \
--input-smi-dir pkm2_split/ \
--input-fp-dir pkm2_fingerprints/ \
--output-dir pkm2_train/train_inference \
--cycle 2 \
--k 2.0 \
--top-compounds 1000 \
--train-smi pkm2_train/train_smiles.smi \
--train-fp pkm2_train/train_fingerprints.txt \
--active \
--acquisition greedyWhen --active is set, MolbitAL writes the next‑cycle SMILES and fingerprints automatically.
| Name | Formula | Intuition |
|---|---|---|
| Greedy | ( ŷ(x) ) | Exploit top‑ranked predictions |
| LCB | ( ŷ(x) - k*σ(x) ) | Balance exploitation and exploration via weight (k) |
| UNC | ( σ(x) ) | Pure exploration based sampling the most uncertain compounds |
--k applies the weight of the uncertainty (default 2)
MolbitAL tests two benchmarks to mimic sparse hit scenarios:
- PKM2 – Pyruvate kinase muscle isoform 2 (Active: 546, Inactive: 244679, 0.2% active)
- ALDH1 – Aldehyde dihydrogenase 1 (Active: 5363 , Inactive: 101874, 5% active)
Both subsets are distilled from the LIT‑PCBA open benchmark; see https://drugdesign.unistra.fr/LIT-PCBA/
MolbitAL is released under the MIT License (see LICENSE).

