Skip to content

ilkwonch/MolbitAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MolbitAL - MOdular Ligand screening for Billion‑scale ITerative selection with Active Learning

MolbitAL

MolbitAL is a modular active‑learning engine designed for ultra‑large ligand virtual screening. It coordinates end‑to‑end workflows—from ligand preparation and molecular docking to machine learning with acquisition—scaling to billion compounds efficiently on locals or HPC cluster.
The framework plugs into Autodck-Vina with its variants (AutoDock‑Vina, Vina‑GPU 2.1, QVina 2.1, QVina‑W, smina, Uni‑Dock), and currently supports Random Forest (RF) and XGBoost‑Distribution (XGB‑D) surrogate models for fast affinity ranking.
(aimnet-x2d will be intergrated soon)


Table of Contents


Features

  • High scalability – splits SMILES collections into tractable samples and streams them through GPU‑accelerated docking engines.
  • Modular architecture – switch docking engines, ML models, or acquisition functions via a single JSON config.
  • Stateful cycles – resume an active‑learning cycles seamlessly (checkpointed by project_name).
  • Isomer/Tautomer enumeration – integrates OpenEye-Toolkits.

Requirements

  • Python ≥ 3.10
  • CUDA 12+ (optional, for Vina‑GPU 2.1 and Uni‑Dock)
  • OpenEye Toolkits (optional, for isomer/tautomer enumeration)

Installation

# 1. Clone the repository
git clone https://github.com/isayeblab/molbital.git

# 2. Create the conda environment
mamba env create --file environment.yml
mamba activate molbital

# 3. Install MolbitAL and docking wrappers
python setup.py install

Note: For Autodock-Vina-GPU2.1, you need to manually assign OpenCL, boost, and CUDA toolkits. (For the detail, see https://github.com/DeltaGroupNJUPT/Vina-GPU-2.1)


Quick Start

When your SMILES directories, fingerprints, and binding pocket grid are ready, launch an entire active‑learning loop with one line:

python run_al.py --config run_al.json

Key fields in run_al.json

Field Description
project_name Folder where every cycle’s outputs are stored
smiles_dir Directory containing SMILES chunks (*.smi)
fingerprint_dir Matching binary fingerprints (*.txt)
receptor Prepared protein in PDBQT format
config Docking box file generated by build_autobox.py
docking_engine vina, vina-gpu2.1, qvina2.1, qvina-w, smina and unidock
ml_model rf or xgbd
acquisition greedy, lcb, unc
start_cycle / end_cycle Range of AL iterations
max_gpu_memory (Uni‑Dock) cap GPU memory per batch
workers Number of CPU workers for I/O intensive steps
script_dir Root directory that contains the phase_* and utils folders

If start_cycle > 1, MolbitAL automatically resumes from the previous checkpoint.


Manual Workflow

Below is the manual, broken down into each phase.

Protein Preparation

python prepare_receptor4.py -r 3me3_protein.pdb \
                             -A checkhydrogens \
                             -o 3me3_protein.pdbqt

Ligand Preparation

Split libraries into N ligand subsample (adjust --target_num :).

python utils/split_smiles.py --input total.smi \
                             --output smiles/ \
                             --target_num 1000

Optional OpenEye enumeration:

./generate_isomers.sh  -i smiles/ -o smiles_isomers  -m 16 -n -b
./generate_tautomers.sh -i smiles_isomers/ -o smiles_all -m 16 -p -t 30.0

see the bash scripts for option in detail

Fingerprint Generation

./generate_fingerprints.sh -i smiles_all/ -o smiles_all/

Docking

Generate an autobox grid:

python utils/build_autobox.py --input box.sdf \
                              --output auto_box.txt

Run docking (unidock example):

python phase_1/run_docking.py \
       --engine unidock \
       --receptor 3me3_protein.pdbqt \
       --ligand-index pkm2_train/train_pdbqt/ \
       --config pkm2_box.txt \
       --smiles-file pkm2_train/train_smiles.smi \
       --output-dir pkm2_train/train_docking \
       --results-csv scores.csv \
       --active \
       --extra-args --max_gpu_memory 2

Run docking (Vina example):

python phase_1/run_docking.py \
       --engine vina \
       --receptor 3me3_protein.pdbqt \
       --ligands-dir pkm2_train/train_pdbqt/ \
       --config pkm2_box.txt \
       --smiles-file pkm2_train/train_smiles.smi \
       --output-dir pkm2_train/train_docking \
       --results-csv scores.csv \
       --active 

For in detail description and options for unidock, please see https://github.com/dptech-corp/Uni-Dock Substitute --engine with vina-gpu, smina, qvina, or qvina-w as needed.

Machine Learning

python phase_2/train_ml.py \
       --input-csv pkm2_train/train_docking/scores.csv \
       --input-fp  pkm2_train/train_fingerprints.txt \
       --output-dir pkm2_train/train_ml \
       --save-predictions \
       --model rf

Inference & Acquisition

python phase_2/inference.py \
       --model pkm2_train/train_ml/ml_models_rf/rf_model.joblib \
       --scaler pkm2_train/train_ml/ml_models_rf/scaler.joblib \
       --input-smi-dir pkm2_split/ \
       --input-fp-dir  pkm2_fingerprints/ \
       --output-dir pkm2_train/train_inference \
       --cycle 2 \
       --k 2.0 \
       --top-compounds 1000 \
       --train-smi pkm2_train/train_smiles.smi \
       --train-fp  pkm2_train/train_fingerprints.txt \
       --active \
       --acquisition greedy

When --active is set, MolbitAL writes the next‑cycle SMILES and fingerprints automatically.


Acquisition Strategies

Name Formula Intuition
Greedy ( ŷ(x) ) Exploit top‑ranked predictions
LCB ( ŷ(x) - k*σ(x) ) Balance exploitation and exploration via weight (k)
UNC ( σ(x) ) Pure exploration based sampling the most uncertain compounds

--k applies the weight of the uncertainty (default 2)

Acquisition strategies


Datasets

MolbitAL tests two benchmarks to mimic sparse hit scenarios:

  • PKM2 – Pyruvate kinase muscle isoform 2 (Active: 546, Inactive: 244679, 0.2% active)
  • ALDH1 – Aldehyde dihydrogenase 1 (Active: 5363 , Inactive: 101874, 5% active)

Both subsets are distilled from the LIT‑PCBA open benchmark; see https://drugdesign.unistra.fr/LIT-PCBA/


License

MolbitAL is released under the MIT License (see LICENSE).


About

temp

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published