A comprehensive pipeline for curating ChEMBL database and filtering protein-ligand complexes based on activity data, compound properties, and structural quality.
ChEMBL Curator provides a two-stage pipeline:
-
Compound Filtering Stage: Extracts and filters ligands from ChEMBL database based on:
- Activity data (Ki, Kd, IC50, EC50)
- Activity thresholds and relations
- Compound validity (SMILES, molecular properties)
- Target types (single proteins)
-
Protein Filtering Stage: Filters targets based on structural data:
- PDB structure availability
- AlphaFold model availability
- Ligand-bound structures
- Single binding site requirement
- Flexible Configuration: JSON-based configuration or command-line options
- Robust Filtering: Multiple validity checks for activity data and compounds
- Structure-Based Filtering: Automated PDB download, alignment, and pocket analysis
- Parallel Processing: Support for multi-core processing
- Comprehensive Logging: Detailed logging at multiple levels
- Error Handling: Continues processing on errors, logs issues
- Python 3.8+
- External tools (for protein filtering):
pdb_get: PDB download tool (optional - will fallback to RCSB web download)TMalign: Structure alignment tool (expected at./bin/TMalign)wget: For downloading AlphaFold models
# Install the package
pip install -e .
# Or install dependencies manually
pip install click requests numpy rdkit pandas# Check if tools are available
which wget # Required for AlphaFold downloads
ls ./bin/TMalign # Required for structure alignment
# Optional: Check if pdb_get is available (faster, but will fallback to web download)
which pdb_get
# If TMalign is in a different location, create symlink
mkdir -p ./bin
ln -s /path/to/TMalign ./bin/TMalignNote: PDB download automatically uses:
pdb_getif available (faster, requires local PDB database)- RCSB web download as fallback (works anywhere with internet)
# Step 1: Compound filtering - creates curated_data_filtered/
echo "Step 1: Running compound filtering..."
chembl-curator curate --download --output curated_data_filtered --log-level INFO
# Wait for completion, then:
# Step 2: Protein filtering - adds PDB structures and filters by binding site
echo "Step 2: Running protein filtering..."
chembl-curator filter-proteins \
--curated-dir curated_data_filtered \
--n-processes 8 \
--log-level INFO
# Step 3: Check results
echo "Step 3: Checking results..."
cat curated_data_filtered/passed_targets.txt
echo "Total passed targets: $(wc -l < curated_data_filtered/passed_targets.txt)"For testing, you might want to try on a small subset first:
# Step 1: Create test directory with just a few targets
mkdir -p test_curated
cp -r curated_data_filtered/P28222 test_curated/
cp -r curated_data_filtered/Q9Y6K9 test_curated/
cp -r curated_data_filtered/P00533 test_curated/
# Step 2: Run protein filtering on test subset
chembl-curator filter-proteins \
--curated-dir test_curated \
--n-processes 1 \
--log-level DEBUG
# Step 3: Check if it worked
ls -la test_curated/P28222/pdb/
cat test_curated/passed_targets.txt# Basic usage - download and curate with defaults
chembl-curator curate --download --output curated_data_filtered
# Using existing database
chembl-curator curate --database /path/to/chembl_34.db --output curated_data_filtered
# Using custom configuration file
chembl-curator curate --download --config config.json --output curated_data_filtered
# Override specific parameters
chembl-curator curate --download --output curated_data_filtered \
--activity-types Ki Kd IC50 \
--relations "=" "<=" "<" \
--units nM
# Create example config file
chembl-curator curate --create-config config.json
# With debug logging
chembl-curator curate --download --output curated_data_filtered --log-level DEBUGfrom chembl_curator import ChEMBLCurator, CurationConfig
from pathlib import Path
# Option 1: Using default configuration
curator = ChEMBLCurator(log_level='INFO')
results = curator.run_pipeline(
database_path=None, # Will download ChEMBL
output_dir=Path('curated_data_filtered')
)
print(f"Total compounds: {results.total_compounds}")
print(f"Total proteins: {results.total_proteins}")
# Option 2: Using custom configuration
config = CurationConfig(
activity_thresholds={'nM': 1000.0, 'uM': 1.0},
activity_types=['Ki', 'Kd', 'IC50'],
units=['nM'],
min_pchembl_value=6.0,
min_confidence_score=8
)
curator = ChEMBLCurator(config=config, log_level='INFO')
results = curator.run_pipeline(
database_path=Path('chembl_34.db'),
output_dir=Path('curated_data_filtered')
)# Sequential processing (1 core)
chembl-curator filter-proteins --curated-dir curated_data_filtered --n-processes 1
# Parallel processing (8 cores, recommended)
chembl-curator filter-proteins --curated-dir curated_data_filtered --n-processes 8
# With debug logging
chembl-curator filter-proteins --curated-dir curated_data_filtered --n-processes 8 --log-level DEBUGfrom chembl_curator import ProteinFilter
from pathlib import Path
# Create protein filter
pf = ProteinFilter(
curated_dir=Path('curated_data_filtered'),
log_level='INFO'
)
# Run pipeline with parallel processing
passed_targets = pf.run_pipeline(n_processes=8)
print(f"Passed {len(passed_targets)} targets with single binding sites")
# Process single target
result = pf.process_target('P28222')
print(f"Target P28222 passed: {result}")python example_scripts/run_protein_filter.py \
--curated-dir curated_data_filtered \
--n-processes 8 \
--log-level INFOThe default configuration filters for:
Activity Types:
Ki,Kd,IC50,EC50
Relations:
=,<=
Units:
nM(nanomolar),uM(micromolar)
Activity Thresholds:
- ≤ 10000 nM (10 µM) for nM units
- ≤ 10 µM for µM units
Target Types:
SINGLE PROTEINonly
Compound Filters:
- Heavy atoms: 5-80
- Valid SMILES required
Validity Filters:
- Standard flag: not required (false)
- Exclude invalid data: true
- Exclude duplicates: true
- Minimum confidence score: ≥ 6
- Minimum pChEMBL value: ≥ 5.0
Assay Types:
B(Binding assays only)
BAO Formats:
BAO_0000357(single protein format)
Create a JSON configuration file:
{
"activity_thresholds": {
"nM": 1000.0,
"uM": 1.0
},
"min_heavy_atoms": 5,
"max_heavy_atoms": 50,
"target_types": ["SINGLE PROTEIN"],
"activity_types": ["Ki", "Kd", "IC50"],
"relations": ["=", "<="],
"units": ["nM", "uM"],
"require_standard_flag": false,
"exclude_invalid_data": true,
"exclude_duplicates": true,
"min_confidence_score": 8,
"assay_types": ["B", "F"],
"bao_formats": ["BAO_0000357"],
"min_pchembl_value": 6.0
}Use with:
chembl-curator curate --config config.json --download --output curated_data_filteredcurated_data_filtered/
├── P28222/ # UniProt ID
│ └── comps/
│ └── smiles/
│ ├── CHEMBL31217.smi
│ ├── CHEMBL308727.smi
│ └── ... # One .smi file per compound
├── Q9Y6K9/
│ └── comps/
│ └── smiles/
│ └── ...
└── ... # One directory per target
curated_data_filtered/
├── P28222/ # Target that passed all filters
│ ├── comps/
│ │ └── smiles/
│ │ └── *.smi
│ ├── pdb/ # Downloaded structures
│ │ ├── AF-P28222.pdb # AlphaFold model
│ │ ├── 1atp.pdb # Experimental PDB structures
│ │ ├── 2atp.pdb
│ │ └── ...
│ ├── aligned/ # Structures aligned to AF model (per chain)
│ │ ├── 1atp_A.pdb
│ │ ├── 1atp_B.pdb
│ │ ├── 2atp_A.pdb
│ │ └── ...
│ ├── pdbid.list # PDB information
│ ├── ligand_bound_pdbs.txt # Ligand-bound structures
│ └── pocket_info.csv # Pocket clustering data
├── Q9Y6K9/ # Another target that passed
│ └── ...
├── passed_targets.txt # List of targets that passed protein filtering
└── ...
pdbid.list - PDB structure information:
# PDBID method resolution chains
1ATP X-RAY DIFFRACTION 2.0 A A/B=1-250
2ATP X-RAY DIFFRACTION 1.8 A A=1-250
ligand_bound_pdbs.txt - Structures with biological ligands:
1ATP Chain A: ATP
1ATP Chain B: ATP
2ATP Chain A: ADP,MG
pocket_info.csv - Ligand pocket coordinates (from aligned structures):
PDB_ID,Chain,Aligned_File,Ligand_Name,Center_X,Center_Y,Center_Z
1ATP,A,1atp_A.pdb,ATP,12.345,23.456,34.567
1ATP,B,1atp_B.pdb,ATP,12.123,23.234,34.345
2ATP,A,2atp_A.pdb,ADP,12.100,23.200,34.300passed_targets.txt - Final filtered targets:
P28222
Q9Y6K9
P00533
The protein filtering stage consists of 7 steps:
- Queries UniProt REST API
- Retrieves PDB ID, method, resolution, chain information
- Tries
pdb_getfirst (if available, faster with local database) - Falls back to RCSB web download (https://files.rcsb.org)
- Creates
pdb/directory
- Downloads from
https://alphafold.ebi.ac.uk/files/AF-{uniid}-F1-model_v6.pdb - Used as reference for alignment
- Parses HETATM records
- Excludes non-biological ligands (ions, buffers, crystallization aids, water)
- Checks ligand-protein contact (≤4Å)
- Identifies single-ligand or clustered structures (≤10Å between ligand centers)
- Aligns to AlphaFold model using TMalign
- Saves only chain of interest
- Re-extracts ligand centers from aligned structures
- Verifies all ligands in same pocket (≤10Å)
- Keeps only single binding site targets
- Saves results to
passed_targets.txt
Non-biological ligands that are filtered out:
- Water: HOH, H2O, WAT
- Ions: CA, MG, ZN, NA, CL, FE, etc.
- Buffers: MES, HEPES, TRIS, etc.
- Sugars: NAG, MAN, GLC, GAL, etc.
- Crystallization aids: PEG, GOL, EDO, SO4, etc.
See example_scripts/exclusion.py or chembl_curator/protein_filter.py for complete list.
from chembl_curator import ChEMBLCurator, ProteinFilter
from pathlib import Path
# Stage 1: Compound filtering
curator = ChEMBLCurator(log_level='INFO')
results = curator.run_pipeline(output_dir=Path('curated_data'))
# Stage 2: Protein filtering
pf = ProteinFilter(curated_dir=Path('curated_data'), log_level='INFO')
passed = pf.run_pipeline(n_processes=8)
print(f"Final database: {len(passed)} targets")from chembl_curator import ChEMBLCurator, CurationConfig
from pathlib import Path
# More stringent activity threshold
config = CurationConfig(
activity_thresholds={'nM': 100.0, 'uM': 0.1}, # 100 nM instead of 10 µM
activity_types=['Ki', 'Kd'],
units=['nM', 'uM'],
min_pchembl_value=7.0, # Higher quality data
min_confidence_score=8, # Higher confidence
max_heavy_atoms=30, # Smaller molecules
assay_types=['B'] # Binding assays only
)
curator = ChEMBLCurator(config=config, log_level='INFO')
results = curator.run_pipeline(output_dir=Path('curated_high_quality'))from chembl_curator.protein_filter import ProteinFilter
from pathlib import Path
pf = ProteinFilter(Path('curated_data_filtered'))
# Process specific targets
targets = ['P28222', 'Q9Y6K9', 'P00533']
passed = []
for target in targets:
try:
if pf.process_target(target):
passed.append(target)
print(f"✓ {target} passed")
else:
print(f"✗ {target} filtered out")
except Exception as e:
print(f"✗ {target} error: {e}")
print(f"\nPassed: {passed}")from pathlib import Path
import pandas as pd
# Read passed targets
with open('curated_data_filtered/passed_targets.txt') as f:
targets = [line.strip() for line in f]
print(f"Total passed targets: {len(targets)}")
# Inspect a specific target
target = targets[0]
target_dir = Path('curated_data_filtered') / target
# Count compounds
compounds = list((target_dir / 'comps' / 'smiles').glob('*.smi'))
print(f"\n{target}: {len(compounds)} compounds")
# Read PDB list
with open(target_dir / 'pdbid.list') as f:
pdbs = [line for line in f if not line.startswith('#')]
print(f"PDB structures: {len(pdbs)}")
# Read pocket info
pocket_df = pd.read_csv(target_dir / 'pocket_info.csv')
print(f"\nPocket info:")
print(pocket_df)- Time: 10-30 minutes (depending on database size)
- Memory: ~2-4 GB
- Output: ~1000-5000 targets (depends on filters)
- Per target: 30-60 seconds
- Total (4000 targets, 8 cores): 3-6 hours
- Memory: 100-500 MB per process
- Recommended: 8-16 processes for optimal performance
- ChEMBL database: ~3 GB
- Compound filtering output: ~10-50 MB
- Protein filtering output: ~500 MB - 5 GB (depends on number of PDBs)
1. PDB download issues
# pdb_get is now optional - the pipeline automatically falls back to RCSB web download
# If you want faster downloads and have access to a local PDB database:
which pdb_get
# If pdb_get is not found but you have it installed:
export PATH=/path/to/pdb_tools:$PATH
# Otherwise, web download from RCSB will be used automatically (requires internet)2. TMalign not found
# Create bin directory and symlink to TMalign
mkdir -p ./bin
ln -s /path/to/TMalign ./bin/TMalign
# Or update path in protein_filter.py line 4603. Memory issues with parallel processing
# Reduce number of processes
chembl-curator filter-proteins --curated-dir curated_data_filtered --n-processes 44. Download failures
- Check internet connection
- Some PDBs may be obsolete (pipeline continues)
- Some targets may not have AlphaFold models (filtered out)
5. Import errors
# Reinstall package
pip install -e .
# Or check Python path
export PYTHONPATH=/path/to/ChEMBL-Q:$PYTHONPATHFor detailed error messages:
# Compound filtering
chembl-curator curate --download --output test_output --log-level DEBUG
# Protein filtering
chembl-curator filter-proteins --curated-dir test_output --n-processes 1 --log-level DEBUGChEMBL-Q/
├── chembl_curator/ # Main package
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration classes
│ ├── curator.py # Compound filtering
│ ├── downloader.py # ChEMBL downloader
│ ├── filters.py # Activity and compound filters
│ ├── protein_filter.py # Protein filtering pipeline
│ └── utils.py # Utility functions
├── example_scripts/ # Example usage scripts
│ ├── getpdb.py
│ ├── exclusion.py
│ ├── aligntm.py
│ └── run_protein_filter.py
├── pyproject.toml # Package configuration
├── README.md # This file
├── PROTEIN_FILTERING.md # Detailed protein filtering docs
└── .gitignore
- README.md (this file): Overview and quick start guide
- PROTEIN_FILTERING.md: Detailed protein filtering pipeline documentation
- example_scripts/: Example usage scripts and reference implementations
- ChEMBL
- AlphaFold DB
- TMalign
This project is provided as-is for research purposes.