ChEMBL Curator

A comprehensive pipeline for curating ChEMBL database and filtering protein-ligand complexes based on activity data, compound properties, and structural quality.

Overview

ChEMBL Curator provides a two-stage pipeline:

Compound Filtering Stage: Extracts and filters ligands from ChEMBL database based on:
- Activity data (Ki, Kd, IC50, EC50)
- Activity thresholds and relations
- Compound validity (SMILES, molecular properties)
- Target types (single proteins)
Protein Filtering Stage: Filters targets based on structural data:
- PDB structure availability
- AlphaFold model availability
- Ligand-bound structures
- Single binding site requirement

Features

Flexible Configuration: JSON-based configuration or command-line options
Robust Filtering: Multiple validity checks for activity data and compounds
Structure-Based Filtering: Automated PDB download, alignment, and pocket analysis
Parallel Processing: Support for multi-core processing
Comprehensive Logging: Detailed logging at multiple levels
Error Handling: Continues processing on errors, logs issues

Installation

Prerequisites

Python 3.8+
External tools (for protein filtering):
- pdb_get: PDB download tool (optional - will fallback to RCSB web download)
- TMalign: Structure alignment tool (expected at ./bin/TMalign)
- wget: For downloading AlphaFold models

Python Dependencies

# Install the package
pip install -e .

# Or install dependencies manually
pip install click requests numpy rdkit pandas

External Tools Setup

# Check if tools are available
which wget  # Required for AlphaFold downloads
ls ./bin/TMalign  # Required for structure alignment

# Optional: Check if pdb_get is available (faster, but will fallback to web download)
which pdb_get

# If TMalign is in a different location, create symlink
mkdir -p ./bin
ln -s /path/to/TMalign ./bin/TMalign

Note: PDB download automatically uses:

pdb_get if available (faster, requires local PDB database)
RCSB web download as fallback (works anywhere with internet)

Quick Start

Complete Pipeline (Compound + Protein Filtering)

# Step 1: Compound filtering - creates curated_data_filtered/
echo "Step 1: Running compound filtering..."
chembl-curator curate --download --output curated_data_filtered --log-level INFO

# Wait for completion, then:

# Step 2: Protein filtering - adds PDB structures and filters by binding site
echo "Step 2: Running protein filtering..."
chembl-curator filter-proteins \
    --curated-dir curated_data_filtered \
    --n-processes 8 \
    --log-level INFO

# Step 3: Check results
echo "Step 3: Checking results..."
cat curated_data_filtered/passed_targets.txt
echo "Total passed targets: $(wc -l < curated_data_filtered/passed_targets.txt)"

Testing on Subset (Recommended First)

For testing, you might want to try on a small subset first:

# Step 1: Create test directory with just a few targets
mkdir -p test_curated
cp -r curated_data_filtered/P28222 test_curated/
cp -r curated_data_filtered/Q9Y6K9 test_curated/
cp -r curated_data_filtered/P00533 test_curated/

# Step 2: Run protein filtering on test subset
chembl-curator filter-proteins \
    --curated-dir test_curated \
    --n-processes 1 \
    --log-level DEBUG

# Step 3: Check if it worked
ls -la test_curated/P28222/pdb/
cat test_curated/passed_targets.txt

Usage

Stage 1: Compound Filtering

Using CLI

# Basic usage - download and curate with defaults
chembl-curator curate --download --output curated_data_filtered

# Using existing database
chembl-curator curate --database /path/to/chembl_34.db --output curated_data_filtered

# Using custom configuration file
chembl-curator curate --download --config config.json --output curated_data_filtered

# Override specific parameters
chembl-curator curate --download --output curated_data_filtered \
    --activity-types Ki Kd IC50 \
    --relations "=" "<=" "<" \
    --units nM

# Create example config file
chembl-curator curate --create-config config.json

# With debug logging
chembl-curator curate --download --output curated_data_filtered --log-level DEBUG

Using Python API

from chembl_curator import ChEMBLCurator, CurationConfig
from pathlib import Path

# Option 1: Using default configuration
curator = ChEMBLCurator(log_level='INFO')
results = curator.run_pipeline(
    database_path=None,  # Will download ChEMBL
    output_dir=Path('curated_data_filtered')
)

print(f"Total compounds: {results.total_compounds}")
print(f"Total proteins: {results.total_proteins}")

# Option 2: Using custom configuration
config = CurationConfig(
    activity_thresholds={'nM': 1000.0, 'uM': 1.0},
    activity_types=['Ki', 'Kd', 'IC50'],
    units=['nM'],
    min_pchembl_value=6.0,
    min_confidence_score=8
)

curator = ChEMBLCurator(config=config, log_level='INFO')
results = curator.run_pipeline(
    database_path=Path('chembl_34.db'),
    output_dir=Path('curated_data_filtered')
)

Stage 2: Protein Filtering

Using CLI

# Sequential processing (1 core)
chembl-curator filter-proteins --curated-dir curated_data_filtered --n-processes 1

# Parallel processing (8 cores, recommended)
chembl-curator filter-proteins --curated-dir curated_data_filtered --n-processes 8

# With debug logging
chembl-curator filter-proteins --curated-dir curated_data_filtered --n-processes 8 --log-level DEBUG

Using Python API

from chembl_curator import ProteinFilter
from pathlib import Path

# Create protein filter
pf = ProteinFilter(
    curated_dir=Path('curated_data_filtered'),
    log_level='INFO'
)

# Run pipeline with parallel processing
passed_targets = pf.run_pipeline(n_processes=8)

print(f"Passed {len(passed_targets)} targets with single binding sites")

# Process single target
result = pf.process_target('P28222')
print(f"Target P28222 passed: {result}")

Using Example Script

python example_scripts/run_protein_filter.py \
    --curated-dir curated_data_filtered \
    --n-processes 8 \
    --log-level INFO

Configuration

Default Configuration

The default configuration filters for:

Activity Types:

Ki, Kd, IC50, EC50

Relations:

=, <=

Units:

nM (nanomolar), uM (micromolar)

Activity Thresholds:

≤ 10000 nM (10 µM) for nM units
≤ 10 µM for µM units

Target Types:

SINGLE PROTEIN only

Compound Filters:

Heavy atoms: 5-80
Valid SMILES required

Validity Filters:

Standard flag: not required (false)
Exclude invalid data: true
Exclude duplicates: true
Minimum confidence score: ≥ 6
Minimum pChEMBL value: ≥ 5.0

Assay Types:

B (Binding assays only)

BAO Formats:

BAO_0000357 (single protein format)

Custom Configuration File

Create a JSON configuration file:

{
  "activity_thresholds": {
    "nM": 1000.0,
    "uM": 1.0
  },
  "min_heavy_atoms": 5,
  "max_heavy_atoms": 50,
  "target_types": ["SINGLE PROTEIN"],
  "activity_types": ["Ki", "Kd", "IC50"],
  "relations": ["=", "<="],
  "units": ["nM", "uM"],
  "require_standard_flag": false,
  "exclude_invalid_data": true,
  "exclude_duplicates": true,
  "min_confidence_score": 8,
  "assay_types": ["B", "F"],
  "bao_formats": ["BAO_0000357"],
  "min_pchembl_value": 6.0
}

Use with:

chembl-curator curate --config config.json --download --output curated_data_filtered

Output Structure

After Compound Filtering

curated_data_filtered/
├── P28222/                      # UniProt ID
│   └── comps/
│       └── smiles/
│           ├── CHEMBL31217.smi
│           ├── CHEMBL308727.smi
│           └── ...              # One .smi file per compound
├── Q9Y6K9/
│   └── comps/
│       └── smiles/
│           └── ...
└── ...                          # One directory per target

After Protein Filtering

curated_data_filtered/
├── P28222/                      # Target that passed all filters
│   ├── comps/
│   │   └── smiles/
│   │       └── *.smi
│   ├── pdb/                     # Downloaded structures
│   │   ├── AF-P28222.pdb       # AlphaFold model
│   │   ├── 1atp.pdb            # Experimental PDB structures
│   │   ├── 2atp.pdb
│   │   └── ...
│   ├── aligned/                 # Structures aligned to AF model (per chain)
│   │   ├── 1atp_A.pdb
│   │   ├── 1atp_B.pdb
│   │   ├── 2atp_A.pdb
│   │   └── ...
│   ├── pdbid.list              # PDB information
│   ├── ligand_bound_pdbs.txt   # Ligand-bound structures
│   └── pocket_info.csv         # Pocket clustering data
├── Q9Y6K9/                      # Another target that passed
│   └── ...
├── passed_targets.txt           # List of targets that passed protein filtering
└── ...

Output Files Description

pdbid.list - PDB structure information:

# PDBID method resolution chains
1ATP  X-RAY DIFFRACTION  2.0 A  A/B=1-250
2ATP  X-RAY DIFFRACTION  1.8 A  A=1-250

ligand_bound_pdbs.txt - Structures with biological ligands:

1ATP  Chain A: ATP
1ATP  Chain B: ATP
2ATP  Chain A: ADP,MG

pocket_info.csv - Ligand pocket coordinates (from aligned structures):

PDB_ID,Chain,Aligned_File,Ligand_Name,Center_X,Center_Y,Center_Z
1ATP,A,1atp_A.pdb,ATP,12.345,23.456,34.567
1ATP,B,1atp_B.pdb,ATP,12.123,23.234,34.345
2ATP,A,2atp_A.pdb,ADP,12.100,23.200,34.300

passed_targets.txt - Final filtered targets:

P28222
Q9Y6K9
P00533

Protein Filtering Pipeline Details

The protein filtering stage consists of 7 steps:

1. Fetch PDB Information

Queries UniProt REST API
Retrieves PDB ID, method, resolution, chain information

2. Download PDB Structures

Tries pdb_get first (if available, faster with local database)
Falls back to RCSB web download (https://files.rcsb.org)
Creates pdb/ directory

3. Download AlphaFold Models

Downloads from https://alphafold.ebi.ac.uk/files/AF-{uniid}-F1-model_v6.pdb
Used as reference for alignment

4. Detect Ligand-Bound Structures

Parses HETATM records
Excludes non-biological ligands (ions, buffers, crystallization aids, water)
Checks ligand-protein contact (≤4Å)
Identifies single-ligand or clustered structures (≤10Å between ligand centers)

5. Align Structures

Aligns to AlphaFold model using TMalign
Saves only chain of interest

6. Check Pocket Clustering

Re-extracts ligand centers from aligned structures
Verifies all ligands in same pocket (≤10Å)

7. Filter and Save

Keeps only single binding site targets
Saves results to passed_targets.txt

Excluded Ligands

Non-biological ligands that are filtered out:

Water: HOH, H2O, WAT
Ions: CA, MG, ZN, NA, CL, FE, etc.
Buffers: MES, HEPES, TRIS, etc.
Sugars: NAG, MAN, GLC, GAL, etc.
Crystallization aids: PEG, GOL, EDO, SO4, etc.

See example_scripts/exclusion.py or chembl_curator/protein_filter.py for complete list.

Examples

Example 1: Basic Usage

from chembl_curator import ChEMBLCurator, ProteinFilter
from pathlib import Path

# Stage 1: Compound filtering
curator = ChEMBLCurator(log_level='INFO')
results = curator.run_pipeline(output_dir=Path('curated_data'))

# Stage 2: Protein filtering
pf = ProteinFilter(curated_dir=Path('curated_data'), log_level='INFO')
passed = pf.run_pipeline(n_processes=8)

print(f"Final database: {len(passed)} targets")

Example 2: Custom Thresholds

from chembl_curator import ChEMBLCurator, CurationConfig
from pathlib import Path

# More stringent activity threshold
config = CurationConfig(
    activity_thresholds={'nM': 100.0, 'uM': 0.1},  # 100 nM instead of 10 µM
    activity_types=['Ki', 'Kd'],
    units=['nM', 'uM'],
    min_pchembl_value=7.0,      # Higher quality data
    min_confidence_score=8,      # Higher confidence
    max_heavy_atoms=30,          # Smaller molecules
    assay_types=['B']            # Binding assays only
)

curator = ChEMBLCurator(config=config, log_level='INFO')
results = curator.run_pipeline(output_dir=Path('curated_high_quality'))

Example 3: Process Specific Targets

from chembl_curator.protein_filter import ProteinFilter
from pathlib import Path

pf = ProteinFilter(Path('curated_data_filtered'))

# Process specific targets
targets = ['P28222', 'Q9Y6K9', 'P00533']
passed = []

for target in targets:
    try:
        if pf.process_target(target):
            passed.append(target)
            print(f"✓ {target} passed")
        else:
            print(f"✗ {target} filtered out")
    except Exception as e:
        print(f"✗ {target} error: {e}")

print(f"\nPassed: {passed}")

Example 4: Inspect Results

from pathlib import Path
import pandas as pd

# Read passed targets
with open('curated_data_filtered/passed_targets.txt') as f:
    targets = [line.strip() for line in f]

print(f"Total passed targets: {len(targets)}")

# Inspect a specific target
target = targets[0]
target_dir = Path('curated_data_filtered') / target

# Count compounds
compounds = list((target_dir / 'comps' / 'smiles').glob('*.smi'))
print(f"\n{target}: {len(compounds)} compounds")

# Read PDB list
with open(target_dir / 'pdbid.list') as f:
    pdbs = [line for line in f if not line.startswith('#')]
print(f"PDB structures: {len(pdbs)}")

# Read pocket info
pocket_df = pd.read_csv(target_dir / 'pocket_info.csv')
print(f"\nPocket info:")
print(pocket_df)

Performance

Compound Filtering

Time: 10-30 minutes (depending on database size)
Memory: ~2-4 GB
Output: ~1000-5000 targets (depends on filters)

Protein Filtering

Per target: 30-60 seconds
Total (4000 targets, 8 cores): 3-6 hours
Memory: 100-500 MB per process
Recommended: 8-16 processes for optimal performance

Disk Space

ChEMBL database: ~3 GB
Compound filtering output: ~10-50 MB
Protein filtering output: ~500 MB - 5 GB (depends on number of PDBs)

Troubleshooting

Common Issues

1. PDB download issues

# pdb_get is now optional - the pipeline automatically falls back to RCSB web download
# If you want faster downloads and have access to a local PDB database:
which pdb_get

# If pdb_get is not found but you have it installed:
export PATH=/path/to/pdb_tools:$PATH

# Otherwise, web download from RCSB will be used automatically (requires internet)

2. TMalign not found

# Create bin directory and symlink to TMalign
mkdir -p ./bin
ln -s /path/to/TMalign ./bin/TMalign

# Or update path in protein_filter.py line 460

3. Memory issues with parallel processing

# Reduce number of processes
chembl-curator filter-proteins --curated-dir curated_data_filtered --n-processes 4

4. Download failures

Check internet connection
Some PDBs may be obsolete (pipeline continues)
Some targets may not have AlphaFold models (filtered out)

5. Import errors

# Reinstall package
pip install -e .

# Or check Python path
export PYTHONPATH=/path/to/ChEMBL-Q:$PYTHONPATH

Debug Mode

For detailed error messages:

# Compound filtering
chembl-curator curate --download --output test_output --log-level DEBUG

# Protein filtering
chembl-curator filter-proteins --curated-dir test_output --n-processes 1 --log-level DEBUG

Project Structure

ChEMBL-Q/
├── chembl_curator/              # Main package
│   ├── __init__.py
│   ├── cli.py                   # Command-line interface
│   ├── config.py                # Configuration classes
│   ├── curator.py               # Compound filtering
│   ├── downloader.py            # ChEMBL downloader
│   ├── filters.py               # Activity and compound filters
│   ├── protein_filter.py        # Protein filtering pipeline
│   └── utils.py                 # Utility functions
├── example_scripts/             # Example usage scripts
│   ├── getpdb.py
│   ├── exclusion.py
│   ├── aligntm.py
│   └── run_protein_filter.py
├── pyproject.toml               # Package configuration
├── README.md                    # This file
├── PROTEIN_FILTERING.md         # Detailed protein filtering docs
└── .gitignore

Documentation

README.md (this file): Overview and quick start guide
PROTEIN_FILTERING.md: Detailed protein filtering pipeline documentation
example_scripts/: Example usage scripts and reference implementations

External DB/Tools Used

ChEMBL
AlphaFold DB
TMalign

License

This project is provided as-is for research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
bin		bin
chembl_curator		chembl_curator
.gitignore		.gitignore
PROTEIN_FILTERING.md		PROTEIN_FILTERING.md
README.md		README.md
config.json		config.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

j2ho/chembl-q

Folders and files

Latest commit

History

Repository files navigation

ChEMBL Curator

Overview

Features

Installation

Prerequisites

Python Dependencies

External Tools Setup

Quick Start

Complete Pipeline (Compound + Protein Filtering)

Testing on Subset (Recommended First)

Usage

Stage 1: Compound Filtering

Using CLI

Using Python API

Stage 2: Protein Filtering

Using CLI

Using Python API

Using Example Script

Configuration

Default Configuration

Custom Configuration File

Output Structure

After Compound Filtering

After Protein Filtering

Output Files Description

Protein Filtering Pipeline Details

1. Fetch PDB Information

2. Download PDB Structures

3. Download AlphaFold Models

4. Detect Ligand-Bound Structures

5. Align Structures

6. Check Pocket Clustering

7. Filter and Save

Excluded Ligands

Examples

Example 1: Basic Usage

Example 2: Custom Thresholds

Example 3: Process Specific Targets

Example 4: Inspect Results

Performance

Compound Filtering

Protein Filtering

Disk Space

Troubleshooting

Common Issues

Debug Mode

Project Structure

Documentation

External DB/Tools Used

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages