Skip to content

Bioinformatics toolkit for detecting colibactin (pks island) in metagenomic data using three complementary approaches: read-based mapping, assembly-based validation, and protein homology search

License

Notifications You must be signed in to change notification settings

vanbelkummax/clb-detection-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

16 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Neonatal Antibiotic Exposure Enriches Genotoxin-Producing Gut Bacteria

Multi-Pipeline Computational Analysis of the Infant Gut Microbiome

License: MIT Python 3.8+ Conda

Max Van Belkum | Vanderbilt University Associated with: Ryan et al. (2025) - Bifidobacteria support optimal infant vaccine responses


๐Ÿ”ฌ Overview

This repository contains a comprehensive computational framework for detecting and quantifying the colibactin (pks) genotoxin biosynthesis gene cluster in infant gut microbiomes. Through three independent pipelines employing orthogonal computational approaches, we demonstrate that neonatal antibiotic exposure dramatically enriches colibactin-producing bacteria in the infant gutโ€”a finding with potential implications for long-term colorectal cancer risk.

Key Discovery

Antibiotic-exposed infants show 7ร— higher burden of genotoxin-producing bacteria 113 samples | 3 independent pipelines | Publication-quality statistics


๐ŸŽฏ Key Findings

Metric Neo-ABX (n=33) No-ABX (n=80) Fold Change P-value
CLB Burden (reads/10M) 16,176 3,504 4.6ร— 0.004
CLB+ Prevalence (>9 genes) 42.4% 21.2% 2.0ร— 0.036
Complete Island (18/18 genes) 33.3% 12.5% 2.7ร— 0.016
High Burden (โ‰ฅ50K reads/10M) 18.2% 2.5% 7.3ร— 0.008

Statistical Methods:

  • Permutation testing (100,000 iterations)
  • Fisher's exact tests with Haldane-Anscombe correction
  • Benjamini-Hochberg FDR correction
  • Cohen's d effect sizes

๐Ÿงฌ Biological Context

The colibactin (pks) island is an 18-gene biosynthetic cluster that produces a genotoxin causing:

  • DNA double-strand breaks
  • Mutational signatures SBS88/ID18 (colorectal cancer-associated)
  • Enhanced bacterial colonization when paired with adhesins

Clinical Significance: Early-life antibiotic exposure may establish a genotoxin-rich microbiome during critical developmental windows, potentially contributing to early-onset colorectal cancer (EOCRC) risk decades later.


๐Ÿ”ง Computational Approach: Three Independent Pipelines

This analysis employs multi-method validation using three orthogonal computational approaches:

Pipeline 1: Nucleotide-Level Read Mapping (bbmap)

Speed-optimized prevalence screening

  • Algorithm: Direct nucleotide alignment with bbmap v39.37
  • Threshold: 95% nucleotide identity
  • Normalization: Reads Per Million (RPM)
  • Runtime: ~2 hours for 113 samples
  • Best for: Fast screening, prevalence estimation

Key Result: 7.08ร— higher CLB burden in Neo-ABX (p=0.0046)

Pipeline 2: Protein-Level Detection (DIAMOND)

High-sensitivity functional annotation

  • Algorithm: DIAMOND blastx with multi-threshold filtering
  • Thresholds: 70%, 75%, 80%, 85%, 90%, 95%, 100% identity
  • Innovation: One-pass search + post-hoc filtering (4ร— faster)
  • Coverage: Breadth calculated as nucleotide fraction
  • Best for: Functional validation, gene-level quantification

Key Result: 4.62ร— higher CLB burden at 95% identity (p=0.00444)

Pipeline 3: Assembly-Based MAG Reconstruction

Genome-coherent co-location proof

  • Assembly: metaSPAdes on 398 metagenomes
  • Binning: MetaBAT2, MaxBin2, SemiBin2 โ†’ DAS_Tool consensus
  • QC: CheckM2 (completeness/contamination), GUNC (chimera detection)
  • Taxonomy: GTDB-Tk R220
  • Best for: Proving pks + virulence genes co-occur on same genome

Key Result: Genome-level validation of pks + adhesin toolkit co-location


๐Ÿ“Š Repository Structure

neonatal-antibiotics-colibactin-microbiome/
โ”œโ”€โ”€ README.md                           # This file
โ”œโ”€โ”€ LICENSE                             # MIT License
โ”‚
โ”œโ”€โ”€ pipelines/                          # Three independent pipelines
โ”‚   โ”œโ”€โ”€ 01_nucleotide_bbmap/           # Pipeline 1: Read-centric (bbmap)
โ”‚   โ”‚   โ”œโ”€โ”€ run_pipeline1.sh
โ”‚   โ”‚   โ”œโ”€โ”€ extract_clb_rpm.py
โ”‚   โ”‚   โ””โ”€โ”€ statistical_tests.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ 02_protein_diamond/            # Pipeline 2: Protein-level (DIAMOND)
โ”‚   โ”‚   โ”œโ”€โ”€ multi_threshold_clb_pipeline.py
โ”‚   โ”‚   โ”œโ”€โ”€ compute_statistics.py
โ”‚   โ”‚   โ””โ”€โ”€ generate_figures.R
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ 03_assembly_mag/               # Pipeline 3: MAG binning
โ”‚       โ”œโ”€โ”€ run_metaspades.sh
โ”‚       โ”œโ”€โ”€ run_binning.sh
โ”‚       โ”œโ”€โ”€ check_clb_synteny.py
โ”‚       โ””โ”€โ”€ quantify_toolkit.py
โ”‚
โ”œโ”€โ”€ analysis/                           # Statistical analysis & visualization
โ”‚   โ”œโ”€โ”€ statistics/
โ”‚   โ”‚   โ”œโ”€โ”€ permutation_tests.py
โ”‚   โ”‚   โ”œโ”€โ”€ fisher_exact_tests.R
โ”‚   โ”‚   โ””โ”€โ”€ fdr_correction.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ visualization/
โ”‚   โ”‚   โ”œโ”€โ”€ plot_burden_violin.py
โ”‚   โ”‚   โ”œโ”€โ”€ plot_prevalence_bar.py
โ”‚   โ”‚   โ””โ”€โ”€ plot_ecdf.py
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ results/                        # Output TSV files
โ”‚       โ”œโ”€โ”€ pipeline1_summary.tsv
โ”‚       โ”œโ”€โ”€ pipeline2_gene_details.tsv
โ”‚       โ””โ”€โ”€ statistical_tests.tsv
โ”‚
โ”œโ”€โ”€ figures/                            # Publication-quality figures
โ”‚   โ”œโ”€โ”€ fig1_burden_violin.pdf
โ”‚   โ”œโ”€โ”€ fig2_prevalence_bar.pdf
โ”‚   โ”œโ”€โ”€ fig3_gene_heatmap.pdf
โ”‚   โ””โ”€โ”€ fig4_forest_plot.pdf
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ metadata/
โ”‚   โ”‚   โ””โ”€โ”€ sample_metadata.tsv       # 113 samples, exposure groups
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ reference/
โ”‚       โ”œโ”€โ”€ clb_island_proteins.faa   # 18 CLB genes (amino acid)
โ”‚       โ”œโ”€โ”€ clb_island_nucleotide.fna # IHE3034 reference
โ”‚       โ””โ”€โ”€ virulence_toolkit.faa     # Expanded virulence panel
โ”‚
โ”œโ”€โ”€ envs/                               # Conda environments
โ”‚   โ”œโ”€โ”€ pipeline1_bbmap.yml
โ”‚   โ”œโ”€โ”€ pipeline2_diamond.yml
โ”‚   โ”œโ”€โ”€ pipeline3_assembly.yml
โ”‚   โ””โ”€โ”€ analysis_stats.yml
โ”‚
โ””โ”€โ”€ docs/
    โ”œโ”€โ”€ METHODS.md                      # Detailed methodology
    โ”œโ”€โ”€ RESULTS.md                      # Complete results summary
    โ”œโ”€โ”€ INSTALLATION.md                 # Setup instructions
    โ””โ”€โ”€ CITATION.md                     # How to cite

๐Ÿš€ Quick Start

Installation

# Clone repository
git clone https://github.com/vanbelkummax/clb-detection-toolkit.git
cd clb-detection-toolkit

# Create conda environment
conda env create -f envs/pipeline2_diamond.yml
conda activate clb-pipeline

# Download reference databases
# GRCh38 for human depletion
# DIAMOND database for CLB island

Running Pipelines

Pipeline 1: Fast Screening (bbmap)

cd pipelines/01_nucleotide_bbmap
bash run_pipeline1.sh --sample SRR12345678 --threads 10

Pipeline 2: Multi-Threshold Detection (DIAMOND)

cd pipelines/02_protein_diamond
python multi_threshold_clb_pipeline.py \
    SRR12345678 \
    /data/SRR12345678_1.fq.gz \
    /data/SRR12345678_2.fq.gz \
    /output/results \
    --exposure Neo-ABX

Pipeline 3: MAG Assembly & Binning

cd pipelines/03_assembly_mag
bash run_pipeline.sh --threads 16

๐Ÿ“ˆ Results at a Glance

Burden Distribution

Statistical Summary

Primary Comparison: Neo-ABX vs No-ABX

  • Permutation test (CLB burden): p = 0.00444
  • Welch's t-test: t = 2.14 (df โ‰ˆ 37), p = 0.039
  • Mann-Whitney U: p = 0.048
  • Fisher's exact (complete island): OR = 3.43 (95% CI: 1.31-8.98), p = 0.016

Robust across definitions:

  • Standard (โ‰ฅ5 reads & โ‰ฅ10% breadth): OR = 3.43
  • Strict (โ‰ฅ99.9% breadth): OR = 3.31
  • High burden (โ‰ฅ50K reads/10M): OR = 7.42

๐Ÿ’ป Technical Highlights

Computational Skills Demonstrated

โœ… Multi-Pipeline Development Three independent approaches with orthogonal validation

โœ… Statistical Rigor Permutation testing, multiple testing correction, effect sizes

โœ… Scalability Parallel processing across 113 samples

โœ… Reproducibility Conda environments, version-controlled parameters, documented workflows

โœ… Data Visualization Publication-quality figures with matplotlib/seaborn

โœ… Algorithm Selection Appropriate method choice for sensitivity vs. specificity trade-offs

Software Stack

Core Tools:

  • BBMap v39.37 (read mapping)
  • DIAMOND v2.1.10 (protein alignment)
  • Bowtie2 v2.5.4 (human depletion)
  • metaSPAdes v3.15+ (assembly)
  • MetaBAT2, MaxBin2, SemiBin2 (binning)
  • CheckM2, GTDB-Tk (QC & taxonomy)

Statistical Analysis:

  • Python 3.8+ (numpy, scipy, pandas, statsmodels)
  • R 4.3+ (tidyverse, logistf)

Visualization:

  • matplotlib v3.7+, seaborn v0.12+

๐Ÿงฎ Methods Overview

Sample Processing

  1. Quality Trimming (BBDuk)

    • Adapter removal, Q20 trimming
    • Minimum length: 30bp
  2. Human Depletion (Bowtie2)

    • Reference: GRCh38
    • Very-sensitive-local mode
    • Result: <1% human contamination
  3. CLB Detection (DIAMOND blastx)

    • E-value: 1e-5
    • Initial identity: 70%
    • Post-hoc filtering: 70-100% (7 thresholds)
  4. Gene Detection Criteria

    • Minimum unique reads: โ‰ฅ5
    • Minimum breadth: โ‰ฅ10% (nucleotide fraction)
    • Complete island: All 18 genes detected
  5. Normalization

    • Burden: Reads per 10 million microbial reads
    • Accounts for sequencing depth variation

Statistical Framework

Permutation Testing:

  • 100,000 random permutations
  • Two-tailed test for difference in means
  • Effect size: Cohen's d
  • Seed: 42 (reproducibility)

Fisher's Exact Tests:

  • Haldane-Anscombe correction (0.5 added to all cells)
  • Odds ratios with 95% confidence intervals
  • Appropriate for small sample counts

Multiple Testing Correction:

  • Benjamini-Hochberg FDR (virulence toolkit genes)
  • ฮฑ = 0.05 significance threshold

๐Ÿ“š Documentation


๐Ÿ”— Related Work

This analysis is part of:

Ryan et al. (2025) - Bifidobacteria support optimal infant vaccine responses [Publication details to be added]

Clinical Context:

  • Early-onset colorectal cancer (EOCRC) rising in birth cohorts exposed to early antibiotics
  • Colibactin mutational signatures (SBS88/ID18) found in EOCRC tumors
  • This work provides an associative link: antibiotics โ†’ genotoxin enrichment โ†’ potential cancer risk

๐Ÿค Contributing

This repository is primarily for reproducibility and demonstration of computational methods. For questions or collaboration inquiries, please open an issue.


๐Ÿ“„ License

MIT License - See LICENSE file for details


๐Ÿ‘ค Author

Max Van Belkum Vanderbilt University GitHub: @vanbelkummax


๐Ÿ™ Acknowledgments

  • Ryan Lab - Sample collection and metadata
  • VMIC - Computational infrastructure
  • Feargal Ryan - metaSPAdes assemblies

๐Ÿ“Š Citation

If you use this code or methodology, please cite:

@article{ryan2025bifidobacteria,
  title={Bifidobacteria support optimal infant vaccine responses},
  author={Ryan, Feargal J. and others},
  journal={[Journal]},
  year={2025},
  note={CLB analysis by Max van Belkum}
}

Built with computational rigor. Validated with orthogonal methods. Designed for reproducibility.

About

Bioinformatics toolkit for detecting colibactin (pks island) in metagenomic data using three complementary approaches: read-based mapping, assembly-based validation, and protein homology search

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published