Max Van Belkum | Vanderbilt University Associated with: Ryan et al. (2025) - Bifidobacteria support optimal infant vaccine responses
This repository contains a comprehensive computational framework for detecting and quantifying the colibactin (pks) genotoxin biosynthesis gene cluster in infant gut microbiomes. Through three independent pipelines employing orthogonal computational approaches, we demonstrate that neonatal antibiotic exposure dramatically enriches colibactin-producing bacteria in the infant gutโa finding with potential implications for long-term colorectal cancer risk.
Antibiotic-exposed infants show 7ร higher burden of genotoxin-producing bacteria 113 samples | 3 independent pipelines | Publication-quality statistics
| Metric | Neo-ABX (n=33) | No-ABX (n=80) | Fold Change | P-value |
|---|---|---|---|---|
| CLB Burden (reads/10M) | 16,176 | 3,504 | 4.6ร | 0.004 |
| CLB+ Prevalence (>9 genes) | 42.4% | 21.2% | 2.0ร | 0.036 |
| Complete Island (18/18 genes) | 33.3% | 12.5% | 2.7ร | 0.016 |
| High Burden (โฅ50K reads/10M) | 18.2% | 2.5% | 7.3ร | 0.008 |
Statistical Methods:
- Permutation testing (100,000 iterations)
- Fisher's exact tests with Haldane-Anscombe correction
- Benjamini-Hochberg FDR correction
- Cohen's d effect sizes
The colibactin (pks) island is an 18-gene biosynthetic cluster that produces a genotoxin causing:
- DNA double-strand breaks
- Mutational signatures SBS88/ID18 (colorectal cancer-associated)
- Enhanced bacterial colonization when paired with adhesins
Clinical Significance: Early-life antibiotic exposure may establish a genotoxin-rich microbiome during critical developmental windows, potentially contributing to early-onset colorectal cancer (EOCRC) risk decades later.
This analysis employs multi-method validation using three orthogonal computational approaches:
Speed-optimized prevalence screening
- Algorithm: Direct nucleotide alignment with bbmap v39.37
- Threshold: 95% nucleotide identity
- Normalization: Reads Per Million (RPM)
- Runtime: ~2 hours for 113 samples
- Best for: Fast screening, prevalence estimation
Key Result: 7.08ร higher CLB burden in Neo-ABX (p=0.0046)
High-sensitivity functional annotation
- Algorithm: DIAMOND blastx with multi-threshold filtering
- Thresholds: 70%, 75%, 80%, 85%, 90%, 95%, 100% identity
- Innovation: One-pass search + post-hoc filtering (4ร faster)
- Coverage: Breadth calculated as nucleotide fraction
- Best for: Functional validation, gene-level quantification
Key Result: 4.62ร higher CLB burden at 95% identity (p=0.00444)
Genome-coherent co-location proof
- Assembly: metaSPAdes on 398 metagenomes
- Binning: MetaBAT2, MaxBin2, SemiBin2 โ DAS_Tool consensus
- QC: CheckM2 (completeness/contamination), GUNC (chimera detection)
- Taxonomy: GTDB-Tk R220
- Best for: Proving pks + virulence genes co-occur on same genome
Key Result: Genome-level validation of pks + adhesin toolkit co-location
neonatal-antibiotics-colibactin-microbiome/
โโโ README.md # This file
โโโ LICENSE # MIT License
โ
โโโ pipelines/ # Three independent pipelines
โ โโโ 01_nucleotide_bbmap/ # Pipeline 1: Read-centric (bbmap)
โ โ โโโ run_pipeline1.sh
โ โ โโโ extract_clb_rpm.py
โ โ โโโ statistical_tests.py
โ โ
โ โโโ 02_protein_diamond/ # Pipeline 2: Protein-level (DIAMOND)
โ โ โโโ multi_threshold_clb_pipeline.py
โ โ โโโ compute_statistics.py
โ โ โโโ generate_figures.R
โ โ
โ โโโ 03_assembly_mag/ # Pipeline 3: MAG binning
โ โโโ run_metaspades.sh
โ โโโ run_binning.sh
โ โโโ check_clb_synteny.py
โ โโโ quantify_toolkit.py
โ
โโโ analysis/ # Statistical analysis & visualization
โ โโโ statistics/
โ โ โโโ permutation_tests.py
โ โ โโโ fisher_exact_tests.R
โ โ โโโ fdr_correction.py
โ โ
โ โโโ visualization/
โ โ โโโ plot_burden_violin.py
โ โ โโโ plot_prevalence_bar.py
โ โ โโโ plot_ecdf.py
โ โ
โ โโโ results/ # Output TSV files
โ โโโ pipeline1_summary.tsv
โ โโโ pipeline2_gene_details.tsv
โ โโโ statistical_tests.tsv
โ
โโโ figures/ # Publication-quality figures
โ โโโ fig1_burden_violin.pdf
โ โโโ fig2_prevalence_bar.pdf
โ โโโ fig3_gene_heatmap.pdf
โ โโโ fig4_forest_plot.pdf
โ
โโโ data/
โ โโโ metadata/
โ โ โโโ sample_metadata.tsv # 113 samples, exposure groups
โ โ
โ โโโ reference/
โ โโโ clb_island_proteins.faa # 18 CLB genes (amino acid)
โ โโโ clb_island_nucleotide.fna # IHE3034 reference
โ โโโ virulence_toolkit.faa # Expanded virulence panel
โ
โโโ envs/ # Conda environments
โ โโโ pipeline1_bbmap.yml
โ โโโ pipeline2_diamond.yml
โ โโโ pipeline3_assembly.yml
โ โโโ analysis_stats.yml
โ
โโโ docs/
โโโ METHODS.md # Detailed methodology
โโโ RESULTS.md # Complete results summary
โโโ INSTALLATION.md # Setup instructions
โโโ CITATION.md # How to cite
# Clone repository
git clone https://github.com/vanbelkummax/clb-detection-toolkit.git
cd clb-detection-toolkit
# Create conda environment
conda env create -f envs/pipeline2_diamond.yml
conda activate clb-pipeline
# Download reference databases
# GRCh38 for human depletion
# DIAMOND database for CLB islandPipeline 1: Fast Screening (bbmap)
cd pipelines/01_nucleotide_bbmap
bash run_pipeline1.sh --sample SRR12345678 --threads 10Pipeline 2: Multi-Threshold Detection (DIAMOND)
cd pipelines/02_protein_diamond
python multi_threshold_clb_pipeline.py \
SRR12345678 \
/data/SRR12345678_1.fq.gz \
/data/SRR12345678_2.fq.gz \
/output/results \
--exposure Neo-ABXPipeline 3: MAG Assembly & Binning
cd pipelines/03_assembly_mag
bash run_pipeline.sh --threads 16Primary Comparison: Neo-ABX vs No-ABX
- Permutation test (CLB burden): p = 0.00444
- Welch's t-test: t = 2.14 (df โ 37), p = 0.039
- Mann-Whitney U: p = 0.048
- Fisher's exact (complete island): OR = 3.43 (95% CI: 1.31-8.98), p = 0.016
Robust across definitions:
- Standard (โฅ5 reads & โฅ10% breadth): OR = 3.43
- Strict (โฅ99.9% breadth): OR = 3.31
- High burden (โฅ50K reads/10M): OR = 7.42
โ Multi-Pipeline Development Three independent approaches with orthogonal validation
โ Statistical Rigor Permutation testing, multiple testing correction, effect sizes
โ Scalability Parallel processing across 113 samples
โ Reproducibility Conda environments, version-controlled parameters, documented workflows
โ Data Visualization Publication-quality figures with matplotlib/seaborn
โ Algorithm Selection Appropriate method choice for sensitivity vs. specificity trade-offs
Core Tools:
- BBMap v39.37 (read mapping)
- DIAMOND v2.1.10 (protein alignment)
- Bowtie2 v2.5.4 (human depletion)
- metaSPAdes v3.15+ (assembly)
- MetaBAT2, MaxBin2, SemiBin2 (binning)
- CheckM2, GTDB-Tk (QC & taxonomy)
Statistical Analysis:
- Python 3.8+ (numpy, scipy, pandas, statsmodels)
- R 4.3+ (tidyverse, logistf)
Visualization:
- matplotlib v3.7+, seaborn v0.12+
-
Quality Trimming (BBDuk)
- Adapter removal, Q20 trimming
- Minimum length: 30bp
-
Human Depletion (Bowtie2)
- Reference: GRCh38
- Very-sensitive-local mode
- Result: <1% human contamination
-
CLB Detection (DIAMOND blastx)
- E-value: 1e-5
- Initial identity: 70%
- Post-hoc filtering: 70-100% (7 thresholds)
-
Gene Detection Criteria
- Minimum unique reads: โฅ5
- Minimum breadth: โฅ10% (nucleotide fraction)
- Complete island: All 18 genes detected
-
Normalization
- Burden: Reads per 10 million microbial reads
- Accounts for sequencing depth variation
Permutation Testing:
- 100,000 random permutations
- Two-tailed test for difference in means
- Effect size: Cohen's d
- Seed: 42 (reproducibility)
Fisher's Exact Tests:
- Haldane-Anscombe correction (0.5 added to all cells)
- Odds ratios with 95% confidence intervals
- Appropriate for small sample counts
Multiple Testing Correction:
- Benjamini-Hochberg FDR (virulence toolkit genes)
- ฮฑ = 0.05 significance threshold
- METHODS.md - Detailed pipeline methodology
- RESULTS.md - Complete statistical results
- INSTALLATION.md - Setup & dependencies
- CITATION.md - How to cite this work
This analysis is part of:
Ryan et al. (2025) - Bifidobacteria support optimal infant vaccine responses [Publication details to be added]
Clinical Context:
- Early-onset colorectal cancer (EOCRC) rising in birth cohorts exposed to early antibiotics
- Colibactin mutational signatures (SBS88/ID18) found in EOCRC tumors
- This work provides an associative link: antibiotics โ genotoxin enrichment โ potential cancer risk
This repository is primarily for reproducibility and demonstration of computational methods. For questions or collaboration inquiries, please open an issue.
MIT License - See LICENSE file for details
Max Van Belkum Vanderbilt University GitHub: @vanbelkummax
- Ryan Lab - Sample collection and metadata
- VMIC - Computational infrastructure
- Feargal Ryan - metaSPAdes assemblies
If you use this code or methodology, please cite:
@article{ryan2025bifidobacteria,
title={Bifidobacteria support optimal infant vaccine responses},
author={Ryan, Feargal J. and others},
journal={[Journal]},
year={2025},
note={CLB analysis by Max van Belkum}
}Built with computational rigor. Validated with orthogonal methods. Designed for reproducibility.