This repository contains the code and analysis pipeline for evaluating the Naïve Bayes Classifier++ (NBC++) as an out-of-distribution (OOD) detector for novel taxa in metagenomic classification.
Paper: Naïve Bayes Classifier++ as an out-of-distribution detector of novel taxa preprint
NBC-novelty-detection/
├── README.md
├── LICENSE
├── requirements.txt
├── .gitignore
├── run_pipeline.sh
├── docs/
│ ├── DATA_FORMAT.md # Input file format specifications
│ └── PIPELINE.md # Detailed pipeline documentation
├── src/
│ ├── __init__.py
│ ├── config.py # Configuration - SET YOUR PATHS HERE
│ ├── create_training_sets.py
│ ├── mass_mod.py
│ ├── plot_mean_roc.py
│ ├── plot_roc_distro.py
│ └── jellyfish_gen.sh
├── examples/
│ ├── sample_lineage.csv
│ ├── sample_species_mapping.json
│ └── sample_training_list.txt
└── scripts/
└── submit_jobs.sh # SLURM job submission
# Clone the repository
git clone https://github.com/yourusername/NBC-novelty-detection.git
cd NBC-novelty-detection
# Install Python dependencies
pip install -r requirements.txt- Python 3.8+
- Jellyfish (for k-mer counting)
- SLURM (optional, for HPC parallel processing)
# 1. Configure your paths (edit src/config.py or use environment variables)
export NBC_DATA_ROOT=/path/to/your/data
export NBC_RESULTS_ROOT=/path/to/your/results
export NBC_IMAGES_ROOT=/path/to/your/images
# 2. Run the full pipeline
./run_pipeline.sh
# Or run with options
./run_pipeline.sh --skip-training --taxa order --kmer 9All paths are configured in src/config.py. You can either edit the file directly or set environment variables:
export NBC_DATA_ROOT=/path/to/your/data
export NBC_RESULTS_ROOT=/path/to/your/results
export NBC_IMAGES_ROOT=/path/to/your/imagesSee docs/DATA_FORMAT.md for detailed input file specifications, or check examples/ for sample files.
python src/create_training_sets.py <taxa_level> <trial_number>
# Example
python src/create_training_sets.py Order Trial_1./src/jellyfish_gen.sh <source_dir> <k-mer_size> <is_full_genome>
# Example
./src/jellyfish_gen.sh /path/to/genomes 9 falseSee NBC++ documentation for classification instructions.
python src/mass_mod.pypython src/plot_mean_roc.py
python src/plot_roc_distro.pyFor detailed pipeline documentation, see docs/PIPELINE.md.
Longer k-mers consistently improve discrimination between known and unknown sequences:
| K-mer Length | AUC Performance |
|---|---|
| 3-mers | ~0.5-0.6 (baseline) |
| 6-mers | Moderate improvement |
| 9-mers | Good discrimination |
| 12-mers | ~0.81 (order level) |
| 15-mers | >0.85 (best) |
Novelty thresholds remained stable between basic and extended databases. At 9-mers:
- Extended phylum threshold: -1049.62
- Basic phylum threshold: -1053.81
Applied to a real human gut metagenome (SRA ID: SRS105153):
| Model | % Classified as "Known" |
|---|---|
| Basic (NBC++) | 98.09% |
| Extended (NBC++) | 92.77% |
| Basic (Kraken2) | 8.87% |
| Extended (Kraken2) | 82.30% |
- NBC++ Source Code: GitHub
- Docker Container: Docker Hub
- Training Data: Zenodo
- Human Gut Sample: SRA SRS105153