NBC++ for Out-of-Distribution Detection of Novel Taxa

This repository contains the code and analysis pipeline for evaluating the Naïve Bayes Classifier++ (NBC++) as an out-of-distribution (OOD) detector for novel taxa in metagenomic classification.

Paper: Naïve Bayes Classifier++ as an out-of-distribution detector of novel taxa preprint

Repository Structure

NBC-novelty-detection/
├── README.md
├── LICENSE
├── requirements.txt
├── .gitignore
├── run_pipeline.sh
├── docs/
│   ├── DATA_FORMAT.md           # Input file format specifications
│   └── PIPELINE.md              # Detailed pipeline documentation
├── src/
│   ├── __init__.py
│   ├── config.py                # Configuration - SET YOUR PATHS HERE
│   ├── create_training_sets.py
│   ├── mass_mod.py
│   ├── plot_mean_roc.py
│   ├── plot_roc_distro.py
│   └── jellyfish_gen.sh
├── examples/
│   ├── sample_lineage.csv
│   ├── sample_species_mapping.json
│   └── sample_training_list.txt
└── scripts/
    └── submit_jobs.sh           # SLURM job submission

Installation

# Clone the repository
git clone https://github.com/yourusername/NBC-novelty-detection.git
cd NBC-novelty-detection

# Install Python dependencies
pip install -r requirements.txt

Prerequisites

Python 3.8+
Jellyfish (for k-mer counting)
SLURM (optional, for HPC parallel processing)

Quick Start

# 1. Configure your paths (edit src/config.py or use environment variables)
export NBC_DATA_ROOT=/path/to/your/data
export NBC_RESULTS_ROOT=/path/to/your/results
export NBC_IMAGES_ROOT=/path/to/your/images

# 2. Run the full pipeline
./run_pipeline.sh

# Or run with options
./run_pipeline.sh --skip-training --taxa order --kmer 9

Configuration

All paths are configured in src/config.py. You can either edit the file directly or set environment variables:

export NBC_DATA_ROOT=/path/to/your/data
export NBC_RESULTS_ROOT=/path/to/your/results
export NBC_IMAGES_ROOT=/path/to/your/images

See docs/DATA_FORMAT.md for detailed input file specifications, or check examples/ for sample files.

Usage

Step 1: Create Training Sets

python src/create_training_sets.py <taxa_level> <trial_number>

# Example
python src/create_training_sets.py Order Trial_1

Step 2: Generate K-mer Counts

./src/jellyfish_gen.sh <source_dir> <k-mer_size> <is_full_genome>

# Example
./src/jellyfish_gen.sh /path/to/genomes 9 false

Step 3: Run NBC++ Classification

See NBC++ documentation for classification instructions.

Step 4: Process Results

python src/mass_mod.py

Step 5: Generate Plots

python src/plot_mean_roc.py
python src/plot_roc_distro.py

For detailed pipeline documentation, see docs/PIPELINE.md.

Results

K-mer Length and Classification Accuracy

Longer k-mers consistently improve discrimination between known and unknown sequences:

K-mer Length	AUC Performance
3-mers	~0.5-0.6 (baseline)
6-mers	Moderate improvement
9-mers	Good discrimination
12-mers	~0.81 (order level)
15-mers	>0.85 (best)

Threshold Stability

Novelty thresholds remained stable between basic and extended databases. At 9-mers:

Extended phylum threshold: -1049.62
Basic phylum threshold: -1053.81

Human Gut Metagenome Application

Applied to a real human gut metagenome (SRA ID: SRS105153):

Model	% Classified as "Known"
Basic (NBC++)	98.09%
Extended (NBC++)	92.77%
Basic (Kraken2)	8.87%
Extended (Kraken2)	82.30%

Data Availability

NBC++ Source Code: GitHub
Docker Container: Docker Hub
Training Data: Zenodo
Human Gut Sample: SRA SRS105153

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBC++ for Out-of-Distribution Detection of Novel Taxa

Repository Structure

Installation

Prerequisites

Quick Start

Configuration

Usage

Step 1: Create Training Sets

Step 2: Generate K-mer Counts

Step 3: Run NBC++ Classification

Step 4: Process Results

Step 5: Generate Plots

Results

K-mer Length and Classification Accuracy

Threshold Stability

Human Gut Metagenome Application

Data Availability

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
examples		examples
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh

Folders and files

Latest commit

History

Repository files navigation

NBC++ for Out-of-Distribution Detection of Novel Taxa

Repository Structure

Installation

Prerequisites

Quick Start

Configuration

Usage

Step 1: Create Training Sets

Step 2: Generate K-mer Counts

Step 3: Run NBC++ Classification

Step 4: Process Results

Step 5: Generate Plots

Results

K-mer Length and Classification Accuracy

Threshold Stability

Human Gut Metagenome Application

Data Availability

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages