Genomic Similarity via Normalized Compression Distance (NCD)

This project implements a Python pipeline to compute pairwise distances between genomic sequences using Normalized Compression Distance (NCD) — a method rooted in Kolmogorov complexity and approximated via gzip compression. It supports both genome- and taxonomy-level subsampling, multithreaded distance computation, and top-K filtering for scalability.

Features

Enabled usage with .fasta format for input sequences
Subsampling by genome length or taxonomic hierarchy (e.g., kingdom, phylum, species)
Optional saving/loading of precomputed compressed lengths for fast reuse
Parallelized NCD calculation with ProcessPoolExecutor
Export of results as .csv file, including distances and matched labels
Support for top-K filtering (--k_min) to reduce output memory footprint

Project Structure

.
├── NCD.py                 # Main script for training/testing
├── dataloader.py          # FASTA loader and fragment subsampling logic
├── gziplength.py          # Gzip-based compression length utility
├── utils.py               # Taxonomic parsing, sequence validation
├── train.sh               # Example training script
├── test.sh                # Example testing script
└── data/
    ├── train/             # Directory of training FASTA files
    ├── test/              # Directory of test FASTA files
    └── metadata/          # Contains `nodes.dmp` and optionally `names.dmp`

Requirements

Python 3.8+ (this project has not been tested on any prior python versions)
Biopython
NumPy
pandas

Install dependencies with:

pip install -r requirements.txt

Example Usage

Training (with sampling and saving dataset)

./train.sh

This will:

Load and fragment genomic FASTA files in data/train
Optionally subsample by kingdom (using --subsample_mode taxa)
Save the resulting dataset as data/test_dataset.npz

Testing (using precomputed training dataset)

./test.sh

This will:

Load test sequences from data/test
Load precomputed training fragments
Compute NCD distances
Save a CSV result with test label, sorted distances, and matched training labels

Output Format

test_label,NCD,train_labels
NC_000001,0.123;0.234;0.245;...,Escherichia;Klebsiella;Salmonella;...
...

If --k_min is used, only the top-k smallest NCD values are saved for each test sequence.

Taxonomy Metadata Format

The taxonomy higherarchy is based on the NCBI taxanomic tree structure. This is required if any taxa-level sampling is used.

The --taxa_data folder must contain:

nodes.dmp: Taxonomic tree structure (from NCBI)
names.dmp: (optional) TaxID-to-name mapping

Used for sampling genomes by taxonomic rank (e.g., 100 fragments per species, genus, etc.)

Notes

NCD is symmetric and bounded between 0 and 1
In this case, we utilize 126bp fragments for training and testing.
For large datasets, --k_min can be used to avoid memory and storage issues. The NCD matrix can become large as the number of compared sequences increases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genomic Similarity via Normalized Compression Distance (NCD)

Features

Project Structure

Requirements

Example Usage

Training (with sampling and saving dataset)

Testing (using precomputed training dataset)

Output Format

Taxonomy Metadata Format

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
NCD.py		NCD.py
README.md		README.md
dataloader.py		dataloader.py
gziplength.py		gziplength.py
requirements.txt		requirements.txt
test.sh		test.sh
train.sh		train.sh
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Genomic Similarity via Normalized Compression Distance (NCD)

Features

Project Structure

Requirements

Example Usage

Training (with sampling and saving dataset)

Testing (using precomputed training dataset)

Output Format

Taxonomy Metadata Format

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages