Ultra-fast SNP/indel-level distance calculator for core genome MLST analysis
cgDist is a high-performance Rust implementation for calculating genetic distances in bacterial genomics, specifically designed for epidemiological outbreak investigations and phylogenetic analysis.
- β‘ Ultra-fast: Parallel processing with optimized algorithms
- π― Precision: SNP/indel-level distance calculation
- π§ Flexible: Multiple hashing algorithms (CRC32, MD5, SHA256)
- π Comprehensive: Built-in comparison tools and statistical analysis
- π₯ Epidemiological: Star clustering analysis for outbreak investigation
- 𧬠Recombination Analysis: Advanced tools for detecting genetic recombination events
- πΎ Efficient: LZ4 compression for fast caching
- π Scalable: Memory-efficient processing of large datasets
- Features
- Installation
- Quick Start
- Usage
- Recombination Analysis
- Cache Inspector
- Custom Hashers Plugin System
- API Documentation
- Citation
- Related Repositories
- Support
- License
- Rust 1.70+ (install from rustup.rs)
- Python 3.8+ (for analysis scripts)
# Clone the repository
git clone https://github.com/genpat-it/cgDist.git
cd cgDist
# Build release version
cargo build --release
# The binary will be available at ./target/release/cgdistcargo install cgdist# Build Docker image
docker build -t cgdist-rs .
# Run with Docker
docker run -v $(pwd):/data cgdist-rs cgdist --help# Calculate SNP distances from cgMLST profiles
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv
# Use different distance mode
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv --mode snps-indel-bases
# Use different hashing algorithm
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv --hasher-type sha256
# Enable cache for faster recomputation
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv --cache-file cache.lz4
# Specify number of threads
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv --threads 16Create a cgdist-config.toml file:
[general]
hasher_type = "crc32"
threads = 8
cache_file = "./cache/cgdist_cache.lz4"
missing_char = "-"
[filtering]
min_loci = 0
sample_threshold = 0.0
locus_threshold = 0.0
[output]
format = "tsv"
mode = "snps"# Use configuration file
cgdist --schema schema_dir/ --profiles profiles.tsv --config cgdist-config.toml --output distances.tsvcgdist [OPTIONS]
MAIN OPTIONS:
--schema <PATH> Path to FASTA schema directory or schema file
--profiles <PATH> Path to allelic profile matrix (.tsv or .csv)
--output <FILE> Output distance matrix file
--mode <MODE> Distance mode [default: snps]
Options: snps, snps-indel-events, snps-indel-bases, hamming
--format <FORMAT> Output format [default: tsv]
Options: tsv, csv, phylip, nexus
FILTERING OPTIONS:
--min-loci <N> Minimum shared loci for distance calculation [default: 0]
--sample-threshold <VAL> Sample quality filter (0.0-1.0) [default: 0.0]
--locus-threshold <VAL> Locus quality filter (0.0-1.0) [default: 0.0]
--include-samples <REGEX> Include only samples matching regex pattern
--exclude-samples <REGEX> Exclude samples matching regex pattern
--include-loci <REGEX> Include only loci matching regex pattern
--exclude-loci <REGEX> Exclude loci matching regex pattern
--include-samples-list <FILE> Include samples from file (one per line)
--exclude-samples-list <FILE> Exclude samples from file (one per line)
--include-loci-list <FILE> Include loci from file (one per line)
--exclude-loci-list <FILE> Exclude loci from file (one per line)
ALIGNMENT OPTIONS:
--alignment-mode <MODE> Alignment mode [default: dna]
Options: dna, dna-strict, dna-permissive, custom
--match-score <N> Custom match score (enables custom mode)
--mismatch-penalty <N> Custom mismatch penalty (enables custom mode)
--gap-open <N> Custom gap open penalty (enables custom mode)
--gap-extend <N> Custom gap extend penalty (enables custom mode)
--save-alignments <FILE> Save detailed alignments to TSV file
PERFORMANCE OPTIONS:
--threads <N> Number of threads [default: auto-detect]
--cache-file <FILE> Cache file path (.lz4 extension)
--cache-note <TEXT> Note to save with cache
--cache-only Build cache only without computing distance matrix
--force-recompute Force recomputation ignoring cache
--hasher-type <TYPE> Allele hasher type [default: crc32]
Options: crc32, sha256, md5, sequence, hamming
CACHE ENRICHMENT OPTIONS:
--enrich-lengths Enrich cache with nucleotide sequence lengths from schema
--enrich-output <FILE> Output file for enriched cache [default: overwrites input cache]
RECOMBINATION DETECTION OPTIONS:
--recombination-log <FILE> Output log of allele pairs exceeding threshold
--recombination-threshold <N> Threshold for recombination detection (SNPs + InDel bases) [default: 20]
OTHER OPTIONS:
--missing-char <CHAR> Missing data character [default: -]
--no-hamming-fallback Disable Hamming fallback for SNPs mode
--stats-only Show matrix statistics only
--benchmark Measure alignment processing speed
--benchmark-duration <N> Benchmark duration in seconds [default: 15]
--dry-run Validate inputs without computation
--inspector <FILE> Inspect cache file
--config <FILE> Path to TOML configuration file
--generate-config Generate sample configuration file
--help Display usage informationSchema (FASTA directory):
- Individual FASTA files per locus
- Each file contains allele sequences
- File names correspond to locus names
Profiles (allelic profiles):
- TSV: Tab-separated values
- CSV: Comma-separated values
- Format: Sample name | Locus1 | Locus2 | ... | LocusN
- Missing data represented by configurable character (default:
-)
Cache files:
- LZ4: Compressed cache files (.lz4 or .bin extension)
- Automatic compression/decompression
- TSV: Tab-separated distance matrix (default)
- CSV: Comma-separated distance matrix
- PHYLIP: Phylogenetic analysis format
- NEXUS: Nexus format for phylogenetic tools
cgDist includes powerful tools for detecting and analyzing genetic recombination events in bacterial populations. This feature is particularly useful for epidemiological studies and understanding horizontal gene transfer.
- Mutation Density Analysis: Detects loci with high SNP/indel density indicating recombination
- Hamming Distance Filtering: Focuses analysis on genetically related samples
- Pairwise Recombination Summary: Comprehensive overview of recombination between sample pairs
- EFSA Loci Support: Compatible with standardized loci sets for food safety applications
- Distance Matrix Correction: Adjusts distances by excluding recombinant loci
The main cgdist binary can detect potential recombination events during distance calculation:
# Detect recombination events with default threshold (20 SNPs+InDel bases)
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv \
--recombination-log recombination_events.csv \
--mode snps-indel-bases
# Custom threshold (e.g., 30 SNPs+InDel bases)
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv \
--recombination-log recombination_events.csv \
--recombination-threshold 30 \
--mode snps-indel-basesOutput: CSV log with locus, sample pairs, divergence percentages, and sequence lengths
For advanced analysis with Hamming filtering and EFSA loci support:
# Build recombination analyzer
cargo build --release --bin recombination_analyzer
# Step 1: Create enriched cache with sequence lengths
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv \
--cache-file cache.bin --enrich-lengths --mode snps-indel-bases
# Step 2: Run recombination analyzer
./target/release/recombination_analyzer \
--enriched-cache cache.bin \
--profiles profiles.tsv \
--distance-matrix distances.tsv \
--output-matrix corrected_distances.tsv \
--recombination-log recombination_events.tsv \
--threshold 3.0
# Custom threshold (5% mutation density)
./target/release/recombination_analyzer \
--enriched-cache cache.bin \
--profiles profiles.tsv \
--distance-matrix distances.tsv \
--output-matrix corrected_distances.tsv \
--recombination-log recombination_events.tsv \
--threshold 5.0For Tool 1 (Built-in Detection):
- Schema: FASTA directory with allele sequences
- Profiles: TSV/CSV file with sample-locus-allele matrix
For Tool 2 (Recombination Analyzer):
- Enriched Cache:
.binfile generated with--enrich-lengthsoption - Allelic Profiles: TSV file with sample-locus-allele matrix
- Distance Matrix: Original distance matrix from cgdist
- EFSA Loci (optional): TSV file listing loci of interest
Tool 1 Output: recombination_events.csv
- Locus name
- Sample pairs
- Divergence percentage
- Sequence lengths
- SNPs and InDel counts
Tool 2 Outputs:
- Corrected Distance Matrix: Distance matrix with recombinant loci excluded
- Recombination Events Log: Detailed list of detected recombination events with:
- Sample pairs
- Locus information
- Mutation statistics (SNPs, InDels)
- Density percentages
- Sequence lengths
Tool 1 (Built-in):
--recombination-threshold: SNPs + InDel bases threshold (default: 20)
Tool 2 (Recombination Analyzer):
--threshold: Mutation density percentage (default: 3.0%)
# Option A: Quick detection during distance calculation
cgdist --schema schema/ --profiles samples.tsv --output distances.tsv \
--recombination-log events.csv --recombination-threshold 20 \
--mode snps-indel-bases
# Option B: Advanced analysis with corrected distances
# Step 1: Create enriched cache
cgdist --schema schema/ --profiles samples.tsv --output distances.tsv \
--cache-file cache.bin --enrich-lengths --mode snps-indel-bases
# Step 2: Analyze and correct
./target/release/recombination_analyzer \
--enriched-cache cache.bin \
--profiles samples.tsv \
--distance-matrix distances.tsv \
--output-matrix corrected_distances.tsv \
--recombination-log events.tsv \
--threshold 3.0- High SNP Density: > 3% suggests recombination vs point mutations
- High Indel Events: Indicates potential mobile genetic elements
- Pairwise Patterns: Multiple loci with recombination between same samples suggests related strains
- Hamming Filtering: Ensures focus on epidemiologically relevant comparisons
- Memory Usage: ~4-8GB for typical bacterial datasets (1000+ samples)
- Processing Time: 2-5 minutes for 21M cache entries on modern hardware
- Scalability: Linear with cache size, efficient for large epidemiological studies
- Outbreak Investigation: Identify recombination hotspots in transmission chains
- Evolutionary Analysis: Track horizontal gene transfer patterns
- Food Safety: Monitor recombination in foodborne pathogens
- Antimicrobial Resistance: Detect resistance gene transfer events
- Population Genomics: Understand bacterial population structure
The inspector tool provides detailed analysis of cgDist cache files, including validation, statistics, and compatibility checks.
cargo build --release --bin inspector# Show cache summary
./target/release/inspector --cache cache.lz4
# Detailed information including all loci
./target/release/inspector --cache cache.lz4 --detailed
# Show entries for specific locus
./target/release/inspector --cache cache.lz4 --show-locus locus_name
# Validate cache integrity
./target/release/inspector --cache cache.lz4 --validate
# Export cache summary to TSV
./target/release/inspector --cache cache.lz4 --export-summary summary.tsv
# Check top N loci by entry count
./target/release/inspector --cache cache.lz4 --top-loci 20# Detect alignment mode from parameters
./target/release/inspector --cache cache.lz4 --detect-mode
# Check compatibility with specific alignment parameters
./target/release/inspector --cache cache.lz4 \
--check-compatibility "5,-4,-10,-1" # match,mismatch,gap_open,gap_extend
# Quiet mode for scripting
./target/release/inspector --cache cache.lz4 --validate --quiet- Cache Validation: Verify cache file integrity before reuse
- Troubleshooting: Diagnose cache compatibility issues
- Statistics: Understand cache size and loci distribution
- Auditing: Track which alignment parameters were used
- Quality Control: Ensure cache matches expected schema
cgDist provides a powerful plugin architecture for implementing custom hashing algorithms. This is particularly useful for specialized applications or compatibility with other tools.
Create a new hasher by implementing the AlleleHasher trait:
use cgdist::hashers::{AlleleHasher, AlleleHash};
/// Example: Simple nucleotide composition hasher
#[derive(Debug)]
pub struct CompositionHasher;
impl AlleleHasher for CompositionHasher {
fn hash_sequence(&self, sequence: &str) -> AlleleHash {
// Count nucleotides: A, T, G, C
let mut counts = [0u8; 4]; // A, T, G, C
for nucleotide in sequence.chars() {
match nucleotide.to_ascii_uppercase() {
'A' => counts[0] += 1,
'T' => counts[1] += 1,
'G' => counts[2] += 1,
'C' => counts[3] += 1,
_ => {} // Ignore ambiguous bases
}
}
// Create hash from composition: AAAAATTTTGGGGCCCC format
let hash_string = format!("A{}T{}G{}C{}",
counts[0], counts[1], counts[2], counts[3]);
AlleleHash::String(hash_string)
}
fn parse_allele(&self, allele_str: &str, missing_char: &str) -> Result<AlleleHash, String> {
if allele_str == missing_char {
Ok(AlleleHash::Missing)
} else {
// Parse composition string or return as-is
Ok(AlleleHash::String(allele_str.to_string()))
}
}
fn name(&self) -> &'static str {
"composition"
}
fn description(&self) -> &'static str {
"Nucleotide composition-based hasher (A/T/G/C counts)"
}
fn validate_sequence(&self, sequence: &str) -> Result<(), String> {
// Only allow ATGC nucleotides
for ch in sequence.chars() {
match ch.to_ascii_uppercase() {
'A' | 'T' | 'G' | 'C' | 'N' => {}
_ => return Err(format!("Invalid nucleotide: {}", ch)),
}
}
Ok(())
}
}use cgdist::hashers::HasherRegistry;
fn main() {
let mut registry = HasherRegistry::new();
// Register your custom hasher
registry.register_hasher("composition", Box::new(CompositionHasher));
// Use it like any built-in hasher
let hasher = registry.get_hasher("composition").unwrap();
let hash = hasher.hash_sequence("ATCGATCG");
println!("Hash: {}", hash); // Output: A2T2G2C2
}#[derive(Debug)]
pub struct KmerHasher {
k: usize,
}
impl KmerHasher {
pub fn new(k: usize) -> Self {
Self { k }
}
}
impl AlleleHasher for KmerHasher {
fn hash_sequence(&self, sequence: &str) -> AlleleHash {
let mut kmers = Vec::new();
let seq_bytes = sequence.as_bytes();
if seq_bytes.len() >= self.k {
for i in 0..=(seq_bytes.len() - self.k) {
let kmer = std::str::from_utf8(&seq_bytes[i..i + self.k])
.unwrap_or("")
.to_string();
kmers.push(kmer);
}
}
kmers.sort();
let hash_string = kmers.join("|");
AlleleHash::String(hash_string)
}
// ... implement other required methods
}#[derive(Debug)]
pub struct CustomNumericHasher;
impl AlleleHasher for CustomNumericHasher {
fn hash_sequence(&self, sequence: &str) -> AlleleHash {
// Convert sequence to custom numeric representation
let mut hash_value = 0u32;
for (i, nucleotide) in sequence.chars().enumerate() {
let base_value = match nucleotide.to_ascii_uppercase() {
'A' => 0,
'T' => 1,
'G' => 2,
'C' => 3,
_ => 0, // Default for ambiguous
};
// Simple polynomial rolling hash
hash_value = hash_value.wrapping_mul(4).wrapping_add(base_value);
}
AlleleHash::Crc32(hash_value)
}
fn parse_allele(&self, allele_str: &str, missing_char: &str) -> Result<AlleleHash, String> {
if allele_str == missing_char {
Ok(AlleleHash::Missing)
} else {
match allele_str.parse::<u32>() {
Ok(value) => Ok(AlleleHash::Crc32(value)),
Err(_) => Err(format!("Invalid numeric allele: {}", allele_str)),
}
}
}
fn name(&self) -> &'static str {
"custom-numeric"
}
fn description(&self) -> &'static str {
"Custom polynomial rolling hash for sequences"
}
}To use custom hashers with the cgdist command-line tool, you can:
- Fork and modify: Add your hasher to the registry in
src/main.rs - Configuration file: Load hashers from a configuration file
- Dynamic loading: Use Rust's plugin system (advanced)
Example integration in main.rs:
fn create_registry() -> HasherRegistry {
let mut registry = HasherRegistry::new();
// Add your custom hashers here
registry.register_hasher("composition", Box::new(CompositionHasher));
registry.register_hasher("kmer3", Box::new(KmerHasher::new(3)));
registry.register_hasher("custom-numeric", Box::new(CustomNumericHasher));
registry
}- Legacy Compatibility: Match existing tool formats
- Domain-Specific: Specialized algorithms for specific organisms
- Research: Experimental hashing strategies
- Performance: Optimized for specific hardware or datasets
- Compliance: Meet specific regulatory or institutional requirements
- Deterministic: Ensure same sequence always produces same hash
- Collision-Resistant: Minimize hash collisions for your use case
- Performance: Consider computational overhead
- Validation: Implement robust input validation
- Documentation: Provide clear usage examples and limitations
The plugin architecture makes cgDist highly extensible while maintaining backward compatibility with existing workflows.
See the complete working example:
# Run the custom hasher demonstration
cargo run --example custom_hasher
# Output shows different hashers applied to test sequences:
# π cgDist Custom Hasher Examples
# ===================================
#
# π Available Hashers:
# β’ crc32: Fast CRC32 checksum (chewBBACA compatible)
# β’ composition: Nucleotide composition-based hasher (A/T/G/C counts)
# β’ kmer3: K-mer composition hasher (sorted k-mers)
# β’ polynomial: Polynomial rolling hash for sequences
#
# 𧬠Testing hasher: composition
# Description: Nucleotide composition-based hasher (A/T/G/C counts)
# ATCGATCGATCG β A3T3G3C3
# AAATTTGGGCCC β A3T3G3C3
# ATGCATGCATGC β A3T3G3C3This example demonstrates practical implementation patterns for:
- Composition-based hashing: Count nucleotide frequencies
- K-mer analysis: Extract and sort sequence k-mers
- Polynomial hashing: Mathematical sequence encoding
- Error handling: Validation and missing data management
use cgdist::{DistanceCalculator, Config};
// Create calculator with custom config
let config = Config::new()
.hasher("crc32")
.threads(8)
.cache_enabled(true);
let calculator = DistanceCalculator::new(config);
// Calculate distances
let distances = calculator.calculate_from_file("sequences.fasta")?;import subprocess
import pandas as pd
# Run cgdist from Python
result = subprocess.run([
'cgdist',
'--schema', 'schema_dir/',
'--profiles', 'profiles.tsv',
'--output', 'distances.tsv',
'--mode', 'snps-indel-bases'
], capture_output=True, text=True)
# Check for errors
if result.returncode != 0:
print(f"Error: {result.stderr}")
else:
# Load results
distances = pd.read_csv('distances.tsv', sep='\t', index_col=0)
print(f"Distance matrix shape: {distances.shape}")
print(distances.head())If you use cgDist in your research, please cite our preprint:
de Ruvo, A.; Castelli, P.; Bucciacchio, A.; Mangone, I.; Mixao, V.; Borges, V.; Radomski, N.; Di Pasquale, A. (2025). cgDist: An Enhanced Algorithm for Efficient Calculation of pairwise SNP and InDel differences from Core Genome Multilocus Sequence Typing. bioRxiv. DOI: 10.1101/2025.10.16.682749
@article{deruvo2025cgdist,
title = {cgDist: An Enhanced Algorithm for Efficient Calculation of pairwise SNP and InDel differences from Core Genome Multilocus Sequence Typing},
author = {de Ruvo, Andrea and Castelli, Pierluigi and Bucciacchio, Andrea and Mangone, Iolanda and Mixao, VerΓ³nica and Borges, VΓtor and Radomski, Nicolas and Di Pasquale, Adriano},
year = {2025},
month = {October},
doi = {10.1101/2025.10.16.682749},
journal = {bioRxiv},
note = {Preprint. Software: https://github.com/genpat-it/cgDist}
}- π Comprehensive Study: cgDist-study - Complete epidemiological analysis and performance benchmarks
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: a.deruvo@izs.it
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ for the bioinformatics community