BacTaxID is a high-performance bacterial sub-genus classification tool developed in Rust that uses advanced sketching algorithms for rapid and accurate classification of bacterial genomes. The tool implements a BinDash-like algorithm, a state-of-the-art sketching technique derived from BinHash, to provide efficient similarity with superior speed and accuracy compared to traditional methods.
- BinDash-like Algorithm: O(log S) Complexity: Uses direct binning instead of heap operations for optimal performance
- Rolling-Hash Streaming: Efficient processing of FASTA files with ntHash (https://github.com/bcgsc/ntHash)
- Parallel Processing: Leverages Rayon for multi-threaded sketch comparison and analysis
- DuckDB Backend: Fast analytical database optimized for genomic data queries
- Dynamic Level Support: Configurable taxonomic levels with custom ANI thresholds
- Sketch Serialization: Efficient storage and retrieval of sketch objects using bincode
- Hierarchical Classification: Multi-level taxonomic assignment (L_0 to L_N)
- ANI-based Thresholds: Average Nucleotide Identity calculations for similarity assessment
- Graph-based Analysis: Clique detection using PetGraph for community identification
- Clique Detection: Identifies maximal cliques for taxonomic grouping
- Connectivity Analysis: Distance-based edge filtering with configurable thresholds
- Community Detection: Automatic discovery of taxonomic clusters
- Reference Management: Dynamic reference database with size-based optimization
- Sketching Engine (
src/sketch/sketching.rs)- BinDash sketch generation from FASTA files
- Jaccard distance calculation converted to ANI
- SketchManager for batch operations
- Database Layer (
src/db/db.rs)- Unified taxonomy storage schema
- Batch operations for high-throughput analysis
- Automatic indexing and optimization
- Graph Analysis (
src/graph/graph.rs)- PetGraph-based clique detection
- Edge management and filtering
- Community analysis algorithms
- Command Interface (
src/commands/)- CLI-based workflow management
- TOML configuration support
- Batch processing capabilities
BacTaxID uses TOML configuration files for flexible parameter management:
genus = "Escherichia"
acronym = "EC"
levels = "[0.95, 0.98, 0.99, 0.999, 0.9999]"
kmer_size = 21
sketch_size = 1000
click_size = 8
click_threshold = 0.8
reference_size = 100- Rust: 2021 edition or later
- System Memory: Minimum 2GB RAM (configurable)
- Storage: SSD recommended for optimal I/O performance
- CPU: Multi-core processor recommended for parallel processing
git clone https://github.com/irycisBioinfo/BacTaxID
cd BacTaxID
cargo build --release- Initialize Database:
./target/release/bactaxid init --config config.toml --db bacteria.db- Add Reference Genomes:
./target/release/bactaxid update --db bacteria.db --files reference_list.txt- Classify New Samples:
./target/release/bactaxid classify --db bacteria.db --query sample.fastaBacTaxID implements the BinDash-like algorithm, which provides several advantages over traditional MinHash approaches:
- Direct Binning: Eliminates heap operations for O(log S) complexity
- Better Accuracy: Improved estimation of Jaccard similarities
- Memory Efficiency: Reduced memory footprint compared to traditional sketching
The tool uses a hierarchical approach with configurable ANI thresholds:
- Hierarchical Classification: Build self-explanatory classification in N ANI levels.
- Best Hit Analysis: Identifies closest matches in reference database.
- Robust Reference DataBase: Identifies reference communities avoiding chain-effect.
- Confidence Assessment: Evaluates classification reliability.
- Clique Detection: Groups similar organisms into taxonomic clusters.
We welcome contributions to BacTaxID! Please see our contributing guidelines and feel free to submit issues or pull requests.
BacTaxID is released under the GPL-3.0 license. See LICENSE file for details.
If you use BacTaxID in your research, please cite:
[Citation information will be added upon publication]
- Development Team: irycisBioinfo
- Maintainers: Val F. Lanza, Miguel Diez Fernandez de Bobadilla
- Issues: Please use GitHub Issues for bug reports and feature requests
BacTaxID - Empowering bacterial taxonomy identification through advanced computational methods and optimized performance.
