A fast, parallel k-mer counter for DNA sequences in FASTA and FASTQ files.
- Fast parallel processing using rayon and dashmap
- FASTA and FASTQ support with automatic format detection from file extension
- Canonical k-mers - outputs the lexicographically smaller of each k-mer and its reverse complement
- Flexible k-mer lengths from 1 to 32
- Handles N bases by skipping invalid k-mers
- Jellyfish-compatible output format for easy integration with existing pipelines
- Tested for accuracy against Jellyfish
cargo install kmerustgit clone https://github.com/suchapalaver/kmerust.git
cd kmerust
cargo install --path .kmerust <k> <path><k>- K-mer length (1-32)<path>- Path to a FASTA or FASTQ file (use-or omit for stdin)
-f, --format <FORMAT>- Output format:fasta(default),tsv,json, orhistogram-i, --input-format <FORMAT>- Input format:auto(default),fasta, orfastq-m, --min-count <N>- Minimum count threshold (default: 1)-Q, --min-quality <N>- Minimum Phred quality score for FASTQ (0-93); bases below this are skipped--save <PATH>- Save k-mer counts to a binary index file for fast querying-q, --quiet- Suppress informational output-h, --help- Print help information-V, --version- Print version information
Count 21-mers in a FASTA file:
kmerust 21 sequences.fa > kmers.txtCount 21-mers in a FASTQ file (format auto-detected):
kmerust 21 reads.fq > kmers.txtCount 5-mers:
kmerust 5 sequences.fa > kmers.txtkmerust supports reading from stdin, enabling seamless integration with Unix pipelines:
# Pipe from another command
cat genome.fa | kmerust 21
# Decompress and count
zcat large.fa.gz | kmerust 21 > counts.tsv
# Sample reads and count
seqtk sample reads.fa 0.1 | kmerust 17
# Explicit stdin marker
cat genome.fa | kmerust 21 -
# FASTQ from stdin (specify format explicitly)
cat reads.fq | kmerust 21 --input-format fastq
zcat reads.fq.gz | kmerust 21 -i fastq > counts.tsvUse --format to choose the output format:
# TSV format (tab-separated)
kmerust 21 sequences.fa --format tsv
# JSON format
kmerust 21 sequences.fa --format json
# FASTA-like format (default)
kmerust 21 sequences.fa --format fasta
# Histogram format (k-mer frequency spectrum)
kmerust 21 sequences.fa --format histogramThe histogram format outputs the k-mer frequency spectrum (count of counts), useful for genome size estimation and error detection:
kmerust 21 genome.fa --format histogram > spectrum.tsvOutput is tab-separated with columns count and frequency:
1 1523456 # 1.5M k-mers appear exactly once (likely errors)
2 234567 # 234K k-mers appear twice
10 45678 # 45K k-mers appear 10 times
...
For FASTQ files, use --min-quality to filter out k-mers containing low-quality bases:
# Skip k-mers with any base below Q20
kmerust 21 reads.fq --min-quality 20
# Higher threshold for stricter filtering
kmerust 21 reads.fq -Q 30 --format tsvK-mers containing bases with Phred quality scores below the threshold are skipped entirely.
For large genomes, save k-mer counts to a binary index file to avoid re-counting:
# Count and save to index
kmerust 21 genome.fa --save counts.kmix
# Counts are also written to stdout as usual
kmerust 21 genome.fa --save counts.kmix > counts.tsvThe index file uses a compact binary format with CRC32 checksums for integrity verification. Gzip compression is auto-detected from the .gz extension:
# Save with gzip compression
kmerust 21 genome.fa --save counts.kmix.gzUse the query subcommand to look up k-mer counts from a saved index:
# Query a single k-mer
kmerust query counts.kmix ACGTACGTACGTACGTACGTA
# Output: 42 (or 0 if not found)
# Queries are case-insensitive and canonicalized
kmerust query counts.kmix acgtacgtacgtacgtacgta # Same result
kmerust query counts.kmix TGTACGTACGTACGTACGTAC # Reverse complement, same resultThe query k-mer length must match the index's k value.
kmerust supports two sequence readers via feature flags, both supporting FASTA and FASTQ:
rust-bio(default) - Uses the rust-bio libraryneedletail- Uses the needletail library
To use needletail instead:
cargo run --release --no-default-features --features needletail -- 21 sequences.faWith needletail, format is auto-detected from file content. With rust-bio, format is detected from file extension (.fa, .fasta, .fna for FASTA; .fq, .fastq for FASTQ).
Enable production features for additional capabilities:
cargo build --release --features productionOr enable individual features:
gzip- Read gzip-compressed FASTA files (.fa.gz)mmap- Memory-mapped I/O for large filestracing- Structured logging and diagnostics
With the gzip feature, kmerust can directly read gzip-compressed files:
cargo run --release --features gzip -- 21 sequences.fa.gzWith the tracing feature, use the RUST_LOG environment variable for diagnostic output:
RUST_LOG=kmerust=debug cargo run --features tracing -- 21 sequences.faOutput is written to stdout in FASTA-like format:
>{count}
{canonical_kmer}
Example output:
>114928
ATGCC
>289495
AATCA
kmerust can also be used as a library:
use kmerust::run::count_kmers;
use std::path::PathBuf;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Works with both FASTA and FASTQ (format auto-detected)
let path = PathBuf::from("sequences.fa");
let counts = count_kmers(&path, 21)?;
for (kmer, count) in counts {
println!("{kmer}: {count}");
}
Ok(())
}When using the builder API, you can explicitly specify the input format:
use kmerust::builder::KmerCounter;
use kmerust::format::SequenceFormat;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = KmerCounter::new()
.k(21)?
.input_format(SequenceFormat::Fastq)
.count("reads.fq")?;
Ok(())
}Monitor progress during long-running operations:
use kmerust::run::count_kmers_with_progress;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = count_kmers_with_progress("genome.fa", 21, |progress| {
eprintln!(
"Processed {} sequences ({} bases)",
progress.sequences_processed,
progress.bases_processed
);
})?;
Ok(())
}For large files, use memory-mapped I/O (requires mmap feature):
use kmerust::run::count_kmers_mmap;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = count_kmers_mmap("large_genome.fa", 21)?;
println!("Found {} unique k-mers", counts.len());
Ok(())
}For memory-efficient processing:
use kmerust::streaming::count_kmers_streaming;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = count_kmers_streaming("genome.fa", 21)?;
println!("Found {} unique k-mers", counts.len());
Ok(())
}Count k-mers from any BufRead source, including stdin or in-memory data:
use kmerust::streaming::count_kmers_from_reader;
use std::io::BufReader;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// From in-memory data
let fasta_data = b">seq1\nACGTACGT\n>seq2\nTGCATGCA\n";
let reader = BufReader::new(&fasta_data[..]);
let counts = count_kmers_from_reader(reader, 4)?;
// From stdin
// use kmerust::streaming::count_kmers_stdin;
// let counts = count_kmers_stdin(21)?;
Ok(())
}kmerust uses parallel processing to efficiently count k-mers:
- Sequences are processed in parallel using rayon
- A concurrent hash map (dashmap) allows lock-free updates
- FxHash provides fast hashing for 64-bit packed k-mers
MIT License - see LICENSE for details.