Skip to content

suchapalaver/kmerust

Repository files navigation

kmerust

Crates.io Documentation CI License: MIT

A fast, parallel k-mer counter for DNA sequences in FASTA and FASTQ files.

Features

  • Fast parallel processing using rayon and dashmap
  • FASTA and FASTQ support with automatic format detection from file extension
  • Canonical k-mers - outputs the lexicographically smaller of each k-mer and its reverse complement
  • Flexible k-mer lengths from 1 to 32
  • Handles N bases by skipping invalid k-mers
  • Jellyfish-compatible output format for easy integration with existing pipelines
  • Tested for accuracy against Jellyfish

Installation

From crates.io

cargo install kmerust

From source

git clone https://github.com/suchapalaver/kmerust.git
cd kmerust
cargo install --path .

Usage

kmerust <k> <path>

Arguments

  • <k> - K-mer length (1-32)
  • <path> - Path to a FASTA or FASTQ file (use - or omit for stdin)

Options

  • -f, --format <FORMAT> - Output format: fasta (default), tsv, json, or histogram
  • -i, --input-format <FORMAT> - Input format: auto (default), fasta, or fastq
  • -m, --min-count <N> - Minimum count threshold (default: 1)
  • -Q, --min-quality <N> - Minimum Phred quality score for FASTQ (0-93); bases below this are skipped
  • --save <PATH> - Save k-mer counts to a binary index file for fast querying
  • -q, --quiet - Suppress informational output
  • -h, --help - Print help information
  • -V, --version - Print version information

Examples

Count 21-mers in a FASTA file:

kmerust 21 sequences.fa > kmers.txt

Count 21-mers in a FASTQ file (format auto-detected):

kmerust 21 reads.fq > kmers.txt

Count 5-mers:

kmerust 5 sequences.fa > kmers.txt

Unix Pipeline Integration

kmerust supports reading from stdin, enabling seamless integration with Unix pipelines:

# Pipe from another command
cat genome.fa | kmerust 21

# Decompress and count
zcat large.fa.gz | kmerust 21 > counts.tsv

# Sample reads and count
seqtk sample reads.fa 0.1 | kmerust 17

# Explicit stdin marker
cat genome.fa | kmerust 21 -

# FASTQ from stdin (specify format explicitly)
cat reads.fq | kmerust 21 --input-format fastq
zcat reads.fq.gz | kmerust 21 -i fastq > counts.tsv

Output Formats

Use --format to choose the output format:

# TSV format (tab-separated)
kmerust 21 sequences.fa --format tsv

# JSON format
kmerust 21 sequences.fa --format json

# FASTA-like format (default)
kmerust 21 sequences.fa --format fasta

# Histogram format (k-mer frequency spectrum)
kmerust 21 sequences.fa --format histogram

Histogram Output

The histogram format outputs the k-mer frequency spectrum (count of counts), useful for genome size estimation and error detection:

kmerust 21 genome.fa --format histogram > spectrum.tsv

Output is tab-separated with columns count and frequency:

1       1523456    # 1.5M k-mers appear exactly once (likely errors)
2       234567     # 234K k-mers appear twice
10      45678      # 45K k-mers appear 10 times
...

Quality Filtering (FASTQ)

For FASTQ files, use --min-quality to filter out k-mers containing low-quality bases:

# Skip k-mers with any base below Q20
kmerust 21 reads.fq --min-quality 20

# Higher threshold for stricter filtering
kmerust 21 reads.fq -Q 30 --format tsv

K-mers containing bases with Phred quality scores below the threshold are skipped entirely.

Index Serialization

For large genomes, save k-mer counts to a binary index file to avoid re-counting:

# Count and save to index
kmerust 21 genome.fa --save counts.kmix

# Counts are also written to stdout as usual
kmerust 21 genome.fa --save counts.kmix > counts.tsv

The index file uses a compact binary format with CRC32 checksums for integrity verification. Gzip compression is auto-detected from the .gz extension:

# Save with gzip compression
kmerust 21 genome.fa --save counts.kmix.gz

Querying a Saved Index

Use the query subcommand to look up k-mer counts from a saved index:

# Query a single k-mer
kmerust query counts.kmix ACGTACGTACGTACGTACGTA
# Output: 42 (or 0 if not found)

# Queries are case-insensitive and canonicalized
kmerust query counts.kmix acgtacgtacgtacgtacgta  # Same result
kmerust query counts.kmix TGTACGTACGTACGTACGTAC  # Reverse complement, same result

The query k-mer length must match the index's k value.

Sequence Readers

kmerust supports two sequence readers via feature flags, both supporting FASTA and FASTQ:

To use needletail instead:

cargo run --release --no-default-features --features needletail -- 21 sequences.fa

With needletail, format is auto-detected from file content. With rust-bio, format is detected from file extension (.fa, .fasta, .fna for FASTA; .fq, .fastq for FASTQ).

Production Features

Enable production features for additional capabilities:

cargo build --release --features production

Or enable individual features:

  • gzip - Read gzip-compressed FASTA files (.fa.gz)
  • mmap - Memory-mapped I/O for large files
  • tracing - Structured logging and diagnostics

Gzip Compressed Input

With the gzip feature, kmerust can directly read gzip-compressed files:

cargo run --release --features gzip -- 21 sequences.fa.gz

Tracing/Logging

With the tracing feature, use the RUST_LOG environment variable for diagnostic output:

RUST_LOG=kmerust=debug cargo run --features tracing -- 21 sequences.fa

Output Format

Output is written to stdout in FASTA-like format:

>{count}
{canonical_kmer}

Example output:

>114928
ATGCC
>289495
AATCA

Library Usage

kmerust can also be used as a library:

use kmerust::run::count_kmers;
use std::path::PathBuf;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Works with both FASTA and FASTQ (format auto-detected)
    let path = PathBuf::from("sequences.fa");
    let counts = count_kmers(&path, 21)?;
    for (kmer, count) in counts {
        println!("{kmer}: {count}");
    }
    Ok(())
}

Explicit Format Selection

When using the builder API, you can explicitly specify the input format:

use kmerust::builder::KmerCounter;
use kmerust::format::SequenceFormat;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let counts = KmerCounter::new()
        .k(21)?
        .input_format(SequenceFormat::Fastq)
        .count("reads.fq")?;
    Ok(())
}

Progress Reporting

Monitor progress during long-running operations:

use kmerust::run::count_kmers_with_progress;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let counts = count_kmers_with_progress("genome.fa", 21, |progress| {
        eprintln!(
            "Processed {} sequences ({} bases)",
            progress.sequences_processed,
            progress.bases_processed
        );
    })?;
    Ok(())
}

Memory-Mapped I/O

For large files, use memory-mapped I/O (requires mmap feature):

use kmerust::run::count_kmers_mmap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let counts = count_kmers_mmap("large_genome.fa", 21)?;
    println!("Found {} unique k-mers", counts.len());
    Ok(())
}

Streaming API

For memory-efficient processing:

use kmerust::streaming::count_kmers_streaming;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let counts = count_kmers_streaming("genome.fa", 21)?;
    println!("Found {} unique k-mers", counts.len());
    Ok(())
}

Reading from Any Source

Count k-mers from any BufRead source, including stdin or in-memory data:

use kmerust::streaming::count_kmers_from_reader;
use std::io::BufReader;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // From in-memory data
    let fasta_data = b">seq1\nACGTACGT\n>seq2\nTGCATGCA\n";
    let reader = BufReader::new(&fasta_data[..]);
    let counts = count_kmers_from_reader(reader, 4)?;

    // From stdin
    // use kmerust::streaming::count_kmers_stdin;
    // let counts = count_kmers_stdin(21)?;

    Ok(())
}

Performance

kmerust uses parallel processing to efficiently count k-mers:

  • Sequences are processed in parallel using rayon
  • A concurrent hash map (dashmap) allows lock-free updates
  • FxHash provides fast hashing for 64-bit packed k-mers

License

MIT License - see LICENSE for details.