kmerust

A fast, parallel k-mer counter for DNA sequences in FASTA and FASTQ files.

Features

Fast parallel processing using rayon and dashmap
FASTA and FASTQ support with automatic format detection from file extension
Canonical k-mers - outputs the lexicographically smaller of each k-mer and its reverse complement
Flexible k-mer lengths from 1 to 32
Handles N bases by skipping invalid k-mers
Jellyfish-compatible output format for easy integration with existing pipelines
Tested for accuracy against Jellyfish

Installation

From crates.io

cargo install kmerust

From source

git clone https://github.com/suchapalaver/kmerust.git
cd kmerust
cargo install --path .

Usage

kmerust <k> <path>

Arguments

<k> - K-mer length (1-32)
<path> - Path to a FASTA or FASTQ file (use - or omit for stdin)

Options

-f, --format <FORMAT> - Output format: fasta (default), tsv, json, or histogram
-i, --input-format <FORMAT> - Input format: auto (default), fasta, or fastq
-m, --min-count <N> - Minimum count threshold (default: 1)
-Q, --min-quality <N> - Minimum Phred quality score for FASTQ (0-93); bases below this are skipped
--save <PATH> - Save k-mer counts to a binary index file for fast querying
-q, --quiet - Suppress informational output
-h, --help - Print help information
-V, --version - Print version information

Examples

Count 21-mers in a FASTA file:

kmerust 21 sequences.fa > kmers.txt

Count 21-mers in a FASTQ file (format auto-detected):

kmerust 21 reads.fq > kmers.txt

Count 5-mers:

kmerust 5 sequences.fa > kmers.txt

Unix Pipeline Integration

kmerust supports reading from stdin, enabling seamless integration with Unix pipelines:

# Pipe from another command
cat genome.fa | kmerust 21

# Decompress and count
zcat large.fa.gz | kmerust 21 > counts.tsv

# Sample reads and count
seqtk sample reads.fa 0.1 | kmerust 17

# Explicit stdin marker
cat genome.fa | kmerust 21 -

# FASTQ from stdin (specify format explicitly)
cat reads.fq | kmerust 21 --input-format fastq
zcat reads.fq.gz | kmerust 21 -i fastq > counts.tsv

Output Formats

Use --format to choose the output format:

# TSV format (tab-separated)
kmerust 21 sequences.fa --format tsv

# JSON format
kmerust 21 sequences.fa --format json

# FASTA-like format (default)
kmerust 21 sequences.fa --format fasta

# Histogram format (k-mer frequency spectrum)
kmerust 21 sequences.fa --format histogram

Histogram Output

The histogram format outputs the k-mer frequency spectrum (count of counts), useful for genome size estimation and error detection:

kmerust 21 genome.fa --format histogram > spectrum.tsv

Output is tab-separated with columns count and frequency:

1       1523456    # 1.5M k-mers appear exactly once (likely errors)
2       234567     # 234K k-mers appear twice
10      45678      # 45K k-mers appear 10 times
...

Quality Filtering (FASTQ)

For FASTQ files, use --min-quality to filter out k-mers containing low-quality bases:

# Skip k-mers with any base below Q20
kmerust 21 reads.fq --min-quality 20

# Higher threshold for stricter filtering
kmerust 21 reads.fq -Q 30 --format tsv

K-mers containing bases with Phred quality scores below the threshold are skipped entirely.

Index Serialization

For large genomes, save k-mer counts to a binary index file to avoid re-counting:

# Count and save to index
kmerust 21 genome.fa --save counts.kmix

# Counts are also written to stdout as usual
kmerust 21 genome.fa --save counts.kmix > counts.tsv

The index file uses a compact binary format with CRC32 checksums for integrity verification. Gzip compression is auto-detected from the .gz extension:

# Save with gzip compression
kmerust 21 genome.fa --save counts.kmix.gz

Querying a Saved Index

Use the query subcommand to look up k-mer counts from a saved index:

# Query a single k-mer
kmerust query counts.kmix ACGTACGTACGTACGTACGTA
# Output: 42 (or 0 if not found)

# Queries are case-insensitive and canonicalized
kmerust query counts.kmix acgtacgtacgtacgtacgta  # Same result
kmerust query counts.kmix TGTACGTACGTACGTACGTAC  # Reverse complement, same result

The query k-mer length must match the index's k value.

Sequence Readers

kmerust supports two sequence readers via feature flags, both supporting FASTA and FASTQ:

rust-bio (default) - Uses the rust-bio library
needletail - Uses the needletail library

To use needletail instead:

cargo run --release --no-default-features --features needletail -- 21 sequences.fa

With needletail, format is auto-detected from file content. With rust-bio, format is detected from file extension (.fa, .fasta, .fna for FASTA; .fq, .fastq for FASTQ).

Production Features

Enable production features for additional capabilities:

cargo build --release --features production

Or enable individual features:

gzip - Read gzip-compressed FASTA files (.fa.gz)
mmap - Memory-mapped I/O for large files
tracing - Structured logging and diagnostics

Gzip Compressed Input

With the gzip feature, kmerust can directly read gzip-compressed files:

cargo run --release --features gzip -- 21 sequences.fa.gz

Tracing/Logging

With the tracing feature, use the RUST_LOG environment variable for diagnostic output:

RUST_LOG=kmerust=debug cargo run --features tracing -- 21 sequences.fa

Output Format

Output is written to stdout in FASTA-like format:

>{count}
{canonical_kmer}

Example output:

>114928
ATGCC
>289495
AATCA

Library Usage

kmerust can also be used as a library:

use kmerust::run::count_kmers;
use std::path::PathBuf;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Works with both FASTA and FASTQ (format auto-detected)
    let path = PathBuf::from("sequences.fa");
    let counts = count_kmers(&path, 21)?;
    for (kmer, count) in counts {
        println!("{kmer}: {count}");
    }
    Ok(())
}

Explicit Format Selection

When using the builder API, you can explicitly specify the input format:

use kmerust::builder::KmerCounter;
use kmerust::format::SequenceFormat;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let counts = KmerCounter::new()
        .k(21)?
        .input_format(SequenceFormat::Fastq)
        .count("reads.fq")?;
    Ok(())
}

Progress Reporting

Monitor progress during long-running operations:

use kmerust::run::count_kmers_with_progress;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let counts = count_kmers_with_progress("genome.fa", 21, |progress| {
        eprintln!(
            "Processed {} sequences ({} bases)",
            progress.sequences_processed,
            progress.bases_processed
        );
    })?;
    Ok(())
}

Memory-Mapped I/O

For large files, use memory-mapped I/O (requires mmap feature):

use kmerust::run::count_kmers_mmap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let counts = count_kmers_mmap("large_genome.fa", 21)?;
    println!("Found {} unique k-mers", counts.len());
    Ok(())
}

Streaming API

For memory-efficient processing:

use kmerust::streaming::count_kmers_streaming;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let counts = count_kmers_streaming("genome.fa", 21)?;
    println!("Found {} unique k-mers", counts.len());
    Ok(())
}

Reading from Any Source

Count k-mers from any BufRead source, including stdin or in-memory data:

use kmerust::streaming::count_kmers_from_reader;
use std::io::BufReader;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // From in-memory data
    let fasta_data = b">seq1\nACGTACGT\n>seq2\nTGCATGCA\n";
    let reader = BufReader::new(&fasta_data[..]);
    let counts = count_kmers_from_reader(reader, 4)?;

    // From stdin
    // use kmerust::streaming::count_kmers_stdin;
    // let counts = count_kmers_stdin(21)?;

    Ok(())
}

Performance

kmerust uses parallel processing to efficiently count k-mers:

Sequences are processed in parallel using rayon
A concurrent hash map (dashmap) allows lock-free updates
FxHash provides fast hashing for 64-bit packed k-mers

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
.github/workflows		.github/workflows
benches		benches
examples		examples
fuzz		fuzz
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
clippy.toml		clippy.toml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kmerust

Features

Installation

From crates.io

From source

Usage

Arguments

Options

Examples

Unix Pipeline Integration

Output Formats

Histogram Output

Quality Filtering (FASTQ)

Index Serialization

Querying a Saved Index

Sequence Readers

Production Features

Gzip Compressed Input

Tracing/Logging

Output Format

Library Usage

Explicit Format Selection

Progress Reporting

Memory-Mapped I/O

Streaming API

Reading from Any Source

Performance

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kmerust

Features

Installation

From crates.io

From source

Usage

Arguments

Options

Examples

Unix Pipeline Integration

Output Formats

Histogram Output

Quality Filtering (FASTQ)

Index Serialization

Querying a Saved Index

Sequence Readers

Production Features

Gzip Compressed Input

Tracing/Logging

Output Format

Library Usage

Explicit Format Selection

Progress Reporting

Memory-Mapped I/O

Streaming API

Reading from Any Source

Performance

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages