GitHub - drnatebrown/kebab: KeBaB: k-mer based breaking for finding long maximal exact matches

K-mer Based Breaking for finding long maximal exact matches.
(Release Version 1.0.1)

KeBaB breaks nucleotide DNA patterns into pseudo-MEM fragments, filtering out subsequences which cannot overlap any maximal exact match (MEM) of some minimum match length, $L$. Using fragments results in faster MEM-finding queries. The full paper can be found on arXiv.

How-to

Compile

Creates ./kebab executable:

git clone https://github.com/drnatebrown/kebab.git
cd kebab
make

Build

Build a KeBaB index (bloom filter).

Usage: ./kebab build [OPTIONS] fasta

Positionals:
  fasta TEXT REQUIRED         Input FASTA file

Options:
  -h,--help                   Print this help message and exit
  -o,--output TEXT REQUIRED   Output prefix for index file, [PREFIX].kbb
  -k,--kmer-size UINT:POSITIVE [20] 
                              K-mer size used to populate the index
  --kmer-mode ENUM:value in {both->0,canonical->1,forward->2} OR {0,1,2} [1] 
                              K-mer strands to include in the index
  -m,--expected-kmers UINT:POSITIVE
                              Expected number of k-mers (otherwise estimated)
  -e,--fp-rate FLOAT:FLOAT in [0 - 1] [0.1] 
                              Desired false positive rate (between 0 and 1)
  -f,--hash-funcs UINT:POSITIVE
                              Number of hash functions (otherwise set to minimize index size)
  -t,--threads UINT:POSITIVE [8] 
                              Number of threads to use
  --no-rounding               Don't round to power of 2 for filter size (slower)

Note that a chosen -k affects which minimum MEM lengths are valid (see below).

Scan

Breaks sequences into fragments using KeBaB index. Fragments use [SEQ]:[START]-[END] notation where the range is 1-based and inclusive.

Usage: ./kebab scan [OPTIONS] fasta

Positionals:
  fasta TEXT REQUIRED         Patterns FASTA file

Options:
  -h,--help                   Print this help message and exit
  -i,--index TEXT REQUIRED    KeBaB index file
  -o,--output TEXT REQUIRED   Output FASTA file
  -l,--mem-length UINT:POSITIVE [25] 
                              Minimum MEM length (must be greater than k-mer size of index)
  --top-t UINT:POSITIVE       Keep only top-t longest fragments
  -s,--sort                   Sort fragments by length
  -r,--remove-overlaps        Merge overlapping fragments
  -t,--threads UINT:POSITIVE [8] 
                              Number of threads to use
  --no-prefetch               Don't prefetch k-mers to avoid latency

To ensure fragments support early stopping (e.g., top t-MEMs), use -s and do not use -r.

Example Usage

Using KeBaB

./kebab build -k 20 -e 0.1 -f 1 -o ~/data/ref_index ~/data/ref.fa
./kebab scan -o ~/data/reads.frag.fa -i ~/data/ref_index.kbb -l 40 ~/data/reads.fa

In this example, all MEMs of length at least 40 are contained in the fragments found by breaking reads using KeBaB.

MEM Finding

The fragments output by a KeBaB scan can be used with MEM-finding tools such as ropebwt3:

./ropebwt3 mem -l 40 ~/data/ropebwt3_index.fmd ~/data/reads.frag.fa > ~/data/reads.frag.mems

To verify correctness, ropefix fixes output to match that of running ropebwt3 alone, removing any fragment based notation:

./ropefix ~/data/reads.frag.mems > ~/data/reads.mems

Thirdparty

KeBaB utilizes the following third-party libraries:

kseq.h - FASTA parser
CLI11 - Command line parser
hll - HyperLogLog implementation

The k-mer hash implementation is based on the original codebase/paper of ntHash.

Academic Use

If using this tool or its ideas in an academic setting, please cite:

Brown, N. K., Depuydt, L., Zakeri, M., Alhadi, A., Allam, N., Begleiter, D., Karpagavalli, N., Khajjayam, S., Wahed, H., Gagie, T., Langmead, B. (2025). KeBaB: $k$-mer based breaking for finding long MEMs. arXiv. arXiv:2502.20338

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
include		include
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ropefix.c		ropefix.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How-to

Compile

Build

Scan

Example Usage

Using KeBaB

MEM Finding

Thirdparty

Academic Use

About

Uh oh!

Releases 1

Packages

Languages

License

drnatebrown/kebab

Folders and files

Latest commit

History

Repository files navigation

How-to

Compile

Build

Scan

Example Usage

Using KeBaB

MEM Finding

Thirdparty

Academic Use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages