/\ /\ /\
/**\ /\ /**\ /**\
/****\ /\ /**\/****\ __/****\
/ \ /**\__/ / \ / \
/ /\ / \ / \ _ _ / _ \__
/ / \ / \ / ___ ___ | | | |____/ _| |_ \
/ / \/ /\ \/ / __/ _ \| |_____| '_ \ \ /\ / / __| \
/ / \/ \/\ \ | (_| (_) | |_____| |_) \ V V /| |_ \
_/__/_______/___/__\___\_\___\___/|_| |_.__/ \_/\_/ \__|______\_
Release Version 1.0.0Pangenomic index computing pseudo matching lengths and chain staistics by using multi-MUMs.
Full paper on bioRxiv.
We identify sub-runs of the BWT which correspond to multi-maximal unique matches (multi-MUMs) with respect to sequences in a collection. Multi-MUMs are identified efficiently using mumemto, then mapped to the BWT using our algorithm to build an index using Movi [1]. This supports computing matching statistics or its derivative pseudo matching lengths [2] in
Chain statistics describe which conserved region an exact match falls in. Remarkably, this allows for distinct matches to be “chained” (i.e. found to either be co-linear or not co-linear with respect to the collection) in linear time. With sufficient coverage, this improves read classification accuracy: on 64 human haplotypes of HPRC, ~80% of bases belong to a multi-MUM. This allows us to color seed hits, shown as peaks below, among those found by computing matching statistics.

Our index uses col as it abbreviates co-linearity (not column). A col is also the lowest point on a ridge between two peaks, which describes our index aptly. In construction, we carve runs from a larger range due to their heightened similarity. In application, we bridge gaps between peaks when they appear to be of the same landscape.
git clone https://github.com/drnatebrown/col-bwt.git
cd col-bwt
mkdir build && cd build
cmake ..
makePython3 is required to run the col-bwt script.
See help for available commands:
./col-bwt -hUse -h for help
./col-bwt build [-h] [-i INPUT] -o OUTPUT [-r] [-m MODE] [-s SUB_SAMPLE] [-l MIN_MUM]
[-v] [--force] [--keep] [--clean]
[fastas ...]Example:
./col-bwt build -o ./data/index_files -r -m tunnels -s 10 ./data/seq1.fa ./data/seq2.faBuilds a col-bwt stored in directory ./data/index_files, using multi-MUM tunnels and sub-sampling rate 10, over the documents seq1.fa,seq2.fa and their reverse compliment.
Input Format:
-i is an override expecting a path to a file of genomes, one per line, with their space delimited "class", e.g.
/path/to/seq1.fa 1
/path/to/seq2.fa 2
/path/to/seq3.fa 3
/path/to/seq4.fa 4If run with positional arguments, a file [PREFIX]_filelist.txt is generated. It is highly recommended to use -i with this file when running the build script again. Since many steps are skipped if intermediate files are already generated, this ensures they are to the expected dataset.
Use -h for help
col-bwt query [-h] -p PATTERN indexExample:
./col-bwt query -o -p ./data/pattern.fa ./data/index_filesGenerates pattern.fa.split.pml.bin and pattern.fa.cid.bin, the pseudo matching lengths and chain statistics respectively.
* denotes fork modified for this tool
If using col-bwt or chain statistics in an academic setting please cite:
Brown, N. K., Shivakumar, V. S., & Langmead, B. (2024). Improved pangenomic classification accuracy with chain statistics. bioRxiv. doi:10.1101/2024.10.29.620953
[1] Zakeri, M., Brown, N. K., Ahmed, O. Y., Gagie, T., & Langmead, B. (2023). Movi: a fast and cache-efficient full-text pangenome index. bioRxiv.
[2] Uwe Baier (2018). On Undetected Redundancy in the Burrows-Wheeler Transform. In Annual Symposium on Combinatorial Pattern Matching, CPM 2018, 3:1–3:15.
[3] Ahmed, O., Rossi, M., Kovaka, S., Schatz, M. C., Gagie, T., Boucher, C., & Langmead, B. (2021). Pan-genomic matching statistics for targeted nanopore sequencing. Iscience, 24(6).