col-bwt

               /\                 /\                          /\     
              /**\           /\  /**\                        /**\    
             /****\   /\    /**\/****\                    __/****\     
            /      \ /**\__/   /      \                  /        \ 
           /  /\    /    \    /        \ _       _      /      _   \__
          /  /  \  /      \  /  ___ ___ | |     | |____/     _| |_    \ 
         /  /    \/ /\     \/  / __/ _ \| |_____| '_ \ \ /\ / / __|    \ 
        /  /      \/  \/\   \ | (_| (_) | |_____| |_) \ V  V /| |_      \ 
      _/__/_______/___/__\___\_\___\___/|_|     |_.__/ \_/\_/  \__|______\_

                                                        Release Version 1.0.0

Pangenomic index computing pseudo matching lengths and chain staistics by using multi-MUMs.

Description

Full paper on bioRxiv.

We identify sub-runs of the BWT which correspond to multi-maximal unique matches (multi-MUMs) with respect to sequences in a collection. Multi-MUMs are identified efficiently using mumemto, then mapped to the BWT using our algorithm to build an index using Movi [1]. This supports computing matching statistics or its derivative pseudo matching lengths [2] in $O(m)$-time for a pattern $P[1..m]$, at the same time outputting their chain statistics describing which multi-MUM the match occurs in. By choosing to mark sub-runs which can be tunneled [3] we construct an index in $O(r+n/d)$-space; $d$ is the number of documents (sequences) in the collection, $n$ the length of the concatenated collection, and $r$ the number of runs of its $BWT$. In pangenome contexts, $n/d$ is the average genome length, and does not grow substantially with the number of sequences.

Chain statistics describe which conserved region an exact match falls in. Remarkably, this allows for distinct matches to be “chained” (i.e. found to either be co-linear or not co-linear with respect to the collection) in linear time. With sufficient coverage, this improves read classification accuracy: on 64 human haplotypes of HPRC, ~80% of bases belong to a multi-MUM. This allows us to color seed hits, shown as peaks below, among those found by computing matching statistics.

Our index uses col as it abbreviates co-linearity (not column). A col is also the lowest point on a ridge between two peaks, which describes our index aptly. In construction, we carve runs from a larger range due to their heightened similarity. In application, we bridge gaps between peaks when they appear to be of the same landscape.

How-to

Download and Compile

git clone https://github.com/drnatebrown/col-bwt.git
cd col-bwt

mkdir build && cd build
cmake ..
make

Commands

Python3 is required to run the col-bwt script. See help for available commands:

./col-bwt -h

Build Command:

Use -h for help

./col-bwt build [-h] [-i INPUT] -o OUTPUT [-r] [-m MODE] [-s SUB_SAMPLE] [-l MIN_MUM]
                     [-v] [--force] [--keep] [--clean]
                     [fastas ...]

Example:

./col-bwt build -o ./data/index_files -r -m tunnels -s 10 ./data/seq1.fa ./data/seq2.fa

Builds a col-bwt stored in directory ./data/index_files, using multi-MUM tunnels and sub-sampling rate 10, over the documents seq1.fa,seq2.fa and their reverse compliment.

Input Format: -i is an override expecting a path to a file of genomes, one per line, with their space delimited "class", e.g.

/path/to/seq1.fa 1
/path/to/seq2.fa 2
/path/to/seq3.fa 3
/path/to/seq4.fa 4

If run with positional arguments, a file [PREFIX]_filelist.txt is generated. It is highly recommended to use -i with this file when running the build script again. Since many steps are skipped if intermediate files are already generated, this ensures they are to the expected dataset.

Query Command:

Use -h for help

col-bwt query [-h] -p PATTERN index

Example:

./col-bwt query -o -p ./data/pattern.fa ./data/index_files

Generates pattern.fa.split.pml.bin and pattern.fa.cid.bin, the pseudo matching lengths and chain statistics respectively.

Thirdparty

* denotes fork modified for this tool

Citation

If using col-bwt or chain statistics in an academic setting please cite:

Brown, N. K., Shivakumar, V. S., & Langmead, B. (2024). Improved pangenomic classification accuracy with chain statistics. bioRxiv. doi:10.1101/2024.10.29.620953

References

[1] Zakeri, M., Brown, N. K., Ahmed, O. Y., Gagie, T., & Langmead, B. (2023). Movi: a fast and cache-efficient full-text pangenome index. bioRxiv.
[2] Uwe Baier (2018). On Undetected Redundancy in the Burrows-Wheeler Transform. In Annual Symposium on Combinatorial Pattern Matching, CPM 2018, 3:1–3:15.
[3] Ahmed, O., Rossi, M., Kovaka, S., Schatz, M. C., Gagie, T., Boucher, C., & Langmead, B. (2021). Pan-genomic matching statistics for targeted nanopore sequencing. Iscience, 24(6).

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
include		include
scripts		scripts
src		src
thirdparty		thirdparty
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

col-bwt

Description

How-to

Download and Compile

Commands

Build Command:

Query Command:

Thirdparty

Citation

References

About

Uh oh!

Releases

Packages

Languages

License

drnatebrown/col-bwt

Folders and files

Latest commit

History

Repository files navigation

col-bwt

Description

How-to

Download and Compile

Commands

Build Command:

Query Command:

Thirdparty

Citation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages