Skip to content

HIBF creates a very large index #370

@genomewalker

Description

@genomewalker

Hi

I have been trying to build an index of a large collection of microbial genomes (102999) using HIBF and the resulting index is way larger than when I create the same index using IBF.

The raptor version I used:

VERSION
    Last update: 2023-08-30
    Raptor version: 3.1.0-rc.1 (raptor-v3.0.0-146-gedec71b5a2c19a2203278db814b3362ddb98e9e6)
    Sharg version: 1.1.1
    SeqAn version: 3.4.0-rc.1

The layout stat file:

## ### Parameters ###
## number of user bins = 102999
## number of hash functions = 2
## false positive rate = 0.05
## ### Notation ###
## X-IBF = An IBF with X number of bins.
## X-HIBF = An HIBF with tmax = X, e.g a maximum of X technical bins on each level.
## ### Column Description ###
## tmax : The maximum number of technical bin on each level
## c_tmax : The technical extra cost of querying an tmax-IBF, compared to 64-IBF
## l_tmax : The estimated query cost for an tmax-HIBF, compared to an 64-HIBF
## m_tmax : The estimated memory consumption for an tmax-HIBF, compared to an 64-HIBF
## (l*m)_tmax : Computed by l_tmax * m_tmax
## size : The expected total size of an tmax-HIBF
# tmax  c_tmax  l_tmax  m_tmax  (l*m)_tmax      size
64      1.00    0.00    1.00    0.00    424.3GiB
384     1.51    3.34    1.48    4.96    630.0GiB
# Best t_max (regarding expected query runtime): 64

The prepare and layout and build commands I used:

raptor prepare --input genomes.lst --output genomes_k20_w20 --kmer 20 --window 20 --threads 32
raptor layout --input-file genomes_k20_w20/minimiser.list --output-sketches-to genomes_k20_w20 \
    --determine-best-tmax --kmer-size 20 --false-positive-rate 0.05 --threads 32 \
    --output-filename genomes_k20_w20_binning
raptor build --input genomes_k20_w20_binning --output genomes_k20_w20.index --threads 32

The final index is ~1Tb, and these are the timings of building the index, where it had a peak memory usage of ~3Tb:

============= Timings =============
Wall clock time [s]: 40397.13
Peak memory usage [TiB]: 2.9
Index allocation [s]: 0.00
User bin I/O avg per thread [s]: 0.00
User bin I/O sum [s]: 0.00
Merge kmer sets avg per thread [s]: 0.00
Merge kmer sets sum [s]: 0.00
Fill IBF avg per thread [s]: 0.00
Fill IBF sum [s]: 0.00
Store index [s]: 0.00

The IBF index is ~750G and required a fraction of the memory to build the index. Shouldn't the HBIF be smaller than the IBF index? Any suggestions are much appreciated :-)

Thanks
Antonio

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions