Skip to content

feat(bgzf): add position tracking to MultithreadedWriter#371

Draft
nh13 wants to merge 2 commits intozaeleus:masterfrom
nh13:bgzf-position-tracking
Draft

feat(bgzf): add position tracking to MultithreadedWriter#371
nh13 wants to merge 2 commits intozaeleus:masterfrom
nh13:bgzf-position-tracking

Conversation

@nh13
Copy link
Copy Markdown
Contributor

@nh13 nh13 commented Jan 24, 2026

Summary

Adds position tracking to MultithreadedWriter to enable building BAM indexes during multi-threaded compression.

Motivation

When using MultithreadedWriter for parallel BAM compression, there's currently no way to determine the compressed file positions needed for BAI/CSI index construction. The standard Writer allows position tracking through its synchronous API, but MultithreadedWriter compresses and writes blocks asynchronously, making position correlation difficult.

This change enables building indexes during multi-threaded writes by:

  1. Assigning sequential block numbers when blocks are sent for compression
  2. Sending notifications (via channel) when blocks complete with their final compressed positions
  3. Allowing callers to cache index entries with block numbers, then resolve positions when notifications arrive

Use Case

This supports the parallel BAM processing pipeline I'm building (related to #364). The workflow:

  1. Process records, caching index entries with (block_number, uncompressed_offset)
  2. Flush blocks to compression (receive block number)
  3. Receive BlockInfo notifications when blocks are written
  4. Resolve cached index entries to final (compressed_position, uncompressed_offset) virtual offsets

New Public API

  • BlockInfo - struct with block_number, compressed_start, compressed_size, uncompressed_size
  • BlockInfoRx - type alias for the receiver channel
  • block_info_receiver() - get the notification receiver
  • current_block_number() - next block number to be assigned
  • blocks_written() - count of blocks fully written
  • position() - current compressed file position
  • buffer_offset() - bytes in staging buffer since last flush

Alternatives Considered

  • Post-hoc index building: Requires re-reading the BAM file after writing, doubling I/O for large files
  • Single-threaded Writer: Works but sacrifices the compression parallelism that MultithreadedWriter provides

Test Plan

  • Added unit test covering position tracking through multiple blocks
  • Existing tests pass

Add block-level position tracking to enable building BAM indexes during
multi-threaded BGZF compression. This follows the htslib pattern of
tracking block positions as they are written.

New public API:
- BlockInfo: Block completion info (block_number, compressed_start,
  compressed_size, uncompressed_size)
- block_info_receiver(): Get receiver for block completion notifications
- current_block_number(): Get next block number to be written
- blocks_written(): Get number of blocks fully written
- position(): Get current compressed file position
- buffer_offset(): Get current uncompressed buffer offset

The writer thread sends BlockInfo through an unbounded channel after
each block is written, allowing callers to build indexes with accurate
virtual positions without requiring per-record flushes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants