Skip to content

perf(bam): add optimized RecordBuf encoding (~4.5x faster writes)#367

Draft
nh13 wants to merge 6 commits intozaeleus:masterfrom
nh13:perf/bam-encode-optimizations
Draft

perf(bam): add optimized RecordBuf encoding (~4.5x faster writes)#367
nh13 wants to merge 6 commits intozaeleus:masterfrom
nh13:perf/bam-encode-optimizations

Conversation

@nh13
Copy link
Copy Markdown
Contributor

@nh13 nh13 commented Jan 9, 2026

Add Writer::write_record_buf() for high-throughput BAM writing when using RecordBuf.

Benchmarks

cargo bench -p noodles-bam -- writer_methods

Results on Apple M3 Max:

Method Throughput
write_alignment_record() 173 MiB/s
write_record_buf() 777 MiB/s

Optimizations

  • Bulk sequence encoding with 16-base chunks (matches htslib)
  • Bulk quality score encoding with vectorized validation
  • Bulk CIGAR encoding with capacity pre-allocation
  • Buffer size estimation to reduce reallocations

New Public API

  • Writer::write_record_buf(&header, &record) - optimized write path
  • encode_record_buf() - low-level optimized encoder
  • encode_with_prealloc() - generic encoder with pre-allocation
  • estimate_record_size() - fast size heuristic

The generic Record trait path is unchanged.

API Design Rationale

The write_record_buf method is separate from write_alignment_record because the trait-based method uses dynamic dispatch (&dyn Record), which prevents the compiler from accessing concrete slice data for bulk operations. The specialized method is necessary to achieve these performance gains without breaking the existing generic API.

Future Work

Similar optimizations could be applied to other formats (CRAM, VCF writers) if profiling indicates encoding is a bottleneck. This PR focuses on BAM as it's the most commonly written format in high-throughput pipelines.

nh13 added 5 commits January 8, 2026 21:52
Add estimate_record_size() and encode_with_prealloc() functions to reduce
memory reallocations during BAM record encoding.

The estimate_record_size() function provides a fast heuristic estimate of
encoded record size based on sequence length, avoiding expensive iteration
over CIGAR and auxiliary data fields. The estimate is intentionally generous
to minimize buffer reallocations.

encode_with_prealloc() uses this estimate to reserve buffer capacity before
encoding, reducing Vec reallocations during batch encoding operations.

Performance improvement: approximately 2-3% throughput increase when encoding
many records.

- Add estimate_record_size() with documentation and examples
- Add encode_with_prealloc() with documentation and examples
- Add unit tests for both functions
- Add Criterion benchmarks for encoding performance
- Export new functions from record::codec module
Add write_quality_scores_from_slice() function that bypasses trait-based
iteration for encoding quality scores, using bulk operations instead.

The optimization works by:
1. Validating all scores in a single pass (LLVM can auto-vectorize this)
2. Using extend_from_slice() for a single memcpy instead of per-byte writes

This eliminates dynamic dispatch overhead and enables better compiler
optimizations for quality score encoding.

Performance improvement: approximately 30-35% throughput increase for
quality score encoding compared to the trait-based iterator version.

- Add write_quality_scores_from_slice() with documentation
- Add unit tests including boundary validation
- Add test verifying output matches trait-based encoder
Add write_sequence_from_slice() function that encodes DNA sequences using
16-base chunked processing, matching htslib's strategy for optimal
performance.

The optimization works by:
1. Processing 16 bases at a time (producing 8 output bytes per chunk)
2. Using unrolled loop operations for better CPU pipelining
3. Handling remainder bases in 2-base pairs
4. Padding final odd base with zero in lower nibble

This chunked approach provides:
- Reduced loop overhead by processing more data per iteration
- Improved CPU pipelining through unrolled operations
- Potential SIMD auto-vectorization opportunities
- Better cache locality from contiguous memory access

The 16-base chunking matches htslib's NN=16 strategy (sam.c:621-636) for
cross-tool consistency.

Performance improvement: approximately 15-20% throughput increase compared
to the per-base iterator version.

- Add write_sequence_from_slice() with comprehensive documentation
- Move CODES lookup table to module-level const for shared access
- Add unit tests for various sequence lengths and edge cases
- Add tests verifying output matches trait-based encoder
- Add tests for all 16 standard base codes
Add write_cigar_from_slice() for bulk CIGAR encoding and encode_record_buf()
which combines all optimizations for maximum performance when encoding
RecordBuf instances.

write_cigar_from_slice():
- Bypasses trait-based iterator overhead
- Pre-reserves capacity for all CIGAR operations
- Direct slice iteration for better performance

encode_record_buf():
- Specialized encoder for RecordBuf type
- Combines all bulk encoding optimizations:
  - Buffer pre-allocation
  - Bulk sequence encoding with 16-base chunks
  - Bulk quality score encoding with vectorized validation
  - Bulk CIGAR encoding with pre-allocated capacity
- Directly accesses underlying slices instead of trait objects

Performance improvement: approximately 40-50% throughput increase compared
to the generic trait-based encoder when encoding RecordBuf instances.

- Add write_cigar_from_slice() with documentation
- Add encode_record_buf() with comprehensive documentation
- Export encode_record_buf from record::codec module
- Add unit tests verifying output matches generic encoder
- Add tests for various sequence lengths and configurations
- Add Criterion benchmarks comparing generic vs optimized paths
Add a specialized write_record_buf() method to the BAM Writer that uses
the optimized encode_record_buf() function for maximum write throughput
when working with RecordBuf instances.

The standard write_alignment_record() method uses trait objects and cannot
automatically detect when a RecordBuf is being written. This new method
provides an explicit opt-in for users who want maximum performance.

Performance comparison:
- write_alignment_record(): ~170 MiB/s
- write_record_buf():       ~760 MiB/s (4.5x faster)

Usage:
```rust
let mut writer = bam::io::Writer::new(io::sink());
writer.write_header(&header)?;

// Use optimized path for RecordBuf
writer.write_record_buf(&header, &record)?;
```

- Add Writer::write_record_buf() with documentation
- Add tests verifying output matches generic writer
- Add test for round-trip read/write
- Add benchmark comparing write_alignment_record vs write_record_buf
@nh13 nh13 force-pushed the perf/bam-encode-optimizations branch from 0c3f5a3 to 46487c6 Compare January 9, 2026 04:54
Add test_encode_record_buf_with_oversized_cigar to verify the optimized
encoder correctly handles records with more than 65535 CIGAR operations.

Also enhance Writer::write_record_buf documentation with guidance on when
to use each encoding method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants