perf(bam): add optimized RecordBuf encoding (~4.5x faster writes) by nh13 · Pull Request #367 · zaeleus/noodles

nh13 · 2026-01-09T04:43:15Z

Add Writer::write_record_buf() for high-throughput BAM writing when using RecordBuf.

Benchmarks

cargo bench -p noodles-bam -- writer_methods

Results on Apple M3 Max:

Method	Throughput
`write_alignment_record()`	173 MiB/s
`write_record_buf()`	777 MiB/s

Optimizations

Bulk sequence encoding with 16-base chunks (matches htslib)
Bulk quality score encoding with vectorized validation
Bulk CIGAR encoding with capacity pre-allocation
Buffer size estimation to reduce reallocations

New Public API

Writer::write_record_buf(&header, &record) - optimized write path
encode_record_buf() - low-level optimized encoder
encode_with_prealloc() - generic encoder with pre-allocation
estimate_record_size() - fast size heuristic

The generic Record trait path is unchanged.

API Design Rationale

The write_record_buf method is separate from write_alignment_record because the trait-based method uses dynamic dispatch (&dyn Record), which prevents the compiler from accessing concrete slice data for bulk operations. The specialized method is necessary to achieve these performance gains without breaking the existing generic API.

Future Work

Similar optimizations could be applied to other formats (CRAM, VCF writers) if profiling indicates encoding is a bottleneck. This PR focuses on BAM as it's the most commonly written format in high-throughput pipelines.

Add estimate_record_size() and encode_with_prealloc() functions to reduce memory reallocations during BAM record encoding. The estimate_record_size() function provides a fast heuristic estimate of encoded record size based on sequence length, avoiding expensive iteration over CIGAR and auxiliary data fields. The estimate is intentionally generous to minimize buffer reallocations. encode_with_prealloc() uses this estimate to reserve buffer capacity before encoding, reducing Vec reallocations during batch encoding operations. Performance improvement: approximately 2-3% throughput increase when encoding many records. - Add estimate_record_size() with documentation and examples - Add encode_with_prealloc() with documentation and examples - Add unit tests for both functions - Add Criterion benchmarks for encoding performance - Export new functions from record::codec module

Add write_quality_scores_from_slice() function that bypasses trait-based iteration for encoding quality scores, using bulk operations instead. The optimization works by: 1. Validating all scores in a single pass (LLVM can auto-vectorize this) 2. Using extend_from_slice() for a single memcpy instead of per-byte writes This eliminates dynamic dispatch overhead and enables better compiler optimizations for quality score encoding. Performance improvement: approximately 30-35% throughput increase for quality score encoding compared to the trait-based iterator version. - Add write_quality_scores_from_slice() with documentation - Add unit tests including boundary validation - Add test verifying output matches trait-based encoder

Add write_sequence_from_slice() function that encodes DNA sequences using 16-base chunked processing, matching htslib's strategy for optimal performance. The optimization works by: 1. Processing 16 bases at a time (producing 8 output bytes per chunk) 2. Using unrolled loop operations for better CPU pipelining 3. Handling remainder bases in 2-base pairs 4. Padding final odd base with zero in lower nibble This chunked approach provides: - Reduced loop overhead by processing more data per iteration - Improved CPU pipelining through unrolled operations - Potential SIMD auto-vectorization opportunities - Better cache locality from contiguous memory access The 16-base chunking matches htslib's NN=16 strategy (sam.c:621-636) for cross-tool consistency. Performance improvement: approximately 15-20% throughput increase compared to the per-base iterator version. - Add write_sequence_from_slice() with comprehensive documentation - Move CODES lookup table to module-level const for shared access - Add unit tests for various sequence lengths and edge cases - Add tests verifying output matches trait-based encoder - Add tests for all 16 standard base codes

Add write_cigar_from_slice() for bulk CIGAR encoding and encode_record_buf() which combines all optimizations for maximum performance when encoding RecordBuf instances. write_cigar_from_slice(): - Bypasses trait-based iterator overhead - Pre-reserves capacity for all CIGAR operations - Direct slice iteration for better performance encode_record_buf(): - Specialized encoder for RecordBuf type - Combines all bulk encoding optimizations: - Buffer pre-allocation - Bulk sequence encoding with 16-base chunks - Bulk quality score encoding with vectorized validation - Bulk CIGAR encoding with pre-allocated capacity - Directly accesses underlying slices instead of trait objects Performance improvement: approximately 40-50% throughput increase compared to the generic trait-based encoder when encoding RecordBuf instances. - Add write_cigar_from_slice() with documentation - Add encode_record_buf() with comprehensive documentation - Export encode_record_buf from record::codec module - Add unit tests verifying output matches generic encoder - Add tests for various sequence lengths and configurations - Add Criterion benchmarks comparing generic vs optimized paths

Add a specialized write_record_buf() method to the BAM Writer that uses the optimized encode_record_buf() function for maximum write throughput when working with RecordBuf instances. The standard write_alignment_record() method uses trait objects and cannot automatically detect when a RecordBuf is being written. This new method provides an explicit opt-in for users who want maximum performance. Performance comparison: - write_alignment_record(): ~170 MiB/s - write_record_buf(): ~760 MiB/s (4.5x faster) Usage: ```rust let mut writer = bam::io::Writer::new(io::sink()); writer.write_header(&header)?; // Use optimized path for RecordBuf writer.write_record_buf(&header, &record)?; ``` - Add Writer::write_record_buf() with documentation - Add tests verifying output matches generic writer - Add test for round-trip read/write - Add benchmark comparing write_alignment_record vs write_record_buf

Add test_encode_record_buf_with_oversized_cigar to verify the optimized encoder correctly handles records with more than 65535 CIGAR operations. Also enhance Writer::write_record_buf documentation with guidance on when to use each encoding method.

nh13 added 5 commits January 8, 2026 21:52

nh13 force-pushed the perf/bam-encode-optimizations branch from 0c3f5a3 to 46487c6 Compare January 9, 2026 04:54

zaeleus added the bam label Jan 12, 2026

This was referenced Jan 24, 2026

Track upstream PRs blocking crates.io release fulcrumgenomics/fgumi#13

Open

feat(noodles-bam): expose record codec encode/decode functions #364

Closed

nh13 marked this pull request as draft March 26, 2026 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bam): add optimized RecordBuf encoding (~4.5x faster writes)#367

perf(bam): add optimized RecordBuf encoding (~4.5x faster writes)#367
nh13 wants to merge 6 commits intozaeleus:masterfrom
nh13:perf/bam-encode-optimizations

nh13 commented Jan 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nh13 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Optimizations

New Public API

API Design Rationale

Future Work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nh13 commented Jan 9, 2026 •

edited

Loading