perf(bam): add optimized RecordBuf encoding (~4.5x faster writes)#367
Draft
nh13 wants to merge 6 commits intozaeleus:masterfrom
Draft
perf(bam): add optimized RecordBuf encoding (~4.5x faster writes)#367nh13 wants to merge 6 commits intozaeleus:masterfrom
nh13 wants to merge 6 commits intozaeleus:masterfrom
Conversation
Add estimate_record_size() and encode_with_prealloc() functions to reduce memory reallocations during BAM record encoding. The estimate_record_size() function provides a fast heuristic estimate of encoded record size based on sequence length, avoiding expensive iteration over CIGAR and auxiliary data fields. The estimate is intentionally generous to minimize buffer reallocations. encode_with_prealloc() uses this estimate to reserve buffer capacity before encoding, reducing Vec reallocations during batch encoding operations. Performance improvement: approximately 2-3% throughput increase when encoding many records. - Add estimate_record_size() with documentation and examples - Add encode_with_prealloc() with documentation and examples - Add unit tests for both functions - Add Criterion benchmarks for encoding performance - Export new functions from record::codec module
Add write_quality_scores_from_slice() function that bypasses trait-based iteration for encoding quality scores, using bulk operations instead. The optimization works by: 1. Validating all scores in a single pass (LLVM can auto-vectorize this) 2. Using extend_from_slice() for a single memcpy instead of per-byte writes This eliminates dynamic dispatch overhead and enables better compiler optimizations for quality score encoding. Performance improvement: approximately 30-35% throughput increase for quality score encoding compared to the trait-based iterator version. - Add write_quality_scores_from_slice() with documentation - Add unit tests including boundary validation - Add test verifying output matches trait-based encoder
Add write_sequence_from_slice() function that encodes DNA sequences using 16-base chunked processing, matching htslib's strategy for optimal performance. The optimization works by: 1. Processing 16 bases at a time (producing 8 output bytes per chunk) 2. Using unrolled loop operations for better CPU pipelining 3. Handling remainder bases in 2-base pairs 4. Padding final odd base with zero in lower nibble This chunked approach provides: - Reduced loop overhead by processing more data per iteration - Improved CPU pipelining through unrolled operations - Potential SIMD auto-vectorization opportunities - Better cache locality from contiguous memory access The 16-base chunking matches htslib's NN=16 strategy (sam.c:621-636) for cross-tool consistency. Performance improvement: approximately 15-20% throughput increase compared to the per-base iterator version. - Add write_sequence_from_slice() with comprehensive documentation - Move CODES lookup table to module-level const for shared access - Add unit tests for various sequence lengths and edge cases - Add tests verifying output matches trait-based encoder - Add tests for all 16 standard base codes
Add write_cigar_from_slice() for bulk CIGAR encoding and encode_record_buf() which combines all optimizations for maximum performance when encoding RecordBuf instances. write_cigar_from_slice(): - Bypasses trait-based iterator overhead - Pre-reserves capacity for all CIGAR operations - Direct slice iteration for better performance encode_record_buf(): - Specialized encoder for RecordBuf type - Combines all bulk encoding optimizations: - Buffer pre-allocation - Bulk sequence encoding with 16-base chunks - Bulk quality score encoding with vectorized validation - Bulk CIGAR encoding with pre-allocated capacity - Directly accesses underlying slices instead of trait objects Performance improvement: approximately 40-50% throughput increase compared to the generic trait-based encoder when encoding RecordBuf instances. - Add write_cigar_from_slice() with documentation - Add encode_record_buf() with comprehensive documentation - Export encode_record_buf from record::codec module - Add unit tests verifying output matches generic encoder - Add tests for various sequence lengths and configurations - Add Criterion benchmarks comparing generic vs optimized paths
Add a specialized write_record_buf() method to the BAM Writer that uses the optimized encode_record_buf() function for maximum write throughput when working with RecordBuf instances. The standard write_alignment_record() method uses trait objects and cannot automatically detect when a RecordBuf is being written. This new method provides an explicit opt-in for users who want maximum performance. Performance comparison: - write_alignment_record(): ~170 MiB/s - write_record_buf(): ~760 MiB/s (4.5x faster) Usage: ```rust let mut writer = bam::io::Writer::new(io::sink()); writer.write_header(&header)?; // Use optimized path for RecordBuf writer.write_record_buf(&header, &record)?; ``` - Add Writer::write_record_buf() with documentation - Add tests verifying output matches generic writer - Add test for round-trip read/write - Add benchmark comparing write_alignment_record vs write_record_buf
0c3f5a3 to
46487c6
Compare
Add test_encode_record_buf_with_oversized_cigar to verify the optimized encoder correctly handles records with more than 65535 CIGAR operations. Also enhance Writer::write_record_buf documentation with guidance on when to use each encoding method.
This was referenced Jan 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
Writer::write_record_buf()for high-throughput BAM writing when usingRecordBuf.Benchmarks
Results on Apple M3 Max:
write_alignment_record()write_record_buf()Optimizations
New Public API
Writer::write_record_buf(&header, &record)- optimized write pathencode_record_buf()- low-level optimized encoderencode_with_prealloc()- generic encoder with pre-allocationestimate_record_size()- fast size heuristicThe generic
Recordtrait path is unchanged.API Design Rationale
The
write_record_bufmethod is separate fromwrite_alignment_recordbecause the trait-based method uses dynamic dispatch (&dyn Record), which prevents the compiler from accessing concrete slice data for bulk operations. The specialized method is necessary to achieve these performance gains without breaking the existing generic API.Future Work
Similar optimizations could be applied to other formats (CRAM, VCF writers) if profiling indicates encoding is a bottleneck. This PR focuses on BAM as it's the most commonly written format in high-throughput pipelines.