feat(parquet): make PushBuffers boundary-agnostic for prefetch IO

HippoBaro · HippoBaro · commit 504035a05b3b · 2026-04-13T17:00:58.000-04:00
The `PushDecoder` (introduced in #7997, #8080) is designed to decouple IO and CPU. It holds non-contiguous byte ranges, with a `NeedsData`/`push_range` protocol. However, it requires each logical read to be satisfied in full by a single physical buffer: `has_range`, `get_bytes`, and `Read::read` all searched for one buffer that entirely covered the requested range. This assumption conflates two orthogonal IO strategies: - Coalescing: the IO layer merges adjacent requested ranges into fewer, larger fetches. - Prefetching: the IO layer pushes data ahead of what the decoder has requested. This is an inversion of control: the IO layer speculatively fills buffers at offsets not yet requested and for arbitrary buffer sizes. These two strategies interact poorly with the current release mechanism (`clear_ranges`), which matches buffers by exact range equality: - Coalescing is both rewarded and punished. It is load bearing because without it, the number of physical buffers scale with ranges requested, and `clear_ranges` performs an O(N×M) scan to remove consumed ranges, producing quadratic overhead on wide schemas. But it is also punished because a coalesced buffer never exactly matches any individual requested range, so `clear_ranges` silently skips it: the buffer leaks in `PushBuffers` until the decoder finishes or the caller manually calls `release_all_ranges` (#9624). This increases peak RSS proportionally to the amount of data coalesced ahead of the current row group. - Prefetching is structurally impossible: speculatively pushed buffers will straddle future read boundaries, so the decoder cannot consume them, and `clear_ranges` cannot release them. This commit makes `PushBuffers` boundary-agnostic, completing the prefetching story, and changes the internals to scale with buffer count instead of range count: - Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve logical ranges across multiple contiguous physical buffers via binary search, so the IO layer is free to push arbitrarily-sized parts without knowing future read boundaries. This is a nice improvement, because some IO layer can be made much more efficient when using uniform buffers and vectorized reads. - Incremental release (`release_through`): replaces `clear_ranges` with a watermark-based release that drops all buffers below a byte offset, trimming straddling buffers via zero-copy `Bytes::slice`. The decoder calls this automatically at row-group boundaries. Benchmark results (vs baseline): push_decoder/1buf/1000ranges 321.9 µs (was 323.5 µs, −1%) push_decoder/1buf/10000ranges 3.26 ms (was 3.25 ms, +0%) push_decoder/1buf/100000ranges 34.9 ms (was 34.6 ms, +1%) push_decoder/1buf/500000ranges 192.2 ms (was 185.3 ms, +4%) push_decoder/Nbuf/1000ranges 363.9 µs (was 437.2 µs, −17%) push_decoder/Nbuf/10000ranges 3.82 ms (was 10.7 ms, −64%) push_decoder/Nbuf/100000ranges 42.1 ms (was 711.6 ms, −94%) Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
diff --git a/parquet/src/arrow/push_decoder/mod.rs b/parquet/src/arrow/push_decoder/mod.rs
@@ -326,22 +326,25 @@ impl ParquetPushDecoder {
         Ok(decode_result)
     }
 
-    /// Push data into the decoder for processing
+    /// Push data into the decoder for processing.
     ///
     /// This is a convenience wrapper around [`Self::push_ranges`] for pushing a
-    /// single range of data.
-    ///
-    /// Note this can be the entire file or just a part of it. If it is part of the file,
-    /// the ranges should correspond to the data ranges requested by the decoder.
-    ///
-    /// See example in [`ParquetPushDecoderBuilder`]
+    /// single range of data. See [`Self::push_ranges`] for details.
     pub fn push_range(&mut self, range: Range<u64>, data: Bytes) -> Result<(), ParquetError> {
         self.push_ranges(vec![range], vec![data])
     }
 
-    /// Push data into the decoder for processing
+    /// Push data into the decoder for processing.
     ///
-    /// This should correspond to the data ranges requested by the decoder
+    /// Each `(range, data)` pair associates a byte range in the Parquet file
+    /// with its contents. The pushed buffers do not need to align with the
+    /// ranges requested by [`DecodeResult::NeedsData`]: they may be smaller
+    /// (the decoder stitches adjacent buffers), larger (e.g. coalesced
+    /// fetches), or even cover offsets not yet requested (prefetch).
+    ///
+    /// The only requirement is that, by the time [`Self::try_decode`] is
+    /// called, the union of all pushed ranges must cover every byte the
+    /// decoder requeted for the current decode step.
     pub fn push_ranges(
         &mut self,
         ranges: Vec<Range<u64>>,
@@ -366,13 +369,31 @@ impl ParquetPushDecoder {
         self.state.buffered_bytes()
     }
 
-    /// Clear any staged byte ranges currently buffered for future decode work.
+    /// Release all staged byte ranges currently buffered for future decode
+    /// work.
     ///
-    /// This clears byte ranges still owned by the decoder's internal
+    /// This releases byte ranges still owned by the decoder's internal
     /// `PushBuffers`. It does not affect any data that has already been handed
     /// off to an active [`ParquetRecordBatchReader`].
+    pub fn release_all(&mut self) {
+        self.state.release_all();
+    }
+
+    /// Use [`Self::release_all`] instead.
+    #[deprecated(since = "58.1.0", note = "Use `release_all` instead")]
     pub fn clear_all_ranges(&mut self) {
-        self.state.clear_all_ranges();
+        self.release_all();
+    }
+
+    /// Release all physical buffers that end at or before the given byte offset.
+    ///
+    /// A buffer straddling the offset is trimmed: the portion before `offset`
+    /// is dropped and the suffix is retained (zero-copy via [`Bytes::slice`]).
+    ///
+    /// This does not affect any data that has already been handed off to an
+    /// active [`ParquetRecordBatchReader`].
+    pub fn release_through(&mut self, offset: u64) {
+        self.state.release_through(offset);
     }
 }
 
@@ -583,16 +604,28 @@ impl ParquetDecoderState {
         }
     }
 
-    /// Clear any staged ranges currently buffered in the decoder.
-    fn clear_all_ranges(&mut self) {
+    fn release_all(&mut self) {
         match self {
             ParquetDecoderState::ReadingRowGroup {
                 remaining_row_groups,
-            } => remaining_row_groups.clear_all_ranges(),
+            } => remaining_row_groups.release_all(),
             ParquetDecoderState::DecodingRowGroup {
                 record_batch_reader: _,
                 remaining_row_groups,
-            } => remaining_row_groups.clear_all_ranges(),
+            } => remaining_row_groups.release_all(),
+            ParquetDecoderState::Finished => {}
+        }
+    }
+
+    fn release_through(&mut self, offset: u64) {
+        match self {
+            ParquetDecoderState::ReadingRowGroup {
+                remaining_row_groups,
+            } => remaining_row_groups.release_through(offset),
+            ParquetDecoderState::DecodingRowGroup {
+                record_batch_reader: _,
+                remaining_row_groups,
+            } => remaining_row_groups.release_through(offset),
             ParquetDecoderState::Finished => {}
         }
     }
@@ -691,8 +724,9 @@ mod test {
     /// Releasing staged ranges should free speculative buffers without affecting
     /// the active row group reader.
     #[test]
-    fn test_decoder_clear_all_ranges() {
-        let mut decoder = ParquetPushDecoderBuilder::try_new_decoder(test_file_parquet_metadata())
+    fn test_decoder_release_all() {
+        let metadata = test_file_parquet_metadata();
+        let mut decoder = ParquetPushDecoderBuilder::try_new_decoder(metadata.clone())
             .unwrap()
             .with_batch_size(100)
             .build()
@@ -703,14 +737,16 @@ mod test {
             .unwrap();
         assert_eq!(decoder.buffered_bytes(), test_file_len());
 
-        // The current row group reader is built from the prefetched bytes, but
-        // the speculative full-file range remains staged in the decoder.
+        // Building the InMemoryRowGroup for row group 0 releases buffers up
+        // to that row group's end offset.  The remainder (row group 1 + footer)
+        // is still staged.
         let batch1 = expect_data(decoder.try_decode());
         assert_eq!(batch1, TEST_BATCH.slice(0, 100));
-        assert_eq!(decoder.buffered_bytes(), test_file_len());
+        let rg0_end = metadata.row_group(0).end_offset();
+        assert_eq!(decoder.buffered_bytes(), test_file_len() - rg0_end);
 
-        // All of the buffer is released
-        decoder.clear_all_ranges();
+        // Release everything that remains.
+        decoder.release_all();
         assert_eq!(decoder.buffered_bytes(), 0);
 
         // The active reader still owns the current row group's bytes, so it can
@@ -1167,6 +1203,72 @@ mod test {
         expect_finished(decoder.try_decode());
     }
 
+    /// Decode the file pushed as fixed-size streaming parts, simulating a
+    /// single GET request that yields part-sized buffers. Part boundaries are
+    /// intentionally misaligned with column chunk / page boundaries.
+    #[test]
+    fn test_decoder_streaming_parts() {
+        let part_size = 512usize; // misaligned with column chunks
+        let mut decoder = ParquetPushDecoderBuilder::try_new_decoder(test_file_parquet_metadata())
+            .unwrap()
+            .build()
+            .unwrap();
+
+        // Push the entire file as fixed-size parts.
+        let file_len = TEST_FILE_DATA.len();
+        let mut offset = 0usize;
+        while offset < file_len {
+            let end = (offset + part_size).min(file_len);
+            let range = (offset as u64)..(end as u64);
+            let data = TEST_FILE_DATA.slice(offset..end);
+            decoder.push_range(range, data).unwrap();
+            offset = end;
+        }
+
+        // Decode all row groups — stitching should handle cross-part reads.
+        let batch1 = expect_data(decoder.try_decode());
+        let batch2 = expect_data(decoder.try_decode());
+        expect_finished(decoder.try_decode());
+
+        let all_output = concat_batches(&TEST_BATCH.schema(), &[batch1, batch2]).unwrap();
+        assert_eq!(all_output, *TEST_BATCH);
+    }
+
+    /// Push the entire file, decode the first row group, call `release_through`
+    /// to free its buffers, then decode the second row group.
+    #[test]
+    fn test_decoder_release_through() {
+        let metadata = test_file_parquet_metadata();
+        let mut decoder = ParquetPushDecoderBuilder::try_new_decoder(metadata.clone())
+            .unwrap()
+            .build()
+            .unwrap();
+
+        decoder
+            .push_range(test_file_range(), TEST_FILE_DATA.clone())
+            .unwrap();
+        assert_eq!(decoder.buffered_bytes(), test_file_len());
+
+        // Decode first row group.
+        let batch1 = expect_data(decoder.try_decode());
+        assert_eq!(batch1, TEST_BATCH.slice(0, 200));
+
+        // Free everything up to the end of row group 0.
+        let rg0_end = metadata.row_group(0).end_offset();
+        decoder.release_through(rg0_end);
+        let remaining = decoder.buffered_bytes();
+        assert!(
+            remaining < test_file_len(),
+            "buffered_bytes should have decreased: {remaining} < {}",
+            test_file_len()
+        );
+
+        // Second row group should still be decodable.
+        let batch2 = expect_data(decoder.try_decode());
+        assert_eq!(batch2, TEST_BATCH.slice(200, 200));
+        expect_finished(decoder.try_decode());
+    }
+
     /// Returns a batch with 400 rows, with 3 columns: "a", "b", "c"
     ///
     /// Note c is a different types (so the data page sizes will be different)
diff --git a/parquet/src/arrow/push_decoder/reader_builder/data.rs b/parquet/src/arrow/push_decoder/reader_builder/data.rs
@@ -23,7 +23,6 @@ use crate::arrow::in_memory_row_group::{ColumnChunkData, FetchRanges, InMemoryRo
 use crate::errors::ParquetError;
 use crate::file::metadata::ParquetMetaData;
 use crate::file::page_index::offset_index::OffsetIndexMetaData;
-use crate::file::reader::ChunkReader;
 use crate::util::push_buffers::PushBuffers;
 use bytes::Bytes;
 use std::ops::Range;
@@ -55,7 +54,7 @@ impl DataRequest {
     }
 
     /// Returns the chunks from the buffers that satisfy this request
-    fn get_chunks(&self, buffers: &PushBuffers) -> Result<Vec<Bytes>, ParquetError> {
+    fn get_chunks(&self, buffers: &mut PushBuffers) -> Result<Vec<Bytes>, ParquetError> {
         self.ranges
             .iter()
             .map(|range| {
@@ -72,10 +71,12 @@ impl DataRequest {
             .collect()
     }
 
-    /// Create a new InMemoryRowGroup, and fill it with provided data
+    /// Create a new InMemoryRowGroup, and fill it with provided data.
     ///
-    /// Assumes that all needed data is present in the buffers
-    /// and clears any explicitly requested ranges
+    /// Assumes that all needed data is present in the buffers.
+    /// Does **not** release any buffers — the caller is responsible for
+    /// calling `PushBuffers::release_through` at the appropriate time
+    /// (typically after all phases for a row group are complete).
     pub fn try_into_in_memory_row_group<'a>(
         self,
         row_group_idx: usize,
@@ -88,7 +89,7 @@ impl DataRequest {
 
         let Self {
             column_chunks,
-            ranges,
+            ranges: _,
             page_start_offsets,
         } = self;
 
@@ -105,9 +106,6 @@ impl DataRequest {
 
         in_memory_row_group.fill_column_chunks(projection, page_start_offsets, chunks);
 
-        // Clear the ranges that were explicitly requested
-        buffers.clear_ranges(&ranges);
-
         Ok(in_memory_row_group)
     }
 }
diff --git a/parquet/src/arrow/push_decoder/reader_builder/mod.rs b/parquet/src/arrow/push_decoder/reader_builder/mod.rs
@@ -212,9 +212,15 @@ impl RowGroupReaderBuilder {
         self.buffers.buffered_bytes()
     }
 
-    /// Clear any staged ranges currently buffered for future decode work.
-    pub fn clear_all_ranges(&mut self) {
-        self.buffers.clear_all_ranges();
+    /// Release all staged ranges currently buffered for future decode work.
+    pub fn release_all(&mut self) {
+        self.buffers.release_all();
+    }
+
+    /// Release all physical buffers that end at or before `offset`.
+    /// A straddling buffer is trimmed via zero-copy [`Bytes::slice`].
+    pub fn release_through(&mut self, offset: u64) {
+        self.buffers.release_through(offset);
     }
 
     /// take the current state, leaving None in its place.
@@ -269,6 +275,7 @@ impl RowGroupReaderBuilder {
     pub(crate) fn try_build(
         &mut self,
     ) -> Result<DecodeResult<ParquetRecordBatchReader>, ParquetError> {
+        self.buffers.ensure_sorted();
         loop {
             let current_state = self.take_state()?;
             // Try to transition the decoder.
@@ -610,6 +617,12 @@ impl RowGroupReaderBuilder {
                     &mut self.buffers,
                 )?;
 
+                // All data for this row group has been extracted into the
+                // InMemoryRowGroup.  Release physical buffers up to the end
+                // of this row group so streaming IO can reclaim memory.
+                self.buffers
+                    .release_through(self.metadata.row_group(row_group_idx).end_offset());
+
                 let plan = plan_builder.build();
 
                 // if we have any cached results, connect them up
diff --git a/parquet/src/arrow/push_decoder/remaining.rs b/parquet/src/arrow/push_decoder/remaining.rs
@@ -70,9 +70,15 @@ impl RemainingRowGroups {
         self.row_group_reader_builder.buffered_bytes()
     }
 
-    /// Clear any staged ranges currently buffered for future decode work
-    pub fn clear_all_ranges(&mut self) {
-        self.row_group_reader_builder.clear_all_ranges();
+    /// Release all staged ranges currently buffered for future decode work.
+    pub fn release_all(&mut self) {
+        self.row_group_reader_builder.release_all();
+    }
+
+    /// Release all physical buffers that end at or before `offset`.
+    /// A straddling buffer is trimmed via zero-copy [`Bytes::slice`].
+    pub fn release_through(&mut self, offset: u64) {
+        self.row_group_reader_builder.release_through(offset);
     }
 
     /// returns [`ParquetRecordBatchReader`] suitable for reading the next
diff --git a/parquet/src/file/metadata/mod.rs b/parquet/src/file/metadata/mod.rs
@@ -713,6 +713,21 @@ impl RowGroupMetaData {
         self.file_offset
     }
 
+    /// Returns the byte offset just past the last column chunk in this row group.
+    ///
+    /// This is the maximum of `(start + length)` across all column chunks, which
+    /// represents the first byte that is *not* part of this row group's data.
+    pub fn end_offset(&self) -> u64 {
+        self.columns
+            .iter()
+            .map(|c| {
+                let (start, len) = c.byte_range();
+                start + len
+            })
+            .max()
+            .unwrap_or(0)
+    }
+
     /// Converts this [`RowGroupMetaData`] into a [`RowGroupMetaDataBuilder`]
     pub fn into_builder(self) -> RowGroupMetaDataBuilder {
         RowGroupMetaDataBuilder(self)
diff --git a/parquet/src/file/metadata/push_decoder.rs b/parquet/src/file/metadata/push_decoder.rs
@@ -23,7 +23,6 @@ use crate::file::FOOTER_SIZE;
 use crate::file::metadata::parser::{MetadataParser, parse_column_index, parse_offset_index};
 use crate::file::metadata::{FooterTail, PageIndexPolicy, ParquetMetaData, ParquetMetaDataOptions};
 use crate::file::page_index::index_reader::acc_range;
-use crate::file::reader::ChunkReader;
 use bytes::Bytes;
 use std::ops::Range;
 use std::sync::Arc;
@@ -360,12 +359,13 @@ impl ParquetMetaDataPushDecoder {
 
     /// Clear any staged byte ranges currently buffered for future decode work.
     pub fn clear_all_ranges(&mut self) {
-        self.buffers.clear_all_ranges();
+        self.buffers.release_all();
     }
 
     /// Try to decode the metadata from the pushed data, returning the
     /// decoded metadata or an error if not enough data is available.
     pub fn try_decode(&mut self) -> Result<DecodeResult<ParquetMetaData>> {
+        self.buffers.ensure_sorted();
         let file_len = self.buffers.file_len();
         let footer_len = FOOTER_SIZE as u64;
         loop {
@@ -397,10 +397,10 @@ impl ParquetMetaDataPushDecoder {
                         return Ok(needs_range(metadata_range));
                     }
 
-                    let metadata = self.metadata_parser.decode_metadata(
-                        &self.get_bytes(&metadata_range)?,
-                        footer_tail.is_encrypted_footer(),
-                    )?;
+                    let metadata_bytes = self.get_bytes(&metadata_range)?;
+                    let metadata = self
+                        .metadata_parser
+                        .decode_metadata(&metadata_bytes, footer_tail.is_encrypted_footer())?;
                     // Note: ReadingPageIndex first checks if page indexes are needed
                     // and is a no-op if not
                     self.state = DecodeState::ReadingPageIndex(Box::new(metadata));
@@ -445,7 +445,7 @@ impl ParquetMetaDataPushDecoder {
     }
 
     /// Returns the bytes for the given range from the internal buffer
-    fn get_bytes(&self, range: &Range<u64>) -> Result<Bytes> {
+    fn get_bytes(&mut self, range: &Range<u64>) -> Result<Bytes> {
         let start = range.start;
         let raw_len = range.end - range.start;
         let len: usize = raw_len.try_into().map_err(|_| {
diff --git a/parquet/src/util/push_buffers.rs b/parquet/src/util/push_buffers.rs