Skip to content

21586: perf(physical-plan): optimize byte view append#309

Open
martin-augment wants to merge 2 commits intomainfrom
pr-21586-2026-04-13-11-51-50
Open

21586: perf(physical-plan): optimize byte view append#309
martin-augment wants to merge 2 commits intomainfrom
pr-21586-2026-04-13-11-51-50

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21586: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Walkthrough

This pull request introduces three distinct changes: two documentation updates define new constraints for AI agents conducting code reviews, specifically prohibiting linking to issues or pull requests and reviewing AI configuration files; one significant implementation change optimizes ByteView handling in DataFusion's aggregation engine by introducing vectorized append operations with deferred bulk copying for long values, replacing per-row processing loops, and includes comprehensive unit test coverage for the new behavior.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21586-2026-04-13-11-51-50

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 13, 2026

🤖 Augment PR Summary

Summary: Optimizes ByteViewGroupValueBuilder’s vectorized append path for byte/utf8 view group values to reduce per-row work when ingesting many rows.

Changes:

  • Introduces a pending-copy representation for long (non-inlined) byte views and a batched copy routine that can copy contiguous source slices in fewer operations.
  • Splits vectorized append into fast paths for “no nulls” vs “has nulls”, minimizing branching in the non-null case.
  • Updates long-view handling to rewrite stored views to point at the builder’s internal buffers after batch copying.
  • Adds unit tests covering subset/repeated-row appends, interaction with take_n, multiple batches, mid-batch flush behavior, and oversized values.

Technical Notes: The new batching logic attempts to coalesce adjacent long views from the same source buffer into a single copy into the builder’s in-progress block, while preserving correct view offsets/buffer indices.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code review for this optimization PR. The overall approach is sound but I have a few observations. First, vectorized_append_non_null_views double-iterates rows: once to extend self.views with raw source views, then again to build the pending list. The vectorized_append_views_with_nulls method already does this in a single pass, and aligning the non-null path to that pattern would avoid the transient invalid state where long-string views temporarily hold stale buffer_index and offset from the source array. Second, batch_copy_long_views only batches entries that are contiguous in source offset (next_start == batch_source_end), so in typical GROUP BY workloads with arbitrary row indices this degrades to per-item copies - correct, but a doc comment would help. On test coverage: none of the new tests exercise an input StringViewArray backed by multiple source data buffers, which is the primary scenario this optimization targets; a test constructing such an input would close that gap. Similarly, there is no test for scattered non-contiguous long-string rows from the same source buffer (e.g. rows [3, 0, 2]), which is the common GROUP BY pattern and exercises the single-item batch offset calculation. A test for an empty rows slice would also document that contract. Minor: PendingByteViewCopy deriving Clone Copy is fine for a POD type. Summary: the implementation is correct and the flush/oversized/null paths are well covered. Main asks before merging are (1) a test with multiple source buffers in the input, (2) a test with scattered non-contiguous rows, and (3) consider the single-pass refactor in vectorized_append_non_null_views.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces vectorized appending for ByteViewGroupValueBuilder in the DataFusion physical plan, optimizing the handling of binary and UTF-8 views during aggregation. The changes include new methods for batching the copy of long views and several unit tests to verify correctness across various scenarios. Feedback suggests optimizing memory allocation for the pending vector in the vectorized append paths to avoid over-allocation when processing many short strings.

let start_idx = self.views.len();
self.views.extend(rows.iter().map(|&row| source_views[row]));

let mut pending = Vec::with_capacity(rows.len());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pending vector is allocated and its capacity is reserved to rows.len() in every call to vectorized_append_non_null_views. For large batches where most strings are short (<= 12 bytes), this results in significant over-allocation of memory that is only partially used. Consider using a smaller initial capacity or a reusable buffer to reduce allocation overhead.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! Most probably many of the rows won't be long (longer than 12 bytes), so the pending Vec won't use all of the pre-allocated slots. It would be better to allocate less entries, e.g. half of the rows' length, and let it resize on append if needed.

rows: &[usize],
) {
let source_views = array.views();
let mut pending = Vec::with_capacity(rows.len());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the non-null case, the pending vector here is allocated with rows.len() capacity. This can be wasteful if the number of 'long' views is small relative to the total number of rows. Reusing a buffer or using a more conservative initial capacity would be more efficient.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! Most probably many of the rows won't be long (longer than 12 bytes), so the pending Vec won't use all of the pre-allocated slots. It would be better to allocate less entries, e.g. half of the rows' length, and let it resize on append if needed.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs (1)

247-310: Well-designed batching strategy.

The contiguous-range detection and bulk copying is a good optimization. A few notes:

  1. The flush-before-process pattern (lines 257-261) ensures buffer_index calculation is always correct
  2. Line 264 correctly handles oversized values by using max(block_size, first_len)
  3. The offset calculation (line 303-304) correctly computes: dest_start + relative_offset_within_batch

Consider adding a brief comment at the top of the method explaining the batching strategy (identify contiguous ranges from same source buffer, copy in bulk) for future maintainers.
,

📝 Optional: Add explanatory comment
     fn batch_copy_long_views(
         &mut self,
         source_buffers: &[Buffer],
         pending: &[PendingByteViewCopy],
     ) {
+        // Batch contiguous ranges from the same source buffer to minimize
+        // copy operations. For each batch, copy the entire range at once
+        // and update each view's buffer_index/offset to point to our buffers.
         let mut batch_start = 0;
         while batch_start < pending.len() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs`
around lines 247 - 310, Add a brief explanatory comment at the top of the
batch_copy_long_views method describing the batching strategy: that it scans
pending (PendingByteViewCopy) to find contiguous ranges from the same source
buffer, flushes in_progress when adding the first element would exceed
max_block_size, uses max(max_block_size, first_len) to allow oversized single
entries, copies contiguous bytes in bulk from source_buffers into in_progress,
and then updates views entries with the new buffer_index, adjusted offset
(dest_batch_start + relative offset within the batch) so future readers
understand why flush_in_progress(), the max_block_size handling, and the offset
arithmetic in batch_copy_long_views are necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs`:
- Around line 247-310: Add a brief explanatory comment at the top of the
batch_copy_long_views method describing the batching strategy: that it scans
pending (PendingByteViewCopy) to find contiguous ranges from the same source
buffer, flushes in_progress when adding the first element would exceed
max_block_size, uses max(max_block_size, first_len) to allow oversized single
entries, copies contiguous bytes in bulk from source_buffers into in_progress,
and then updates views entries with the new buffer_index, adjusted offset
(dest_batch_start + relative offset within the batch) so future readers
understand why flush_in_progress(), the max_block_size handling, and the offset
arithmetic in batch_copy_long_views are necessary.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6348ddf0-4d77-408e-95b0-697edc1e9af2

📥 Commits

Reviewing files that changed from the base of the PR and between 644be38 and 0854982.

📒 Files selected for processing (3)
  • .cursor/rules.md
  • AGENTS.md
  • datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants