Skip to content

21538: perf: Optimize NULL handling in StringViewArrayBuilder#311

Open
martin-augment wants to merge 3 commits intomainfrom
pr-21538-2026-04-13-13-42-06
Open

21538: perf: Optimize NULL handling in StringViewArrayBuilder#311
martin-augment wants to merge 3 commits intomainfrom
pr-21538-2026-04-13-13-42-06

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21538: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Warning

Rate limit exceeded

@martin-augment has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 37 minutes and 11 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 37 minutes and 11 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c5cbe819-1876-4855-ac4b-4306720be618

📥 Commits

Reviewing files that changed from the base of the PR and between db7aad0 and 7e4bb00.

📒 Files selected for processing (3)
  • .cursor/rules.md
  • AGENTS.md
  • datafusion/functions/src/strings.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21538-2026-04-13-13-42-06

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review: Optimize NULL handling in StringViewArrayBuilder

Overall this is a clean, focused refactor that removes the StringViewBuilder dependency and gives callers more control over null handling via bulk bitmap operations. The approach is sound. A few items to address:


Bug: Silent u32 overflow for large batches

In append_offset:

let offset = self.data.len() as u32;
if v.len() > 12 {
    self.data.extend_from_slice(v);
}
self.views.push(make_view(v, 0, offset));

All views use buffer_index = 0 referencing the single self.data buffer. The offset is cast with self.data.len() as u32, which silently truncates once self.data exceeds ~4 GB. Arrow's StringViewBuilder avoids this by starting a new block once one fills up (using multiple buffer indices). For typical batch sizes this won't trigger, but it's a correctness hazard worth either documenting as a known limitation or guarding against with a checked cast or capacity assertion.


Minor: Misleading SAFETY comment in finish

// SAFETY: views were constructed with correct lengths, offsets, and
// prefixes. UTF-8 validity has also been checked, if the input was
// tainted.

By the time finish is called, self.tainted is always false — it is reset at the end of every successful append_offset. The comment implies the taint check happens at finish time, but it actually happened inside append_offset. A clearer phrasing would be something like:

UTF-8 validity was checked in append_offset for any row where tainted data (e.g., binary literals) was appended.


No tests for the new implementation path

The PR replaces the internals of StringViewArrayBuilder but doesn't add tests targeting the new code paths directly. The existing integration tests in concat.rs / concat_ws.rs exercise it indirectly, but unit tests for the following cases would improve confidence:

  • All inline strings (length ≤ 12)
  • All out-of-line strings (length > 12)
  • Mixed inline and out-of-line in the same builder
  • Empty strings
  • Invalid UTF-8 rejection via tainted path
  • Correct null buffer propagation

Positive changes

  • data_capacity is now actually passed to Vec::with_capacity (previously the parameter was _data_capacity and ignored). Good fix.
  • Error message in finish now includes the actual lengths ({} vs {}), making mismatches much easier to diagnose.
  • Resetting tainted = false after each append_offset is more correct than the original, which left tainted permanently true once any binary data was seen.
  • The finish method no longer needs mut self, which is a small ergonomic improvement.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 13, 2026

🤖 Augment PR Summary

Summary: This PR optimizes building Utf8View results by changing how StringViewArrayBuilder accumulates values and NULLs.

Changes:

  • Replaces the internal use of Arrow’s `StringViewBuilder` with manual accumulation of `u128` views plus a contiguous data buffer
  • Moves NULL handling to `finish(...)`, allowing callers to supply a precomputed `NullBuffer` instead of appending NULLs per row
  • Constructs the final `StringViewArray` via `StringViewArray::new_unchecked` and validates the provided NULL buffer length
  • Adds/updates AI agent guidance files for code review behavior

Technical Notes: UTF-8 validity is only checked for “tainted” inputs (binary/scalar sources) during append_offset.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

}

let v = &self.block;
let offset = self.data.len() as u32;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset is computed with self.data.len() as u32, which will silently truncate once the accumulated buffer exceeds u32::MAX; since this is later passed into StringViewArray::new_unchecked, it could produce invalid views and potential out-of-bounds reads. Consider enforcing the 32-bit offset invariant with a checked conversion/error (similar to the other builders’ overflow handling).

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! The offsets in Arrow are u32, so the cast is needed but it could be done with TryFrom to return an error if the data length is bigger than u32::MAX. Prevents silent truncation and invalid results because of this.

if v.len() > 12 {
self.data.extend_from_slice(v);
}
self.views.push(make_view(v, 0, offset));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For v.len() <= 12, most other make_view call sites pass (0, 0) for (buffer_index, offset) to create a canonical inline view; here the inline case still passes the running offset. If make_view doesn’t fully ignore offset for inlined values, this can make identical short strings hash/compare differently depending on prior appends.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! The make_view() helper function is used inconsistently in strings.rs and this is confusing. It would be good to synchronize the call sites or even completely remove the second argument to this helper function if its value is not important.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors StringViewArrayBuilder to manually manage its internal view and data buffers and introduces new guidelines for AI agents in .cursor/rules.md and AGENTS.md. A high-severity issue was identified in the append_offset function, where casting the data buffer length to a u32 could lead to silent truncation and data corruption if the buffer exceeds 4GB.

}

let v = &self.block;
let offset = self.data.len() as u32;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cast self.data.len() as u32 can silently truncate if the total size of strings longer than 12 bytes exceeds 4GB. This would lead to invalid views and data corruption. It is recommended to use try_into() and handle the error, similar to how it's done in StringArrayBuilder.

        let offset: u32 = self.data.len().try_into().map_err(|_| {
            exec_datafusion_err!("StringView data buffer overflow")
        })?;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The offsets in Arrow are u32, so the cast is needed but it could be done with TryFrom to return an error if the data length is bigger than u32::MAX. Prevents silent truncation and invalid results because of this.

@martin-augment
Copy link
Copy Markdown
Owner Author

Bug: Silent u32 overflow for large batches

In append_offset:

let offset = self.data.len() as u32;
if v.len() > 12 {
    self.data.extend_from_slice(v);
}
self.views.push(make_view(v, 0, offset));

All views use buffer_index = 0 referencing the single self.data buffer. The offset is cast with self.data.len() as u32, which silently truncates once self.data exceeds ~4 GB. Arrow's StringViewBuilder avoids this by starting a new block once one fills up (using multiple buffer indices). For typical batch sizes this won't trigger, but it's a correctness hazard worth either documenting as a known limitation or guarding against with a checked cast or capacity assertion.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The offsets in Arrow are u32, so the cast is needed but it could be done with TryFrom to return an error if the data length is bigger than u32::MAX. Prevents silent truncation and invalid results because of this.

@martin-augment
Copy link
Copy Markdown
Owner Author

Minor: Misleading SAFETY comment in finish

// SAFETY: views were constructed with correct lengths, offsets, and
// prefixes. UTF-8 validity has also been checked, if the input was
// tainted.

By the time finish is called, self.tainted is always false — it is reset at the end of every successful append_offset. The comment implies the taint check happens at finish time, but it actually happened inside append_offset. A clearer phrasing would be something like:

UTF-8 validity was checked in append_offset for any row where tainted data (e.g., binary literals) was appended.

value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The comment is not fully correct and it could be improved to explain that the UTF-8 validation has already happened in append_offset()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants