21538: perf: Optimize NULL handling in StringViewArrayBuilder#311
21538: perf: Optimize NULL handling in StringViewArrayBuilder#311martin-augment wants to merge 3 commits intomainfrom
StringViewArrayBuilder#311Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 37 minutes and 11 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: Optimize NULL handling in
|
🤖 Augment PR SummarySummary: This PR optimizes building Changes:
Technical Notes: UTF-8 validity is only checked for “tainted” inputs (binary/scalar sources) during 🤖 Was this summary useful? React with 👍 or 👎 |
| } | ||
|
|
||
| let v = &self.block; | ||
| let offset = self.data.len() as u32; |
There was a problem hiding this comment.
offset is computed with self.data.len() as u32, which will silently truncate once the accumulated buffer exceeds u32::MAX; since this is later passed into StringViewArray::new_unchecked, it could produce invalid views and potential out-of-bounds reads. Consider enforcing the 32-bit offset invariant with a checked conversion/error (similar to the other builders’ overflow handling).
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! The offsets in Arrow are u32, so the cast is needed but it could be done with TryFrom to return an error if the data length is bigger than u32::MAX. Prevents silent truncation and invalid results because of this.
| if v.len() > 12 { | ||
| self.data.extend_from_slice(v); | ||
| } | ||
| self.views.push(make_view(v, 0, offset)); |
There was a problem hiding this comment.
For v.len() <= 12, most other make_view call sites pass (0, 0) for (buffer_index, offset) to create a canonical inline view; here the inline case still passes the running offset. If make_view doesn’t fully ignore offset for inlined values, this can make identical short strings hash/compare differently depending on prior appends.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! The make_view() helper function is used inconsistently in strings.rs and this is confusing. It would be good to synchronize the call sites or even completely remove the second argument to this helper function if its value is not important.
There was a problem hiding this comment.
Code Review
This pull request refactors StringViewArrayBuilder to manually manage its internal view and data buffers and introduces new guidelines for AI agents in .cursor/rules.md and AGENTS.md. A high-severity issue was identified in the append_offset function, where casting the data buffer length to a u32 could lead to silent truncation and data corruption if the buffer exceeds 4GB.
| } | ||
|
|
||
| let v = &self.block; | ||
| let offset = self.data.len() as u32; |
There was a problem hiding this comment.
The cast self.data.len() as u32 can silently truncate if the total size of strings longer than 12 bytes exceeds 4GB. This would lead to invalid views and data corruption. It is recommended to use try_into() and handle the error, similar to how it's done in StringArrayBuilder.
let offset: u32 = self.data.len().try_into().map_err(|_| {
exec_datafusion_err!("StringView data buffer overflow")
})?;There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The offsets in Arrow are u32, so the cast is needed but it could be done with TryFrom to return an error if the data length is bigger than u32::MAX. Prevents silent truncation and invalid results because of this.
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The offsets in Arrow are u32, so the cast is needed but it could be done with TryFrom to return an error if the data length is bigger than u32::MAX. Prevents silent truncation and invalid results because of this. |
value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The comment is not fully correct and it could be improved to explain that the UTF-8 validation has already happened in append_offset() |
21538: To review by AI