21119: perf: Optimize split_part, support Utf8View#292
21119: perf: Optimize split_part, support Utf8View#292martin-augment wants to merge 3 commits intomainfrom
split_part, support Utf8View#292Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughThe changes add ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the split_part function to support Utf8View arrays, introducing optimized splitting logic for single-character delimiters and updating the benchmarking and SQL logic tests. The review comments suggest further optimizing memory allocation in the split_part_impl function by pre-allocating GenericStringBuilder with estimated capacity for Utf8 and LargeUtf8 types.
| DataType::Utf8 => split_part_for_delimiter_type!( | ||
| &args[0].as_string::<i32>(), | ||
| GenericStringBuilder::<i32>::new() | ||
| ), |
There was a problem hiding this comment.
To optimize memory allocation, you can pre-allocate the GenericStringBuilder with both item and data capacity. The data capacity can be estimated from the input array's value data length.
DataType::Utf8 => {
let str_arr = &args[0].as_string::<i32>();
split_part_for_delimiter_type!(
str_arr,
GenericStringBuilder::<i32>::with_capacity(
inferred_length,
str_arr.value_data().len(),
)
)
},There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The result string builder could be initialized with the expected capacity by using the string array values length. Prevents memory resizing when adding items to the builder.
| DataType::LargeUtf8 => split_part_for_delimiter_type!( | ||
| &args[0].as_string::<i64>(), | ||
| GenericStringBuilder::<i64>::new() | ||
| ), |
There was a problem hiding this comment.
To optimize memory allocation, you can pre-allocate the GenericStringBuilder with both item and data capacity. The data capacity can be estimated from the input array's value data length.
| DataType::LargeUtf8 => split_part_for_delimiter_type!( | |
| &args[0].as_string::<i64>(), | |
| GenericStringBuilder::<i64>::new() | |
| ), | |
| DataType::LargeUtf8 => { | |
| let str_arr = &args[0].as_string::<i64>(); | |
| split_part_for_delimiter_type!( | |
| str_arr, | |
| GenericStringBuilder::<i64>::with_capacity( | |
| inferred_length, | |
| str_arr.value_data().len(), | |
| ) | |
| ) | |
| }, |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The result string builder could be initialized with the expected capacity by using the string array values length. Prevents memory resizing when adding items to the builder.
Code ReviewThis PR makes two distinct improvements: (1) adding native
|
🤖 Augment PR SummarySummary: This PR optimizes the Changes:
Technical Notes: The implementation keeps PostgreSQL-like semantics for empty delimiters (only positions 1 / -1 return data) and errors when the requested position is zero. 🤖 Was this summary useful? React with 👍 or 👎 |
| func.invoke_with_args(ScalarFunctionArgs { | ||
| args: args.clone(), | ||
| arg_fields: arg_fields.clone(), | ||
| number_rows: N_ROWS, |
There was a problem hiding this comment.
bench_split_part hard-codes N_ROWS for both positions and number_rows, which can silently desync from the actual strings/delimiters length if this helper is reused with a different row count (potentially truncating via zip). Consider deriving/passing the row count to keep the benchmark inputs consistent.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| .enumerate() | ||
| .map(|(idx, arg)| Field::new(format!("arg_{idx}"), arg.data_type(), true).into()) | ||
| .collect(); | ||
| let return_type = match args[0].data_type() { |
There was a problem hiding this comment.
The benchmark return_type falls back to DataType::Utf8 for all non-Utf8View inputs; if a LargeUtf8 case is added later, this would produce an incorrect return_field and could make invoke_with_args error or benchmark the wrong path.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:annoying; category:bug; feedback: The Augment AI reviewer is not correct! The non-Utf8View case is handled by crate::utils::utf8_to_str_type() which does support both Utf8 and LargeUtf8 types.
21119: To review by AI