Skip to content

21119: perf: Optimize split_part, support Utf8View#292

Open
martin-augment wants to merge 3 commits intomainfrom
pr-21119-2026-03-25-08-59-13
Open

21119: perf: Optimize split_part, support Utf8View#292
martin-augment wants to merge 3 commits intomainfrom
pr-21119-2026-03-25-08-59-13

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21119: To review by AI

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the split_part string function by introducing performance optimizations, particularly for single-character delimiters, and extending its compatibility to support Utf8View data types. The changes involve refactoring the core implementation, adding specialized helper functions for efficient string splitting, and updating benchmarks and SQL logic tests to validate the new functionality and performance gains.

Highlights

  • Performance Optimization: Optimized the split_part function, especially for single-character delimiters, by introducing specialized split_nth and rsplit_nth helper functions that leverage faster character-based splitting.
  • Utf8View Support: Extended the split_part function to natively support Utf8View data types for both input strings and output results, improving efficiency by avoiding unnecessary conversions.
  • Code Refactoring: Refactored the core split_part_impl function to be more generic, accepting various string array types and builders, and utilizing a macro for cleaner dispatch logic based on input data types.
  • Enhanced Benchmarks: Updated the split_part benchmarks with a new helper function to reduce code duplication and added new test cases specifically for Utf8View inputs and various delimiter scenarios to validate performance improvements and new functionality.
  • New SQL Logic Tests: Introduced new SQL logic tests to thoroughly verify the correct behavior of split_part with Utf8View inputs, covering different positions, empty delimiters, and ensuring Utf8View output for Utf8View input.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 895f48ef-70d0-4c47-98b9-3c0d78a0d3a6

📥 Commits

Reviewing files that changed from the base of the PR and between 132bf44 and bad1a74.

📒 Files selected for processing (3)
  • datafusion/functions/benches/split_part.rs
  • datafusion/functions/src/string/split_part.rs
  • datafusion/sqllogictest/test_files/string/string_view.slt

Walkthrough

The changes add Utf8View support to the split_part function. The implementation now preserves DataType::Utf8View when the input string is Utf8View instead of coercing to UTF-8. The dispatch logic is refactored to use a two-stage approach based on string and delimiter array types, with new split_nth and rsplit_nth helper functions for extracting nth parts. Benchmarks are updated to use ScalarUDF-based invocation via a new helper, while test coverage is expanded with SQL logic tests validating split_part behavior on Utf8View inputs.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21119-2026-03-25-08-59-13

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the split_part function to support Utf8View arrays, introducing optimized splitting logic for single-character delimiters and updating the benchmarking and SQL logic tests. The review comments suggest further optimizing memory allocation in the split_part_impl function by pre-allocating GenericStringBuilder with estimated capacity for Utf8 and LargeUtf8 types.

Comment on lines +164 to +167
DataType::Utf8 => split_part_for_delimiter_type!(
&args[0].as_string::<i32>(),
GenericStringBuilder::<i32>::new()
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To optimize memory allocation, you can pre-allocate the GenericStringBuilder with both item and data capacity. The data capacity can be estimated from the input array's value data length.

            DataType::Utf8 => {
                let str_arr = &args[0].as_string::<i32>();
                split_part_for_delimiter_type!(
                    str_arr,
                    GenericStringBuilder::<i32>::with_capacity(
                        inferred_length,
                        str_arr.value_data().len(),
                    )
                )
            },

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The result string builder could be initialized with the expected capacity by using the string array values length. Prevents memory resizing when adding items to the builder.

Comment on lines +168 to +171
DataType::LargeUtf8 => split_part_for_delimiter_type!(
&args[0].as_string::<i64>(),
GenericStringBuilder::<i64>::new()
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To optimize memory allocation, you can pre-allocate the GenericStringBuilder with both item and data capacity. The data capacity can be estimated from the input array's value data length.

Suggested change
DataType::LargeUtf8 => split_part_for_delimiter_type!(
&args[0].as_string::<i64>(),
GenericStringBuilder::<i64>::new()
),
DataType::LargeUtf8 => {
let str_arr = &args[0].as_string::<i64>();
split_part_for_delimiter_type!(
str_arr,
GenericStringBuilder::<i64>::with_capacity(
inferred_length,
str_arr.value_data().len(),
)
)
},

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The result string builder could be initialized with the expected capacity by using the string array values length. Prevents memory resizing when adding items to the builder.

@claude
Copy link
Copy Markdown

claude bot commented Mar 25, 2026

Code Review

This PR makes two distinct improvements: (1) adding native Utf8View output support (previously Utf8View inputs were silently downcast to Utf8 on output), and (2) a performance optimization via char-based splitting for single-byte delimiters. The overall direction is good. A few things worth discussing:

split_nth comment could be clearer (the code is correct)

The logic in split_nth / rsplit_nth is sound — &str is guaranteed valid UTF-8, so len() == 1 implies the single byte is in 0x00..=0x7F (ASCII), and the as char cast is valid. However the comment "A single-byte UTF-8 string is always ASCII" is unusual phrasing. A clearer alternative: "a 1-byte valid UTF-8 sequence is always an ASCII code point (U+0000 to U+007F), so the byte-to-char cast is safe." This prevents future maintainers from wondering if the cast is unsafe for non-ASCII delimiters.

Inconsistent capacity hints on builders

StringViewBuilder gets with_capacity(inferred_length), but GenericStringBuilder gets new() without any capacity hint:

DataType::Utf8View  => StringViewBuilder::with_capacity(inferred_length)  // has capacity
DataType::Utf8      => GenericStringBuilder::<i32>::new()                  // no capacity
DataType::LargeUtf8 => GenericStringBuilder::<i64>::new()                  // no capacity

For consistency and to avoid reallocations on large arrays, consider pre-allocating the generic builders too, e.g. GenericStringBuilder::<i32>::with_capacity(inferred_length, 0).

return_type special-casing needs a comment

The new Utf8View branch in return_type should explain why it is needed — i.e., that utf8_to_str_type does not handle the Utf8View -> Utf8View mapping. Without that context, a future refactor might remove it thinking it is redundant.

Removed benchmark scenarios

The PR drops two scenarios that stress-tested edge cases:

  • pos_last (position = last part, worst case for .nth() early-exit) — useful to confirm no regression in the non-early-exit path
  • long_strings_50_parts/pos_first — the original comment explicitly called this the "worst case for old implementation"

Consider keeping at least a utf8_single_char/pos_last case to detect future regressions.

Minor positives

  • Converting from try_for_each to a plain for loop improves readability; the return exec_err!(...) still propagates correctly from the function.
  • Using #[expect(clippy::too_many_arguments)] over #[allow] is the right choice — it fails to compile if the lint stops triggering, preventing stale suppressions.
  • The macro to collapse delimiter-type dispatch is a clean deduplication.
  • Utf8View output type preservation is correct and well-tested by the new .slt cases.

Summary: The core logic is correct. Main actionable items: (1) give GenericStringBuilder a capacity hint for consistency, (2) add an explanatory comment in return_type, (3) restore a pos_last benchmark scenario.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 25, 2026

🤖 Augment PR Summary

Summary: This PR optimizes the split_part string UDF and extends it to preserve Utf8View output for Utf8View inputs.

Changes:

  • Refactors `split_part` type dispatch to reduce match complexity and select builders based on the input string type.
  • Adds fast paths (`split_nth`/`rsplit_nth`) that use `str::split(char)` / `str::rsplit(char)` for single-byte delimiters.
  • Updates the UDF return type logic so `Utf8View` inputs return `Utf8View` outputs.
  • Reworks the Criterion benchmarks to use a shared helper and adds `Utf8View` benchmark scenarios.
  • Adds SQL logic tests covering `split_part` behavior with `Utf8View`, including empty delimiter and error-on-zero-position cases.

Technical Notes: The implementation keeps PostgreSQL-like semantics for empty delimiters (only positions 1 / -1 return data) and errors when the requested position is zero.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

func.invoke_with_args(ScalarFunctionArgs {
args: args.clone(),
arg_fields: arg_fields.clone(),
number_rows: N_ROWS,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bench_split_part hard-codes N_ROWS for both positions and number_rows, which can silently desync from the actual strings/delimiters length if this helper is reused with a different row count (potentially truncating via zip). Consider deriving/passing the row count to keep the benchmark inputs consistent.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

.enumerate()
.map(|(idx, arg)| Field::new(format!("arg_{idx}"), arg.data_type(), true).into())
.collect();
let return_type = match args[0].data_type() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark return_type falls back to DataType::Utf8 for all non-Utf8View inputs; if a LargeUtf8 case is added later, this would produce an incorrect return_field and could make invoke_with_args error or benchmark the wrong path.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback: The Augment AI reviewer is not correct! The non-Utf8View case is handled by crate::utils::utf8_to_str_type() which does support both Utf8 and LargeUtf8 types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants