21119: perf: Optimize `split_part`, support `Utf8View` by martin-augment · Pull Request #292 · martin-augment/datafusion

martin-augment · 2026-03-25T08:59:21Z

21119: To review by AI

gemini-code-assist · 2026-03-25T08:59:42Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the split_part string function by introducing performance optimizations, particularly for single-character delimiters, and extending its compatibility to support Utf8View data types. The changes involve refactoring the core implementation, adding specialized helper functions for efficient string splitting, and updating benchmarks and SQL logic tests to validate the new functionality and performance gains.

Highlights

Performance Optimization: Optimized the split_part function, especially for single-character delimiters, by introducing specialized split_nth and rsplit_nth helper functions that leverage faster character-based splitting.
Utf8View Support: Extended the split_part function to natively support Utf8View data types for both input strings and output results, improving efficiency by avoiding unnecessary conversions.
Code Refactoring: Refactored the core split_part_impl function to be more generic, accepting various string array types and builders, and utilizing a macro for cleaner dispatch logic based on input data types.
Enhanced Benchmarks: Updated the split_part benchmarks with a new helper function to reduce code duplication and added new test cases specifically for Utf8View inputs and various delimiter scenarios to validate performance improvements and new functionality.
New SQL Logic Tests: Introduced new SQL logic tests to thoroughly verify the correct behavior of split_part with Utf8View inputs, covering different positions, empty delimiters, and ensuring Utf8View output for Utf8View input.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-25T08:59:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 895f48ef-70d0-4c47-98b9-3c0d78a0d3a6

📥 Commits

Reviewing files that changed from the base of the PR and between 132bf44 and bad1a74.

📒 Files selected for processing (3)

datafusion/functions/benches/split_part.rs
datafusion/functions/src/string/split_part.rs
datafusion/sqllogictest/test_files/string/string_view.slt

Walkthrough

The changes add Utf8View support to the split_part function. The implementation now preserves DataType::Utf8View when the input string is Utf8View instead of coercing to UTF-8. The dispatch logic is refactored to use a two-stage approach based on string and delimiter array types, with new split_nth and rsplit_nth helper functions for extracting nth parts. Benchmarks are updated to use ScalarUDF-based invocation via a new helper, while test coverage is expanded with SQL logic tests validating split_part behavior on Utf8View inputs.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pr-21119-2026-03-25-08-59-13

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the split_part function to support Utf8View arrays, introducing optimized splitting logic for single-character delimiters and updating the benchmarking and SQL logic tests. The review comments suggest further optimizing memory allocation in the split_part_impl function by pre-allocating GenericStringBuilder with estimated capacity for Utf8 and LargeUtf8 types.

gemini-code-assist · 2026-03-25T09:01:24Z

datafusion/functions/src/string/split_part.rs

+            DataType::Utf8 => split_part_for_delimiter_type!(
+                &args[0].as_string::<i32>(),
+                GenericStringBuilder::<i32>::new()
+            ),


To optimize memory allocation, you can pre-allocate the GenericStringBuilder with both item and data capacity. The data capacity can be estimated from the input array's value data length.

DataType::Utf8 => { let str_arr = &args[0].as_string::<i32>(); split_part_for_delimiter_type!( str_arr, GenericStringBuilder::<i32>::with_capacity( inferred_length, str_arr.value_data().len(), ) ) },

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The result string builder could be initialized with the expected capacity by using the string array values length. Prevents memory resizing when adding items to the builder.

gemini-code-assist · 2026-03-25T09:01:24Z

datafusion/functions/src/string/split_part.rs

+            DataType::LargeUtf8 => split_part_for_delimiter_type!(
+                &args[0].as_string::<i64>(),
+                GenericStringBuilder::<i64>::new()
+            ),


To optimize memory allocation, you can pre-allocate the GenericStringBuilder with both item and data capacity. The data capacity can be estimated from the input array's value data length.

Suggested change

DataType::LargeUtf8 => split_part_for_delimiter_type!(

&args[0].as_string::<i64>(),

GenericStringBuilder::<i64>::new()

),

DataType::LargeUtf8 => {

let str_arr = &args[0].as_string::<i64>();

split_part_for_delimiter_type!(

str_arr,

GenericStringBuilder::<i64>::with_capacity(

inferred_length,

str_arr.value_data().len(),

)

)

},

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The result string builder could be initialized with the expected capacity by using the string array values length. Prevents memory resizing when adding items to the builder.

claude · 2026-03-25T09:03:18Z

Code Review

This PR makes two distinct improvements: (1) adding native Utf8View output support (previously Utf8View inputs were silently downcast to Utf8 on output), and (2) a performance optimization via char-based splitting for single-byte delimiters. The overall direction is good. A few things worth discussing:

`split_nth` comment could be clearer (the code is correct)

The logic in split_nth / rsplit_nth is sound — &str is guaranteed valid UTF-8, so len() == 1 implies the single byte is in 0x00..=0x7F (ASCII), and the as char cast is valid. However the comment "A single-byte UTF-8 string is always ASCII" is unusual phrasing. A clearer alternative: "a 1-byte valid UTF-8 sequence is always an ASCII code point (U+0000 to U+007F), so the byte-to-char cast is safe." This prevents future maintainers from wondering if the cast is unsafe for non-ASCII delimiters.

Inconsistent capacity hints on builders

StringViewBuilder gets with_capacity(inferred_length), but GenericStringBuilder gets new() without any capacity hint:

DataType::Utf8View  => StringViewBuilder::with_capacity(inferred_length)  // has capacity
DataType::Utf8      => GenericStringBuilder::<i32>::new()                  // no capacity
DataType::LargeUtf8 => GenericStringBuilder::<i64>::new()                  // no capacity

For consistency and to avoid reallocations on large arrays, consider pre-allocating the generic builders too, e.g. GenericStringBuilder::<i32>::with_capacity(inferred_length, 0).

`return_type` special-casing needs a comment

The new Utf8View branch in return_type should explain why it is needed — i.e., that utf8_to_str_type does not handle the Utf8View -> Utf8View mapping. Without that context, a future refactor might remove it thinking it is redundant.

Removed benchmark scenarios

The PR drops two scenarios that stress-tested edge cases:

pos_last (position = last part, worst case for .nth() early-exit) — useful to confirm no regression in the non-early-exit path
long_strings_50_parts/pos_first — the original comment explicitly called this the "worst case for old implementation"

Consider keeping at least a utf8_single_char/pos_last case to detect future regressions.

Minor positives

Converting from try_for_each to a plain for loop improves readability; the return exec_err!(...) still propagates correctly from the function.
Using #[expect(clippy::too_many_arguments)] over #[allow] is the right choice — it fails to compile if the lint stops triggering, preventing stale suppressions.
The macro to collapse delimiter-type dispatch is a clean deduplication.
Utf8View output type preservation is correct and well-tested by the new .slt cases.

Summary: The core logic is correct. Main actionable items: (1) give GenericStringBuilder a capacity hint for consistency, (2) add an explanatory comment in return_type, (3) restore a pos_last benchmark scenario.

augmentcode · 2026-03-25T09:03:28Z

🤖 Augment PR Summary

Summary: This PR optimizes the split_part string UDF and extends it to preserve Utf8View output for Utf8View inputs.

Changes:

Refactors `split_part` type dispatch to reduce match complexity and select builders based on the input string type.
Adds fast paths (`split_nth`/`rsplit_nth`) that use `str::split(char)` / `str::rsplit(char)` for single-byte delimiters.
Updates the UDF return type logic so `Utf8View` inputs return `Utf8View` outputs.
Reworks the Criterion benchmarks to use a shared helper and adds `Utf8View` benchmark scenarios.
Adds SQL logic tests covering `split_part` behavior with `Utf8View`, including empty delimiter and error-on-zero-position cases.

Technical Notes: The implementation keeps PostgreSQL-like semantics for empty delimiters (only positions 1 / -1 return data) and errors when the requested position is zero.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-03-25T09:03:29Z

datafusion/functions/benches/split_part.rs

+                func.invoke_with_args(ScalarFunctionArgs {
+                    args: args.clone(),
+                    arg_fields: arg_fields.clone(),
+                    number_rows: N_ROWS,


bench_split_part hard-codes N_ROWS for both positions and number_rows, which can silently desync from the actual strings/delimiters length if this helper is reused with a different row count (potentially truncating via zip). Consider deriving/passing the row count to keep the benchmark inputs consistent.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-03-25T09:03:29Z

datafusion/functions/benches/split_part.rs

+        .enumerate()
+        .map(|(idx, arg)| Field::new(format!("arg_{idx}"), arg.data_type(), true).into())
+        .collect();
+    let return_type = match args[0].data_type() {


The benchmark return_type falls back to DataType::Utf8 for all non-Utf8View inputs; if a LargeUtf8 case is added later, this would produce an incorrect return_field and could make invoke_with_args error or benchmark the wrong path.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:annoying; category:bug; feedback: The Augment AI reviewer is not correct! The non-Utf8View case is handled by crate::utils::utf8_to_str_type() which does support both Utf8 and LargeUtf8 types.

neilconway added 3 commits March 23, 2026 11:44

Revise benchmarks for split_part

5467531

split_part: Optimize, cleanup, support utf8view

d044aff

Fix clippy

bad1a74

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

augmentcode bot reviewed Mar 25, 2026

View reviewed changes

-            DataType::LargeUtf8 => split_part_for_delimiter_type!(
-                &args[0].as_string::<i64>(),
-                GenericStringBuilder::<i64>::new()
-            ),
+            DataType::LargeUtf8 => {
+                let str_arr = &args[0].as_string::<i64>();
+                split_part_for_delimiter_type!(
+                    str_arr,
+                    GenericStringBuilder::<i64>::with_capacity(
+                        inferred_length,
+                        str_arr.value_data().len(),
+                    )
+                )
+            },

Conversation

martin-augment commented Mar 25, 2026

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 25, 2026

Code Review

split_nth comment could be clearer (the code is correct)

Inconsistent capacity hints on builders

return_type special-casing needs a comment

Removed benchmark scenarios

Minor positives

Uh oh!

augmentcode bot commented Mar 25, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 25, 2026 •

edited

Loading

`split_nth` comment could be clearer (the code is correct)

`return_type` special-casing needs a comment