21238: perf: Optimize `split_part` for scalar args by martin-augment · Pull Request #298 · martin-augment/datafusion

martin-augment · 2026-04-02T19:37:16Z

21238: To review by AI

coderabbitai · 2026-04-02T19:37:34Z

Walkthrough

This pull request optimizes the split_part function in DataFusion. The implementation adds a fast-path execution for cases where the input string is an array while delimiter and position are scalars. New helper functions using efficient delimiter-finding techniques are introduced. Benchmark code is refactored to separate scalar and array delimiter/position test cases with renamed benchmark tags. Test coverage is expanded with new sqllogictest cases for column-based split_part inputs, including error conditions, multi-character delimiters, negative positions, empty delimiters, and null value handling.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pr-21238-2026-04-02-19-37-11

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-02T19:41:58Z

Code Review: `perf: Optimize split_part for scalar args`

Overall this is a well-motivated and cleanly implemented optimization. Building a memmem::Finder/FinderRev once per batch and reusing it across all rows is a genuine win for the common case where delimiter and position are query-time constants. The implementation is correct and the code is readable. A few points worth addressing:

Correctness

Empty-delimiter behavior is consistent ✓
The fast path correctly replicates the existing slow-path logic in split_part_impl (lines 443–463): return the full string for position ±1, empty string otherwise. This matches PostgreSQL semantics and the existing scalar-only tests (split_part('a,b', '', 1) → a,b).

i64::MIN edge case ✓
The existing unit test passes i64::MIN as position. With the fast path, position.unsigned_abs() - 1 = u64::MAX >> 1, which is a valid usize on 64-bit. The rsplit_nth_finder loop exits early via return None once the delimiter isn't found, so this terminates quickly in practice.

split_nth_finder / rsplit_nth_finder byte indexing is safe ✓
start is always advanced to a position right after a delimiter (a UTF-8 boundary), so &string[start..start+pos] is always a valid UTF-8 slice.

Potential Issues

1. try_as_str() returns None for LargeUtf8 scalars (possible gap)

let delimiter = delim_scalar.try_as_str().ok_or_else(|| {
    exec_datafusion_err!("Unsupported delimiter type {:?}", ...)
})?;

If the delimiter scalar is LargeUtf8 (which is a valid string type per the coercion signature), try_as_str() may return None and the fast path would return an error rather than falling through to the slow path. Please verify that ScalarValue::try_as_str() handles all three string scalar variants (Utf8, LargeUtf8, Utf8View). If it doesn't, this silently breaks queries with LargeUtf8 delimiter scalars instead of producing a correct result.

2. No test coverage for LargeUtf8 column type

The new .slt tests cover Utf8 and Utf8View columns, but not LargeUtf8. Since split_part_scalar has a DataType::LargeUtf8 branch, it would be good to add at least one test case:

query T
SELECT split_part(arrow_cast(column1, 'LargeUtf8'), '.', 2)
FROM (VALUES ('a.b.c')) AS t(column1)
----
b

3. Null array uses input string type, not declared return type

return Ok(ColumnarValue::Array(new_null_array(
    string_array.data_type(),  // e.g. Utf8View
    string_array.len(),
)));

return_type() mirrors the first argument's string type, so this is technically consistent. But it's worth a comment linking back to return_type() to make the intent clear for future readers.

Minor / Style

The non-Int64 position error is dead code in practice

ScalarValue::Int64(v) => *v,
other => {
    return exec_err!("Unsupported position type {:?} for split_part", ...);
}

The coercion signature enforces logical_int64() for the position argument, so other can never be reached in a well-formed query. This is fine as defensive programming, but a comment noting this would help reviewers understand why it exists.

position == Some(0) check happens before the null-destructuring let-else

This is correct (and necessary to give a good error before the null check), but the ordering is slightly surprising. A brief comment like // validate before the null-shortcircuit below would aid readability.

Benchmarks

It would be helpful to include before/after benchmark numbers in the PR description (e.g. from cargo bench --bench split_part). Even rough numbers would help reviewers gauge the win and serve as a reference for future performance work.

Summary

Fix or verify the try_as_str() gap for LargeUtf8 delimiter scalars (potential correctness issue)
Add a LargeUtf8 column test case
The core optimization (memmem::Finder amortization) is correct and well-implemented — happy to approve once the above are addressed

augmentcode · 2026-04-02T19:42:06Z

🤖 Augment PR Summary

Summary: This PR optimizes split_part execution when the input is a string array and both the delimiter and position are scalars.

Changes:

Added a scalar fast path in `SplitPartFunc::invoke_with_args` for split_part(array, scalar_delimiter, scalar_position)
Implemented a batch-oriented scalar path using memchr::memmem::Finder/FinderRev to amortize delimiter-search setup across rows
Introduced helper utilities (map_strings, split_nth_finder, rsplit_nth_finder) to compute nth/rsplit parts efficiently
Updated Criterion benchmarks to measure scalar-arg performance separately from array-arg behavior
Added sqllogictest coverage for column input with scalar delimiter/position, including multi-char delimiters, negative positions, empty delimiters, Utf8View inputs, and position=0 error behavior

Technical Notes: The fast path preserves existing semantics for empty delimiters and out-of-range positions (returning empty strings), while reducing per-row overhead by reusing substring searchers across the batch.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-04-02T19:42:07Z

datafusion/functions/src/string/split_part.rs

+        }
+    };
+
+    if position == Some(0) {


split_part_scalar checks position == Some(0) before handling NULL scalar delimiter/position, so split_part(col, NULL, 0) will now error while the array/array implementation only errors when all three inputs are non-NULL (otherwise it yields NULL). Consider whether the zero-position error should be applied consistently across NULL cases / code paths to avoid surprising behavior differences.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:useful; category:bug; feedback: The Augment AI reviewer is correct! To keep the behavior consistent between the slow path (Array) and fast path (Scalar) it would be good to move the check for position=0 below the check for delimiter=NULL. Prevents a different behavior when the delimiter is NULL but the position is 0.

gemini-code-assist

Code Review

This pull request introduces a performance optimization for the split_part function by implementing a fast path for cases where the delimiter and position are scalar values. The implementation leverages memmem::Finder and memmem::FinderRev to improve search efficiency across Utf8, LargeUtf8, and Utf8View types. Additionally, the PR updates benchmarks to cover these new scenarios and adds regression tests in sqllogictest. Review feedback recommends refining the memory pre-allocation for string builders in the fast path by using array offsets instead of the total buffer length, ensuring better efficiency when processing sliced arrays.

gemini-code-assist · 2026-04-02T19:42:50Z

datafusion/functions/src/string/split_part.rs

+                GenericStringBuilder::<i32>::with_capacity(
+                    arr.len(),
+                    arr.value_data().len(),
+                ),


In the scalar fast path, using arr.value_data().len() for pre-allocation has the same potential for excessive memory usage when dealing with slices. Using the difference between the last and first offsets provides a precise capacity requirement.

Suggested change

GenericStringBuilder::<i32>::with_capacity(

arr.len(),

arr.value_data().len(),

),

GenericStringBuilder::<i32>::with_capacity(

arr.len(),

(arr.value_offset(arr.len()) - arr.value_offset(0)).to_usize().unwrap(),

),

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocated capacity does not take into account that the array might be a slice and uses its full length. It should use the offsets to calculate the actual length.

gemini-code-assist · 2026-04-02T19:42:50Z

datafusion/functions/src/string/split_part.rs

+                GenericStringBuilder::<i64>::with_capacity(
+                    arr.len(),
+                    arr.value_data().len(),
+                ),


Applying the same optimization for LargeUtf8 in the scalar fast path to ensure efficient memory usage for sliced arrays.

Suggested change

GenericStringBuilder::<i64>::with_capacity(

arr.len(),

arr.value_data().len(),

),

GenericStringBuilder::<i64>::with_capacity(

arr.len(),

(arr.value_offset(arr.len()) - arr.value_offset(0)).to_usize().unwrap(),

),

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocated capacity does not take into account that the array might be a slice and uses its full length. It should use the offsets to calculate the actual length.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@datafusion/functions/src/string/split_part.rs`:
- Around line 247-259: The current global validation of pos_scalar (position)
rejects position == 0 even for empty batches, causing empty-result batches to
error; update the check around position (the pos_scalar -> position match and
the if position == Some(0) test) to only enforce "field position must not be
zero" when the input batch actually contains rows (e.g., guard the zero check
with a check on the input string array length such as string_array.len() > 0 or
equivalent), so empty arrays preserve an empty result instead of raising an
error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 671c13f3-bb34-4f57-bcec-e60e8d570b44

📥 Commits

Reviewing files that changed from the base of the PR and between 2c79756 and bef9f5f.

📒 Files selected for processing (3)

datafusion/functions/benches/split_part.rs
datafusion/functions/src/string/split_part.rs
datafusion/sqllogictest/test_files/expr.slt

coderabbitai · 2026-04-02T19:48:19Z

datafusion/functions/src/string/split_part.rs

+    let position = match pos_scalar {
+        ScalarValue::Int64(v) => *v,
+        other => {
+            return exec_err!(
+                "Unsupported position type {:?} for split_part",
+                other.data_type()
+            );
+        }
+    };
+
+    if position == Some(0) {
+        return exec_err!("field position must not be zero");
+    }


⚠️ Potential issue | 🟠 Major

Preserve zero-row batches before rejecting position = 0.

This validation is now batch-global. When string_array.len() == 0, the old array path returned an empty result because no row was evaluated, but the fast path now raises field position must not be zero. That can regress queries after filters or joins produce empty batches.

🐛 Proposed fix

let position = match pos_scalar { ScalarValue::Int64(v) => *v, other => { return exec_err!( "Unsupported position type {:?} for split_part", other.data_type() ); } }; + + if string_array.len() == 0 { + return Ok(ColumnarValue::Array(new_null_array( + string_array.data_type(), + 0, + ))); + } if position == Some(0) { return exec_err!("field position must not be zero"); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@datafusion/functions/src/string/split_part.rs` around lines 247 - 259, The current global validation of pos_scalar (position) rejects position == 0 even for empty batches, causing empty-result batches to error; update the check around position (the pos_scalar -> position match and the if position == Some(0) test) to only enforce "field position must not be zero" when the input batch actually contains rows (e.g., guard the zero check with a check on the input string array length such as string_array.len() > 0 or equivalent), so empty arrays preserve an empty result instead of raising an error.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! To keep the behavior consistent between the slow path (Array) and fast path (Scalar) it would be good to return an empty result if there are no columns to split at all. Prevents a different behavior when there are no columns to split but the position is 0.

martin-augment · 2026-04-02T20:44:50Z

1. try_as_str() returns None for LargeUtf8 scalars (possible gap)
let delimiter = delim_scalar.try_as_str().ok_or_else(|| {
    exec_datafusion_err!("Unsupported delimiter type {:?}", ...)
})?;
If the delimiter scalar is LargeUtf8 (which is a valid string type per the coercion signature), try_as_str() may return None and the fast path would return an error rather than falling through to the slow path. Please verify that ScalarValue::try_as_str() handles all three string scalar variants (Utf8, LargeUtf8, Utf8View). If it doesn't, this silently breaks queries with LargeUtf8 delimiter scalars instead of producing a correct result.

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! ScalarValue::try_as_str() supports LargeUtf8 type!

martin-augment · 2026-04-02T20:46:39Z

2. No test coverage for LargeUtf8 column type

The new .slt tests cover Utf8 and Utf8View columns, but not LargeUtf8. Since split_part_scalar has a DataType::LargeUtf8 branch, it would be good to add at least one test case:
query T
SELECT split_part(arrow_cast(column1, 'LargeUtf8'), '.', 2)
FROM (VALUES ('a.b.c')) AS t(column1)
----
b

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! There are no SQL Logic Tests for split_part() with LargeUtf8 values. Prevents leaving code path untested and this could lead to uncaught regressions in the future.

neilconway added 3 commits March 29, 2026 12:16

perf: Optimize split_part for scalar args

a439ef7

cargo fmt

62733bc

Fix clippy

bef9f5f

augmentcode bot reviewed Apr 2, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

Conversation

martin-augment commented Apr 2, 2026

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Apr 2, 2026

Code Review: perf: Optimize split_part for scalar args

Correctness

Potential Issues

Minor / Style

Benchmarks

Summary

Uh oh!

augmentcode bot commented Apr 2, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Apr 2, 2026

Uh oh!

martin-augment commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Code Review: `perf: Optimize split_part for scalar args`