Skip to content

21238: perf: Optimize split_part for scalar args#298

Open
martin-augment wants to merge 3 commits intomainfrom
pr-21238-2026-04-02-19-37-11
Open

21238: perf: Optimize split_part for scalar args#298
martin-augment wants to merge 3 commits intomainfrom
pr-21238-2026-04-02-19-37-11

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21238: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

Walkthrough

This pull request optimizes the split_part function in DataFusion. The implementation adds a fast-path execution for cases where the input string is an array while delimiter and position are scalars. New helper functions using efficient delimiter-finding techniques are introduced. Benchmark code is refactored to separate scalar and array delimiter/position test cases with renamed benchmark tags. Test coverage is expanded with new sqllogictest cases for column-based split_part inputs, including error conditions, multi-character delimiters, negative positions, empty delimiters, and null value handling.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21238-2026-04-02-19-37-11

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

Code Review: perf: Optimize split_part for scalar args

Overall this is a well-motivated and cleanly implemented optimization. Building a memmem::Finder/FinderRev once per batch and reusing it across all rows is a genuine win for the common case where delimiter and position are query-time constants. The implementation is correct and the code is readable. A few points worth addressing:


Correctness

Empty-delimiter behavior is consistent ✓
The fast path correctly replicates the existing slow-path logic in split_part_impl (lines 443–463): return the full string for position ±1, empty string otherwise. This matches PostgreSQL semantics and the existing scalar-only tests (split_part('a,b', '', 1)a,b).

i64::MIN edge case ✓
The existing unit test passes i64::MIN as position. With the fast path, position.unsigned_abs() - 1 = u64::MAX >> 1, which is a valid usize on 64-bit. The rsplit_nth_finder loop exits early via return None once the delimiter isn't found, so this terminates quickly in practice.

split_nth_finder / rsplit_nth_finder byte indexing is safe ✓
start is always advanced to a position right after a delimiter (a UTF-8 boundary), so &string[start..start+pos] is always a valid UTF-8 slice.


Potential Issues

1. try_as_str() returns None for LargeUtf8 scalars (possible gap)

let delimiter = delim_scalar.try_as_str().ok_or_else(|| {
    exec_datafusion_err!("Unsupported delimiter type {:?}", ...)
})?;

If the delimiter scalar is LargeUtf8 (which is a valid string type per the coercion signature), try_as_str() may return None and the fast path would return an error rather than falling through to the slow path. Please verify that ScalarValue::try_as_str() handles all three string scalar variants (Utf8, LargeUtf8, Utf8View). If it doesn't, this silently breaks queries with LargeUtf8 delimiter scalars instead of producing a correct result.

2. No test coverage for LargeUtf8 column type

The new .slt tests cover Utf8 and Utf8View columns, but not LargeUtf8. Since split_part_scalar has a DataType::LargeUtf8 branch, it would be good to add at least one test case:

query T
SELECT split_part(arrow_cast(column1, 'LargeUtf8'), '.', 2)
FROM (VALUES ('a.b.c')) AS t(column1)
----
b

3. Null array uses input string type, not declared return type

return Ok(ColumnarValue::Array(new_null_array(
    string_array.data_type(),  // e.g. Utf8View
    string_array.len(),
)));

return_type() mirrors the first argument's string type, so this is technically consistent. But it's worth a comment linking back to return_type() to make the intent clear for future readers.


Minor / Style

The non-Int64 position error is dead code in practice

ScalarValue::Int64(v) => *v,
other => {
    return exec_err!("Unsupported position type {:?} for split_part", ...);
}

The coercion signature enforces logical_int64() for the position argument, so other can never be reached in a well-formed query. This is fine as defensive programming, but a comment noting this would help reviewers understand why it exists.

position == Some(0) check happens before the null-destructuring let-else

This is correct (and necessary to give a good error before the null check), but the ordering is slightly surprising. A brief comment like // validate before the null-shortcircuit below would aid readability.


Benchmarks

It would be helpful to include before/after benchmark numbers in the PR description (e.g. from cargo bench --bench split_part). Even rough numbers would help reviewers gauge the win and serve as a reference for future performance work.


Summary

  • Fix or verify the try_as_str() gap for LargeUtf8 delimiter scalars (potential correctness issue)
  • Add a LargeUtf8 column test case
  • The core optimization (memmem::Finder amortization) is correct and well-implemented — happy to approve once the above are addressed

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 2, 2026

🤖 Augment PR Summary

Summary: This PR optimizes split_part execution when the input is a string array and both the delimiter and position are scalars.

Changes:

  • Added a scalar fast path in `SplitPartFunc::invoke_with_args` for split_part(array, scalar_delimiter, scalar_position)
  • Implemented a batch-oriented scalar path using memchr::memmem::Finder/FinderRev to amortize delimiter-search setup across rows
  • Introduced helper utilities (map_strings, split_nth_finder, rsplit_nth_finder) to compute nth/rsplit parts efficiently
  • Updated Criterion benchmarks to measure scalar-arg performance separately from array-arg behavior
  • Added sqllogictest coverage for column input with scalar delimiter/position, including multi-char delimiters, negative positions, empty delimiters, Utf8View inputs, and position=0 error behavior

Technical Notes: The fast path preserves existing semantics for empty delimiters and out-of-range positions (returning empty strings), while reducing per-row overhead by reusing substring searchers across the batch.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

}
};

if position == Some(0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_part_scalar checks position == Some(0) before handling NULL scalar delimiter/position, so split_part(col, NULL, 0) will now error while the array/array implementation only errors when all three inputs are non-NULL (otherwise it yields NULL). Consider whether the zero-position error should be applied consistently across NULL cases / code paths to avoid surprising behavior differences.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! To keep the behavior consistent between the slow path (Array) and fast path (Scalar) it would be good to move the check for position=0 below the check for delimiter=NULL. Prevents a different behavior when the delimiter is NULL but the position is 0.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for the split_part function by implementing a fast path for cases where the delimiter and position are scalar values. The implementation leverages memmem::Finder and memmem::FinderRev to improve search efficiency across Utf8, LargeUtf8, and Utf8View types. Additionally, the PR updates benchmarks to cover these new scenarios and adds regression tests in sqllogictest. Review feedback recommends refining the memory pre-allocation for string builders in the fast path by using array offsets instead of the total buffer length, ensuring better efficiency when processing sliced arrays.

Comment on lines +282 to +285
GenericStringBuilder::<i32>::with_capacity(
arr.len(),
arr.value_data().len(),
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the scalar fast path, using arr.value_data().len() for pre-allocation has the same potential for excessive memory usage when dealing with slices. Using the difference between the last and first offsets provides a precise capacity requirement.

Suggested change
GenericStringBuilder::<i32>::with_capacity(
arr.len(),
arr.value_data().len(),
),
GenericStringBuilder::<i32>::with_capacity(
arr.len(),
(arr.value_offset(arr.len()) - arr.value_offset(0)).to_usize().unwrap(),
),

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocated capacity does not take into account that the array might be a slice and uses its full length. It should use the offsets to calculate the actual length.

Comment on lines +294 to +297
GenericStringBuilder::<i64>::with_capacity(
arr.len(),
arr.value_data().len(),
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Applying the same optimization for LargeUtf8 in the scalar fast path to ensure efficient memory usage for sliced arrays.

Suggested change
GenericStringBuilder::<i64>::with_capacity(
arr.len(),
arr.value_data().len(),
),
GenericStringBuilder::<i64>::with_capacity(
arr.len(),
(arr.value_offset(arr.len()) - arr.value_offset(0)).to_usize().unwrap(),
),

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocated capacity does not take into account that the array might be a slice and uses its full length. It should use the offsets to calculate the actual length.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@datafusion/functions/src/string/split_part.rs`:
- Around line 247-259: The current global validation of pos_scalar (position)
rejects position == 0 even for empty batches, causing empty-result batches to
error; update the check around position (the pos_scalar -> position match and
the if position == Some(0) test) to only enforce "field position must not be
zero" when the input batch actually contains rows (e.g., guard the zero check
with a check on the input string array length such as string_array.len() > 0 or
equivalent), so empty arrays preserve an empty result instead of raising an
error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 671c13f3-bb34-4f57-bcec-e60e8d570b44

📥 Commits

Reviewing files that changed from the base of the PR and between 2c79756 and bef9f5f.

📒 Files selected for processing (3)
  • datafusion/functions/benches/split_part.rs
  • datafusion/functions/src/string/split_part.rs
  • datafusion/sqllogictest/test_files/expr.slt

Comment on lines +247 to +259
let position = match pos_scalar {
ScalarValue::Int64(v) => *v,
other => {
return exec_err!(
"Unsupported position type {:?} for split_part",
other.data_type()
);
}
};

if position == Some(0) {
return exec_err!("field position must not be zero");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve zero-row batches before rejecting position = 0.

This validation is now batch-global. When string_array.len() == 0, the old array path returned an empty result because no row was evaluated, but the fast path now raises field position must not be zero. That can regress queries after filters or joins produce empty batches.

🐛 Proposed fix
     let position = match pos_scalar {
         ScalarValue::Int64(v) => *v,
         other => {
             return exec_err!(
                 "Unsupported position type {:?} for split_part",
                 other.data_type()
             );
         }
     };
+
+    if string_array.len() == 0 {
+        return Ok(ColumnarValue::Array(new_null_array(
+            string_array.data_type(),
+            0,
+        )));
+    }
 
     if position == Some(0) {
         return exec_err!("field position must not be zero");
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/string/split_part.rs` around lines 247 - 259, The
current global validation of pos_scalar (position) rejects position == 0 even
for empty batches, causing empty-result batches to error; update the check
around position (the pos_scalar -> position match and the if position == Some(0)
test) to only enforce "field position must not be zero" when the input batch
actually contains rows (e.g., guard the zero check with a check on the input
string array length such as string_array.len() > 0 or equivalent), so empty
arrays preserve an empty result instead of raising an error.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! To keep the behavior consistent between the slow path (Array) and fast path (Scalar) it would be good to return an empty result if there are no columns to split at all. Prevents a different behavior when there are no columns to split but the position is 0.

@martin-augment
Copy link
Copy Markdown
Owner Author

1. try_as_str() returns None for LargeUtf8 scalars (possible gap)

let delimiter = delim_scalar.try_as_str().ok_or_else(|| {
    exec_datafusion_err!("Unsupported delimiter type {:?}", ...)
})?;

If the delimiter scalar is LargeUtf8 (which is a valid string type per the coercion signature), try_as_str() may return None and the fast path would return an error rather than falling through to the slow path. Please verify that ScalarValue::try_as_str() handles all three string scalar variants (Utf8, LargeUtf8, Utf8View). If it doesn't, this silently breaks queries with LargeUtf8 delimiter scalars instead of producing a correct result.

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! ScalarValue::try_as_str() supports LargeUtf8 type!

@martin-augment
Copy link
Copy Markdown
Owner Author

2. No test coverage for LargeUtf8 column type

The new .slt tests cover Utf8 and Utf8View columns, but not LargeUtf8. Since split_part_scalar has a DataType::LargeUtf8 branch, it would be good to add at least one test case:

query T
SELECT split_part(arrow_cast(column1, 'LargeUtf8'), '.', 2)
FROM (VALUES ('a.b.c')) AS t(column1)
----
b

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! There are no SQL Logic Tests for split_part() with LargeUtf8 values. Prevents leaving code path untested and this could lead to uncaught regressions in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants