21238: perf: Optimize split_part for scalar args#298
21238: perf: Optimize split_part for scalar args#298martin-augment wants to merge 3 commits intomainfrom
split_part for scalar args#298Conversation
WalkthroughThis pull request optimizes the ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review:
|
🤖 Augment PR SummarySummary: This PR optimizes Changes:
Technical Notes: The fast path preserves existing semantics for empty delimiters and out-of-range positions (returning empty strings), while reducing per-row overhead by reusing substring searchers across the batch. 🤖 Was this summary useful? React with 👍 or 👎 |
| } | ||
| }; | ||
|
|
||
| if position == Some(0) { |
There was a problem hiding this comment.
split_part_scalar checks position == Some(0) before handling NULL scalar delimiter/position, so split_part(col, NULL, 0) will now error while the array/array implementation only errors when all three inputs are non-NULL (otherwise it yields NULL). Consider whether the zero-position error should be applied consistently across NULL cases / code paths to avoid surprising behavior differences.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! To keep the behavior consistent between the slow path (Array) and fast path (Scalar) it would be good to move the check for position=0 below the check for delimiter=NULL. Prevents a different behavior when the delimiter is NULL but the position is 0.
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization for the split_part function by implementing a fast path for cases where the delimiter and position are scalar values. The implementation leverages memmem::Finder and memmem::FinderRev to improve search efficiency across Utf8, LargeUtf8, and Utf8View types. Additionally, the PR updates benchmarks to cover these new scenarios and adds regression tests in sqllogictest. Review feedback recommends refining the memory pre-allocation for string builders in the fast path by using array offsets instead of the total buffer length, ensuring better efficiency when processing sliced arrays.
| GenericStringBuilder::<i32>::with_capacity( | ||
| arr.len(), | ||
| arr.value_data().len(), | ||
| ), |
There was a problem hiding this comment.
In the scalar fast path, using arr.value_data().len() for pre-allocation has the same potential for excessive memory usage when dealing with slices. Using the difference between the last and first offsets provides a precise capacity requirement.
| GenericStringBuilder::<i32>::with_capacity( | |
| arr.len(), | |
| arr.value_data().len(), | |
| ), | |
| GenericStringBuilder::<i32>::with_capacity( | |
| arr.len(), | |
| (arr.value_offset(arr.len()) - arr.value_offset(0)).to_usize().unwrap(), | |
| ), |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocated capacity does not take into account that the array might be a slice and uses its full length. It should use the offsets to calculate the actual length.
| GenericStringBuilder::<i64>::with_capacity( | ||
| arr.len(), | ||
| arr.value_data().len(), | ||
| ), |
There was a problem hiding this comment.
Applying the same optimization for LargeUtf8 in the scalar fast path to ensure efficient memory usage for sliced arrays.
| GenericStringBuilder::<i64>::with_capacity( | |
| arr.len(), | |
| arr.value_data().len(), | |
| ), | |
| GenericStringBuilder::<i64>::with_capacity( | |
| arr.len(), | |
| (arr.value_offset(arr.len()) - arr.value_offset(0)).to_usize().unwrap(), | |
| ), |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocated capacity does not take into account that the array might be a slice and uses its full length. It should use the offsets to calculate the actual length.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@datafusion/functions/src/string/split_part.rs`:
- Around line 247-259: The current global validation of pos_scalar (position)
rejects position == 0 even for empty batches, causing empty-result batches to
error; update the check around position (the pos_scalar -> position match and
the if position == Some(0) test) to only enforce "field position must not be
zero" when the input batch actually contains rows (e.g., guard the zero check
with a check on the input string array length such as string_array.len() > 0 or
equivalent), so empty arrays preserve an empty result instead of raising an
error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 671c13f3-bb34-4f57-bcec-e60e8d570b44
📒 Files selected for processing (3)
datafusion/functions/benches/split_part.rsdatafusion/functions/src/string/split_part.rsdatafusion/sqllogictest/test_files/expr.slt
| let position = match pos_scalar { | ||
| ScalarValue::Int64(v) => *v, | ||
| other => { | ||
| return exec_err!( | ||
| "Unsupported position type {:?} for split_part", | ||
| other.data_type() | ||
| ); | ||
| } | ||
| }; | ||
|
|
||
| if position == Some(0) { | ||
| return exec_err!("field position must not be zero"); | ||
| } |
There was a problem hiding this comment.
Preserve zero-row batches before rejecting position = 0.
This validation is now batch-global. When string_array.len() == 0, the old array path returned an empty result because no row was evaluated, but the fast path now raises field position must not be zero. That can regress queries after filters or joins produce empty batches.
🐛 Proposed fix
let position = match pos_scalar {
ScalarValue::Int64(v) => *v,
other => {
return exec_err!(
"Unsupported position type {:?} for split_part",
other.data_type()
);
}
};
+
+ if string_array.len() == 0 {
+ return Ok(ColumnarValue::Array(new_null_array(
+ string_array.data_type(),
+ 0,
+ )));
+ }
if position == Some(0) {
return exec_err!("field position must not be zero");
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@datafusion/functions/src/string/split_part.rs` around lines 247 - 259, The
current global validation of pos_scalar (position) rejects position == 0 even
for empty batches, causing empty-result batches to error; update the check
around position (the pos_scalar -> position match and the if position == Some(0)
test) to only enforce "field position must not be zero" when the input batch
actually contains rows (e.g., guard the zero check with a check on the input
string array length such as string_array.len() > 0 or equivalent), so empty
arrays preserve an empty result instead of raising an error.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! To keep the behavior consistent between the slow path (Array) and fast path (Scalar) it would be good to return an empty result if there are no columns to split at all. Prevents a different behavior when there are no columns to split but the position is 0.
value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! ScalarValue::try_as_str() supports LargeUtf8 type! |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! There are no SQL Logic Tests for split_part() with LargeUtf8 values. Prevents leaving code path untested and this could lead to uncaught regressions in the future. |
21238: To review by AI