21420: perf: Optimize `split_part` for `Utf8View` by martin-augment · Pull Request #304 · martin-augment/datafusion

martin-augment · 2026-04-07T12:05:37Z

21420: To review by AI

coderabbitai · 2026-04-07T12:05:55Z

Walkthrough

This pull request makes changes across documentation and functionality. It adds AI agent constraints to .cursor/rules.md and AGENTS.md regarding code review restrictions. Additionally, it extends the split_part function implementation in datafusion to support Utf8View data types. This includes refactoring the scalar fast path with zero-copy view references, introducing new helper functions substr_view and split_view_loop, adding a new benchmark case for Utf8View with very long parts, and expanding test coverage with SQL logic tests for split_part operations on Utf8View columns.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pr-21420-2026-04-07-12-05-33

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-07T12:08:27Z

Code Review: Zero-copy `split_part` optimization for `Utf8View`

The core idea is sound — constructing StringView entries that reference the original array's data buffers avoids copying bytes and is a legitimate performance win for long strings. The implementation is generally correct. A few things worth addressing:

Bug Risk: Missing test coverage for the non-inline (long string) path

The SQL logic tests in string_view.slt only use short strings: "Andrew" (6B), "Xiangpeng" (9B), "Raphael" (7B) — all ≤ 12 bytes. For strings ≤ 12 bytes, StringViewArray stores them inline in the view itself, so substr_view always takes the make_view(substr.as_bytes(), 0, 0) path. The new buffer-pointer path (view.buffer_index + start_offset) is only exercised for strings > 12 bytes, but there are no correctness tests for that path.

The benchmark does use long strings (gen_string_array(N_ROWS, 5, 256, ".", true)), but benchmarks don't catch incorrect results. Please add a logic test with strings longer than 12 bytes to validate the zero-copy buffer-reference path.

Correctness: `start_offset as u32` silent truncation

In split_view_loop:

let start_offset = substr.as_ptr() as usize - string.as_ptr() as usize;
views_buf.push(substr_view(raw_view, substr, start_offset as u32));

The as u32 cast silently truncates on overflow. While this can't overflow in practice (Arrow buffer offsets are u32-bounded, so start_offset < string.len() <= u32::MAX), it would be cleaner to use a checked cast or at minimum a debug_assert!(start_offset <= u32::MAX as usize) to document the invariant and catch any future regression.

Minor: `position > i64::MAX / 2` edge case in try_into

The try_into::<usize>() on position - 1 for the positive-index branch can only fail if position == 0 (but the branch is position > 0) or if position - 1 > usize::MAX (impossible on 64-bit). The error message "exceeds maximum supported value" is fine but the path is effectively dead code. Not a bug, just FYI.

Minor: Unrelated files in the PR

.cursor/rules.md (new file) and the AGENTS.md additions are unrelated to the split_part optimization. These should be in a separate PR or removed, as they add noise to the diff.

What looks good

The unsafe block in split_part_scalar_view has a clear safety comment. The invariant holds: null entries use 0u128 (valid since they're masked by the null bitmap), return_input views are copied verbatim from valid input views, and substr_view constructs views pointing to valid subranges of the original buffers.
The substr_view function correctly handles the inline vs. non-inline cases: substrings of inline strings (≤ 12B originals) can never exceed 12B themselves, so ByteView::from(*original_view) is never called on an inline view.
Reusing split_nth_finder/rsplit_nth_finder via closures is clean — the lifetime ensures returned subslices genuinely point into the input string, making the pointer arithmetic in split_view_loop sound.
Empty-delimiter and out-of-bounds-position behaviors are correctly handled and tested.

augmentcode · 2026-04-07T12:10:20Z

🤖 Augment PR Summary

Summary: This PR optimizes split_part for Utf8View inputs by adding a zero-copy scalar fast path.

Changes:

Adds a dedicated split_part_scalar_view implementation that builds StringViewArray results by reusing the input’s underlying buffers
Introduces helpers (substr_view, split_view_loop) to construct substring views and apply pre-built memmem finders
Updates dispatch so Utf8View + scalar delimiter/position uses the new view-based implementation
Adds a benchmark case for very long Utf8View parts (256 bytes) to validate/track the optimization
Adds sqllogictest coverage for split_part on Utf8View columns (positive/negative positions, empty delimiter, delimiter-not-found) and verifies Utf8View output type

Technical Notes: The new path uses StringViewArray::new_unchecked to avoid copying substring bytes, relying on view/offset invariants for safety.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-04-07T12:10:21Z

datafusion/functions/src/string/split_part.rs

+        let string = string_view_array.value(i);
+        match split_fn(string) {
+            Some(substr) => {
+                let start_offset = substr.as_ptr() as usize - string.as_ptr() as usize;


split_view_loop computes start_offset via raw pointer subtraction and then truncates it with as u32; this relies on split_fn always returning a slice of string and the offset always fitting in u32, otherwise the derived view could become invalid (especially since the result is built with StringViewArray::new_unchecked). Consider adding a defensive assertion and/or documenting this invariant more explicitly to prevent accidental unsoundness if this helper is reused/modified later.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:valid-but-wont-fix; category:bug; feedback: The Augment AI reviewer is not correct! Arrow types use u32 for offsets, so there is no point to use bigger types for offset related functionality.

gemini-code-assist

Code Review

This pull request introduces a zero-copy fast path for the split_part function when processing StringViewArray inputs, leveraging Utf8View to avoid redundant memory allocations. The update includes new performance benchmarks and SQL logic tests to ensure correctness. Feedback was provided to address duplicated instruction lines added to the AGENTS.md documentation.

gemini-code-assist · 2026-04-07T12:11:10Z

AGENTS.md

+The AI agents should never link to any issue or a pull request
+in any GitHub repository in the code reviews!
+
+The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!
+
+The AI agents should never link to any issue or a pull request
+in any GitHub repository in the code reviews!
+
+The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!
+


The added instructions for AI agents are duplicated. Please remove the redundant lines to keep the documentation clean.

Suggested change

The AI agents should never link to any issue or a pull request

in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

The AI agents should never link to any issue or a pull request

in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

The AI agents should never link to any issue or a pull request

in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 014d100. Configure here.}

cursor · 2026-04-07T12:17:59Z

AGENTS.md

+in any GitHub repository in the code reviews!
+
+The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!
+


Unrelated duplicate config rules in performance optimization PR

Low Severity

This PR includes AI agent configuration rules in AGENTS.md and a new .cursor/rules.md. These changes appear unrelated to the PR's performance optimization goal, and the rules in AGENTS.md are duplicated.

^{Reviewed by Cursor Bugbot for commit 014d100. Configure here.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

datafusion/sqllogictest/test_files/string/string_view.slt (1)
957-1020: Cover the non-inline Utf8View branch in SQL tests.

All rows in test are at most 9 bytes, so these new assertions only hit the inline branch of substr_view in datafusion/functions/src/string/split_part.rs:496-506. Adding one Utf8View column case whose selected part is > 12 bytes would give correctness coverage for the new buffer-backed path too.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/sqllogictest/test_files/string/string_view.slt` around lines 957 -
1020, Add a SQL test case that exercises the non-inline (buffer-backed) Utf8View
branch by ensuring at least one row in the test table produces a selected
split_part result longer than 12 bytes; update
datafusion/sqllogictest/test_files/string/string_view.slt to include a row with
column1_utf8view containing a piece >12 bytes (e.g., a long segment separated by
your chosen delimiter) and add assertions that call split_part(column1_utf8view,
'<delimiter>', <index>) so the returned substring is >12 bytes, thereby covering
the substr_view buffer-backed path referenced in split_part.rs (substr_view).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@datafusion/sqllogictest/test_files/string/string_view.slt`:
- Around line 957-1020: Add a SQL test case that exercises the non-inline
(buffer-backed) Utf8View branch by ensuring at least one row in the test table
produces a selected split_part result longer than 12 bytes; update
datafusion/sqllogictest/test_files/string/string_view.slt to include a row with
column1_utf8view containing a piece >12 bytes (e.g., a long segment separated by
your chosen delimiter) and add assertions that call split_part(column1_utf8view,
'<delimiter>', <index>) so the returned substring is >12 bytes, thereby covering
the substr_view buffer-backed path referenced in split_part.rs (substr_view).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e2919c08-8f3b-4cfb-b11d-17eeaf426ac8

📥 Commits

Reviewing files that changed from the base of the PR and between 2e13a2a and 014d100.

📒 Files selected for processing (5)

.cursor/rules.md
AGENTS.md
datafusion/functions/benches/split_part.rs
datafusion/functions/src/string/split_part.rs
datafusion/sqllogictest/test_files/string/string_view.slt

coderabbitai · 2026-04-07T12:20:51Z

datafusion/functions/src/string/split_part.rs

+    // Safety: each view is either copied unchanged from the input, or built
+    // by `substr_view` from a substring that is a contiguous sub-range of the
+    // original string value stored in the input's data buffers.
+    unsafe {
+        Ok(Arc::new(StringViewArray::new_unchecked(
+            views_buf,
+            string_view_array.data_buffers().to_vec(),
+            nulls,
+        )) as ArrayRef)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd datafusion && find . -name "split_part.rs" -type f

Repository: martin-augment/datafusion

Length of output: 139

🏁 Script executed:

cat -n datafusion/functions/src/string/split_part.rs | sed -n '470,510p'

Repository: martin-augment/datafusion

Length of output: 1787

🏁 Script executed:

cat -n datafusion/functions/src/string/split_part.rs | sed -n '280,290p'

Repository: martin-augment/datafusion

Length of output: 636

🏁 Script executed:

cat -n datafusion/functions/src/string/split_part.rs | sed -n '420,450p'

Repository: martin-augment/datafusion

Length of output: 1334

🏁 Script executed:

cat -n datafusion/functions/src/string/split_part.rs | sed -n '432,520p'

Repository: martin-augment/datafusion

Length of output: 3782

🏁 Script executed:

# Let's search for the make_view function to understand how views are created rg -n "fn make_view" datafusion/functions/src/string/split_part.rs -A 10

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Let's check if there are any tests for split_part with Utf8View rg -n "split_part.*View\|Utf8View.*split_part" datafusion/functions/src/string/split_part.rs -B 2 -A 5

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Let's look at the StringViewArray implementation to understand buffer management find . -path "*/arrow*" -name "*.rs" | head -20

Repository: martin-augment/datafusion

Length of output: 511

🏁 Script executed:

# Search for StringViewArray in the codebase to understand how buffers are managed fd StringViewArray datafusion -t f 2>/dev/null | head -5

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Look at how StringViewArray::new_unchecked is used elsewhere rg "StringViewArray::new_unchecked" --type rs | head -20

Repository: martin-augment/datafusion

Length of output: 95

🏁 Script executed:

rg "StringViewArray::new_unchecked" --type-list

Repository: martin-augment/datafusion

Length of output: 458

🏁 Script executed:

rg "StringViewArray::new_unchecked" datafusion 2>/dev/null | head -20

Repository: martin-augment/datafusion

Length of output: 492

🏁 Script executed:

# Look at make_view function rg -n "fn make_view" datafusion/functions/src/string/split_part.rs -A 8

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check the full split_view_loop and how it uses substr_view cat -n datafusion/functions/src/string/split_part.rs | sed -n '509,550p'

Repository: martin-augment/datafusion

Length of output: 1644

🏁 Script executed:

# Let's look at the entire file to understand structure better wc -l datafusion/functions/src/string/split_part.rs

Repository: martin-augment/datafusion

Length of output: 118

🏁 Script executed:

# Check if there are any optimizations or checks for inline-only results elsewhere rg -n "data_buffers\|buffer_index" datafusion/functions/src/string/split_part.rs -B 1 -A 1

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Find the make_view function definition rg -n "fn make_view" datafusion/functions/src/string/ -A 15

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check ByteView struct and buffer_index behavior rg -n "struct ByteView\|buffer_index" datafusion/functions/src/string/split_part.rs -B 2 -A 2

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Look at how other StringViewArray::new_unchecked calls handle buffers (from earlier search results) cat -n datafusion/functions/src/string/common.rs | sed -n '1,50p' | grep -A 20 "StringViewArray::new_unchecked"

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Get more context around the StringViewArray::new_unchecked call in common.rs rg -n "StringViewArray::new_unchecked" datafusion/functions/src/string/common.rs -B 10 -A 5

Repository: martin-augment/datafusion

Length of output: 649

🏁 Script executed:

# Check the unicode/substr.rs usage rg -n "StringViewArray::new_unchecked" datafusion/functions/src/unicode/substr.rs -B 10 -A 5

Repository: martin-augment/datafusion

Length of output: 649

🏁 Script executed:

# Look for any buffer optimization or filtering in the codebase rg -n "is_empty\|filter.*buffer\|retain" datafusion/functions/src/string/split_part.rs | head -20

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Search for make_view function more broadly rg -n "fn make_view" datafusion --type rs -A 10 2>/dev/null | head -40

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Search in arrow crate or related files find . -name "*.rs" -exec grep -l "fn make_view" {} \; 2>/dev/null | head -5

Repository: martin-augment/datafusion

Length of output: 115

🏁 Script executed:

# Let's look for ByteView definition and buffer_index meaning rg -n "buffer_index.*0\|inline" datafusion/functions/src/string/split_part.rs | head -20

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check what buffer_index = 0 means in StringView/ByteView context rg -n "buffer_index" datafusion -A 2 -B 2 --type rs 2>/dev/null | head -60

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Let's check if there's any optimization in Arrow's StringViewArray that handles unused buffers find . -path "*/arrow*" -name "*.rs" -exec grep -l "StringViewArray\|data_buffers" {} \; 2>/dev/null | head -10

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check the comments and safety documentation around buffer_index rg -n "buffer_index = 0\|inline.*buffer\|12 bytes" datafusion/functions/src/string/ -B 3 -A 3 | head -60

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Search for make_view definition or import at the top of split_part.rs cat -n datafusion/functions/src/string/split_part.rs | head -100 | grep -i "make_view\|use\|import"

Repository: martin-augment/datafusion

Length of output: 791

🏁 Script executed:

# Look at the top of the file for imports and module structure head -50 datafusion/functions/src/string/split_part.rs

Repository: martin-augment/datafusion

Length of output: 2279

🏁 Script executed:

# Check if make_view is defined inline within split_part.rs itself rg "make_view" datafusion/functions/src/string/split_part.rs -n

Repository: martin-augment/datafusion

Length of output: 270

🏁 Script executed:

# Let's check arrow crate directly find . -path "*arrow-array*" -name "*.rs" -exec grep -l "StringView" {} \; 2>/dev/null | head -5

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check if this is using arrow_string crate rg -n "arrow.string\|arrow_string" datafusion/functions/src/string/split_part.rs -B 2 -A 2

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Let's understand what make_view does by looking at arrow crate # First find arrow crate location find . -name "arrow" -type d | grep -E "^.*arrow$|/arrow$" | head -5

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Look for tests or examples using StringViewArray and inlined data rg -n "inline\|buffer_index.*0\|data_buffers" datafusion/functions/src/string/common.rs -B 3 -A 3 | head -80

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check if there's a way to construct StringViewArray with empty buffers rg -n "StringViewArray.*new\|data_buffers" datafusion -A 2 -B 2 --type rs 2>/dev/null | grep -A 5 -B 5 "to_vec\|Vec::new" | head -100

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Let's look for the definition of StringViewArray and understand if it requires buffers even when unused rg -n "pub.*fn data_buffers\|pub.*fn new_unchecked" datafusion --type rs 2>/dev/null | head -20

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Test case: check if there are tests for split_part that might show the expected behavior rg -n "split_part" datafusion/functions/src/string/split_part.rs | grep -i "test\|#\[test\]" | head -10

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Find test module tail -200 datafusion/functions/src/string/split_part.rs | head -100

Repository: martin-augment/datafusion

Length of output: 3420

🏁 Script executed:

# Let's look at the arrow crate to understand make_view signature and buffer_index semantics find . -name "*.rs" -path "*/arrow*" -exec grep -l "pub.*fn make_view" {} \; 2>/dev/null | head -5

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check if there's documentation about StringView inline behavior rg -n "inline\|12 bytes\|small string" datafusion/functions/src/unicode/substr.rs -B 2 -A 2 | head -60

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Look at substr.rs to see how it handles the same pattern cat -n datafusion/functions/src/unicode/substr.rs | sed -n '200,350p'

Repository: martin-augment/datafusion

Length of output: 6304

🏁 Script executed:

# Check if there's any existing code that filters empty buffers rg -n "filter.*buffer\|empty.*buffer\|retain" datafusion/functions/src/string/ --type rs | head -20

Repository: martin-augment/datafusion

Length of output: 95

🏁 Script executed:

# Search for any memory optimization patterns in similar code rg -n "data_buffers()" datafusion/functions/src/ -B 5 -A 5 --type rs | head -100

Repository: martin-augment/datafusion

Length of output: 95

🏁 Script executed:

# Let's check the make_and_append_view function in substr.rs to see the full pattern rg -n "fn make_and_append_view" datafusion/functions/src/unicode/substr.rs -A 20

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Let's understand ByteView and buffer_index semantics better rg -n "struct ByteView\|buffer_index" datafusion -A 3 --type rs 2>/dev/null | head -60

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check if inline views (buffer_index=0) need the data_buffers rg -n "buffer_index.*=.*0\|inline.*buffer" datafusion/functions/src/ -B 2 -A 2 --type rs | head -80

Repository: martin-augment/datafusion

Length of output: 95

🏁 Script executed:

# Let's look for any comments or documentation about view inlining rg -n "12 bytes\|inline\|small string optimization" datafusion/functions/src/string/split_part.rs -B 2 -A 2

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check if there's any test showing the concern - large input split into small parts rg -n "long\|large.*small\|bytes.*split" datafusion/functions/benches/split_part.rs | head -20

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Let's examine the benchmark file to see if there are tests for the scenario mentioned cat -n datafusion/functions/benches/split_part.rs | head -100

Repository: martin-augment/datafusion

Length of output: 4499

🏁 Script executed:

# Check if there's documentation or issues about StringView memory management rg -n "StringView\|buffer.*retain\|memory.*issue" datafusion/functions/src/string/ --type rs -B 1 -A 1 | head -100

Repository: martin-augment/datafusion

Length of output: 95

🏁 Script executed:

# Look at the actual implementation to confirm the inline behavior (buffer_index = 0) rg -n "make_view.*0.*0\|inline.*view" datafusion/functions/src/string/split_part.rs -B 2 -A 2

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Check if there are any TODOs or FIXMEs related to buffer management rg -n "TODO\|FIXME\|XXX\|HACK" datafusion/functions/src/string/split_part.rs

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Verify the make_view function signature from arrow python3 << 'EOF' # Check if we can find documentation about make_view and buffer_index semantics import subprocess result = subprocess.run(['rg', '-n', 'pub.*fn make_view', 'datafusion', '--type', 'rs'], capture_output=True, text=True) print("make_view definitions:") print(result.stdout if result.stdout else "Not found in datafusion") print(result.stderr if result.stderr else "") # Also check in arrow if available result2 = subprocess.run(['rg', '-n', 'fn make_view', '.'], capture_output=True, text=True) print("\nAll make_view occurrences:") lines = result2.stdout.split('\n')[:20] for line in lines: print(line) EOF

Repository: martin-augment/datafusion

Length of output: 248

🏁 Script executed:

# Let's look at the complete bench file to see if there are scenarios matching the concern cat -n datafusion/functions/benches/split_part.rs | tail -100

Repository: martin-augment/datafusion

Length of output: 3905

🏁 Script executed:

# Let's check what happens in other places where similar patterns occur cat -n datafusion/functions/src/string/common.rs | sed -n '200,250p'

Repository: martin-augment/datafusion

Length of output: 2178

🏁 Script executed:

# Let's trace through the logic more carefully - check split_view_loop to see all paths cat -n datafusion/functions/src/string/split_part.rs | sed -n '509,536p'

Repository: martin-augment/datafusion

Length of output: 1120

🏁 Script executed:

# Critical check: does buffer_index = 0 mean inline in StringView? python3 << 'EOF' # Let's search for documentation or comments about buffer_index in views import subprocess # Search for buffer_index documentation/comments result = subprocess.run(['rg', '-n', 'buffer_index', '.', '-B', '3', '-A', '3'], capture_output=True, text=True) lines = result.stdout.split('\n') # Filter to relevant lines only relevant = [line for line in lines if '0' in line and ('inline' in line.lower() or 'view' in line.lower())] print("Lines mentioning buffer_index with 0 and inline/view:") for line in relevant[:30]: if line.strip(): print(line) EOF

Repository: martin-augment/datafusion

Length of output: 3920

🏁 Script executed:

# Let's verify the actual concern - check if there's a scenario where all results could be inline cat -n datafusion/functions/src/string/split_part.rs | sed -n '454,490p'

Repository: martin-augment/datafusion

Length of output: 1826

🏁 Script executed:

# Check if there's any filtering or optimization of buffers anywhere in the code rg -n "filter.*buffer\|drain.*buffer\|empty.*buffer\|Vec::new.*buffer" datafusion/functions/src/ -B 2 -A 2 | head -60

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Let's check if StringViewArray can be created with empty buffers or fewer buffers rg -n "Vec::new()\|Vec::with_capacity" datafusion/functions/src/string/split_part.rs -B 1 -A 1

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Search for documentation on inline vs external buffer views rg -n "buffer_index.*==.*0\|inline.*string\|small.*string" datafusion/functions/src/string/ -B 2 -A 2 | head -80

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

# Final check: Are there any related issues or TODOs about buffer retention? rg -n "TODO\|FIXME\|buffer.*retain\|memory.*issue\|optimization" datafusion/functions/src/string/ | grep -i "buffer\|memory\|inline"

Repository: martin-augment/datafusion

Length of output: 51

Don't retain source Utf8View buffers when the result is fully inline.

Line 487 always clones string_view_array.data_buffers(), even when substr_view in lines 496–506 has inlined every emitted part. On workloads with long Utf8View inputs split into short tokens (< 12 bytes), this retains the original large buffers for the output array's entire lifetime even though no rows actually reference them. Consider special-casing the all-inline scenario to avoid unnecessary buffer retention. This pattern also appears in substr.rs (lines 320–325) and common.rs (lines 236–240).

martin-augment · 2026-04-07T13:12:33Z

Bug Risk: Missing test coverage for the non-inline (long string) path

The SQL logic tests in string_view.slt only use short strings: "Andrew" (6B), "Xiangpeng" (9B), "Raphael" (7B) — all ≤ 12 bytes. For strings ≤ 12 bytes, StringViewArray stores them inline in the view itself, so substr_view always takes the make_view(substr.as_bytes(), 0, 0) path. The new buffer-pointer path (view.buffer_index + start_offset) is only exercised for strings > 12 bytes, but there are no correctness tests for that path.

The benchmark does use long strings (gen_string_array(N_ROWS, 5, 256, ".", true)), but benchmarks don't catch incorrect results. Please add a logic test with strings longer than 12 bytes to validate the zero-copy buffer-reference path.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The StringView type handles short strings (less than 13 bytes) differently than long ones (longer than 12 bytes). It would be good to use test data for both lengths to cover all paths.

martin-augment · 2026-04-07T13:15:21Z

957-1020: Cover the non-inline Utf8View branch in SQL tests.

All rows in test are at most 9 bytes, so these new assertions only hit the inline branch of substr_view in datafusion/functions/src/string/split_part.rs:496-506. Adding one Utf8View column case whose selected part is > 12 bytes would give correctness coverage for the new buffer-backed path too.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The StringView type handles short strings (less than 13 bytes) differently than long ones (longer than 12 bytes). It would be good to use test data for both lengths to cover all paths.

neilconway and others added 2 commits April 6, 2026 17:52

.

5dd9597

Add CLAUDE.md & AGENTS.md

014d100

augmentcode bot reviewed Apr 7, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

cursor bot reviewed Apr 7, 2026

View reviewed changes

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

		in any GitHub repository in the code reviews!

		The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

Conversation

martin-augment commented Apr 7, 2026

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Apr 7, 2026

Code Review: Zero-copy split_part optimization for Utf8View

Bug Risk: Missing test coverage for the non-inline (long string) path

Correctness: start_offset as u32 silent truncation

Minor: position > i64::MAX / 2 edge case in try_into

Minor: Unrelated files in the PR

What looks good

Uh oh!

augmentcode bot commented Apr 7, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 7, 2026

Choose a reason for hiding this comment

Unrelated duplicate config rules in performance optimization PR

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Apr 7, 2026

Bug Risk: Missing test coverage for the non-inline (long string) path

Uh oh!

martin-augment commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Code Review: Zero-copy `split_part` optimization for `Utf8View`

Correctness: `start_offset as u32` silent truncation

Minor: `position > i64::MAX / 2` edge case in try_into