Skip to content

21420: perf: Optimize split_part for Utf8View#304

Open
martin-augment wants to merge 2 commits intomainfrom
pr-21420-2026-04-07-12-05-33
Open

21420: perf: Optimize split_part for Utf8View#304
martin-augment wants to merge 2 commits intomainfrom
pr-21420-2026-04-07-12-05-33

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21420: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Walkthrough

This pull request makes changes across documentation and functionality. It adds AI agent constraints to .cursor/rules.md and AGENTS.md regarding code review restrictions. Additionally, it extends the split_part function implementation in datafusion to support Utf8View data types. This includes refactoring the scalar fast path with zero-copy view references, introducing new helper functions substr_view and split_view_loop, adding a new benchmark case for Utf8View with very long parts, and expanding test coverage with SQL logic tests for split_part operations on Utf8View columns.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21420-2026-04-07-12-05-33

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

Code Review: Zero-copy split_part optimization for Utf8View

The core idea is sound — constructing StringView entries that reference the original array's data buffers avoids copying bytes and is a legitimate performance win for long strings. The implementation is generally correct. A few things worth addressing:


Bug Risk: Missing test coverage for the non-inline (long string) path

The SQL logic tests in string_view.slt only use short strings: "Andrew" (6B), "Xiangpeng" (9B), "Raphael" (7B) — all ≤ 12 bytes. For strings ≤ 12 bytes, StringViewArray stores them inline in the view itself, so substr_view always takes the make_view(substr.as_bytes(), 0, 0) path. The new buffer-pointer path (view.buffer_index + start_offset) is only exercised for strings > 12 bytes, but there are no correctness tests for that path.

The benchmark does use long strings (gen_string_array(N_ROWS, 5, 256, ".", true)), but benchmarks don't catch incorrect results. Please add a logic test with strings longer than 12 bytes to validate the zero-copy buffer-reference path.


Correctness: start_offset as u32 silent truncation

In split_view_loop:

let start_offset = substr.as_ptr() as usize - string.as_ptr() as usize;
views_buf.push(substr_view(raw_view, substr, start_offset as u32));

The as u32 cast silently truncates on overflow. While this can't overflow in practice (Arrow buffer offsets are u32-bounded, so start_offset < string.len() <= u32::MAX), it would be cleaner to use a checked cast or at minimum a debug_assert!(start_offset <= u32::MAX as usize) to document the invariant and catch any future regression.


Minor: position > i64::MAX / 2 edge case in try_into

The try_into::<usize>() on position - 1 for the positive-index branch can only fail if position == 0 (but the branch is position > 0) or if position - 1 > usize::MAX (impossible on 64-bit). The error message "exceeds maximum supported value" is fine but the path is effectively dead code. Not a bug, just FYI.


Minor: Unrelated files in the PR

.cursor/rules.md (new file) and the AGENTS.md additions are unrelated to the split_part optimization. These should be in a separate PR or removed, as they add noise to the diff.


What looks good

  • The unsafe block in split_part_scalar_view has a clear safety comment. The invariant holds: null entries use 0u128 (valid since they're masked by the null bitmap), return_input views are copied verbatim from valid input views, and substr_view constructs views pointing to valid subranges of the original buffers.
  • The substr_view function correctly handles the inline vs. non-inline cases: substrings of inline strings (≤ 12B originals) can never exceed 12B themselves, so ByteView::from(*original_view) is never called on an inline view.
  • Reusing split_nth_finder/rsplit_nth_finder via closures is clean — the lifetime ensures returned subslices genuinely point into the input string, making the pointer arithmetic in split_view_loop sound.
  • Empty-delimiter and out-of-bounds-position behaviors are correctly handled and tested.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 7, 2026

🤖 Augment PR Summary

Summary: This PR optimizes split_part for Utf8View inputs by adding a zero-copy scalar fast path.

Changes:

  • Adds a dedicated split_part_scalar_view implementation that builds StringViewArray results by reusing the input’s underlying buffers
  • Introduces helpers (substr_view, split_view_loop) to construct substring views and apply pre-built memmem finders
  • Updates dispatch so Utf8View + scalar delimiter/position uses the new view-based implementation
  • Adds a benchmark case for very long Utf8View parts (256 bytes) to validate/track the optimization
  • Adds sqllogictest coverage for split_part on Utf8View columns (positive/negative positions, empty delimiter, delimiter-not-found) and verifies Utf8View output type

Technical Notes: The new path uses StringViewArray::new_unchecked to avoid copying substring bytes, relying on view/offset invariants for safety.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

let string = string_view_array.value(i);
match split_fn(string) {
Some(substr) => {
let start_offset = substr.as_ptr() as usize - string.as_ptr() as usize;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_view_loop computes start_offset via raw pointer subtraction and then truncates it with as u32; this relies on split_fn always returning a slice of string and the offset always fitting in u32, otherwise the derived view could become invalid (especially since the result is built with StringViewArray::new_unchecked). Consider adding a defensive assertion and/or documenting this invariant more explicitly to prevent accidental unsoundness if this helper is reused/modified later.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:valid-but-wont-fix; category:bug; feedback: The Augment AI reviewer is not correct! Arrow types use u32 for offsets, so there is no point to use bigger types for offset related functionality.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a zero-copy fast path for the split_part function when processing StringViewArray inputs, leveraging Utf8View to avoid redundant memory allocations. The update includes new performance benchmarks and SQL logic tests to ensure correctness. Feedback was provided to address duplicated instruction lines added to the AGENTS.md documentation.

Comment on lines +42 to +51
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The added instructions for AI agents are duplicated. Please remove the redundant lines to keep the documentation clean.

Suggested change
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!
The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!
The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!
The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 014d100. Configure here.

in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated duplicate config rules in performance optimization PR

Low Severity

This PR includes AI agent configuration rules in AGENTS.md and a new .cursor/rules.md. These changes appear unrelated to the PR's performance optimization goal, and the rules in AGENTS.md are duplicated.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 014d100. Configure here.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
datafusion/sqllogictest/test_files/string/string_view.slt (1)

957-1020: Cover the non-inline Utf8View branch in SQL tests.

All rows in test are at most 9 bytes, so these new assertions only hit the inline branch of substr_view in datafusion/functions/src/string/split_part.rs:496-506. Adding one Utf8View column case whose selected part is > 12 bytes would give correctness coverage for the new buffer-backed path too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/sqllogictest/test_files/string/string_view.slt` around lines 957 -
1020, Add a SQL test case that exercises the non-inline (buffer-backed) Utf8View
branch by ensuring at least one row in the test table produces a selected
split_part result longer than 12 bytes; update
datafusion/sqllogictest/test_files/string/string_view.slt to include a row with
column1_utf8view containing a piece >12 bytes (e.g., a long segment separated by
your chosen delimiter) and add assertions that call split_part(column1_utf8view,
'<delimiter>', <index>) so the returned substring is >12 bytes, thereby covering
the substr_view buffer-backed path referenced in split_part.rs (substr_view).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@datafusion/sqllogictest/test_files/string/string_view.slt`:
- Around line 957-1020: Add a SQL test case that exercises the non-inline
(buffer-backed) Utf8View branch by ensuring at least one row in the test table
produces a selected split_part result longer than 12 bytes; update
datafusion/sqllogictest/test_files/string/string_view.slt to include a row with
column1_utf8view containing a piece >12 bytes (e.g., a long segment separated by
your chosen delimiter) and add assertions that call split_part(column1_utf8view,
'<delimiter>', <index>) so the returned substring is >12 bytes, thereby covering
the substr_view buffer-backed path referenced in split_part.rs (substr_view).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e2919c08-8f3b-4cfb-b11d-17eeaf426ac8

📥 Commits

Reviewing files that changed from the base of the PR and between 2e13a2a and 014d100.

📒 Files selected for processing (5)
  • .cursor/rules.md
  • AGENTS.md
  • datafusion/functions/benches/split_part.rs
  • datafusion/functions/src/string/split_part.rs
  • datafusion/sqllogictest/test_files/string/string_view.slt

Comment on lines +481 to +489
// Safety: each view is either copied unchanged from the input, or built
// by `substr_view` from a substring that is a contiguous sub-range of the
// original string value stored in the input's data buffers.
unsafe {
Ok(Arc::new(StringViewArray::new_unchecked(
views_buf,
string_view_array.data_buffers().to_vec(),
nulls,
)) as ArrayRef)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd datafusion && find . -name "split_part.rs" -type f

Repository: martin-augment/datafusion

Length of output: 139


🏁 Script executed:

cat -n datafusion/functions/src/string/split_part.rs | sed -n '470,510p'

Repository: martin-augment/datafusion

Length of output: 1787


🏁 Script executed:

cat -n datafusion/functions/src/string/split_part.rs | sed -n '280,290p'

Repository: martin-augment/datafusion

Length of output: 636


🏁 Script executed:

cat -n datafusion/functions/src/string/split_part.rs | sed -n '420,450p'

Repository: martin-augment/datafusion

Length of output: 1334


🏁 Script executed:

cat -n datafusion/functions/src/string/split_part.rs | sed -n '432,520p'

Repository: martin-augment/datafusion

Length of output: 3782


🏁 Script executed:

# Let's search for the make_view function to understand how views are created
rg -n "fn make_view" datafusion/functions/src/string/split_part.rs -A 10

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Let's check if there are any tests for split_part with Utf8View
rg -n "split_part.*View\|Utf8View.*split_part" datafusion/functions/src/string/split_part.rs -B 2 -A 5

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Let's look at the StringViewArray implementation to understand buffer management
find . -path "*/arrow*" -name "*.rs" | head -20

Repository: martin-augment/datafusion

Length of output: 511


🏁 Script executed:

# Search for StringViewArray in the codebase to understand how buffers are managed
fd StringViewArray datafusion -t f 2>/dev/null | head -5

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Look at how StringViewArray::new_unchecked is used elsewhere
rg "StringViewArray::new_unchecked" --type rs | head -20

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

rg "StringViewArray::new_unchecked" --type-list

Repository: martin-augment/datafusion

Length of output: 458


🏁 Script executed:

rg "StringViewArray::new_unchecked" datafusion 2>/dev/null | head -20

Repository: martin-augment/datafusion

Length of output: 492


🏁 Script executed:

# Look at make_view function
rg -n "fn make_view" datafusion/functions/src/string/split_part.rs -A 8

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check the full split_view_loop and how it uses substr_view
cat -n datafusion/functions/src/string/split_part.rs | sed -n '509,550p'

Repository: martin-augment/datafusion

Length of output: 1644


🏁 Script executed:

# Let's look at the entire file to understand structure better
wc -l datafusion/functions/src/string/split_part.rs

Repository: martin-augment/datafusion

Length of output: 118


🏁 Script executed:

# Check if there are any optimizations or checks for inline-only results elsewhere
rg -n "data_buffers\|buffer_index" datafusion/functions/src/string/split_part.rs -B 1 -A 1

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Find the make_view function definition
rg -n "fn make_view" datafusion/functions/src/string/ -A 15

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check ByteView struct and buffer_index behavior
rg -n "struct ByteView\|buffer_index" datafusion/functions/src/string/split_part.rs -B 2 -A 2

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Look at how other StringViewArray::new_unchecked calls handle buffers (from earlier search results)
cat -n datafusion/functions/src/string/common.rs | sed -n '1,50p' | grep -A 20 "StringViewArray::new_unchecked"

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Get more context around the StringViewArray::new_unchecked call in common.rs
rg -n "StringViewArray::new_unchecked" datafusion/functions/src/string/common.rs -B 10 -A 5

Repository: martin-augment/datafusion

Length of output: 649


🏁 Script executed:

# Check the unicode/substr.rs usage
rg -n "StringViewArray::new_unchecked" datafusion/functions/src/unicode/substr.rs -B 10 -A 5

Repository: martin-augment/datafusion

Length of output: 649


🏁 Script executed:

# Look for any buffer optimization or filtering in the codebase
rg -n "is_empty\|filter.*buffer\|retain" datafusion/functions/src/string/split_part.rs | head -20

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Search for make_view function more broadly
rg -n "fn make_view" datafusion --type rs -A 10 2>/dev/null | head -40

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Search in arrow crate or related files
find . -name "*.rs" -exec grep -l "fn make_view" {} \; 2>/dev/null | head -5

Repository: martin-augment/datafusion

Length of output: 115


🏁 Script executed:

# Let's look for ByteView definition and buffer_index meaning
rg -n "buffer_index.*0\|inline" datafusion/functions/src/string/split_part.rs | head -20

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check what buffer_index = 0 means in StringView/ByteView context
rg -n "buffer_index" datafusion -A 2 -B 2 --type rs 2>/dev/null | head -60

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Let's check if there's any optimization in Arrow's StringViewArray that handles unused buffers
find . -path "*/arrow*" -name "*.rs" -exec grep -l "StringViewArray\|data_buffers" {} \; 2>/dev/null | head -10

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check the comments and safety documentation around buffer_index
rg -n "buffer_index = 0\|inline.*buffer\|12 bytes" datafusion/functions/src/string/ -B 3 -A 3 | head -60

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Search for make_view definition or import at the top of split_part.rs
cat -n datafusion/functions/src/string/split_part.rs | head -100 | grep -i "make_view\|use\|import"

Repository: martin-augment/datafusion

Length of output: 791


🏁 Script executed:

# Look at the top of the file for imports and module structure
head -50 datafusion/functions/src/string/split_part.rs

Repository: martin-augment/datafusion

Length of output: 2279


🏁 Script executed:

# Check if make_view is defined inline within split_part.rs itself
rg "make_view" datafusion/functions/src/string/split_part.rs -n

Repository: martin-augment/datafusion

Length of output: 270


🏁 Script executed:

# Let's check arrow crate directly
find . -path "*arrow-array*" -name "*.rs" -exec grep -l "StringView" {} \; 2>/dev/null | head -5

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check if this is using arrow_string crate
rg -n "arrow.string\|arrow_string" datafusion/functions/src/string/split_part.rs -B 2 -A 2

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Let's understand what make_view does by looking at arrow crate
# First find arrow crate location
find . -name "arrow" -type d | grep -E "^.*arrow$|/arrow$" | head -5

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Look for tests or examples using StringViewArray and inlined data
rg -n "inline\|buffer_index.*0\|data_buffers" datafusion/functions/src/string/common.rs -B 3 -A 3 | head -80

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check if there's a way to construct StringViewArray with empty buffers
rg -n "StringViewArray.*new\|data_buffers" datafusion -A 2 -B 2 --type rs 2>/dev/null | grep -A 5 -B 5 "to_vec\|Vec::new" | head -100

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Let's look for the definition of StringViewArray and understand if it requires buffers even when unused
rg -n "pub.*fn data_buffers\|pub.*fn new_unchecked" datafusion --type rs 2>/dev/null | head -20

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Test case: check if there are tests for split_part that might show the expected behavior
rg -n "split_part" datafusion/functions/src/string/split_part.rs | grep -i "test\|#\[test\]" | head -10

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Find test module
tail -200 datafusion/functions/src/string/split_part.rs | head -100

Repository: martin-augment/datafusion

Length of output: 3420


🏁 Script executed:

# Let's look at the arrow crate to understand make_view signature and buffer_index semantics
find . -name "*.rs" -path "*/arrow*" -exec grep -l "pub.*fn make_view" {} \; 2>/dev/null | head -5

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check if there's documentation about StringView inline behavior
rg -n "inline\|12 bytes\|small string" datafusion/functions/src/unicode/substr.rs -B 2 -A 2 | head -60

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Look at substr.rs to see how it handles the same pattern
cat -n datafusion/functions/src/unicode/substr.rs | sed -n '200,350p'

Repository: martin-augment/datafusion

Length of output: 6304


🏁 Script executed:

# Check if there's any existing code that filters empty buffers
rg -n "filter.*buffer\|empty.*buffer\|retain" datafusion/functions/src/string/ --type rs | head -20

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Search for any memory optimization patterns in similar code
rg -n "data_buffers()" datafusion/functions/src/ -B 5 -A 5 --type rs | head -100

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Let's check the make_and_append_view function in substr.rs to see the full pattern
rg -n "fn make_and_append_view" datafusion/functions/src/unicode/substr.rs -A 20

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Let's understand ByteView and buffer_index semantics better
rg -n "struct ByteView\|buffer_index" datafusion -A 3 --type rs 2>/dev/null | head -60

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check if inline views (buffer_index=0) need the data_buffers
rg -n "buffer_index.*=.*0\|inline.*buffer" datafusion/functions/src/ -B 2 -A 2 --type rs | head -80

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Let's look for any comments or documentation about view inlining
rg -n "12 bytes\|inline\|small string optimization" datafusion/functions/src/string/split_part.rs -B 2 -A 2

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check if there's any test showing the concern - large input split into small parts
rg -n "long\|large.*small\|bytes.*split" datafusion/functions/benches/split_part.rs | head -20

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Let's examine the benchmark file to see if there are tests for the scenario mentioned
cat -n datafusion/functions/benches/split_part.rs | head -100

Repository: martin-augment/datafusion

Length of output: 4499


🏁 Script executed:

# Check if there's documentation or issues about StringView memory management
rg -n "StringView\|buffer.*retain\|memory.*issue" datafusion/functions/src/string/ --type rs -B 1 -A 1 | head -100

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Look at the actual implementation to confirm the inline behavior (buffer_index = 0)
rg -n "make_view.*0.*0\|inline.*view" datafusion/functions/src/string/split_part.rs -B 2 -A 2

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check if there are any TODOs or FIXMEs related to buffer management
rg -n "TODO\|FIXME\|XXX\|HACK" datafusion/functions/src/string/split_part.rs

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Verify the make_view function signature from arrow
python3 << 'EOF'
# Check if we can find documentation about make_view and buffer_index semantics
import subprocess
result = subprocess.run(['rg', '-n', 'pub.*fn make_view', 'datafusion', '--type', 'rs'], 
                       capture_output=True, text=True)
print("make_view definitions:")
print(result.stdout if result.stdout else "Not found in datafusion")
print(result.stderr if result.stderr else "")

# Also check in arrow if available
result2 = subprocess.run(['rg', '-n', 'fn make_view', '.'], 
                        capture_output=True, text=True)
print("\nAll make_view occurrences:")
lines = result2.stdout.split('\n')[:20]
for line in lines:
    print(line)
EOF

Repository: martin-augment/datafusion

Length of output: 248


🏁 Script executed:

# Let's look at the complete bench file to see if there are scenarios matching the concern
cat -n datafusion/functions/benches/split_part.rs | tail -100

Repository: martin-augment/datafusion

Length of output: 3905


🏁 Script executed:

# Let's check what happens in other places where similar patterns occur
cat -n datafusion/functions/src/string/common.rs | sed -n '200,250p'

Repository: martin-augment/datafusion

Length of output: 2178


🏁 Script executed:

# Let's trace through the logic more carefully - check split_view_loop to see all paths
cat -n datafusion/functions/src/string/split_part.rs | sed -n '509,536p'

Repository: martin-augment/datafusion

Length of output: 1120


🏁 Script executed:

# Critical check: does buffer_index = 0 mean inline in StringView?
python3 << 'EOF'
# Let's search for documentation or comments about buffer_index in views
import subprocess

# Search for buffer_index documentation/comments
result = subprocess.run(['rg', '-n', 'buffer_index', '.', '-B', '3', '-A', '3'], 
                       capture_output=True, text=True)
lines = result.stdout.split('\n')
# Filter to relevant lines only
relevant = [line for line in lines if '0' in line and ('inline' in line.lower() or 'view' in line.lower())]
print("Lines mentioning buffer_index with 0 and inline/view:")
for line in relevant[:30]:
    if line.strip():
        print(line)
EOF

Repository: martin-augment/datafusion

Length of output: 3920


🏁 Script executed:

# Let's verify the actual concern - check if there's a scenario where all results could be inline
cat -n datafusion/functions/src/string/split_part.rs | sed -n '454,490p'

Repository: martin-augment/datafusion

Length of output: 1826


🏁 Script executed:

# Check if there's any filtering or optimization of buffers anywhere in the code
rg -n "filter.*buffer\|drain.*buffer\|empty.*buffer\|Vec::new.*buffer" datafusion/functions/src/ -B 2 -A 2 | head -60

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Let's check if StringViewArray can be created with empty buffers or fewer buffers
rg -n "Vec::new()\|Vec::with_capacity" datafusion/functions/src/string/split_part.rs -B 1 -A 1

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Search for documentation on inline vs external buffer views
rg -n "buffer_index.*==.*0\|inline.*string\|small.*string" datafusion/functions/src/string/ -B 2 -A 2 | head -80

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Final check: Are there any related issues or TODOs about buffer retention?
rg -n "TODO\|FIXME\|buffer.*retain\|memory.*issue\|optimization" datafusion/functions/src/string/ | grep -i "buffer\|memory\|inline"

Repository: martin-augment/datafusion

Length of output: 51


Don't retain source Utf8View buffers when the result is fully inline.

Line 487 always clones string_view_array.data_buffers(), even when substr_view in lines 496–506 has inlined every emitted part. On workloads with long Utf8View inputs split into short tokens (< 12 bytes), this retains the original large buffers for the output array's entire lifetime even though no rows actually reference them. Consider special-casing the all-inline scenario to avoid unnecessary buffer retention. This pattern also appears in substr.rs (lines 320–325) and common.rs (lines 236–240).

@martin-augment
Copy link
Copy Markdown
Owner Author

Bug Risk: Missing test coverage for the non-inline (long string) path

The SQL logic tests in string_view.slt only use short strings: "Andrew" (6B), "Xiangpeng" (9B), "Raphael" (7B) — all ≤ 12 bytes. For strings ≤ 12 bytes, StringViewArray stores them inline in the view itself, so substr_view always takes the make_view(substr.as_bytes(), 0, 0) path. The new buffer-pointer path (view.buffer_index + start_offset) is only exercised for strings > 12 bytes, but there are no correctness tests for that path.

The benchmark does use long strings (gen_string_array(N_ROWS, 5, 256, ".", true)), but benchmarks don't catch incorrect results. Please add a logic test with strings longer than 12 bytes to validate the zero-copy buffer-reference path.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The StringView type handles short strings (less than 13 bytes) differently than long ones (longer than 12 bytes). It would be good to use test data for both lengths to cover all paths.

@martin-augment
Copy link
Copy Markdown
Owner Author

957-1020: Cover the non-inline Utf8View branch in SQL tests.

All rows in test are at most 9 bytes, so these new assertions only hit the inline branch of substr_view in datafusion/functions/src/string/split_part.rs:496-506. Adding one Utf8View column case whose selected part is > 12 bytes would give correctness coverage for the new buffer-backed path too.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The StringView type handles short strings (less than 13 bytes) differently than long ones (longer than 12 bytes). It would be good to use test data for both lengths to cover all paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants