8716: refactor: remove `arrow-ord` dependency in `arrow-cast` by martin-augment · Pull Request #11 · martin-augment/arrow-rs

martin-augment · 2025-10-28T06:56:51Z

8716: To review by AI

coderabbitai · 2025-10-28T06:57:23Z

Walkthrough

The PR removes the arrow-ord dependency from arrow-cast/Cargo.toml and refactors run boundary detection in run_array.rs from partition-based iteration to a new compute_run_boundaries function with specialized type handlers supporting boolean, numeric, temporal, string, and binary data types.

Changes

Cohort / File(s)	Summary
Dependency Management `arrow-cast/Cargo.toml`	Removed `arrow-ord` from `[dependencies]` section
Run Boundary Detection `arrow-cast/src/cast/run_array.rs`	Replaced partition-based run boundary detection with new `compute_run_boundaries` function and specialized run-generation helpers for multiple data types (boolean, primitive numeric, temporal, interval, decimal, string, binary, fixed-size binary, dictionary). Introduced `runs_generic` and type-specific variants. Added null-handling and capacity management infrastructure while preserving public API signatures

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-8716-2025-10-28-06-56-47

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2025-10-28T06:59:21Z

PR Review: Refactor to Remove `arrow-ord` Dependency in `arrow-cast`

Summary

This PR successfully removes the arrow-ord dependency from arrow-cast by implementing a custom run-length encoding (RLE) boundary detection algorithm to replace the partition function. The implementation adds ~476 lines of specialized code to identify consecutive equal values in arrays.

Code Quality and Best Practices

Strengths

Comprehensive type coverage: The implementation handles all Arrow data types with specialized optimizations for each category (primitives, strings, binary, fixed-size binary, dictionaries, etc.)
Performance optimizations:
- SIMD-like chunked comparison in scan_run_end_chunk (lines 590-628)
- Unsafe code with bounds checking elided for hot paths
- Specialized paths for null-free arrays
Clear code structure: Each data type has its own handler function with consistent naming conventions
Memory efficiency: Preallocates capacity based on heuristics (len / 64 + 2)

Areas for Improvement

1. Unsafe Code Safety (High Priority)

The code contains multiple unsafe blocks that need careful review:

Lines 309, 318, 328, 334, 341 - In runs_for_primitive:

let mut current = unsafe { *values.get_unchecked(0) };

While these appear safe given the checks for empty arrays and bounds, consider adding safety comments explaining the invariants.

Lines 604-611 - In scan_run_end_chunk:

let value_bytes = std::slice::from_raw_parts(&current as *const T::Native as *const u8, element_size);

This pointer cast assumes the memory layout is compatible. This should be safe for primitive types but deserves a safety comment about alignment and validity assumptions.

Lines 614, 617 - Unaligned reads:

let chunk = unsafe { (values.as_ptr().add(idx) as *const u128).read_unaligned() };

The unaligned read is necessary but could fail on certain architectures or cause performance issues. Consider documenting why this approach was chosen over aligned reads.

Line 622 - unreachable!():

unreachable!("chunk mismatch without locating differing element");

This indicates a logic error if reached. Consider whether this should be a panic or debug assertion instead.

2. Potential Bugs

Float NaN Handling (Medium Priority)
Lines 189-191 handle Float16/Float32/Float64 using direct equality comparison:

Float16 => runs_for_primitive(array.as_primitive::<Float16Type>()),
Float32 => runs_for_primitive(array.as_primitive::<Float32Type>()),
Float64 => runs_for_primitive(array.as_primitive::<Float64Type>()),

Issue: NaN values have the property that NaN != NaN, which means consecutive NaN values will be treated as different runs. This may not be the desired behavior for RLE encoding. The original arrow-ord::partition function likely handled this differently.

Recommendation: Add special handling for float types to treat all NaNs as equivalent, or document this behavior explicitly if it's intentional.

3. Performance Considerations

Good Optimizations:

Chunked scanning for primitives with 16-byte SIMD-like comparisons
Early exit for trivial cases (empty, single element)
Capacity management to avoid repeated allocations

Concerns:

Generic fallback path (lines 531-548): Uses array.slice() and to_data() for each element comparison, which is very expensive. Consider documenting that this is a fallback only for unsupported types.
String/Binary view types (lines 463-468): Delegates to the expensive generic path. These are important types that might benefit from specialized implementations.
Chunk size hardcoded: The 16-byte chunks might not be optimal for all CPUs. Consider whether this should be configurable or adaptive.

4. Memory Safety Patterns

Capacity management (lines 560-565):

fn ensure_capacity(vec: &mut Vec<usize>, total_len: usize) {
    if vec.len() == vec.capacity() {
        let remaining = total_len.saturating_sub(vec.len());
        vec.reserve(remaining.max(1));
    }
}

This is clever but might cause a large allocation on the first reserve. Consider whether exponential growth would be better for arrays with many runs.

Test Coverage

Critical Gap: The PR diff does not show any new tests for the compute_run_boundaries function or its specialized implementations.

Recommendations:

Add unit tests for each runs_for_* function
Add tests for edge cases:
- Arrays with all equal values
- Arrays with all unique values
- Arrays with nulls in various positions
- NaN handling for floats
- Empty arrays and single-element arrays
Add property-based tests comparing output with the original arrow-ord::partition
Add benchmarks to ensure performance is maintained or improved

Security Concerns

Unsafe memory access: While the unsafe code appears sound, it needs thorough review and documentation
Out-of-bounds risk: The chunked reading could theoretically access beyond array bounds if the pointer arithmetic is incorrect
Unaligned access: Could cause crashes on some architectures (though read_unaligned should handle this)

Recommendation: Add comprehensive fuzzing tests for the unsafe code paths.

Architecture and Design

Positive:

Reduces dependency graph complexity by removing arrow-ord
Self-contained implementation improves arrow-cast independence

Concerns:

Code duplication risk: If similar functionality is needed elsewhere, this code might need to be extracted to a shared utility
Maintenance burden: The custom implementation is now ~25x larger than the original 3-line call to partition()
Testing burden: Each specialized path needs its own tests

Specific Line-by-Line Issues

Line 269: false is used as a placeholder for invalid boolean values. Document why this choice vs. using an Option type.
Line 582: The condition if element_size <= 8 && 16 % element_size == 0 should be documented. It's checking whether the type fits evenly into 16-byte chunks.
Line 398, 417: Slice comparison values[start..end] != values[current_start..current_end] - This is correct but potentially expensive for large binary values. Consider optimizing with length check first (which is done).

Recommendations Summary

Must Fix (Before Merge):

✅ Add comprehensive tests for the new code
⚠️ Fix or document NaN handling behavior for floats
✅ Add safety documentation for all unsafe blocks
✅ Verify the PR passes CI and existing tests

Should Fix:

Consider optimizing StringView and BinaryView with specialized implementations
Add benchmarks to ensure performance parity with arrow-ord::partition
Review capacity growth strategy in ensure_capacity

Nice to Have:

Add property-based testing with the original implementation
Consider extracting to a utility module if this pattern is needed elsewhere
Add CPU-specific optimizations for the chunked comparison

Conclusion

This is a substantial refactoring that achieves its goal of removing the arrow-ord dependency. The code quality is generally good with thoughtful optimizations, but it needs:

Comprehensive test coverage
Safety documentation for unsafe blocks
Clarification on float NaN handling

The complexity increase (from 3 lines to 476 lines) is significant, so this change should be justified by concrete benefits (reduced binary size, compilation time, or avoiding circular dependencies).

Recommendation: Request additional tests and safety documentation before merging. Consider running benchmarks against the original implementation to ensure no performance regression.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

arrow-cast/src/cast/run_array.rs (3)

64-84: Fix double-Arc return (type mismatch).

new_run_array is already an ArrayRef (Arc). Returning Ok(Arc::new(new_run_array)) yields Arc<Arc> and won’t type‑check. Return the ArrayRef directly.

-                    Ok(Arc::new(new_run_array))
+                    Ok(new_run_array)

161-164: Use signed indices for take() (Int32/Int64), not UInt32.

The compute::take kernel expects signed integer index types (Int32 or Int64). Using UInt32 is unsupported and may error at runtime.

-    let indices = PrimitiveArray::<UInt32Type>::from_iter_values(
-        values_indexes.iter().map(|&idx| idx as u32),
-    );
+    // Prefer Int32 where possible; switch to Int64 if len > i32::MAX.
+    let indices = PrimitiveArray::<Int32Type>::from_iter_values(
+        values_indexes.iter().map(|&idx| idx as i32),
+    );

Optionally gate on length to choose Int64 for very large arrays.

86-106: REE → logical expansion builds wrong-length indices and mishandles 1‑based run_ends.

Iterates over run_ends.len() (number of runs) instead of logical length.
run_ends are cumulative, 1‑based; equality check against 0‑based logical_idx is wrong.
Risks OOB on run_ends[physical_idx] and produces too few elements.
Example: with run_ends=[2, 5, 6] (6 logical elements), current code produces only 3 indices, causing incorrect take() operation.

Apply this rewrite to construct a per‑logical index mapping by expanding each run length:

-                    let run_ends = run_array.run_ends().values().to_vec();
-                    let mut indices = Vec::with_capacity(run_array.run_ends().len());
-                    let mut physical_idx: usize = 0;
-                    for logical_idx in 0..run_array.run_ends().len() {
-                        // If the logical index is equal to the (next) run end, increment the physical index,
-                        // since we are at the end of a run.
-                        if logical_idx == run_ends[physical_idx].as_usize() {
-                            physical_idx += 1;
-                        }
-                        indices.push(physical_idx as i32);
-                    }
+                    let run_ends = run_array.run_ends().values();
+                    // logical_len = last run end (1-based)
+                    let logical_len = {
+                        let &last = run_ends.last().expect("non-empty REE");
+                        last.as_usize()
+                    };
+                    let mut indices: Vec<i32> = Vec::with_capacity(logical_len);
+                    let mut prev = 0usize;
+                    for (run_idx, &end_native) in run_ends.iter().enumerate() {
+                        let end = end_native.as_usize();
+                        let run_len = end.checked_sub(prev).expect("non-decreasing run_ends");
+                        // Repeat run_idx for this run's logical length
+                        indices.extend(std::iter::repeat(run_idx as i32).take(run_len));
+                        prev = end;
+                    }

🧹 Nitpick comments (4)

arrow-cast/src/cast/run_array.rs (4)
531-549: runs_generic is allocation-heavy and O(n) slices per element; consider a leaner comparator.

Repeated slice(idx, 1).to_data() allocates/clones ArrayData each iteration. Prefer a type-directed fast path (like existing runs_for_* helpers) or compare physical buffers/offsets directly for the element at idx. This will reduce allocations and improve hot‑path performance for view/complex types.

If keeping generic fallback, add a micro-benchmark to quantify impact vs old partition-based approach.

150-169: Index type width selection for values_indexes.

When building values_array via take of run-starts, choose Int64 indices when cast_array.len() > i32::MAX to avoid overflow on .map(|&idx| idx as i32).

Consider:
if cast_array.len() <= i32::MAX as usize {
    // Int32 indices
} else {
    // Int64 indices
}
217-257: Dictionary runs should consider logical values, not just key equality (edge case).

Using runs_for_primitive(array.keys()) splits runs when distinct keys map to identical dictionary values (permitted by Arrow). If observable equality should be by values, consider dispatching on cast(array, value_type) first (you already do that in cast_to_run_end_encoded) or normalize keys to canonicalized dictionary.

559-565: ensure_capacity reserves up to total_len every time capacity is hit.

This is safe but can over-reserve for arrays with few runs. A geometric growth (Vec default) or reserve_exact with a modest chunk (e.g., +len/8) can reduce memory spikes for very long arrays with sparse runs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 477bfe4 and fe208be.

📒 Files selected for processing (2)

arrow-cast/Cargo.toml (0 hunks)
arrow-cast/src/cast/run_array.rs (3 hunks)

💤 Files with no reviewable changes (1)

arrow-cast/Cargo.toml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: MIRI
GitHub Check: Test
GitHub Check: Audit
GitHub Check: Lint (cargo fmt)
GitHub Check: Rustdocs are clean
GitHub Check: Test on Windows
GitHub Check: Release Audit Tool (RAT)
GitHub Check: Test on Mac
GitHub Check: codex
GitHub Check: Archery test With other arrows
GitHub Check: Test
GitHub Check: claude-review

coderabbitai · 2025-10-28T07:04:46Z

arrow-cast/src/cast/run_array.rs

+        Float16 => runs_for_primitive(array.as_primitive::<Float16Type>()),
+        Float32 => runs_for_primitive(array.as_primitive::<Float32Type>()),
+        Float64 => runs_for_primitive(array.as_primitive::<Float64Type>()),


⚠️ Potential issue | 🟠 Major

Float runs: NaN handling breaks every element; use bitwise equality for floats.

Using != on floats treats NaN != NaN, fragmenting runs of NaNs. Implement float‑specific run detection using to_bits (including f16) so all equal bit patterns, including NaN, stay in a single run.

Minimal dispatch change:

- Float16 => runs_for_primitive(array.as_primitive::<Float16Type>()), - Float32 => runs_for_primitive(array.as_primitive::<Float32Type>()), - Float64 => runs_for_primitive(array.as_primitive::<Float64Type>()), + Float16 => runs_for_float16(array.as_primitive::<Float16Type>()), + Float32 => runs_for_float32(array.as_primitive::<Float32Type>()), + Float64 => runs_for_float64(array.as_primitive::<Float64Type>()),

Add specialized helpers (outline):

fn runs_for_float32(array: &PrimitiveArray<Float32Type>) -> (Vec<usize>, Vec<usize>) { // mirror runs_for_primitive but compare value.to_bits() instead of value // and use null semantics identical to runs_for_primitive } fn runs_for_float64(...) -> (...) { /* same with u64 */ } fn runs_for_float16(...) -> (...) { /* use half::f16::to_bits() (u16) */ }

🤖 Prompt for AI Agents

In arrow-cast/src/cast/run_array.rs around lines 189-191, the float arms currently call runs_for_primitive and use `!=` which splits NaNs because IEEE floats compare NaN != NaN; replace those arms to call new float-specific run detectors and implement three helpers: runs_for_float16(&PrimitiveArray<Float16Type>), runs_for_float32(&PrimitiveArray<Float32Type>), and runs_for_float64(&PrimitiveArray<Float64Type>). Each helper should mirror runs_for_primitive but compare raw bit representations (use half::f16::to_bits() for f16, f32::to_bits() for f32, f64::to_bits() for f64) so equal bit patterns (including NaN payloads) form a single run, and preserve the same null-handling and return types as runs_for_primitive; update the match dispatch to call these helpers for Float16/Float32/Float64.

github-actions · 2025-10-28T07:05:45Z

Findings

arrow-cast/src/cast/run_array.rs:333 and arrow-cast/src/cast/run_array.rs:637 – The new run detection for primitive arrays relies on Rust’s != / == for floats. Unlike the old arrow_ord::partition code, this no longer uses total-order/bitwise equality, so consecutive NaN values now get split into separate runs (e.g. Float32Array::from(vec![f32::NAN, f32::NAN]) produces two runs, diverging from pre‑PR behaviour). Please switch these comparisons to something like T::Native::is_eq / is_ne (or equivalent bitwise checks) so floating-point runs, including mixed-validity cases, retain the former semantics.
arrow-cast/src/cast/run_array.rs:525 – runs_for_dictionary only looks at the encoded keys. If the dictionary contains duplicate values at different key indices (which Arrow allows), two equal logical values with different keys will be treated as distinct runs, whereas the previous implementation compared the decoded values and kept them together. You’ll want to validate equality against the dictionary values (similar to what arrow_ord::partition did) before recording a boundary.

Weijun-H added 2 commits October 27, 2025 22:38

refactor: remove dependency on arrow_ord

26c8dc6

chore

fe208be

github-actions bot added the arrow label Oct 28, 2025

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8716: refactor: remove `arrow-ord` dependency in `arrow-cast`#11

8716: refactor: remove `arrow-ord` dependency in `arrow-cast`#11
martin-augment wants to merge 2 commits intomainfrom
pr-8716-2025-10-28-06-56-47

martin-augment commented Oct 28, 2025

Uh oh!

coderabbitai bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

claude bot commented Oct 28, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 28, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martin-augment commented Oct 28, 2025

Uh oh!

coderabbitai bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

claude bot commented Oct 28, 2025

PR Review: Refactor to Remove arrow-ord Dependency in arrow-cast

Summary

Code Quality and Best Practices

Strengths

Areas for Improvement

1. Unsafe Code Safety (High Priority)

2. Potential Bugs

3. Performance Considerations

4. Memory Safety Patterns

Test Coverage

Security Concerns

Architecture and Design

Specific Line-by-Line Issues

Recommendations Summary

Must Fix (Before Merge):

Should Fix:

Nice to Have:

Conclusion

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Oct 28, 2025 •

edited

Loading

PR Review: Refactor to Remove `arrow-ord` Dependency in `arrow-cast`