Skip to content

8716: refactor: remove arrow-ord dependency in arrow-cast#11

Open
martin-augment wants to merge 2 commits intomainfrom
pr-8716-2025-10-28-06-56-47
Open

8716: refactor: remove arrow-ord dependency in arrow-cast#11
martin-augment wants to merge 2 commits intomainfrom
pr-8716-2025-10-28-06-56-47

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

8716: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 28, 2025

Walkthrough

The PR removes the arrow-ord dependency from arrow-cast/Cargo.toml and refactors run boundary detection in run_array.rs from partition-based iteration to a new compute_run_boundaries function with specialized type handlers supporting boolean, numeric, temporal, string, and binary data types.

Changes

Cohort / File(s) Summary
Dependency Management
arrow-cast/Cargo.toml
Removed arrow-ord from [dependencies] section
Run Boundary Detection
arrow-cast/src/cast/run_array.rs
Replaced partition-based run boundary detection with new compute_run_boundaries function and specialized run-generation helpers for multiple data types (boolean, primitive numeric, temporal, interval, decimal, string, binary, fixed-size binary, dictionary). Introduced runs_generic and type-specific variants. Added null-handling and capacity management infrastructure while preserving public API signatures
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8716-2025-10-28-06-56-47

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the arrow label Oct 28, 2025
@claude
Copy link
Copy Markdown

claude bot commented Oct 28, 2025

PR Review: Refactor to Remove arrow-ord Dependency in arrow-cast

Summary

This PR successfully removes the arrow-ord dependency from arrow-cast by implementing a custom run-length encoding (RLE) boundary detection algorithm to replace the partition function. The implementation adds ~476 lines of specialized code to identify consecutive equal values in arrays.

Code Quality and Best Practices

Strengths

  1. Comprehensive type coverage: The implementation handles all Arrow data types with specialized optimizations for each category (primitives, strings, binary, fixed-size binary, dictionaries, etc.)
  2. Performance optimizations:
    • SIMD-like chunked comparison in scan_run_end_chunk (lines 590-628)
    • Unsafe code with bounds checking elided for hot paths
    • Specialized paths for null-free arrays
  3. Clear code structure: Each data type has its own handler function with consistent naming conventions
  4. Memory efficiency: Preallocates capacity based on heuristics (len / 64 + 2)

Areas for Improvement

1. Unsafe Code Safety (High Priority)

The code contains multiple unsafe blocks that need careful review:

Lines 309, 318, 328, 334, 341 - In runs_for_primitive:

let mut current = unsafe { *values.get_unchecked(0) };

While these appear safe given the checks for empty arrays and bounds, consider adding safety comments explaining the invariants.

Lines 604-611 - In scan_run_end_chunk:

let value_bytes = std::slice::from_raw_parts(&current as *const T::Native as *const u8, element_size);

This pointer cast assumes the memory layout is compatible. This should be safe for primitive types but deserves a safety comment about alignment and validity assumptions.

Lines 614, 617 - Unaligned reads:

let chunk = unsafe { (values.as_ptr().add(idx) as *const u128).read_unaligned() };

The unaligned read is necessary but could fail on certain architectures or cause performance issues. Consider documenting why this approach was chosen over aligned reads.

Line 622 - unreachable!():

unreachable!("chunk mismatch without locating differing element");

This indicates a logic error if reached. Consider whether this should be a panic or debug assertion instead.

2. Potential Bugs

Float NaN Handling (Medium Priority)
Lines 189-191 handle Float16/Float32/Float64 using direct equality comparison:

Float16 => runs_for_primitive(array.as_primitive::<Float16Type>()),
Float32 => runs_for_primitive(array.as_primitive::<Float32Type>()),
Float64 => runs_for_primitive(array.as_primitive::<Float64Type>()),

Issue: NaN values have the property that NaN != NaN, which means consecutive NaN values will be treated as different runs. This may not be the desired behavior for RLE encoding. The original arrow-ord::partition function likely handled this differently.

Recommendation: Add special handling for float types to treat all NaNs as equivalent, or document this behavior explicitly if it's intentional.

3. Performance Considerations

Good Optimizations:

  • Chunked scanning for primitives with 16-byte SIMD-like comparisons
  • Early exit for trivial cases (empty, single element)
  • Capacity management to avoid repeated allocations

Concerns:

  • Generic fallback path (lines 531-548): Uses array.slice() and to_data() for each element comparison, which is very expensive. Consider documenting that this is a fallback only for unsupported types.
  • String/Binary view types (lines 463-468): Delegates to the expensive generic path. These are important types that might benefit from specialized implementations.
  • Chunk size hardcoded: The 16-byte chunks might not be optimal for all CPUs. Consider whether this should be configurable or adaptive.

4. Memory Safety Patterns

Capacity management (lines 560-565):

fn ensure_capacity(vec: &mut Vec<usize>, total_len: usize) {
    if vec.len() == vec.capacity() {
        let remaining = total_len.saturating_sub(vec.len());
        vec.reserve(remaining.max(1));
    }
}

This is clever but might cause a large allocation on the first reserve. Consider whether exponential growth would be better for arrays with many runs.

Test Coverage

Critical Gap: The PR diff does not show any new tests for the compute_run_boundaries function or its specialized implementations.

Recommendations:

  1. Add unit tests for each runs_for_* function
  2. Add tests for edge cases:
    • Arrays with all equal values
    • Arrays with all unique values
    • Arrays with nulls in various positions
    • NaN handling for floats
    • Empty arrays and single-element arrays
  3. Add property-based tests comparing output with the original arrow-ord::partition
  4. Add benchmarks to ensure performance is maintained or improved

Security Concerns

  1. Unsafe memory access: While the unsafe code appears sound, it needs thorough review and documentation
  2. Out-of-bounds risk: The chunked reading could theoretically access beyond array bounds if the pointer arithmetic is incorrect
  3. Unaligned access: Could cause crashes on some architectures (though read_unaligned should handle this)

Recommendation: Add comprehensive fuzzing tests for the unsafe code paths.

Architecture and Design

Positive:

  • Reduces dependency graph complexity by removing arrow-ord
  • Self-contained implementation improves arrow-cast independence

Concerns:

  • Code duplication risk: If similar functionality is needed elsewhere, this code might need to be extracted to a shared utility
  • Maintenance burden: The custom implementation is now ~25x larger than the original 3-line call to partition()
  • Testing burden: Each specialized path needs its own tests

Specific Line-by-Line Issues

  1. Line 269: false is used as a placeholder for invalid boolean values. Document why this choice vs. using an Option type.

  2. Line 582: The condition if element_size <= 8 && 16 % element_size == 0 should be documented. It's checking whether the type fits evenly into 16-byte chunks.

  3. Line 398, 417: Slice comparison values[start..end] != values[current_start..current_end] - This is correct but potentially expensive for large binary values. Consider optimizing with length check first (which is done).

Recommendations Summary

Must Fix (Before Merge):

  1. ✅ Add comprehensive tests for the new code
  2. ⚠️ Fix or document NaN handling behavior for floats
  3. ✅ Add safety documentation for all unsafe blocks
  4. ✅ Verify the PR passes CI and existing tests

Should Fix:

  1. Consider optimizing StringView and BinaryView with specialized implementations
  2. Add benchmarks to ensure performance parity with arrow-ord::partition
  3. Review capacity growth strategy in ensure_capacity

Nice to Have:

  1. Add property-based testing with the original implementation
  2. Consider extracting to a utility module if this pattern is needed elsewhere
  3. Add CPU-specific optimizations for the chunked comparison

Conclusion

This is a substantial refactoring that achieves its goal of removing the arrow-ord dependency. The code quality is generally good with thoughtful optimizations, but it needs:

  1. Comprehensive test coverage
  2. Safety documentation for unsafe blocks
  3. Clarification on float NaN handling

The complexity increase (from 3 lines to 476 lines) is significant, so this change should be justified by concrete benefits (reduced binary size, compilation time, or avoiding circular dependencies).

Recommendation: Request additional tests and safety documentation before merging. Consider running benchmarks against the original implementation to ensure no performance regression.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
arrow-cast/src/cast/run_array.rs (3)

64-84: Fix double-Arc return (type mismatch).

new_run_array is already an ArrayRef (Arc). Returning Ok(Arc::new(new_run_array)) yields Arc<Arc> and won’t type‑check. Return the ArrayRef directly.

-                    Ok(Arc::new(new_run_array))
+                    Ok(new_run_array)

161-164: Use signed indices for take() (Int32/Int64), not UInt32.

The compute::take kernel expects signed integer index types (Int32 or Int64). Using UInt32 is unsupported and may error at runtime.

-    let indices = PrimitiveArray::<UInt32Type>::from_iter_values(
-        values_indexes.iter().map(|&idx| idx as u32),
-    );
+    // Prefer Int32 where possible; switch to Int64 if len > i32::MAX.
+    let indices = PrimitiveArray::<Int32Type>::from_iter_values(
+        values_indexes.iter().map(|&idx| idx as i32),
+    );

Optionally gate on length to choose Int64 for very large arrays.


86-106: REE → logical expansion builds wrong-length indices and mishandles 1‑based run_ends.

  • Iterates over run_ends.len() (number of runs) instead of logical length.
  • run_ends are cumulative, 1‑based; equality check against 0‑based logical_idx is wrong.
  • Risks OOB on run_ends[physical_idx] and produces too few elements.
  • Example: with run_ends=[2, 5, 6] (6 logical elements), current code produces only 3 indices, causing incorrect take() operation.

Apply this rewrite to construct a per‑logical index mapping by expanding each run length:

-                    let run_ends = run_array.run_ends().values().to_vec();
-                    let mut indices = Vec::with_capacity(run_array.run_ends().len());
-                    let mut physical_idx: usize = 0;
-                    for logical_idx in 0..run_array.run_ends().len() {
-                        // If the logical index is equal to the (next) run end, increment the physical index,
-                        // since we are at the end of a run.
-                        if logical_idx == run_ends[physical_idx].as_usize() {
-                            physical_idx += 1;
-                        }
-                        indices.push(physical_idx as i32);
-                    }
+                    let run_ends = run_array.run_ends().values();
+                    // logical_len = last run end (1-based)
+                    let logical_len = {
+                        let &last = run_ends.last().expect("non-empty REE");
+                        last.as_usize()
+                    };
+                    let mut indices: Vec<i32> = Vec::with_capacity(logical_len);
+                    let mut prev = 0usize;
+                    for (run_idx, &end_native) in run_ends.iter().enumerate() {
+                        let end = end_native.as_usize();
+                        let run_len = end.checked_sub(prev).expect("non-decreasing run_ends");
+                        // Repeat run_idx for this run's logical length
+                        indices.extend(std::iter::repeat(run_idx as i32).take(run_len));
+                        prev = end;
+                    }
🧹 Nitpick comments (4)
arrow-cast/src/cast/run_array.rs (4)

531-549: runs_generic is allocation-heavy and O(n) slices per element; consider a leaner comparator.

Repeated slice(idx, 1).to_data() allocates/clones ArrayData each iteration. Prefer a type-directed fast path (like existing runs_for_* helpers) or compare physical buffers/offsets directly for the element at idx. This will reduce allocations and improve hot‑path performance for view/complex types.

If keeping generic fallback, add a micro-benchmark to quantify impact vs old partition-based approach.


150-169: Index type width selection for values_indexes.

When building values_array via take of run-starts, choose Int64 indices when cast_array.len() > i32::MAX to avoid overflow on .map(|&idx| idx as i32).

Consider:

if cast_array.len() <= i32::MAX as usize {
    // Int32 indices
} else {
    // Int64 indices
}

217-257: Dictionary runs should consider logical values, not just key equality (edge case).

Using runs_for_primitive(array.keys()) splits runs when distinct keys map to identical dictionary values (permitted by Arrow). If observable equality should be by values, consider dispatching on cast(array, value_type) first (you already do that in cast_to_run_end_encoded) or normalize keys to canonicalized dictionary.


559-565: ensure_capacity reserves up to total_len every time capacity is hit.

This is safe but can over-reserve for arrays with few runs. A geometric growth (Vec default) or reserve_exact with a modest chunk (e.g., +len/8) can reduce memory spikes for very long arrays with sparse runs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 477bfe4 and fe208be.

📒 Files selected for processing (2)
  • arrow-cast/Cargo.toml (0 hunks)
  • arrow-cast/src/cast/run_array.rs (3 hunks)
💤 Files with no reviewable changes (1)
  • arrow-cast/Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: MIRI
  • GitHub Check: Test
  • GitHub Check: Audit
  • GitHub Check: Lint (cargo fmt)
  • GitHub Check: Rustdocs are clean
  • GitHub Check: Test on Windows
  • GitHub Check: Release Audit Tool (RAT)
  • GitHub Check: Test on Mac
  • GitHub Check: codex
  • GitHub Check: Archery test With other arrows
  • GitHub Check: Test
  • GitHub Check: claude-review

Comment on lines +189 to +191
Float16 => runs_for_primitive(array.as_primitive::<Float16Type>()),
Float32 => runs_for_primitive(array.as_primitive::<Float32Type>()),
Float64 => runs_for_primitive(array.as_primitive::<Float64Type>()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Float runs: NaN handling breaks every element; use bitwise equality for floats.

Using != on floats treats NaN != NaN, fragmenting runs of NaNs. Implement float‑specific run detection using to_bits (including f16) so all equal bit patterns, including NaN, stay in a single run.

Minimal dispatch change:

-        Float16 => runs_for_primitive(array.as_primitive::<Float16Type>()),
-        Float32 => runs_for_primitive(array.as_primitive::<Float32Type>()),
-        Float64 => runs_for_primitive(array.as_primitive::<Float64Type>()),
+        Float16 => runs_for_float16(array.as_primitive::<Float16Type>()),
+        Float32 => runs_for_float32(array.as_primitive::<Float32Type>()),
+        Float64 => runs_for_float64(array.as_primitive::<Float64Type>()),

Add specialized helpers (outline):

fn runs_for_float32(array: &PrimitiveArray<Float32Type>) -> (Vec<usize>, Vec<usize>) {
    // mirror runs_for_primitive but compare value.to_bits() instead of value
    // and use null semantics identical to runs_for_primitive
}
fn runs_for_float64(...) -> (...) { /* same with u64 */ }
fn runs_for_float16(...) -> (...) { /* use half::f16::to_bits() (u16) */ }
🤖 Prompt for AI Agents
In arrow-cast/src/cast/run_array.rs around lines 189-191, the float arms
currently call runs_for_primitive and use `!=` which splits NaNs because IEEE
floats compare NaN != NaN; replace those arms to call new float-specific run
detectors and implement three helpers:
runs_for_float16(&PrimitiveArray<Float16Type>),
runs_for_float32(&PrimitiveArray<Float32Type>), and
runs_for_float64(&PrimitiveArray<Float64Type>). Each helper should mirror
runs_for_primitive but compare raw bit representations (use half::f16::to_bits()
for f16, f32::to_bits() for f32, f64::to_bits() for f64) so equal bit patterns
(including NaN payloads) form a single run, and preserve the same null-handling
and return types as runs_for_primitive; update the match dispatch to call these
helpers for Float16/Float32/Float64.

@github-actions
Copy link
Copy Markdown

Findings

  • arrow-cast/src/cast/run_array.rs:333 and arrow-cast/src/cast/run_array.rs:637 – The new run detection for primitive arrays relies on Rust’s != / == for floats. Unlike the old arrow_ord::partition code, this no longer uses total-order/bitwise equality, so consecutive NaN values now get split into separate runs (e.g. Float32Array::from(vec![f32::NAN, f32::NAN]) produces two runs, diverging from pre‑PR behaviour). Please switch these comparisons to something like T::Native::is_eq / is_ne (or equivalent bitwise checks) so floating-point runs, including mixed-validity cases, retain the former semantics.
  • arrow-cast/src/cast/run_array.rs:525runs_for_dictionary only looks at the encoded keys. If the dictionary contains duplicate values at different key indices (which Arrow allows), two equal logical values with different keys will be treated as distinct runs, whereas the previous implementation compared the decoded values and kept them together. You’ll want to validate equality against the dictionary values (similar to what arrow_ord::partition did) before recording a boundary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants