8838: Add comparison support for Union arrays by martin-augment · Pull Request #21 · martin-augment/arrow-rs

martin-augment · 2025-11-14T11:05:29Z

8838: To review by AI

coderabbitai · 2025-11-14T11:05:50Z

Walkthrough

The PR adds support for comparing UnionArray types in the core comparator. A new compare_union function was introduced that verifies matching union fields and modes between values, builds per-field comparators for each union member, and performs comparisons by matching type_ids then comparing values. The make_comparator function now routes Union type pairs to this handler. The implementation supports both dense and sparse union layouts using type_ids and offsets. Corresponding test cases were added for dense and sparse union comparisons, and the imports were updated to include necessary DataType and Union-related types. No public API changes were made.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-8838-2025-11-14-11-05-19

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2025-11-14T11:07:45Z

Code Review: Add comparison support for Union arrays

Thank you for implementing comparison support for Union arrays! This is a valuable addition to the arrow-ord crate. Below is my detailed review:

✅ Positive Aspects

Code Quality

Clean implementation: The compare_union function follows the established pattern used by other comparison functions like compare_struct and compare_map
Proper use of child_opts: Correctly applies child sort options to handle nested sorting semantics
Consistent error handling: Properly validates that both union arrays have the same fields and mode before comparison

Test Coverage

Comprehensive test cases: Both test_dense_union and test_sparse_union test various comparison scenarios
Descending sort testing: Validates that sort order inversion works correctly
Well-documented tests: Clear comments explaining the structure and expected outcomes

🔍 Issues & Concerns

1. Potential Index Out of Bounds (Critical)

Location: arrow-ord/src/ord.rs:353

let cmp = field_comparators[left_type_id as usize]
    .as_ref()
    .expect("type_id not found in field_comparators");

Issue: If a union array contains a type_id that exists in the schema but was not included in the union_fields iteration, this will panic or return None.

Scenario: If type_ids in the array are not consecutive (e.g., type_ids [0, 5, 7]), but the comparator vector is sized based on max_type_id (7), indices [1,2,3,4,6] will be None. If the actual data references these missing type_ids, the expect() will panic.

Recommendation: Add validation or use a more robust error message:

let cmp = field_comparators
    .get(left_type_id as usize)
    .and_then(|c| c.as_ref())
    .unwrap_or_else(|| panic\!("Invalid type_id {} not found in union fields", left_type_id));

Better yet, consider whether this should return an error rather than panic, since this is a library function.

2. Semantic Comparison by type_id (Design Question)

Location: arrow-ord/src/ord.rs:347-359

// first, compare by type_id
match left_type_id.cmp(&right_type_id) {
    Ordering::Equal => { /* compare values */ }
    other => other,
}

Question: Is comparing by type_id first semantically correct for Union arrays?

In the test example:

(type_id=0, value=100) is considered LESS than (type_id=1, value=1) solely based on type_id
This means an integer value of 100 compares as "less than" a string value of "a"

Considerations:

This makes sorting stable and deterministic ✅
But it may be unintuitive for users who expect value-based comparison
Other type systems (e.g., SQL) may have different semantics

Recommendation: This should be documented in the function's docstring, explaining that union values are first ordered by type_id, then by value within the same type.

3. Missing Edge Case Tests

The tests are good but could be more comprehensive:

Missing scenarios:

Comparing unions with different fields (should error - already handled in code ✅)
Unions with null values in the child arrays
Unions with empty arrays
Unions with non-consecutive type_ids (e.g., [0, 5, 7])
Cross-comparison between dense and sparse unions (should error - already handled ✅)

Recommendation: Add at least one test for null handling:

#[test]
fn test_union_with_nulls() {
    // Create union arrays with nullable children and verify comparison behavior
}

4. Performance Consideration (Minor)

Location: arrow-ord/src/ord.rs:324-327

let max_type_id = left_fields.iter().map(|(id, _)| id).max().unwrap_or(0);
let mut field_comparators: Vec<Option<DynComparator>> =
    Vec::with_capacity((max_type_id + 1) as usize);
field_comparators.resize_with((max_type_id + 1) as usize, || None);

Issue: If type_ids are sparse (e.g., [0, 100, 200]), this allocates a 201-element vector with most entries being None. This wastes memory for large type_ids.

Alternative approach: Use a HashMap for O(1) lookup with no wasted space:

let mut field_comparators: HashMap<i8, DynComparator> = HashMap::new();
for (type_id, _field) in left_fields.iter() {
    let left_child = left.child(type_id);
    let right_child = right.child(type_id);
    let cmp = make_comparator(left_child.as_ref(), right_child.as_ref(), c_opts)?;
    field_comparators.insert(type_id, cmp);
}

Then in the closure:

let cmp = field_comparators.get(&left_type_id)
    .expect("type_id not found in field_comparators");

Trade-off: HashMap has slightly higher overhead for small arrays but scales better for sparse type_ids.

5. Clone Operations (Performance)

Location: arrow-ord/src/ord.rs:336-340

let left_type_ids = left.type_ids().clone();
let right_type_ids = right.type_ids().clone();

let left_offsets = left.offsets().cloned();
let right_offsets = right.offsets().cloned();

Note: These clones capture the buffers for the closure. Since these are buffer references, the clone is relatively cheap (Arc increments), but worth noting.

Observation: This is consistent with other comparison functions (e.g., compare_map also clones offsets), so no change needed. ✅

📋 Summary

Must Fix

Issue Add Claude Code GitHub Workflow #1: Add better error handling/validation for type_id lookup to prevent unexpected panics

Should Consider

Issue 8633: Fix Writer docs and rename AvroBinaryFormat to AvroSoeFormat #2: Document the type_id-first comparison semantics
Issue 8584: Add support for run-end encoded (REE) arrays in arrow-avro #3: Add tests for null handling in union children
Issue 8700: fix: liberal parsing of zero scale decimals #4: Consider HashMap for sparse type_ids (optional optimization)

Overall Assessment

This is a solid implementation that correctly handles the core comparison logic for Union arrays. The code follows project conventions and includes good test coverage for the basic cases. With the suggested improvements, particularly around error handling and documentation, this would be production-ready.

Recommendation: Approve with minor revisions for issue #1 and documentation improvements.

cc: @friendlymatthew for the implementation

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-11-14T11:07:48Z

arrow-ord/src/ord.rs

+        let left_child = left.child(type_id);
+        let right_child = right.child(type_id);
+        let cmp = make_comparator(left_child.as_ref(), right_child.as_ref(), c_opts)?;
+        field_comparators[type_id as usize] = Some(cmp);


Indexing field_comparators with type_id as usize assumes non-negative, dense type IDs and can panic or cause huge allocations if a union uses negative or sparse type IDs (Arrow allows arbitrary i8 type IDs). Consider avoiding direct index-based lookup by type ID here (also applies to the later access in the comparator closure).

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct! If the fields ids are something like "[1, 100]" then it will use 100 (the max) and create a Vector of 100 items, while it really needs just two. Using a HashMap would be better

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

arrow-ord/src/ord.rs (2)

299-363: Solid implementation with one optional optimization.

The logic correctly handles both dense and sparse union layouts by comparing type_ids first, then values. The implementation properly validates matching fields/modes and uses child comparators with correct offset handling.

Consider using a HashMap<i8, DynComparator> instead of Vec<Option<DynComparator>> for sparse type_id ranges (e.g., [1, 50, 100]). The current approach allocates a vector of size max_type_id + 1, which wastes memory when type_ids are not contiguous.

Apply this optimization if sparse type_ids are a common use case:

-    let max_type_id = left_fields.iter().map(|(id, _)| id).max().unwrap_or(0);
-    let mut field_comparators: Vec<Option<DynComparator>> =
-        Vec::with_capacity((max_type_id + 1) as usize);
-    field_comparators.resize_with((max_type_id + 1) as usize, || None);
+    use std::collections::HashMap;
+    let mut field_comparators: HashMap<i8, DynComparator> = HashMap::new();

     for (type_id, _field) in left_fields.iter() {
         let left_child = left.child(type_id);
         let right_child = right.child(type_id);
         let cmp = make_comparator(left_child.as_ref(), right_child.as_ref(), c_opts)?;
-        field_comparators[type_id as usize] = Some(cmp);
+        field_comparators.insert(type_id, cmp);
     }

Then update the lookup:

-                let cmp = field_comparators[left_type_id as usize]
-                    .as_ref()
+                let cmp = field_comparators
+                    .get(&left_type_id)
                     .expect("type_id not found in field_comparators");

1342-1369: Expand test coverage for sparse unions.

The sparse union test only validates self-comparison. Consider adding tests with two different sparse union arrays to match the comprehensive coverage of the dense union test.

Add a second array and test cross-array comparisons:

// After line 1360, add a second sparse union array:
let int_array2 = Int32Array::from(vec![Some(2), None, Some(1)]);
let str_array2 = StringArray::from(vec![None, Some("a"), None]);
let type_ids2 = [0, 1, 0].into_iter().collect::<ScalarBuffer<i8>>();

let children2 = vec![Arc::new(int_array2) as ArrayRef, Arc::new(str_array2)];

let array2 = UnionArray::try_new(union_fields.clone(), type_ids2, None, children2).unwrap();

let cmp_cross = make_comparator(&array, &array2, opts).unwrap();

// Test cross-array comparisons
assert_eq!(cmp_cross(0, 0), Ordering::Less); // 1 < 2
assert_eq!(cmp_cross(2, 2), Ordering::Greater); // 3 > 1
assert_eq!(cmp_cross(1, 1), Ordering::Greater); // "b" > "a"
assert_eq!(cmp_cross(0, 1), Ordering::Less); // type_id 0 < 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f421d1 and 0ceac84.

📒 Files selected for processing (1)

arrow-ord/src/ord.rs (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

arrow-ord/src/ord.rs (1)

arrow-array/src/array/union_array.rs (10)

fields (306-309)

fields (459-459)

fields (916-919)

type_id (261-264)

type_ids (267-269)

type_ids (462-465)

offsets (272-274)

union_fields (2202-2210)

Arc (1293-1299)

Arc (2239-2244)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Cursor Bugbot
GitHub Check: claude-review

🔇 Additional comments (4)

arrow-ord/src/ord.rs (4)

24-24: LGTM!

The addition of DataType to imports is necessary for extracting union fields and mode in the new comparison logic.

481-481: LGTM!

The routing for Union types follows the established pattern for other nested types like Struct and Map.

493-494: LGTM!

The test imports are necessary for constructing union test cases with ScalarBuffer, UnionFields, and i256.

1260-1340: LGTM!

The dense union test comprehensively covers both type_id-based ordering (cross-type comparisons) and value-based ordering (same-type comparisons), including descending sort behavior.

martin-augment · 2025-11-14T13:39:57Z

1. Potential Index Out of Bounds (Critical)

Location: arrow-ord/src/ord.rs:353
let cmp = field_comparators[left_type_id as usize]
    .as_ref()
    .expect("type_id not found in field_comparators");
Issue: If a union array contains a type_id that exists in the schema but was not included in the union_fields iteration, this will panic or return None.

Scenario: If type_ids in the array are not consecutive (e.g., type_ids [0, 5, 7]), but the comparator vector is sized based on max_type_id (7), indices [1,2,3,4,6] will be None. If the actual data references these missing type_ids, the expect() will panic.

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! The field_comparators Vector is created earlier with the same length as left_fields and right_fields, so they all have the same length during the scope of the function.

martin-augment · 2025-11-14T13:47:13Z

4. Performance Consideration (Minor)

Location: arrow-ord/src/ord.rs:324-327
let max_type_id = left_fields.iter().map(|(id, _)| id).max().unwrap_or(0);
let mut field_comparators: Vec<Option<DynComparator>> =
    Vec::with_capacity((max_type_id + 1) as usize);
field_comparators.resize_with((max_type_id + 1) as usize, || None);
Issue: If type_ids are sparse (e.g., [0, 100, 200]), this allocates a 201-element vector with most entries being None. This wastes memory for large type_ids.

Alternative approach: Use a HashMap for O(1) lookup with no wasted space:

value:useful; category:bug; feedback:The Claude AI reviewer is correct! If the fields ids are something like "[1, 100]" then it will use 100 (the max) and create a Vector of 100 items, while it really needs just two. Using a HashMap would be better

martin-augment · 2025-11-14T13:48:17Z

299-363: Solid implementation with one optional optimization.

The logic correctly handles both dense and sparse union layouts by comparing type_ids first, then values. The implementation properly validates matching fields/modes and uses child comparators with correct offset handling.

Consider using a HashMap<i8, DynComparator> instead of Vec<Option<DynComparator>> for sparse type_id ranges (e.g., [1, 50, 100]). The current approach allocates a vector of size max_type_id + 1, which wastes memory when type_ids are not contiguous.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! If the fields ids are something like "[1, 100]" then it will use 100 (the max) and create a Vector of 100 items, while it really needs just two. Using a HashMap would be better

compare_union

0ceac84

augmentcode bot reviewed Nov 14, 2025

View reviewed changes

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8838: Add comparison support for Union arrays#21

8838: Add comparison support for Union arrays#21
martin-augment wants to merge 1 commit intomainfrom
pr-8838-2025-11-14-11-05-19

martin-augment commented Nov 14, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 14, 2025

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Nov 14, 2025

Uh oh!

martin-augment Nov 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

martin-augment commented Nov 14, 2025

1. Potential Index Out of Bounds (Critical)

Uh oh!

martin-augment commented Nov 14, 2025

4. Performance Consideration (Minor)

Uh oh!

martin-augment commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martin-augment commented Nov 14, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Nov 14, 2025

Code Review: Add comparison support for Union arrays

✅ Positive Aspects

Code Quality

Test Coverage

🔍 Issues & Concerns

1. Potential Index Out of Bounds (Critical)

2. Semantic Comparison by type_id (Design Question)

3. Missing Edge Case Tests

4. Performance Consideration (Minor)

5. Clone Operations (Performance)

📋 Summary

Must Fix

Should Consider

Overall Assessment

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Nov 14, 2025

1. Potential Index Out of Bounds (Critical)

Uh oh!

martin-augment commented Nov 14, 2025

4. Performance Consideration (Minor)

Uh oh!

martin-augment commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Nov 14, 2025 •

edited

Loading