Skip to content

8838: Add comparison support for Union arrays#21

Open
martin-augment wants to merge 1 commit intomainfrom
pr-8838-2025-11-14-11-05-19
Open

8838: Add comparison support for Union arrays#21
martin-augment wants to merge 1 commit intomainfrom
pr-8838-2025-11-14-11-05-19

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

8838: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 14, 2025

Walkthrough

The PR adds support for comparing UnionArray types in the core comparator. A new compare_union function was introduced that verifies matching union fields and modes between values, builds per-field comparators for each union member, and performs comparisons by matching type_ids then comparing values. The make_comparator function now routes Union type pairs to this handler. The implementation supports both dense and sparse union layouts using type_ids and offsets. Corresponding test cases were added for dense and sparse union comparisons, and the imports were updated to include necessary DataType and Union-related types. No public API changes were made.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8838-2025-11-14-11-05-19

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 14, 2025

Code Review: Add comparison support for Union arrays

Thank you for implementing comparison support for Union arrays! This is a valuable addition to the arrow-ord crate. Below is my detailed review:


✅ Positive Aspects

Code Quality

  • Clean implementation: The compare_union function follows the established pattern used by other comparison functions like compare_struct and compare_map
  • Proper use of child_opts: Correctly applies child sort options to handle nested sorting semantics
  • Consistent error handling: Properly validates that both union arrays have the same fields and mode before comparison

Test Coverage

  • Comprehensive test cases: Both test_dense_union and test_sparse_union test various comparison scenarios
  • Descending sort testing: Validates that sort order inversion works correctly
  • Well-documented tests: Clear comments explaining the structure and expected outcomes

🔍 Issues & Concerns

1. Potential Index Out of Bounds (Critical)

Location: arrow-ord/src/ord.rs:353

let cmp = field_comparators[left_type_id as usize]
    .as_ref()
    .expect("type_id not found in field_comparators");

Issue: If a union array contains a type_id that exists in the schema but was not included in the union_fields iteration, this will panic or return None.

Scenario: If type_ids in the array are not consecutive (e.g., type_ids [0, 5, 7]), but the comparator vector is sized based on max_type_id (7), indices [1,2,3,4,6] will be None. If the actual data references these missing type_ids, the expect() will panic.

Recommendation: Add validation or use a more robust error message:

let cmp = field_comparators
    .get(left_type_id as usize)
    .and_then(|c| c.as_ref())
    .unwrap_or_else(|| panic\!("Invalid type_id {} not found in union fields", left_type_id));

Better yet, consider whether this should return an error rather than panic, since this is a library function.


2. Semantic Comparison by type_id (Design Question)

Location: arrow-ord/src/ord.rs:347-359

// first, compare by type_id
match left_type_id.cmp(&right_type_id) {
    Ordering::Equal => { /* compare values */ }
    other => other,
}

Question: Is comparing by type_id first semantically correct for Union arrays?

In the test example:

  • (type_id=0, value=100) is considered LESS than (type_id=1, value=1) solely based on type_id
  • This means an integer value of 100 compares as "less than" a string value of "a"

Considerations:

  • This makes sorting stable and deterministic ✅
  • But it may be unintuitive for users who expect value-based comparison
  • Other type systems (e.g., SQL) may have different semantics

Recommendation: This should be documented in the function's docstring, explaining that union values are first ordered by type_id, then by value within the same type.


3. Missing Edge Case Tests

The tests are good but could be more comprehensive:

Missing scenarios:

  • Comparing unions with different fields (should error - already handled in code ✅)
  • Unions with null values in the child arrays
  • Unions with empty arrays
  • Unions with non-consecutive type_ids (e.g., [0, 5, 7])
  • Cross-comparison between dense and sparse unions (should error - already handled ✅)

Recommendation: Add at least one test for null handling:

#[test]
fn test_union_with_nulls() {
    // Create union arrays with nullable children and verify comparison behavior
}

4. Performance Consideration (Minor)

Location: arrow-ord/src/ord.rs:324-327

let max_type_id = left_fields.iter().map(|(id, _)| id).max().unwrap_or(0);
let mut field_comparators: Vec<Option<DynComparator>> =
    Vec::with_capacity((max_type_id + 1) as usize);
field_comparators.resize_with((max_type_id + 1) as usize, || None);

Issue: If type_ids are sparse (e.g., [0, 100, 200]), this allocates a 201-element vector with most entries being None. This wastes memory for large type_ids.

Alternative approach: Use a HashMap for O(1) lookup with no wasted space:

let mut field_comparators: HashMap<i8, DynComparator> = HashMap::new();
for (type_id, _field) in left_fields.iter() {
    let left_child = left.child(type_id);
    let right_child = right.child(type_id);
    let cmp = make_comparator(left_child.as_ref(), right_child.as_ref(), c_opts)?;
    field_comparators.insert(type_id, cmp);
}

Then in the closure:

let cmp = field_comparators.get(&left_type_id)
    .expect("type_id not found in field_comparators");

Trade-off: HashMap has slightly higher overhead for small arrays but scales better for sparse type_ids.


5. Clone Operations (Performance)

Location: arrow-ord/src/ord.rs:336-340

let left_type_ids = left.type_ids().clone();
let right_type_ids = right.type_ids().clone();

let left_offsets = left.offsets().cloned();
let right_offsets = right.offsets().cloned();

Note: These clones capture the buffers for the closure. Since these are buffer references, the clone is relatively cheap (Arc increments), but worth noting.

Observation: This is consistent with other comparison functions (e.g., compare_map also clones offsets), so no change needed. ✅


📋 Summary

Must Fix

Should Consider

Overall Assessment

This is a solid implementation that correctly handles the core comparison logic for Union arrays. The code follows project conventions and includes good test coverage for the basic cases. With the suggested improvements, particularly around error handling and documentation, this would be production-ready.

Recommendation: Approve with minor revisions for issue #1 and documentation improvements.


cc: @friendlymatthew for the implementation

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

let left_child = left.child(type_id);
let right_child = right.child(type_id);
let cmp = make_comparator(left_child.as_ref(), right_child.as_ref(), c_opts)?;
field_comparators[type_id as usize] = Some(cmp);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing field_comparators with type_id as usize assumes non-negative, dense type IDs and can panic or cause huge allocations if a union uses negative or sparse type IDs (Arrow allows arbitrary i8 type IDs). Consider avoiding direct index-based lookup by type ID here (also applies to the later access in the comparator closure).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! If the fields ids are something like "[1, 100]" then it will use 100 (the max) and create a Vector of 100 items, while it really needs just two. Using a HashMap would be better

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
arrow-ord/src/ord.rs (2)

299-363: Solid implementation with one optional optimization.

The logic correctly handles both dense and sparse union layouts by comparing type_ids first, then values. The implementation properly validates matching fields/modes and uses child comparators with correct offset handling.

Consider using a HashMap<i8, DynComparator> instead of Vec<Option<DynComparator>> for sparse type_id ranges (e.g., [1, 50, 100]). The current approach allocates a vector of size max_type_id + 1, which wastes memory when type_ids are not contiguous.

Apply this optimization if sparse type_ids are a common use case:

-    let max_type_id = left_fields.iter().map(|(id, _)| id).max().unwrap_or(0);
-    let mut field_comparators: Vec<Option<DynComparator>> =
-        Vec::with_capacity((max_type_id + 1) as usize);
-    field_comparators.resize_with((max_type_id + 1) as usize, || None);
+    use std::collections::HashMap;
+    let mut field_comparators: HashMap<i8, DynComparator> = HashMap::new();

     for (type_id, _field) in left_fields.iter() {
         let left_child = left.child(type_id);
         let right_child = right.child(type_id);
         let cmp = make_comparator(left_child.as_ref(), right_child.as_ref(), c_opts)?;
-        field_comparators[type_id as usize] = Some(cmp);
+        field_comparators.insert(type_id, cmp);
     }

Then update the lookup:

-                let cmp = field_comparators[left_type_id as usize]
-                    .as_ref()
+                let cmp = field_comparators
+                    .get(&left_type_id)
                     .expect("type_id not found in field_comparators");

1342-1369: Expand test coverage for sparse unions.

The sparse union test only validates self-comparison. Consider adding tests with two different sparse union arrays to match the comprehensive coverage of the dense union test.

Add a second array and test cross-array comparisons:

// After line 1360, add a second sparse union array:
let int_array2 = Int32Array::from(vec![Some(2), None, Some(1)]);
let str_array2 = StringArray::from(vec![None, Some("a"), None]);
let type_ids2 = [0, 1, 0].into_iter().collect::<ScalarBuffer<i8>>();

let children2 = vec![Arc::new(int_array2) as ArrayRef, Arc::new(str_array2)];

let array2 = UnionArray::try_new(union_fields.clone(), type_ids2, None, children2).unwrap();

let cmp_cross = make_comparator(&array, &array2, opts).unwrap();

// Test cross-array comparisons
assert_eq!(cmp_cross(0, 0), Ordering::Less); // 1 < 2
assert_eq!(cmp_cross(2, 2), Ordering::Greater); // 3 > 1
assert_eq!(cmp_cross(1, 1), Ordering::Greater); // "b" > "a"
assert_eq!(cmp_cross(0, 1), Ordering::Less); // type_id 0 < 1
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f421d1 and 0ceac84.

📒 Files selected for processing (1)
  • arrow-ord/src/ord.rs (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
arrow-ord/src/ord.rs (1)
arrow-array/src/array/union_array.rs (10)
  • fields (306-309)
  • fields (459-459)
  • fields (916-919)
  • type_id (261-264)
  • type_ids (267-269)
  • type_ids (462-465)
  • offsets (272-274)
  • union_fields (2202-2210)
  • Arc (1293-1299)
  • Arc (2239-2244)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
🔇 Additional comments (4)
arrow-ord/src/ord.rs (4)

24-24: LGTM!

The addition of DataType to imports is necessary for extracting union fields and mode in the new comparison logic.


481-481: LGTM!

The routing for Union types follows the established pattern for other nested types like Struct and Map.


493-494: LGTM!

The test imports are necessary for constructing union test cases with ScalarBuffer, UnionFields, and i256.


1260-1340: LGTM!

The dense union test comprehensively covers both type_id-based ordering (cross-type comparisons) and value-based ordering (same-type comparisons), including descending sort behavior.

@martin-augment
Copy link
Copy Markdown
Owner Author

1. Potential Index Out of Bounds (Critical)

Location: arrow-ord/src/ord.rs:353

let cmp = field_comparators[left_type_id as usize]
    .as_ref()
    .expect("type_id not found in field_comparators");

Issue: If a union array contains a type_id that exists in the schema but was not included in the union_fields iteration, this will panic or return None.

Scenario: If type_ids in the array are not consecutive (e.g., type_ids [0, 5, 7]), but the comparator vector is sized based on max_type_id (7), indices [1,2,3,4,6] will be None. If the actual data references these missing type_ids, the expect() will panic.

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! The field_comparators Vector is created earlier with the same length as left_fields and right_fields, so they all have the same length during the scope of the function.

@martin-augment
Copy link
Copy Markdown
Owner Author

4. Performance Consideration (Minor)

Location: arrow-ord/src/ord.rs:324-327

let max_type_id = left_fields.iter().map(|(id, _)| id).max().unwrap_or(0);
let mut field_comparators: Vec<Option<DynComparator>> =
    Vec::with_capacity((max_type_id + 1) as usize);
field_comparators.resize_with((max_type_id + 1) as usize, || None);

Issue: If type_ids are sparse (e.g., [0, 100, 200]), this allocates a 201-element vector with most entries being None. This wastes memory for large type_ids.

Alternative approach: Use a HashMap for O(1) lookup with no wasted space:

value:useful; category:bug; feedback:The Claude AI reviewer is correct! If the fields ids are something like "[1, 100]" then it will use 100 (the max) and create a Vector of 100 items, while it really needs just two. Using a HashMap would be better

@martin-augment
Copy link
Copy Markdown
Owner Author

299-363: Solid implementation with one optional optimization.

The logic correctly handles both dense and sparse union layouts by comparing type_ids first, then values. The implementation properly validates matching fields/modes and uses child comparators with correct offset handling.

Consider using a HashMap<i8, DynComparator> instead of Vec<Option<DynComparator>> for sparse type_id ranges (e.g., [1, 50, 100]). The current approach allocates a vector of size max_type_id + 1, which wastes memory when type_ids are not contiguous.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! If the fields ids are something like "[1, 100]" then it will use 100 (the max) and create a Vector of 100 items, while it really needs just two. Using a HashMap would be better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants