8838: Add comparison support for Union arrays#21
Conversation
WalkthroughThe PR adds support for comparing UnionArray types in the core comparator. A new ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: Add comparison support for Union arraysThank you for implementing comparison support for Union arrays! This is a valuable addition to the arrow-ord crate. Below is my detailed review: ✅ Positive AspectsCode Quality
Test Coverage
🔍 Issues & Concerns1. Potential Index Out of Bounds (Critical)Location: let cmp = field_comparators[left_type_id as usize]
.as_ref()
.expect("type_id not found in field_comparators");Issue: If a union array contains a type_id that exists in the schema but was not included in the union_fields iteration, this will panic or return None. Scenario: If type_ids in the array are not consecutive (e.g., type_ids [0, 5, 7]), but the comparator vector is sized based on max_type_id (7), indices [1,2,3,4,6] will be None. If the actual data references these missing type_ids, the expect() will panic. Recommendation: Add validation or use a more robust error message: let cmp = field_comparators
.get(left_type_id as usize)
.and_then(|c| c.as_ref())
.unwrap_or_else(|| panic\!("Invalid type_id {} not found in union fields", left_type_id));Better yet, consider whether this should return an error rather than panic, since this is a library function. 2. Semantic Comparison by type_id (Design Question)Location: // first, compare by type_id
match left_type_id.cmp(&right_type_id) {
Ordering::Equal => { /* compare values */ }
other => other,
}Question: Is comparing by type_id first semantically correct for Union arrays? In the test example:
Considerations:
Recommendation: This should be documented in the function's docstring, explaining that union values are first ordered by type_id, then by value within the same type. 3. Missing Edge Case TestsThe tests are good but could be more comprehensive: Missing scenarios:
Recommendation: Add at least one test for null handling: #[test]
fn test_union_with_nulls() {
// Create union arrays with nullable children and verify comparison behavior
}4. Performance Consideration (Minor)Location: let max_type_id = left_fields.iter().map(|(id, _)| id).max().unwrap_or(0);
let mut field_comparators: Vec<Option<DynComparator>> =
Vec::with_capacity((max_type_id + 1) as usize);
field_comparators.resize_with((max_type_id + 1) as usize, || None);Issue: If type_ids are sparse (e.g., [0, 100, 200]), this allocates a 201-element vector with most entries being None. This wastes memory for large type_ids. Alternative approach: Use a HashMap for O(1) lookup with no wasted space: let mut field_comparators: HashMap<i8, DynComparator> = HashMap::new();
for (type_id, _field) in left_fields.iter() {
let left_child = left.child(type_id);
let right_child = right.child(type_id);
let cmp = make_comparator(left_child.as_ref(), right_child.as_ref(), c_opts)?;
field_comparators.insert(type_id, cmp);
}Then in the closure: let cmp = field_comparators.get(&left_type_id)
.expect("type_id not found in field_comparators");Trade-off: HashMap has slightly higher overhead for small arrays but scales better for sparse type_ids. 5. Clone Operations (Performance)Location: let left_type_ids = left.type_ids().clone();
let right_type_ids = right.type_ids().clone();
let left_offsets = left.offsets().cloned();
let right_offsets = right.offsets().cloned();Note: These clones capture the buffers for the closure. Since these are buffer references, the clone is relatively cheap (Arc increments), but worth noting. Observation: This is consistent with other comparison functions (e.g., 📋 SummaryMust Fix
Should Consider
Overall AssessmentThis is a solid implementation that correctly handles the core comparison logic for Union arrays. The code follows project conventions and includes good test coverage for the basic cases. With the suggested improvements, particularly around error handling and documentation, this would be production-ready. Recommendation: Approve with minor revisions for issue #1 and documentation improvements. cc: @friendlymatthew for the implementation |
| let left_child = left.child(type_id); | ||
| let right_child = right.child(type_id); | ||
| let cmp = make_comparator(left_child.as_ref(), right_child.as_ref(), c_opts)?; | ||
| field_comparators[type_id as usize] = Some(cmp); |
There was a problem hiding this comment.
Indexing field_comparators with type_id as usize assumes non-negative, dense type IDs and can panic or cause huge allocations if a union uses negative or sparse type IDs (Arrow allows arbitrary i8 type IDs). Consider avoiding direct index-based lookup by type ID here (also applies to the later access in the comparator closure).
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! If the fields ids are something like "[1, 100]" then it will use 100 (the max) and create a Vector of 100 items, while it really needs just two. Using a HashMap would be better
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
arrow-ord/src/ord.rs (2)
299-363: Solid implementation with one optional optimization.The logic correctly handles both dense and sparse union layouts by comparing type_ids first, then values. The implementation properly validates matching fields/modes and uses child comparators with correct offset handling.
Consider using a
HashMap<i8, DynComparator>instead ofVec<Option<DynComparator>>for sparse type_id ranges (e.g., [1, 50, 100]). The current approach allocates a vector of sizemax_type_id + 1, which wastes memory when type_ids are not contiguous.Apply this optimization if sparse type_ids are a common use case:
- let max_type_id = left_fields.iter().map(|(id, _)| id).max().unwrap_or(0); - let mut field_comparators: Vec<Option<DynComparator>> = - Vec::with_capacity((max_type_id + 1) as usize); - field_comparators.resize_with((max_type_id + 1) as usize, || None); + use std::collections::HashMap; + let mut field_comparators: HashMap<i8, DynComparator> = HashMap::new(); for (type_id, _field) in left_fields.iter() { let left_child = left.child(type_id); let right_child = right.child(type_id); let cmp = make_comparator(left_child.as_ref(), right_child.as_ref(), c_opts)?; - field_comparators[type_id as usize] = Some(cmp); + field_comparators.insert(type_id, cmp); }Then update the lookup:
- let cmp = field_comparators[left_type_id as usize] - .as_ref() + let cmp = field_comparators + .get(&left_type_id) .expect("type_id not found in field_comparators");
1342-1369: Expand test coverage for sparse unions.The sparse union test only validates self-comparison. Consider adding tests with two different sparse union arrays to match the comprehensive coverage of the dense union test.
Add a second array and test cross-array comparisons:
// After line 1360, add a second sparse union array: let int_array2 = Int32Array::from(vec![Some(2), None, Some(1)]); let str_array2 = StringArray::from(vec![None, Some("a"), None]); let type_ids2 = [0, 1, 0].into_iter().collect::<ScalarBuffer<i8>>(); let children2 = vec![Arc::new(int_array2) as ArrayRef, Arc::new(str_array2)]; let array2 = UnionArray::try_new(union_fields.clone(), type_ids2, None, children2).unwrap(); let cmp_cross = make_comparator(&array, &array2, opts).unwrap(); // Test cross-array comparisons assert_eq!(cmp_cross(0, 0), Ordering::Less); // 1 < 2 assert_eq!(cmp_cross(2, 2), Ordering::Greater); // 3 > 1 assert_eq!(cmp_cross(1, 1), Ordering::Greater); // "b" > "a" assert_eq!(cmp_cross(0, 1), Ordering::Less); // type_id 0 < 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
arrow-ord/src/ord.rs(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
arrow-ord/src/ord.rs (1)
arrow-array/src/array/union_array.rs (10)
fields(306-309)fields(459-459)fields(916-919)type_id(261-264)type_ids(267-269)type_ids(462-465)offsets(272-274)union_fields(2202-2210)Arc(1293-1299)Arc(2239-2244)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
🔇 Additional comments (4)
arrow-ord/src/ord.rs (4)
24-24: LGTM!The addition of
DataTypeto imports is necessary for extracting union fields and mode in the new comparison logic.
481-481: LGTM!The routing for Union types follows the established pattern for other nested types like Struct and Map.
493-494: LGTM!The test imports are necessary for constructing union test cases with
ScalarBuffer,UnionFields, andi256.
1260-1340: LGTM!The dense union test comprehensively covers both type_id-based ordering (cross-type comparisons) and value-based ordering (same-type comparisons), including descending sort behavior.
value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! The |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! If the fields ids are something like "[1, 100]" then it will use 100 (the max) and create a Vector of 100 items, while it really needs just two. Using a HashMap would be better |
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! If the fields ids are something like "[1, 100]" then it will use 100 (the max) and create a Vector of 100 items, while it really needs just two. Using a HashMap would be better |
8838: To review by AI