Skip to content

8839: Add support for Union types in RowConverter#20

Open
martin-augment wants to merge 1 commit intomainfrom
pr-8839-2025-11-14-10-44-25
Open

8839: Add support for Union types in RowConverter#20
martin-augment wants to merge 1 commit intomainfrom
pr-8839-2025-11-14-10-44-25

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

8839: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 14, 2025

Walkthrough

The pull request adds comprehensive support for Union data types to the row-based encoding/decoding system. Changes include: a new Union variant in the Codec enum storing per-type-branch converters and mode information; extension of the Encoder enum with a Union variant containing child rows, type identifiers, offsets, and union mode; Union handling throughout the codec lifecycle including construction, encoding/decoding, and length calculation; and validation that Union types are supported in sort operations. The implementation handles both Dense and Sparse union modes and includes tests for round-trip encoding, type preservation, and ordering semantics.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8839-2025-11-14-10-44-25

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 4 suggestions posted.

Comment augment review to trigger a new review at any time.


let mut child_rows = Vec::with_capacity(converters.len());
for (type_id, converter) in converters.iter().enumerate() {
let child_array = union_array.child(type_id as i8);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the enumerate index as type_id when calling union_array.child(type_id as i8) assumes union type IDs are dense and aligned with field order. Consider deriving the mapping from the actual union field type IDs; otherwise this may select the wrong child (or panic) when type IDs are non-contiguous.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. For sparse unions the index could be used to get the variant/field of the union since sparse unions contain all its variants. Dense unions contain only an offset (the variant index) and the value for that variant. https://github.com/apache/arrow-rs/pull/8839/files#diff-8235802124a5a88240195fc4b8792cf39bb9a8703eb69308d10358966eeff765R1607 has the logic responsible for this.

}
};

let child_row = child_rows[type_id as usize].row(child_row_i);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing child_rows by type_id as usize assumes type IDs are 0..N-1 and match the vector order. This can read the wrong child rows for unions with non-dense or non-zero-based type IDs (also applies to similar indexing in encode_column).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. For sparse unions the index could be used to get the variant/field of the union since sparse unions contain all its variants. Dense unions contain only an offset (the variant index) and the value for that variant. https://github.com/apache/arrow-rs/pull/8839/files#diff-8235802124a5a88240195fc4b8792cf39bb9a8703eb69308d10358966eeff765R1607 has the logic responsible for this.

mode,
} => {
let _union_array = column.as_any().downcast_ref::<UnionArray>().unwrap();
let null_sentinel = if opts.descending { 0x00 } else { 0x01 };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first byte (null sentinel) is set solely based on opts.descending and not on validity, unlike other encoders that write 0x01 for valid and null_sentinel(opts) for null. This can produce inconsistent ordering/round-tripping for descending sorts and arrays with nulls.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The Augment AI reviewer is not correct. The other encoders use the null_first option to calculate the sentinel value. See https://github.com/pydantic/arrow-rs/blob/55590118f1492addfbb40a736bdf7d67123d303f/arrow-row/src/lib.rs

if options.descending { !id } else { id }
};

let type_id = type_id_byte as i8;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using type_id as usize to index rows_by_field/null_rows assumes dense, zero-based type IDs aligned with converters; this will misplace rows and build incorrect arrays for unions with arbitrary type IDs. (Related to the mapping issue noted above.)

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. For sparse unions the index could be used to get the variant/field of the union since sparse unions contain all its variants. Dense unions contain only an offset (the variant index) and the value for that variant. https://github.com/apache/arrow-rs/pull/8839/files#diff-8235802124a5a88240195fc4b8792cf39bb9a8703eb69308d10358966eeff765R1607 has the logic responsible for this.

@claude
Copy link
Copy Markdown

claude bot commented Nov 14, 2025

Code Review: Add support for Union types in RowConverter

Thank you for implementing Union type support in the RowConverter! This is a valuable addition. Here's my detailed review:

Code Quality and Best Practices

Positive Aspects

Consistent with existing patterns: The implementation follows the same approach used for Struct and Dictionary types, maintaining code consistency
Handles both Union modes: Properly supports both Dense and Sparse union modes
Good test coverage: Three comprehensive tests cover sparse, dense, and ordering scenarios

Issues Found

1. Typo in comment (arrow-row/src/lib.rs:532)

// since the encodedc ontents will be inverted if descending is set

Should be: "encoded contents" (missing space)

2. Critical Bug: Type ID mapping issue (arrow-row/src/lib.rs:634)

for (type_id, converter) in converters.iter().enumerate() {
    let child_array = union_array.child(type_id as i8);

Problem: This assumes type_ids are sequential starting from 0, but Arrow Union types can have arbitrary type_ids (e.g., type_ids could be [5, 10, 15]). The current code would:

  • Enumerate as 0, 1, 2
  • Try to access children with these indices
  • Miss the actual type_id values

Fix needed: Use the actual type_ids from the union fields:

for (type_id, _field) in union_fields.iter() {
    let child_array = union_array.child(type_id);
    let converter = &converters[/* need mapping */];

You'll need a mapping from type_id to converter index.

3. Same type ID mapping issue in decode (arrow-row/src/lib.rs:1912-1931)

The rows_by_field vector is indexed by field_idx = type_id as usize, which fails for non-sequential type IDs.

4. Same issue in child_arrays iteration (arrow-row/src/lib.rs:1939)

for (field_idx, converter) in converters.iter().enumerate() {

This assumes field indices match array positions, but with arbitrary type_ids this breaks.

Potential Bugs

Critical

  1. Non-sequential type IDs: The fundamental assumption that type_ids map to array indices is incorrect. Union type_ids can be arbitrary i8 values. This will cause panics or incorrect data for unions with non-sequential type IDs.

Minor

  1. Unreachable pattern (arrow-row/src/lib.rs:1750): The foreign pattern match is good defensive programming, but consider if there are edge cases during array construction that could trigger this.

Performance Considerations

Concerns

  1. Multiple Vec allocations in decode: Lines 1911-1912 allocate vectors that could be pre-sized if we know the maximum type_id
  2. Sparse mode null row insertion (line 1963-1975): The current approach iterates through all rows and does peek operations. Consider if a more efficient approach exists.

Positive

  • The encoding approach is efficient and follows existing patterns
  • Proper pre-allocation with Vec::with_capacity in most places

Security Concerns

⚠️ Unsafe code usage: The decode function uses unsafe blocks (line 1845, 1956, 1978). While this follows the existing pattern in the codebase:

  1. Ensure convert_raw handles malformed row data gracefully
  2. The type_id byte is read directly from user data - validate it's within bounds before using as array index
  3. Add bounds checking before indexing rows_by_field[field_idx]

Test Coverage

Good Coverage ✅

  • Sparse unions: Basic functionality
  • Dense unions: Offset handling
  • Ordering: Comparison behavior across types

Missing Coverage ⚠️

  1. Non-sequential type IDs: Critical test case missing
    // Test with type_ids = [5, 10] instead of [0, 1]
  2. Null handling: What happens when union fields themselves are nullable?
  3. Empty unions: Edge case with no elements
  4. Descending sort order: Tests only use default ascending order
  5. Nested unions: Union within union scenarios
  6. Round-trip with nulls_first option: Only default options tested

Recommendations

Must Fix

  1. Fix the type ID mapping issue - this is a blocker
  2. Add test for non-sequential type IDs

Should Fix

  1. Fix the typo in the comment
  2. Add bounds checking in decode to prevent panics on malformed data
  3. Add tests for descending order and nulls_first options

Nice to Have

  1. Consider performance optimization for sparse null insertion
  2. Add documentation explaining the encoding format (type_id byte + child row bytes)

Overall Assessment

This is a solid implementation that follows the existing codebase patterns well. However, the type ID mapping bug is critical and must be fixed before merging. The fix will require:

  • Creating a HashMap or similar mapping from type_id -> converter index
  • Updating all three locations where this mapping is needed
  • Adding tests to catch this issue

Great work overall! Please address the critical bug and consider the other suggestions.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f421d1 and 5559011.

📒 Files selected for processing (1)
  • arrow-row/src/lib.rs (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
arrow-row/src/lib.rs (2)
arrow-array/src/array/mod.rs (3)
  • new_null_array (874-876)
  • offset (191-191)
  • new_empty_array (856-859)
arrow-array/src/array/union_array.rs (11)
  • nulls (460-460)
  • nulls (569-572)
  • nulls (777-779)
  • type_ids (267-269)
  • type_ids (462-465)
  • offsets (272-274)
  • type_id (261-264)
  • offset (773-775)
  • union_fields (2202-2210)
  • Arc (1293-1299)
  • Arc (2239-2244)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: claude-review
  • GitHub Check: Cursor Bugbot
🔇 Additional comments (1)
arrow-row/src/lib.rs (1)

1743-1743: Verify null sentinel logic for Union encoding.

Line 1743 uses if opts.descending { 0x00 } else { 0x01 } for the null_sentinel, which differs from the pattern used in other codecs that call the null_sentinel(opts) function (lines 1379-1384).

The null_sentinel() function considers both descending and nulls_first options. However, this appears to be encoding the "valid" sentinel (0x01) rather than the null sentinel (0x00), which might be intentionally different from struct encoding at line 1695.

Please verify this is correct and consider adding a comment explaining why Union uses a different pattern than other nested types like Struct.

Comment on lines +530 to +558
DataType::Union(fields, mode) => {
// similar to dictionaries and lists, we set descending to false and negate nulls_first
// since the encodedc ontents will be inverted if descending is set
let options = SortOptions {
descending: false,
nulls_first: sort_field.options.nulls_first != sort_field.options.descending,
};

let mut converters = Vec::with_capacity(fields.len());
let mut null_rows = Vec::with_capacity(fields.len());

for (_type_id, field) in fields.iter() {
let sort_field =
SortField::new_with_options(field.data_type().clone(), options);
let converter = RowConverter::new(vec![sort_field])?;

let null_array = new_null_array(field.data_type(), 1);
let nulls = converter.convert_columns(&[null_array])?;
let owned = OwnedRow {
data: nulls.buffer.into(),
config: nulls.config,
};

converters.push(converter);
null_rows.push(owned);
}

Ok(Self::Union(converters, null_rows, *mode))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Incorrect indexing for Union converters by type_id.

The code builds converters and null_rows using Vec indexed by iteration order (0, 1, 2, ...), but Union type_ids can be arbitrary i8 values (e.g., [1, 3, 4] as shown in the relevant code snippets).

Later in the encoder (line 637), the code iterates with converters.iter().enumerate(), treating the enumeration index as if it were the type_id. Then at line 638, it calls union_array.child(type_id as i8), which expects the actual union type_id, not the enumeration index.

This causes a mismatch when type_ids are non-sequential. For example, if union_fields has type_ids [1, 3, 4], then:

  • converters[0] corresponds to type_id 1
  • converters[1] corresponds to type_id 3
  • converters[2] corresponds to type_id 4

But at line 637-638, enumerate() gives (0, 1, 2) and the code treats these as type_ids, incorrectly calling union_array.child(0), union_array.child(1), union_array.child(2) instead of child(1), child(3), child(4).

To fix this, either:

  1. Use a HashMap<i8, (RowConverter, OwnedRow)> indexed by actual type_id, or
  2. Store the type_ids alongside the converters and use them for lookup

Example fix using approach 1:

-            DataType::Union(fields, mode) => {
-                // similar to dictionaries and lists, we set descending to false and negate nulls_first
-                // since the encodedc ontents will be inverted if descending is set
-                let options = SortOptions {
-                    descending: false,
-                    nulls_first: sort_field.options.nulls_first != sort_field.options.descending,
-                };
-
-                let mut converters = Vec::with_capacity(fields.len());
-                let mut null_rows = Vec::with_capacity(fields.len());
-
-                for (_type_id, field) in fields.iter() {
-                    let sort_field =
-                        SortField::new_with_options(field.data_type().clone(), options);
-                    let converter = RowConverter::new(vec![sort_field])?;
-
-                    let null_array = new_null_array(field.data_type(), 1);
-                    let nulls = converter.convert_columns(&[null_array])?;
-                    let owned = OwnedRow {
-                        data: nulls.buffer.into(),
-                        config: nulls.config,
-                    };
-
-                    converters.push(converter);
-                    null_rows.push(owned);
-                }
-
-                Ok(Self::Union(converters, null_rows, *mode))
-            }
+            DataType::Union(fields, mode) => {
+                let options = SortOptions {
+                    descending: false,
+                    nulls_first: sort_field.options.nulls_first != sort_field.options.descending,
+                };
+
+                // Store converters indexed by type_id to handle non-sequential type_ids
+                let mut converters = Vec::with_capacity(fields.len());
+                let mut null_rows = Vec::with_capacity(fields.len());
+                let mut type_ids_list = Vec::with_capacity(fields.len());
+
+                for (type_id, field) in fields.iter() {
+                    let sort_field =
+                        SortField::new_with_options(field.data_type().clone(), options);
+                    let converter = RowConverter::new(vec![sort_field])?;
+
+                    let null_array = new_null_array(field.data_type(), 1);
+                    let nulls = converter.convert_columns(&[null_array])?;
+                    let owned = OwnedRow {
+                        data: nulls.buffer.into(),
+                        config: nulls.config,
+                    };
+
+                    type_ids_list.push(type_id);
+                    converters.push(converter);
+                    null_rows.push(owned);
+                }
+
+                Ok(Self::Union(converters, null_rows, type_ids_list, *mode))
+            }

Then update the Codec enum to store the type_ids and adjust all access patterns accordingly throughout the encoder, decoder, etc.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In arrow-row/src/lib.rs around lines 530 to 558, the Union branch builds
converters and null_rows indexed by iteration order but later code treats that
index as the union type_id; this breaks when union field type_ids are
non-sequential. Replace the Vecs with a mapping keyed by the actual i8 type_id
(e.g., HashMap<i8, (RowConverter, OwnedRow)>) or store a Vec of (type_id,
RowConverter, OwnedRow) and change Codec::Union to hold the type_ids mapping;
then update the encoder/decoder code that currently uses enumerate() to look up
by the real type_id (iterate over the map entries or use the stored type_id when
calling union_array.child(type_id) and when accessing the converter/null_row).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. For sparse unions the index could be used to get the variant/field of the union since sparse unions contain all its variants. Dense unions contain only an offset (the variant index) and the value for that variant. https://github.com/apache/arrow-rs/pull/8839/files#diff-8235802124a5a88240195fc4b8792cf39bb9a8703eb69308d10358966eeff765R1607 has the logic responsible for this.

}
DataType::Union(fields, mode) => {
// similar to dictionaries and lists, we set descending to false and negate nulls_first
// since the encodedc ontents will be inverted if descending is set
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in comment.

Line 532 contains a typo: "encodedc ontents" should be "encoded contents".

-                // similar to dictionaries and lists, we set descending to false and negate nulls_first
-                // since the encodedc ontents will be inverted if descending is set
+                // similar to dictionaries and lists, we set descending to false and negate nulls_first
+                // since the encoded contents will be inverted if descending is set

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In arrow-row/src/lib.rs around line 532, fix the typo in the inline comment by
replacing "encodedc ontents" with "encoded contents" so the comment reads
correctly (e.g., "since the encoded contents will be inverted if descending is
set").

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:typo; feedback:The CodeRabbit AI reviewer is correct! There is a typo in the comment

Comment on lines +3845 to +3983
#[test]
fn test_sparse_union() {
// create a sparse union with Int32 (type_id = 0) and Utf8 (type_id = 1)
let int_array = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]);
let str_array = StringArray::from(vec![None, Some("b"), None, Some("d"), None]);

// [1, "b", 3, "d", 5]
let type_ids = vec![0, 1, 0, 1, 0].into();

let union_fields = [
(0, Arc::new(Field::new("int", DataType::Int32, false))),
(1, Arc::new(Field::new("str", DataType::Utf8, false))),
]
.into_iter()
.collect();

let union_array = UnionArray::try_new(
union_fields,
type_ids,
None,
vec![Arc::new(int_array) as ArrayRef, Arc::new(str_array)],
)
.unwrap();

let union_type = union_array.data_type().clone();
let converter = RowConverter::new(vec![SortField::new(union_type)]).unwrap();

let rows = converter
.convert_columns(&[Arc::new(union_array.clone())])
.unwrap();

// round trip
let back = converter.convert_rows(&rows).unwrap();
let back_union = back[0].as_any().downcast_ref::<UnionArray>().unwrap();

assert_eq!(union_array.len(), back_union.len());
for i in 0..union_array.len() {
assert_eq!(union_array.type_id(i), back_union.type_id(i));
}
}

#[test]
fn test_dense_union() {
// create a dense union with Int32 (type_id = 0) and use Utf8 (type_id = 1)
let int_array = Int32Array::from(vec![1, 3, 5]);
let str_array = StringArray::from(vec!["a", "b"]);

let type_ids = vec![0, 1, 0, 1, 0].into();

// [1, "a", 3, "b", 5]
let offsets = vec![0, 0, 1, 1, 2].into();

let union_fields = [
(0, Arc::new(Field::new("int", DataType::Int32, false))),
(1, Arc::new(Field::new("str", DataType::Utf8, false))),
]
.into_iter()
.collect();

let union_array = UnionArray::try_new(
union_fields,
type_ids,
Some(offsets), // Dense mode
vec![Arc::new(int_array) as ArrayRef, Arc::new(str_array)],
)
.unwrap();

let union_type = union_array.data_type().clone();
let converter = RowConverter::new(vec![SortField::new(union_type)]).unwrap();

let rows = converter
.convert_columns(&[Arc::new(union_array.clone())])
.unwrap();

// round trip
let back = converter.convert_rows(&rows).unwrap();
let back_union = back[0].as_any().downcast_ref::<UnionArray>().unwrap();

assert_eq!(union_array.len(), back_union.len());
for i in 0..union_array.len() {
assert_eq!(union_array.type_id(i), back_union.type_id(i));
}
}

#[test]
fn test_union_ordering() {
let int_array = Int32Array::from(vec![100, 5, 20]);
let str_array = StringArray::from(vec!["z", "a"]);

// [100, "z", 5, "a", 20]
let type_ids = vec![0, 1, 0, 1, 0].into();
let offsets = vec![0, 0, 1, 1, 2].into();

let union_fields = [
(0, Arc::new(Field::new("int", DataType::Int32, false))),
(1, Arc::new(Field::new("str", DataType::Utf8, false))),
]
.into_iter()
.collect();

let union_array = UnionArray::try_new(
union_fields,
type_ids,
Some(offsets),
vec![Arc::new(int_array) as ArrayRef, Arc::new(str_array)],
)
.unwrap();

let union_type = union_array.data_type().clone();
let converter = RowConverter::new(vec![SortField::new(union_type)]).unwrap();

let rows = converter.convert_columns(&[Arc::new(union_array)]).unwrap();

/*
expected ordering

row 2: 5 - type_id 0
row 4: 20 - type_id 0
row 0: 100 - type id 0
row 3: "a" - type id 1
row 1: "z" - type id 1
*/

// 5 < "z"
assert!(rows.row(2) < rows.row(1));

// 100 < "a"
assert!(rows.row(0) < rows.row(3));

// among ints
// 5 < 20
assert!(rows.row(2) < rows.row(4));
// 20 < 100
assert!(rows.row(4) < rows.row(0));

// among strigns
// "a" < "z"
assert!(rows.row(3) < rows.row(1));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add test coverage for non-sequential union type_ids.

The three union tests use sequential type_ids [0, 1], but Arrow Union types support arbitrary i8 type_ids such as [1, 3, 4] (as shown in the relevant code snippets from union_array.rs).

The current tests won't catch indexing bugs related to non-sequential type_ids. Consider adding a test case like:

#[test]
fn test_union_non_sequential_type_ids() {
    let int_array = Int32Array::from(vec![1, 2, 3]);
    let float_array = Float64Array::from(vec![4.0, 5.0]);
    let str_array = StringArray::from(vec!["a"]);
    
    // Use non-sequential type_ids: 1, 3, 4
    let type_ids = vec![1, 3, 1, 4, 3].into();
    let offsets = vec![0, 0, 1, 0, 1].into();
    
    let union_fields = [
        (1, Arc::new(Field::new("A", DataType::Int32, true))),
        (3, Arc::new(Field::new("B", DataType::Float64, true))),
        (4, Arc::new(Field::new("C", DataType::Utf8, true))),
    ]
    .into_iter()
    .collect();
    
    let union_array = UnionArray::try_new(
        union_fields,
        type_ids,
        Some(offsets),
        vec![Arc::new(int_array), Arc::new(float_array), Arc::new(str_array)],
    )
    .unwrap();
    
    let converter = RowConverter::new(vec![SortField::new(union_array.data_type().clone())]).unwrap();
    let rows = converter.convert_columns(&[Arc::new(union_array.clone())]).unwrap();
    let back = converter.convert_rows(&rows).unwrap();
    
    // Verify round-trip correctness
    assert_eq!(back[0].as_ref(), &union_array as &dyn Array);
}

@martin-augment
Copy link
Copy Markdown
Owner Author

1. Typo in comment (arrow-row/src/lib.rs:532)

// since the encodedc ontents will be inverted if descending is set

Should be: "encoded contents" (missing space)

value:good-to-have; category:typo; feedback:The Claude AI reviewer is correct! There is a typo in the comment

@martin-augment
Copy link
Copy Markdown
Owner Author

2. Critical Bug: Type ID mapping issue (arrow-row/src/lib.rs:634)

for (type_id, converter) in converters.iter().enumerate() {
    let child_array = union_array.child(type_id as i8);

Problem: This assumes type_ids are sequential starting from 0, but Arrow Union types can have arbitrary type_ids (e.g., type_ids could be [5, 10, 15]). The current code would:

  • Enumerate as 0, 1, 2
  • Try to access children with these indices
  • Miss the actual type_id values

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. For sparse unions the index could be used to get the variant/field of the union since sparse unions contain all its variants. Dense unions contain only an offset (the variant index) and the value for that variant. https://github.com/apache/arrow-rs/pull/8839/files#diff-8235802124a5a88240195fc4b8792cf39bb9a8703eb69308d10358966eeff765R1607 has the logic responsible for this.

@martin-augment
Copy link
Copy Markdown
Owner Author

3. Same type ID mapping issue in decode (arrow-row/src/lib.rs:1912-1931)

The rows_by_field vector is indexed by field_idx = type_id as usize, which fails for non-sequential type IDs.

4. Same issue in child_arrays iteration (arrow-row/src/lib.rs:1939)

for (field_idx, converter) in converters.iter().enumerate() {

This assumes field indices match array positions, but with arbitrary type_ids this breaks.

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. For sparse unions the index could be used to get the variant/field of the union since sparse unions contain all its variants. Dense unions contain only an offset (the variant index) and the value for that variant. https://github.com/apache/arrow-rs/pull/8839/files#diff-8235802124a5a88240195fc4b8792cf39bb9a8703eb69308d10358966eeff765R1607 has the logic responsible for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants