8069: Support writing RunEndEncoded as Parquet#23
8069: Support writing RunEndEncoded as Parquet#23martin-augment wants to merge 1 commit intomainfrom
Conversation
WalkthroughThis pull request adds comprehensive support for Arrow's RunEndEncoded (REE) data type across the Parquet ecosystem. Changes span Parquet array readers and writers, schema conversion logic, and level information builders to handle REE arrays alongside their inner value types. The implementation routes REE content to appropriate handlers based on the inner value type (string/binary/numeric) and includes casting operations during read/write cycles. Minor documentation updates correct grammar and wording in comments. New tests validate REE string and integer scenarios with round-trip verification. Unsupported REE paths are explicitly marked as not-yet-implemented. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: Support writing RunEndEncoded as ParquetThank you for this PR! This adds support for writing Run-End Encoded (REE) arrays to Parquet format. Here's my comprehensive review: ✅ Strengths1. Good Test Coverage
2. Consistent Pattern Following
3. Documentation
|
| } | ||
| // TODO eventually add a dedicated [`ArrayReader`] for REE | ||
| ArrowType::RunEndEncoded(_, ref val_field) => { | ||
| let array = buffer.into_array(null_buffer, val_field.data_type().clone()); |
There was a problem hiding this comment.
For REE with decimal values, building the intermediate array with val_field.data_type() will not correctly decode BYTE_ARRAY-backed decimals (e.g., Decimal128/256). Consider mirroring the Decimal128/256 handling above (decode as Binary and convert with sign extension) before casting to the REE data type.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The Augment AI reviewer is correct that the current implementation will fail when decoding Decimal128/256. The Pull Request author acknowledges this with the TODO saying that it needs a specialized ArrayReader for RunEndEncoded arrays.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
parquet/src/arrow/schema/primitive.rs (1)
106-116: Preserve-REE logic looks correct; small doc nitThe hint application to the inner value is consistent with Dictionary handling. Tiny wording: “Potentially preserve run end encoded encoding” reads redundant — consider “Potentially preserve run-end encoding”.
parquet/src/arrow/arrow_writer/levels.rs (1)
556-595: Make types_compatible REE‑aware (avoid false incompatibility on field names)
types_compatibleignores Dictionary wrappers but not REE. If aField’s REE child field names differ from the array’s (e.g., schema vs. builder defaults), this can error even when inner types match. Mirror the Dictionary logic for REE:
- Treat
RunEndEncoded(_, v)as compatible with the same REE shape ignoring child field names.- Also accept REE vs. plain inner value when appropriate.
Suggested patch:
fn types_compatible(a: &DataType, b: &DataType) -> bool { - // if the Arrow data types are the same, the types are clearly compatible + // if the Arrow data types are the same, the types are clearly compatible if a == b { return true; } - // get the values out of the dictionaries - let (a, b) = match (a, b) { + // unwrap Dictionary and REE wrappers for comparison + let (a, b) = match (a, b) { (DataType::Dictionary(_, va), DataType::Dictionary(_, vb)) => { (va.as_ref(), vb.as_ref()) } (DataType::Dictionary(_, v), b) => (v.as_ref(), b), (a, DataType::Dictionary(_, v)) => (a, v.as_ref()), + (DataType::RunEndEncoded(_, va), DataType::RunEndEncoded(_, vb)) => { + (va.data_type(), vb.data_type()) + } + (DataType::RunEndEncoded(_, v), b) => (v.data_type(), b), + (a, DataType::RunEndEncoded(_, v)) => (a, v.data_type()), _ => (a, b), }; // now that we've got the values from one/both dictionaries, if the values // have the same Arrow data type, they're compatible if a == b { return true; }This reduces spurious incompatibility without changing semantics.
parquet/src/arrow/schema/mod.rs (1)
2258-2274: Test is good; minor readability nitConsider aliasing
use crate::basic::Type as ParquetPhysicalType;to avoid confusion withschema::types::Typeelsewhere in this module.parquet/src/arrow/arrow_writer/mod.rs (2)
4543-4690: Add test coverage for Decimal REE types.The new tests cover string and integer REE arrays, which is good. However, the implementation includes handling for Decimal32/64/128/256 REE types (lines 1229-1263) that aren't covered by tests. Consider adding tests for these cases to ensure the decimal conversion logic works correctly.
Example test structure:
#[test] fn arrow_writer_run_end_encoded_decimal128() { let mut builder = PrimitiveRunBuilder::<Int32Type, Decimal128Type>::new(); // ... populate with decimal values let run_array: RunArray<Int32Type> = builder.finish(); // ... write and verify round-trip }
1229-1263: Refactor opportunity: extract helper for repeated Decimal conversion logic.The code in both Dictionary (lines 1195–1227) and RunEndEncoded (lines 1229–1263) match arms repeats the same pattern for Decimal32/64/128/256: cast, unary convert to Int32, write. This could be extracted into a shared helper function.
Note: The casting differs correctly between cases—Dictionary uses
value_typedirectly while REE callsvalue_type.data_type()(sincevalue_typeis aFieldRef)—so extract accordingly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
arrow-array/src/array/run_array.rs(1 hunks)arrow-schema/src/datatype.rs(1 hunks)parquet/src/arrow/array_reader/byte_array.rs(2 hunks)parquet/src/arrow/arrow_writer/byte_array.rs(3 hunks)parquet/src/arrow/arrow_writer/levels.rs(1 hunks)parquet/src/arrow/arrow_writer/mod.rs(5 hunks)parquet/src/arrow/schema/mod.rs(4 hunks)parquet/src/arrow/schema/primitive.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
parquet/src/arrow/schema/mod.rs (2)
arrow-schema/src/field.rs (2)
new(192-202)name(385-387)parquet/src/schema/types.rs (7)
get_physical_type(119-128)get_basic_info(95-100)repetition(697-700)name(103-105)name(685-687)name(897-899)name(1114-1116)
parquet/src/arrow/array_reader/byte_array.rs (3)
parquet/src/arrow/array_reader/byte_array_dictionary.rs (2)
new(135-147)new(235-252)parquet/src/arrow/array_reader/fixed_len_byte_array.rs (2)
new(142-156)new(334-340)arrow-cast/src/cast/mod.rs (1)
cast(345-347)
parquet/src/arrow/arrow_writer/mod.rs (3)
arrow-cast/src/cast/mod.rs (16)
cast(345-347)b(4154-4154)b(4166-4166)b(4196-4196)b(4207-4207)b(4224-4224)b(4249-4249)b(4277-4277)b(4343-4343)b(4355-4355)b(4367-4367)b(4452-4452)b(4462-4462)b(4472-4472)b(4482-4482)b(4492-4492)parquet/src/arrow/arrow_writer/levels.rs (4)
array(662-664)levels(1968-1971)new(642-660)try_new(140-231)arrow-array/src/array/run_array.rs (13)
run_array(560-562)run_array(973-973)run_array(999-999)run_array(1033-1033)data_type(319-321)data_type(578-580)clone(70-76)clone(535-537)is_nullable(381-383)is_nullable(610-612)try_new(93-118)len(327-329)len(586-588)
parquet/src/arrow/arrow_writer/byte_array.rs (1)
arrow-array/src/array/run_array.rs (6)
as_any(307-309)as_any(566-568)downcast(156-162)data_type(319-321)data_type(578-580)value(634-642)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
🔇 Additional comments (8)
arrow-schema/src/datatype.rs (1)
356-356: Doc tweak LGTMGrammar fix improves clarity.
arrow-array/src/array/run_array.rs (1)
35-41: Doc wording LGTMThe phrasing/readability improvements are fine.
parquet/src/arrow/arrow_writer/levels.rs (1)
225-228: REE leaf treated as primitive: OKThis integrates REE into level building consistently with Dictionary semantics.
parquet/src/arrow/schema/mod.rs (1)
742-752: REE → underlying type mapping is consistentMapping REE to its value type at schema level mirrors dictionary handling and is correct.
parquet/src/arrow/array_reader/byte_array.rs (2)
172-176: LGTM - REE consume_batch implementation.The approach of converting to the inner value type and then casting to the REE type is clean and consistent with similar conversions elsewhere in the codebase.
70-91: Verify REE byte array round-trip behavior with integration tests.The code implements REE support by routing to inner value type readers, then casting back using
arrow_cast::cast()at line 175. While this mirrors the Decimal pattern, the implementation lacks dedicated integration tests. The TODO comment at line 173 acknowledges REE support is incomplete. No existing tests verify that REE arrays are read back correctly with their semantics preserved.Add integration tests in
parquet/tests/arrow_reader/that verify byte arrays wrapped inRunEndEncodedare correctly read and maintain REE structure through the round-trip.parquet/src/arrow/arrow_writer/mod.rs (2)
1125-1135: LGTM - REE column writer routing.The implementation correctly routes REE types to appropriate writers based on their inner value types, mirroring the established pattern for Dictionary types.
1426-1431: LGTM - Clear NYI error for FixedLenByteArray REE.The explicit NYI error clearly communicates that REE support is not yet available for fixed-length byte arrays.
| DataType::RunEndEncoded(run_end, value) => match value.data_type() { | ||
| DataType::Utf8 => downcast_ree_op!(run_end, StringArray, $array, $op$(, $arg)*), | ||
| DataType::LargeUtf8 => { | ||
| downcast_ree_op!(run_end, LargeStringArray, $array, $op$(, $arg)*) | ||
| } | ||
| DataType::Binary => downcast_ree_op!(run_end, BinaryArray, $array, $op$(, $arg)*), | ||
| DataType::LargeBinary => { | ||
| downcast_ree_op!(run_end, LargeBinaryArray, $array, $op$(, $arg)*) | ||
| } | ||
| DataType::FixedSizeBinary(_) => { | ||
| downcast_ree_op!(run_end, FixedSizeBinaryArray, $array, $op$(, $arg)*) | ||
| } | ||
| d => unreachable!("cannot downcast {} run end encoded value to byte array", d), | ||
| }, | ||
| d => unreachable!("cannot downcast {} to byte array", d), |
There was a problem hiding this comment.
Add REE support for Utf8View/BinaryView to avoid panics
Currently REE dispatch doesn’t handle Utf8View or BinaryView; matching them will hit unreachable!() and panic at runtime. Add these two cases (mirroring top-level and Dictionary handling).
Apply this diff:
DataType::RunEndEncoded(run_end, value) => match value.data_type() {
DataType::Utf8 => downcast_ree_op!(run_end, StringArray, $array, $op$(, $arg)*),
DataType::LargeUtf8 => {
downcast_ree_op!(run_end, LargeStringArray, $array, $op$(, $arg)*)
}
+ DataType::Utf8View => {
+ downcast_ree_op!(run_end, StringViewArray, $array, $op$(, $arg)*)
+ }
DataType::Binary => downcast_ree_op!(run_end, BinaryArray, $array, $op$(, $arg)*),
DataType::LargeBinary => {
downcast_ree_op!(run_end, LargeBinaryArray, $array, $op$(, $arg)*)
}
+ DataType::BinaryView => {
+ downcast_ree_op!(run_end, BinaryViewArray, $array, $op$(, $arg)*)
+ }
DataType::FixedSizeBinary(_) => {
downcast_ree_op!(run_end, FixedSizeBinaryArray, $array, $op$(, $arg)*)
}
d => unreachable!("cannot downcast {} run end encoded value to byte array", d),
},Optional: replace unreachable!() branches with returning a ParquetError to avoid process aborts during writes.
🤖 Prompt for AI Agents
parquet/src/arrow/arrow_writer/byte_array.rs around lines 117 to 131: the
RunEndEncoded inner match currently panics on Utf8View and BinaryView because
those arms fall through to unreachable!(); add explicit match arms for
DataType::Utf8View and DataType::BinaryView (and their Large* /
FixedSizeBinary_view equivalents if present) that mirror the handling used for
the top-level and Dictionary cases (i.e., call the same downcast_ree_op! macro
with the corresponding view array types), and remove or replace the
unreachable!() cases so they return a ParquetError (or propagate a Result::Err)
instead of aborting the process.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct adding support for Utf8View and BinaryView is easy and most probably omitted by mistake. They should be added to prevent runtime failures.
| ArrowDataType::RunEndEncoded(_run_ends, _values) => { | ||
| Err(ParquetError::NYI( | ||
| "Int64ColumnWriter: Attempting to write an Arrow REE type that is not yet implemented" | ||
| .to_string(), | ||
| )) | ||
| } |
There was a problem hiding this comment.
Document why Int64 REE is not yet implemented.
The Int32ColumnWriter has REE support for Decimal types (lines 1229-1263), but Int64ColumnWriter explicitly returns NYI. This inconsistency might confuse users. Consider adding a TODO comment explaining the roadmap or technical reasons for this difference.
🤖 Prompt for AI Agents
In parquet/src/arrow/arrow_writer/mod.rs around lines 1346 to 1351, the
Int64ColumnWriter returns an NYI for RunEndEncoded types while Int32ColumnWriter
supports REE for Decimal types; add a concise TODO/comment above this match arm
explaining why Int64 REE is not implemented (e.g., missing support for 128-bit
decimal scaling, overflow handling, or needed changes to parquet encoding API),
indicate planned next steps or link to a tracking issue/PR number, and mark
expected owner/priority so readers understand the roadmap and why the asymmetry
exists.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that currently neither the Pull Request description, nor the source code mentions that Int64ColumnWriter does not support RunEndEncoded arrays. Adding such kind of documentation would prevent surprises in the user which would happen only at runtime.
value:good-to-have; category:documentation; feedback:The CodeRabbit AI reviewer is correct the comment sounds weird and it could be improved. |
value:useful; category:bug; feedback:The Claude AI reviewer is correct that currently neither the Pull Request description, nor the source code mentions that Int64ColumnWriter does not support RunEndEncoded arrays. Adding such kind of documentation would prevent surprises in the user which would happen only at runtime. |
8069: To review by AI