8728: fix: ArrayIter does not report size hint correctly after advancing from the iterator back#10
8728: fix: ArrayIter does not report size hint correctly after advancing from the iterator back#10martin-augment wants to merge 19 commits intomainfrom
ArrayIter does not report size hint correctly after advancing from the iterator back#10Conversation
…che#8706) # Which issue does this PR close? N/A # Rationale for this change It is not obvious that the thrift macros produce public enums only (e.g. see apache#8680 (comment)). This should be made clear in the documentation. # What changes are included in this PR? Add said clarification. # Are these changes tested? Documentation only, so no tests required. # Are there any user-facing changes? No, only changes to private documentation
# Which issue does this PR close? - Closes apache#8691 # Rationale for this change The `README.md` file in `arrow-avro` instructs users to install version 56. This is invalid and should be changed to version 57. # What changes are included in this PR? Updated the `README.md` file to reference version 57. # Are these changes tested? N/A since this a small `README.md` file change. # Are there any user-facing changes? The `README.md` file in `arrow-avro` now instructs users to install version 57. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? chore: add test case of `RowSelection::trim` # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.
…DING_SLOTS (apache#8663) # Which issue does this PR close? - Closes [apache#8662] # Rationale for this change Related to apache#8607 We need to know how many encoding are support to create a decoder slot. # What changes are included in this PR? Update the `thrift_enum` to know the fields count of enum `Encoding`, and the value is passed to `EncodingMask` And the `ENCODING_SLOTS` # Are these changes tested? 1. Originally I think add a UT can prevent failure after the new encoding are introduced, then I realized the counts are already transferred, the UT is not required, the original tests can already cover the code. # Are there any user-facing changes? No
- A follow up from apache#8625 # Rationale for this change While working on a separate task, I noticed `create_test_variant_array` was redundant. Since `VariantArray` can already be constructed directly from an iterator of Variants, this PR removes the now-unnecessary test helper.
# Which issue does this PR close? - Closes apache#8685. # What changes are included in this PR? In the implementation of `RowConverter::from_binary`, the `BinaryArray` is broken into parts and an attempt is made to convert the data buffer into `Vec` at no copying cost with `Buffer::into_vec`. Only if this fails, the data is copied out for a newly allocated `Vec`. # Are these changes tested? Passes existing tests using `RowConverter::from_binary`, which all convert a non-shared buffer taking advantage of the optimization. Another test is added to cover the copying path. # Are there any user-facing changes? No
# Which issue does this PR close? - Closes apache#8692. # Rationale for this change Explained in issue. # What changes are included in this PR? - Adds `FilterPredicate::filter_record_batch` - Adapts the free function `filter_record_batch` to use the new function - Uses `new_unchecked` to create the filtered result. The rationale for this is identical to apache#8583 # Are these changes tested? Covered by existing tests for `filter_record_batch` # Are there any user-facing changes? No --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of apache#7806 # Rationale for this change The `DeltaBitPackDecoder` can panic if it encounters a bit width in the encoded data that is larger than the bit width of the data type being decoded. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…8680) # Which issue does this PR close? - Part of apache#5853. # Rationale for this change While converting to the new Thrift model, the `ConvertedType` enum was done manually due to the `NONE` variant, which used the discriminant of `0`. This PR changes that to `-1` which allows the `thrift_enum` macro to be used instead. This improves code maintainability. # What changes are included in this PR? See above. # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No, this only changes the discriminant value for a unit variant enum.
# Which issue does this PR close? - Contribues towards the RunEndEncoded (REE) epic apache#3520, but there is no specific issue for casting. - Replaces PRs apache#7713 and apache#8384. # Rationale for this change This PR implements casting support for RunEndEncoded arrays in Apache Arrow. # What changes are included in this PR? - `run_end_encoded_cast` in `arrow-cast/src/cast/run_array.rs` - `cast_to_run_end_encoded` in `arrow-cast/src/cast/run_array.rs` - Tests in `arrow-cast/src/cast/mod.rs` # Are these changes tested? Yes! # Are there any user-facing changes? No breaking changes, just new functionality --------- Co-authored-by: Richard Baah <richard.baah@datadoghq.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…erator (apache#8696) # Which issue does this PR close? N/A # Rationale for this change overriding this function improve performance over the fallback implementation # What changes are included in this PR? Override implementation of: - `count` which is not optimized away even when `ExactSizeIterator` is implemented - `nth` to avoid calling `next` `n + 1` times (which is also used when doing `.skip`) - `nth_back` - `last` - `max` # Are these changes tested? Yes, I've added a lot of tests # Are there any user-facing changes? Nope
# Rationale for this change We've caused some unexpected panics from our internal testing. We've put in error checks for all of these so that they don't affect other users. # What changes are included in this PR? Various error checks to ensure panics don't occur. # Are these changes tested? Tests should continue to pass. If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? Existing tests should cover these changes. # Are there any user-facing changes? None. --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
# Which issue does this PR close? Part of apache#5375 Vortex was encountering some issues after we switched our preferred List type to `ListView`, the first thing we noticed was that `arrow_select::filter_array` would fail on ListView (and LargeListView, though we don't use that). This PR addresses some missing select kernel implementations for ListView and LargeListView. This also fixes an existing bug in the ArrayData validation for ListView arrays that would trigger an out of bounds index panic. # Are these changes tested? - [x] filter_array - [x] concat - [x] take # Are there any user-facing changes? ListView/LargeListView can now be used with the `take`, `concat` and `filter_array` kernels You can now use the `PartialEq` to compare ListView arrays. --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev>
…ulation (apache#8671) # Which issue does this PR close? - Closes apache#8472. # Rationale for this change Makes the metadata heap size calculation more accurate when reading encrypted Parquet files, which helps to better manage caches of Parquet metadata. # What changes are included in this PR? * Accounts for heap allocations related to the `FileDecryptor` in `ParquetMetaData` * Fixes the `HeapSize` implementation for `Arc<T>` so the size of `T` is included, as well as the reference counts that are stored on the heap * Fixes the heap size of type pointers within `ColumnDescriptor` being included twice ## Not included * Accounting for any heap allocations in a user-provided `KeyRetriever` # Are these changes tested? Yes, there's a new unit test added that computes the heap size with a decryptor. I also did a manual test that created a test Parquet file with 100 columns using per-column encryption keys, and loaded 10,000 copies of the `ParquetMetaData` into a vector. `heaptrack` reported 1.136 GB memory heap allocated in this test program. Prior to this change, the sum of the metadata was reported as 879.2 MB, and afterwards it was 1.136 GB. # Are there any user-facing changes? No This was co-authored by @etseidl --------- Co-authored-by: Ed Seidl <etseidl@live.com>
…n Parquet variant casts (apache#8689) # Which issue does this PR close? - Closes apache#8670. # Rationale for this change We currently have two separate code paths that both handle decimal casting between different (precision, scale) pairs. Without unifying the logic, a fix in one place often needs to be duplicated in the other (e.g., apache#8579 fixed the `arrow-cast` and apache#8552 fixed the `parquet-variant-compute`), which can easily lead to divergence when contributors lack full context. This PR consolidates the decimal rescale logic for both `arrow-cast` and `parquet-variant-compute`. # What changes are included in this PR? 1. Extract the shared array-unary logic from `convert_to_smaller_scale_decimal` and `convert_to_bigger_or_equal_scale_decimal` into `apply_decimal_cast` 2. Move the rescale-closure creation into `make_upscaler` and `make_downscaler` so that they can be used in `parquet-compute-variant` 3. rework `rescale_decimal` in `parquet-compute-variant` to use the new `make_upscaler` and `make_downscaler` utilities. One challenge is incorporating the large-scale reduction path (aka the `delta_scale` cannot fit into `I::MAX_PRECISION`) into `make_downscaler` without hurting performance. Returning 0 directly is usually cheaper than applying a unary operation to return zero. Therefore, `make_downscaler` may return None, and it is the caller’s responsibility to handle this case appropriately based on the documented behavior. # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No
Closes apache#8709. Adds bench `cast_ree` which can be run with `cargo bench --bench cast_ree`. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…he underlying bit value (apache#8711) # Which issue does this PR close? - closes apache#8721 # Rationale for this change mask `nulls` should be treated as `false` (even if the underlying values are not 0) as described in the docs for zip # What changes are included in this PR? used `prep_null_mask_filter` before iterating over the mask, added tests for both scalar and non scalar (to prepare for apache#8653) # Are these changes tested? Yes # Are there any user-facing changes? Kinda
…from the iterator back this also adds a LOT of tests extracted from (which is how I found that bug): - apache#8697
WalkthroughThis pull request introduces ListView/ListViewArray support across data operations, adds RunEndEncoded casting capabilities, refactors decimal casting with builder patterns, enhances iterator trait implementations, improves parquet error handling and validation, adds HeapSize implementations for encryption types, and updates macro-generated thrift enum constants. Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
Pull Request ReviewSummaryThis PR fixes a bug in Code Quality ✅Strengths:
Minor observations:
Test Coverage ⭐ EXCELLENTThis PR demonstrates exceptional test coverage with 930 new lines of tests: Comprehensive test scenarios:
Testing approach highlights:
Potential Issues 🔍1. Test complexity trade-off:
2. Missing documentation:
3. Potential test data generation issue: let no_nulls_random_values = (0..10)
.map(|_| rng.random::<i32>())
.map(Some)
.collect::<Vec<Option<i32>>>();While using a seeded RNG (line 274) ensures determinism, consider whether fixed test cases would be clearer for debugging failures. Performance Considerations ✅Positive:
No concerns:
Security Concerns ✅No security issues identified:
RecommendationsHigh PriorityNone - the core fix is solid Medium Priority
Low Priority
Verdict: ✅ APPROVE with minor suggestionsThis is a high-quality PR that:
The comprehensive testing significantly strengthens confidence in the fix. The suggestions above are minor improvements that would not block merging. Review generated by Claude Code |
|
Review
Residual risks / follow-up
|
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
parquet/src/encodings/decoding.rs (1)
827-837: Error message uses the wrong expected count variable.Use mini_block_to_skip (expected) rather than mini_block_batch_size to avoid misleading diagnostics.
- return Err(general_err!( - "Expected to skip {} values from mini block got {}.", - mini_block_batch_size, - skip_count - )); + return Err(general_err!( + "Expected to skip {} values from mini block got {}.", + mini_block_to_skip, + skip_count + ));arrow-array/src/iterator.rs (2)
90-95: Safety comment is incorrect (initial cursor invariant)Comment says “initialized with current = array.len()”, but new() sets current = 0 and current_end = len. Please fix to avoid misleading future maintainers.
- // this is safe on the premise that this struct is initialized with - // current = array.len() - // and that current_end is ever only decremented + // This is safe because the iterator is initialized with: + // current = 0 and current_end = array.len() + // We only increment `current` and only decrement `current_end`, + // and we check bounds (current < current_end) before access.
116-121: Repeat the safety comment fix for next_backSame incorrect invariant appears here; please update consistently.
- // this is safe on the premise that this struct is initialized with - // current = array.len() - // and that current_end is ever only decremented + // This is safe because the iterator is initialized with: + // current = 0 and current_end = array.len() + // We only increment `current` and only decrement `current_end`, + // and we check bounds (current < current_end) before access.
🧹 Nitpick comments (32)
arrow-data/src/data.rs (1)
983-989: Defensive length check, but consider error type consistency.Since both
typed_buffercalls requestself.lenelements and return slices of exactly that length, this check should always pass. However, it serves as useful defensive programming.Consider using
InvalidArgumentErrorinstead ofComputeErrorfor consistency with other data validation errors in this file (see lines 984, 993, 1009).Apply this diff if you prefer consistency:
- return Err(ArrowError::ComputeError(format!( + return Err(ArrowError::InvalidArgumentError(format!( "ListView offsets len {} does not match sizes len {}",parquet-variant-compute/src/shred_variant.rs (2)
440-443: Consider consistent wrapping for clarity.While functionally correct, this test uses
Variant::from(...)directly, whereas line 338 usesSome(Variant::from(...)). Both patterns work withfrom_iter, but consistency would improve readability.Apply this diff for consistency with line 338:
- let input = VariantArray::from_iter(vec![ - Variant::from(42i32), - Variant::from(3.15f64), - Variant::from("not_a_number"), - ]); + let input = VariantArray::from_iter(vec![ + Some(Variant::from(42i32)), + Some(Variant::from(3.15f64)), + Some(Variant::from("not_a_number")), + ]);
874-874: Consider consistent wrapping for clarity.Same observation as lines 440-443: this uses
Variant::from(...)directly while line 338 usesSome(Variant::from(...)). Consider using consistent wrapping throughout the test file.Apply this diff for consistency:
- let input = VariantArray::from_iter(vec![Variant::from(42i64), Variant::from("hello")]); + let input = VariantArray::from_iter(vec![Some(Variant::from(42i64)), Some(Variant::from("hello"))]);parquet/src/file/reader.rs (2)
292-321: Consider adding edge-case tests.The current tests verify basic out-of-bounds scenarios effectively. However, additional test cases would improve coverage:
- Boundary case:
start == data.len()(should succeed forget_read)- Valid range:
get_bytes(0, 4)orget_bytes(2, 2)wherestart + length == data.len()- Overflow scenario: very large
lengthvalues- Zero-length reads:
get_bytes(0, 0)orget_bytes(4, 0)Example additional tests:
#[test] fn test_bytes_chunk_reader_edge_cases() { let data = Bytes::from(vec![0, 1, 2, 3]); // Reading from end position should succeed let result = data.get_read(4); assert!(result.is_ok()); // Valid ranges should succeed assert!(data.get_bytes(0, 4).is_ok()); assert!(data.get_bytes(2, 2).is_ok()); // Zero-length reads assert!(data.get_bytes(0, 0).is_ok()); assert!(data.get_bytes(4, 0).is_ok()); }
122-148: Add bounds checking to test utilityget_bytesmethods for consistency.The production
Bytesimplementation now includes defensive bounds checking (reader.rs 138-148), but the test utilities still lack validation:
parquet/tests/arrow_reader/io/sync_reader.rs:413-418— slices directly without validatingstart + lengthparquet/tests/arrow_reader/io/async_reader.rs:373-383— slices directly without validating range boundsBoth will panic on invalid ranges instead of returning errors like the production code. Adding the same defensive pattern would improve consistency and provide clearer error messages in tests.
parquet/src/schema/types.rs (1)
1353-1370: Non-root group must have repetition + propagate build errors
- Enforcing repetition for non-root types matches the parquet spec; good guard.
- Switching to builder.build()? avoids panics and surfaces conversion errors.
Recommend also changing the zero-children group branch to use ? for consistency (currently unwrap):
- Ok((index + 1, Arc::new(builder.build().unwrap()))) + Ok((index + 1, Arc::new(builder.build()?)))Consider adding a small test that a non-root group lacking repetition returns the expected error.
parquet/src/encodings/rle.rs (1)
500-520: Avoid assert! in library decode path; propagate an error instead.The new assert panics in release builds. Prefer returning a descriptive error from reload and plumbing it to callers (get/get_batch/skip) so bad inputs don’t abort the process.
Example (sketch):
- fn reload(&mut self) -> bool { + fn reload(&mut self) -> Result<bool> { let bit_reader = self.bit_reader.as_mut().expect("bit_reader should be set"); - if let Some(indicator_value) = bit_reader.get_vlq_int() { + if let Some(indicator_value) = bit_reader.get_vlq_int() { ... - self.current_value = bit_reader.get_aligned::<u64>(value_width); - assert!(self.current_value.is_some(), "parquet_data_error: not enough data for RLE decoding"); + self.current_value = bit_reader.get_aligned::<u64>(value_width); + if self.current_value.is_none() { + return Err(eof_err!("Not enough data for RLE decoding")); + } ... - true + Ok(true) } else { - false + Ok(false) } }And update call sites:
if !self.reload()? { ... }and in set_data:let _ = self.reload()?;.parquet/THRIFT.md (1)
56-71: Docs clarity: note about pub visibility is helpful.Consider linking directly to the macro definitions (parquet/src/parquet_macros.rs) for quick reference.
Also applies to: 88-89
parquet/tests/arrow_reader/bad_data.rs (1)
88-93: Make test resilient to error wording.Exact string equality is brittle. Prefer contains/starts_with or matching the error type.
- assert_eq!( - err.to_string(), - "External: Parquet argument error: Parquet error: Invalid or corrupted RLE bit width 254. Max allowed is 32" - ); + let s = err.to_string(); + assert!( + s.contains("Invalid or corrupted RLE bit width 254"), + "unexpected error: {s}" + );parquet/src/encodings/decoding.rs (2)
645-657: Bit-width validator is appropriate; minor polish.Great addition. Nit: remove the trailing space in the error message for consistency.
- return Err(general_err!( - "Invalid delta bit width {} which is larger than expected {} ", - bit_width, - std::mem::size_of::<T::T>() * 8 - )); + return Err(general_err!( + "Invalid delta bit width {} which is larger than expected {}", + bit_width, + std::mem::size_of::<T::T>() * 8 + ));
751-753: Validate miniblock bit-widths at use sites — good.Calling check_bit_width in both get and skip catches invalid pages early. Consider also validating all widths once in next_block() to fail fast.
Also applies to: 822-824
parquet/src/basic.rs (1)
65-69: ConvertedType via thrift_enum looks good; doc nit.The enum definition with explicit discriminants (including NONE = -1) matches usage across conversions and tests. Nit: “This struct was renamed…” → “This enum was renamed…”.
-/// This struct was renamed from `LogicalType` in version 4.0.0. +/// This enum was renamed from `LogicalType` in version 4.0.0.Also applies to: 75-169
arrow-array/src/iterator.rs (3)
1186-1190: Seed RNG for deterministic testsUse a local StdRng to avoid nondeterminism across test runs.
- let random_values = (0..100).map(|_| rand::random_bool(0.5)).collect::<Vec<_>>(); + let mut rng = StdRng::seed_from_u64(1); + let random_values = (0..100) + .map(|_| rng.random_bool(0.5)) + .collect::<Vec<_>>();
986-988: Nit: message string for AllOpThe label says “false” but this tests all(); suggest “true”.
- fn name(&self) -> String { - format!("all with {} false returned", self.true_count) - } + fn name(&self) -> String { + format!("all with {} true returned", self.true_count) + }
149-149: Add targeted size_hint regression test for ArrayIterLocks the fix against regressions by asserting size_hint after mixed next/next_back.
+ #[test] + fn assert_array_iter_size_hint_after_next_back_and_next() { + let a = Int32Array::from(vec![Some(1), None, Some(3), Some(4)]); + let mut it = ArrayIter::new(&a); + assert_eq!(it.size_hint(), (4, Some(4))); + assert_eq!(it.next_back(), Some(Some(4))); + assert_eq!(it.size_hint(), (3, Some(3))); + assert_eq!(it.next(), Some(Some(1))); + assert_eq!(it.size_hint(), (2, Some(2))); + // consume remaining + let _: Vec<_> = it.collect(); + assert_eq!(it.size_hint(), (0, Some(0))); + }arrow-buffer/src/util/bit_iterator.rs (1)
156-175: nth_back(): correct; tweak comment wordingLogic matches std semantics, including exhausting on out-of-range. Minor doc nit: subtract from end_offset, not “current offset”.
- // so subtracting n to the current offset and not n - 1 + // so subtracting n from end_offset (and not n - 1)arrow-select/src/zip.rs (2)
131-133: Treating NULL as false is correct; use borrowed buffer when possibleCurrent flow clones the values buffer even when mask has no nulls. Borrowing avoids an allocation.
- let mask = maybe_prep_null_mask_filter(mask); - SlicesIterator::from(&mask).for_each(|(start, end)| { + // Borrow when no nulls, allocate only if needed + if mask.null_count() == 0 { + SlicesIterator::from(mask.values()).for_each(|(start, end)| { + // body unchanged + }); + } else { + let cleaned = maybe_prep_null_mask_filter(mask); + SlicesIterator::from(&cleaned).for_each(|(start, end)| { + // body unchanged + }); + }Note: requires changing maybe_prep_null_mask_filter to return BooleanBuffer and keeping both paths.
171-180: Co-locate mask-prep helper with filtering utilitiesmaybe_prep_null_mask_filter duplicates filter semantics; consider moving it into crate::filter next to prep_null_mask_filter and re-exporting for reuse.
arrow-select/src/concat.rs (2)
244-256: Allocate offsets buffer by bytes to reduce reallocationsMutableBuffer::with_capacity expects bytes; current capacity is in elements. Precompute element count and multiply by size_of::().
- let mut offsets = MutableBuffer::with_capacity(lists.iter().map(|l| l.offsets().len()).sum()); + let offset_len: usize = lists.iter().map(|l| l.offsets().len()).sum(); + let mut offsets = + MutableBuffer::with_capacity(offset_len * std::mem::size_of::<OffsetSize>());This avoids repeated growth on large inputs.
238-253: Optional: avoid carrying unused child values for sliced ListView inputsFor sliced ListView arrays, values() returns the full child; concatenating full children increases memory. Consider computing the minimal [min_offset, max_offset) used per input and concatenating sliced children instead, adjusting offsets by (offset - min_offset) + global_offset.
arrow/tests/array_equal.rs (1)
865-872: Add a guard test for size mismatch on non-null ListView elementsTo catch the null-path size check bug (now fixed), add a case where sizes differ for a non-null slot.
+#[test] +fn test_list_view_size_mismatch_is_unequal() { + // a: [[1,2,3]], b: [[1,2,3,4]] -> unequal + let a = create_list_view_array::<i32, _, _>([Some(vec![Some(1), Some(2), Some(3)])]); + let b = create_list_view_array::<i32, _, _>([Some(vec![Some(1), Some(2), Some(3), Some(4)])]); + test_equal(&a, &b, false); +}This ensures per-element size equality is enforced.
arrow/benches/cast_kernels.rs (1)
363-401: Optional: factor target type construction.
DRY the RunEndEncoded<Int32, T> construction with a small helper to avoid duplication across benches.arrow-cast/src/cast/mod.rs (5)
144-146: Guard can_cast_types for valid RunEndEncoded index typesToday this can return true even if run_ends uses an unsupported index type (leading to can_cast_types=true but cast failing later). Add an index-type filter to keep capability checks aligned with cast behavior.
- (RunEndEncoded(_, value_type), _) => can_cast_types(value_type.data_type(), to_type), - (_, RunEndEncoded(_, value_type)) => can_cast_types(from_type, value_type.data_type()), + (RunEndEncoded(index_type, value_type), _) + if matches!(index_type.data_type(), DataType::Int16 | DataType::Int32 | DataType::Int64) => + { + can_cast_types(value_type.data_type(), to_type) + } + (_, RunEndEncoded(index_type, value_type)) + if matches!(index_type.data_type(), DataType::Int16 | DataType::Int32 | DataType::Int64) => + { + can_cast_types(from_type, value_type.data_type()) + }
11452-11467: Fix misleading test comment and variable nameThe test casts REE -> Primitive Int64 and downcasts to StringArray/Int64Array, not a RunArray. Tweak the comments/variable name for clarity.
- // Cast to Int64 + // Cast to Int64 (flatten REE to a primitive array) @@ - // Verify the result is a RunArray with Int64 values - let result_run_array = cast_result.as_any().downcast_ref::<Int64Array>().unwrap(); + // Verify the result is a flat Int64Array with expanded values + let result_values = cast_result.as_any().downcast_ref::<Int64Array>().unwrap(); @@ - result_run_array.values(), + result_values.values(),
11469-11485: Tighten test comment wordingResult is a StringArray, not a RunArray; update comment to avoid confusion.
- // Verify the result is a RunArray with String values + // Verify the result is a StringArray with expanded values
11717-11734: Align test comment with intentThis test checks Struct -> FixedSizeBinary cast failure; remove the unrelated RunEndEncoded note in the comment.
- // This should fail because: - // 1. The target type is not RunEndEncoded - // 2. The target type is not supported for casting from StructArray + // This should fail because casting from StructArray to FixedSizeBinary is not supported
11548-11556: Strengthen null-run assertionsYou already assert null_count on values(). Also assert which runs are null to make intent explicit.
assert_eq!(result_run_array.values().null_count(), 3); + // Optionally assert positions of null runs in the values array + let vals = result_run_array.values().as_primitive::<Int32Type>(); + assert!(vals.is_null(1)); // run representing consecutive nulls + assert!(vals.is_null(4)); + assert!(vals.is_null(7));parquet/src/encryption/ciphers.rs (1)
54-59: HeapSize for RingGcmBlockDecryptor returns 0 — add a note or guard for ring changes.
If ring ever allocates for key schedules, this will under‑report. Consider a brief comment referencing the specific ring version tested.parquet/src/file/metadata/memory.rs (1)
54-94: HashMap heap sizing: make group_size arch‑aware to reduce test brittleness.
Current 16-byte group size can be smaller on 32‑bit, causing platform‑dependent results. Consider:- // Overhead for the control tags group, which may be smaller depending on architecture - let group_size = 16; + // Overhead for the control tags group; 16 on 64‑bit, typically 8 on 32‑bit + let group_size = if cfg!(target_pointer_width = "64") { 16 } else { 8 };Also keep the “approximation” disclaimer; update if hashbrown behavior changes.
parquet/src/file/metadata/mod.rs (3)
1884-1887: Avoid brittle exact byte constants in tests.
Given HashMap/Arc sizing is approximate and arch‑dependent, strict equality to 2766/2934 may fail cross‑platform or with std/hashbrown changes. Prefer arch‑aware guards or relative assertions (e.g., >= previous baseline), or compute expected from the same sizing helpers.I can propose a patch to make these assertions resilient if you prefer.
1915-1918: Same concern: platform‑sensitive constants.
Use arch‑aware cfg or relative checks to prevent flakes.
1924-1997: New decryptor memory test: prefer resilient assertions.
Exact 2058/3072 deltas are likely to drift across platforms and versions. Suggest asserting that adding a decryptor increases memory_size by a positive amount and by at least the decryptor’s reported HeapSize, instead of hard‑coding totals.I can draft a version that derives expectations from
heap_size()calls to avoid magic numbers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (38)
arrow-array/src/iterator.rs(4 hunks)arrow-avro/README.md(1 hunks)arrow-buffer/src/util/bit_iterator.rs(5 hunks)arrow-cast/Cargo.toml(2 hunks)arrow-cast/src/cast/decimal.rs(4 hunks)arrow-cast/src/cast/mod.rs(6 hunks)arrow-cast/src/cast/run_array.rs(1 hunks)arrow-data/src/data.rs(1 hunks)arrow-data/src/equal/list_view.rs(1 hunks)arrow-data/src/equal/mod.rs(3 hunks)arrow-data/src/transform/list_view.rs(1 hunks)arrow-data/src/transform/mod.rs(6 hunks)arrow-row/src/lib.rs(2 hunks)arrow-select/src/concat.rs(11 hunks)arrow-select/src/filter.rs(7 hunks)arrow-select/src/take.rs(5 hunks)arrow-select/src/zip.rs(4 hunks)arrow/benches/cast_kernels.rs(1 hunks)arrow/tests/array_equal.rs(2 hunks)parquet-variant-compute/src/shred_variant.rs(4 hunks)parquet-variant-compute/src/type_conversion.rs(1 hunks)parquet/THRIFT.md(3 hunks)parquet/src/arrow/arrow_reader/selection.rs(1 hunks)parquet/src/basic.rs(2 hunks)parquet/src/column/page.rs(1 hunks)parquet/src/column/reader.rs(2 hunks)parquet/src/column/reader/decoder.rs(1 hunks)parquet/src/encodings/decoding.rs(6 hunks)parquet/src/encodings/rle.rs(1 hunks)parquet/src/encryption/ciphers.rs(3 hunks)parquet/src/encryption/decrypt.rs(5 hunks)parquet/src/file/metadata/memory.rs(2 hunks)parquet/src/file/metadata/mod.rs(3 hunks)parquet/src/file/reader.rs(2 hunks)parquet/src/file/serialized_reader.rs(4 hunks)parquet/src/parquet_macros.rs(5 hunks)parquet/src/schema/types.rs(2 hunks)parquet/tests/arrow_reader/bad_data.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (26)
arrow-array/src/iterator.rs (1)
arrow-array/src/array/mod.rs (19)
iter(601-601)iter(609-611)iter(618-620)iter(632-632)iter(636-638)iter(641-643)array(944-944)array(953-953)array(962-962)array(970-970)array(980-980)array(994-994)array(1008-1008)array(1020-1020)array(1042-1042)array(1132-1135)len(155-155)len(366-368)len(445-447)
parquet/src/encryption/ciphers.rs (3)
parquet/src/encryption/decrypt.rs (4)
heap_size(276-278)heap_size(301-310)heap_size(358-360)heap_size(583-587)parquet/src/file/metadata/memory.rs (6)
heap_size(41-41)heap_size(45-51)heap_size(55-93)heap_size(97-100)heap_size(104-108)heap_size(112-114)parquet/src/file/metadata/mod.rs (1)
heap_size(931-933)
parquet/src/file/metadata/mod.rs (2)
parquet/src/encryption/decrypt.rs (4)
builder(365-367)aad_prefix(378-380)footer_key(389-397)column_key(400-418)parquet/src/encryption/encrypt.rs (5)
builder(103-105)aad_file_unique(330-332)aad_prefix(138-140)footer_key_metadata(113-115)footer_key(118-120)
arrow-cast/src/cast/run_array.rs (2)
arrow-cast/src/cast/mod.rs (25)
cast(345-347)run_end_encoded_cast(799-799)run_end_encoded_cast(800-800)run_end_encoded_cast(801-801)array(375-375)array(381-381)array(389-389)array(395-395)array(411-411)array(413-413)array(424-424)array(427-427)array(438-438)array(455-467)array(479-481)array(508-524)array(535-535)array(545-546)array(550-551)array(555-556)cast_with_options(751-2202)None(9986-9986)cast_to_run_end_encoded(809-809)cast_to_run_end_encoded(814-814)cast_to_run_end_encoded(819-819)arrow-ord/src/partition.rs (1)
partition(127-153)
arrow-data/src/equal/list_view.rs (2)
arrow-data/src/data.rs (3)
count_nulls(51-62)len(402-404)len(1942-1945)arrow-data/src/equal/mod.rs (4)
equal(161-166)equal_values(62-129)list_view_equal(109-109)list_view_equal(110-110)
arrow/tests/array_equal.rs (2)
arrow-select/src/take.rs (3)
take(86-102)data(1026-1028)indices(702-713)arrow-array/src/array/list_view_array.rs (2)
Buffer(1097-1097)test_list_view_array(586-619)
arrow-select/src/zip.rs (1)
arrow-select/src/filter.rs (10)
filter(155-167)filter(314-316)filter(1459-1461)filter(1489-1491)prep_null_mask_filter(122-126)from(66-68)new(60-62)new(89-93)new(217-231)new(636-653)
parquet/src/schema/types.rs (2)
parquet/src/record/reader.rs (1)
repetition(534-545)parquet_derive/src/parquet_field.rs (1)
repetition(751-757)
arrow/benches/cast_kernels.rs (1)
arrow-cast/src/cast/mod.rs (18)
c(7821-7821)b(4154-4154)b(4166-4166)b(4196-4196)b(4207-4207)b(4224-4224)b(4249-4249)b(4277-4277)b(4343-4343)b(4355-4355)b(4367-4367)b(4452-4452)b(4462-4462)b(4472-4472)b(4482-4482)b(4492-4492)b(4751-4751)cast(345-347)
arrow-row/src/lib.rs (2)
arrow-array/src/array/list_array.rs (1)
offsets(311-313)arrow-array/src/array/primitive_array.rs (2)
i(2070-2070)from_iter_values(753-761)
parquet/src/file/metadata/memory.rs (2)
parquet/src/encryption/decrypt.rs (4)
heap_size(276-278)heap_size(301-310)heap_size(358-360)heap_size(583-587)parquet/src/schema/types.rs (5)
heap_size(72-77)heap_size(677-680)heap_size(752-754)heap_size(847-851)heap_size(1026-1028)
parquet/src/arrow/arrow_reader/selection.rs (1)
parquet/src/arrow/arrow_reader/read_plan.rs (1)
selection(55-57)
parquet/src/file/reader.rs (2)
parquet/tests/arrow_reader/io/sync_reader.rs (2)
len(396-398)get_bytes(413-418)parquet/tests/arrow_reader/io/async_reader.rs (1)
get_bytes(373-390)
arrow-data/src/transform/list_view.rs (2)
arrow-data/src/transform/mod.rs (5)
build_extend(238-283)array(220-220)new(401-403)len(753-755)extend_nulls(742-749)arrow-array/src/array/list_view_array.rs (1)
sizes(277-279)
arrow-data/src/transform/mod.rs (2)
arrow-data/src/transform/list_view.rs (2)
build_extend(24-47)extend_nulls(49-56)arrow-data/src/transform/list.rs (2)
build_extend(27-48)extend_nulls(50-57)
arrow-data/src/equal/mod.rs (2)
arrow-data/src/equal/run.rs (1)
run_equal(26-86)arrow-data/src/equal/list_view.rs (5)
list_view_equal(24-129)lhs(31-31)lhs(32-32)rhs(34-34)rhs(35-35)
parquet-variant-compute/src/shred_variant.rs (1)
parquet-variant-compute/src/variant_array.rs (11)
from_iter(445-451)from_iter(455-457)from(433-435)from(439-441)from(701-703)from(707-709)from(865-870)new(493-499)new(766-768)new(820-822)new(884-886)
arrow-data/src/data.rs (1)
arrow-array/src/array/list_view_array.rs (2)
offsets(262-264)sizes(277-279)
arrow-select/src/take.rs (3)
arrow-data/src/data.rs (5)
new(1857-1859)new(1919-1932)nulls(390-392)nulls(1948-1953)slice(550-578)arrow-array/src/cast.rs (1)
as_list_view(927-929)arrow-array/src/array/list_view_array.rs (6)
nulls(416-418)to_data(379-381)x(1010-1010)x(1045-1045)x(1080-1080)x(1119-1119)
arrow-select/src/concat.rs (5)
arrow-array/src/array/list_array.rs (12)
x(1269-1269)nulls(569-571)new(266-273)values(322-324)offsets(311-313)offset(565-567)try_new(209-259)builder(1278-1278)builder(1279-1279)builder(1280-1280)builder(1281-1281)builder(1282-1282)arrow-array/src/array/list_view_array.rs (8)
x(1010-1010)x(1045-1045)x(1080-1080)x(1119-1119)nulls(416-418)sizes(277-279)offsets(262-264)offset(412-414)arrow-buffer/src/buffer/offset.rs (4)
new(68-79)n(169-171)offsets(282-290)as_ref(248-250)arrow-array/src/builder/generic_list_view_builder.rs (3)
values(119-121)with_capacity(85-96)array(443-443)arrow-buffer/src/buffer/scalar.rs (1)
as_ref(144-146)
arrow-cast/src/cast/mod.rs (2)
arrow-array/src/array/run_array.rs (6)
run_array(560-562)run_array(973-973)run_array(999-999)run_array(1033-1033)run_ends(121-123)run_ends(550-552)arrow-array/src/array/mod.rs (5)
make_array(744-842)value(581-581)data_type(127-127)data_type(358-360)data_type(437-439)
parquet-variant-compute/src/type_conversion.rs (1)
arrow-cast/src/cast/decimal.rs (8)
rescale_decimal(301-328)rescale_decimal(928-928)rescale_decimal(940-940)rescale_decimal(946-946)rescale_decimal(958-958)rescale_decimal(964-964)rescale_decimal(970-970)rescale_decimal(976-976)
arrow-cast/src/cast/decimal.rs (3)
arrow-array/src/array/primitive_array.rs (5)
value(742-750)value(1227-1229)validate_decimal_precision(1603-1618)len(684-686)len(1184-1186)arrow-array/src/types.rs (10)
is_valid_decimal_precision(1347-1347)is_valid_decimal_precision(1413-1415)is_valid_decimal_precision(1449-1451)is_valid_decimal_precision(1485-1487)is_valid_decimal_precision(1521-1523)validate_decimal_precision(1340-1344)validate_decimal_precision(1409-1411)validate_decimal_precision(1445-1447)validate_decimal_precision(1481-1483)validate_decimal_precision(1517-1519)arrow-data/src/decimal.rs (1)
validate_decimal_precision(1134-1167)
arrow-buffer/src/util/bit_iterator.rs (1)
arrow-buffer/src/buffer/boolean.rs (6)
new(61-74)iter(202-204)slice(162-172)len(119-121)from(283-287)from(291-293)
parquet/src/encryption/decrypt.rs (2)
parquet/src/encryption/ciphers.rs (1)
heap_size(55-58)parquet/src/file/metadata/memory.rs (10)
heap_size(41-41)heap_size(45-51)heap_size(55-93)heap_size(97-100)heap_size(104-108)heap_size(112-114)heap_size(118-120)heap_size(124-126)heap_size(130-142)keys(91-91)
arrow-select/src/filter.rs (4)
arrow-data/src/data.rs (10)
new(1857-1859)new(1919-1932)new_unchecked(273-300)nulls(390-392)nulls(1948-1953)null_count(420-425)null_count(1956-1959)buffer(586-588)buffer(922-922)buffer(1496-1496)arrow-array/src/record_batch.rs (1)
new_unchecked(261-271)arrow-array/src/array/list_view_array.rs (2)
nulls(416-418)to_data(379-381)arrow-array/src/array/mod.rs (9)
nulls(205-205)nulls(387-389)nulls(457-459)null_count(286-288)null_count(403-405)null_count(473-475)to_data(108-108)to_data(350-352)to_data(429-431)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: MIRI
- GitHub Check: Test
- GitHub Check: Clippy
- GitHub Check: Check Compilation
- GitHub Check: Build wasm32
- GitHub Check: Archery test With other arrows
- GitHub Check: Check Compilation
- GitHub Check: Test
- GitHub Check: Test
- GitHub Check: Check Compilation
- GitHub Check: claude-review
- GitHub Check: PySpark Integration Test (stable)
- GitHub Check: Test
- GitHub Check: Verify MSRV (Minimum Supported Rust Version)
- GitHub Check: Rustdocs are clean
- GitHub Check: Audit
- GitHub Check: Test on Mac
- GitHub Check: codex
- GitHub Check: Test on Windows
🔇 Additional comments (74)
parquet/src/arrow/arrow_reader/selection.rs (1)
1436-1463: Well-structured tests fortrim()behavior.The two test scenarios comprehensively verify the expected behavior:
- When trailing selectors are non-skips, they are preserved.
- When trailing skips exist, they are correctly removed.
The assertions clearly validate that the selector order and content remain intact after trimming.
parquet/src/column/page.rs (1)
34-34: LGTM! Useful debugging improvement.Adding
Debugto thePageenum is a good practice that enables debug formatting without any behavioral changes or performance impact. All field types already implementDebug, so this is a safe addition.Note: The PR title mentions
ArrayItersize hint fixes, but this file concerns Parquet page definitions. If this is part of a larger multi-file PR, consider ensuring the PR title reflects the full scope of changes.parquet/src/column/reader.rs (2)
572-581: Excellent defensive bounds checking.The added validation properly guards against:
- Insufficient buffer to read the 4-byte size header
- Integer overflow when computing the end offset
- Insufficient buffer to contain the declared data
The use of
checked_addand explicit bounds validation prevents potential panics and provides clear error messages.
605-622: Good test coverage for error paths.The test properly exercises the new bounds checking logic with two scenarios where the declared data size exceeds the available buffer. Error message assertions confirm the expected behavior.
arrow-row/src/lib.rs (2)
916-927: Good zero-copy optimization with proper fallback handling.The implementation correctly attempts a zero-copy path by consuming the
BinaryArraywithinto_parts()and tryinginto_vec(). When the buffer is shared (reference count > 1),into_vec()will fail, and the fallback toto_vec()ensures correctness by copying the data. This is a solid optimization that maintains correctness for all cases.
2481-2492: Excellent test coverage for the shared buffer scenario.This test specifically validates the fallback path in
from_binarywhen the buffer is shared. By cloningbinary_rowson line 2487, the test ensures the buffer's reference count is greater than 1, which forcesinto_vec()to fail and triggers the fallback toto_vec(). The unused variable_binary_rows_shared_bufferis intentionally kept in scope to maintain the shared state.arrow-data/src/data.rs (1)
991-991: Correct loop bound change.Iterating over
sizes.len()(which equals the number of ListView elements) is the correct approach for validating each list element's offset and size. The previous iteration bound was likely incorrect.parquet-variant-compute/src/shred_variant.rs (3)
338-338: LGTM!The refactored test construction using
from_iteris correct and simplifies the test setup.
367-367: LGTM!The simplified construction using direct
Variantvalues is correct and concise.
375-382: LGTM!The test correctly uses
Some/Nonepattern to test various null scenarios (array-level null vs variant null), which is essential for comprehensive null handling validation.parquet/src/file/serialized_reader.rs (3)
1139-1139: LGTM!The import is appropriately added for test usage.
464-467: LGTM!Replacing
unimplemented!()with a proper error return is the correct approach. The error message helpfully includes the page type for debugging.
1179-1213: LGTM!The test appropriately validates error handling for unsupported page types, covering the new error return at lines 464-467.
parquet/src/file/reader.rs (1)
127-132: Bounds checking prevents out-of-bounds reads.The defensive check correctly prevents reading beyond the buffer length while allowing valid edge cases (e.g.,
start == self.len()for empty reads).parquet/src/schema/types.rs (1)
848-851: HeapSize: avoid double-counting primitive_type — good changeExcluding primitive_type here prevents double-counting since SchemaDescriptor::schema accounts for the Type tree. Keeping only path.heap_size() is correct.
parquet/src/column/reader/decoder.rs (1)
141-153: Good: dispatch table sized by MAX_DISCRIMINANT.This future-proofs the decoder cache against added encodings and stays consistent with EncodingMask. LGTM.
parquet/src/encodings/decoding.rs (2)
382-395: Good guards for empty data and oversized bit width.Early EOF and bit-width validation make RLE dictionary decoding safer. LGTM.
1408-1413: Tests cover new error paths well.Empty dict data and invalid miniblock width cases are covered. Nice.
Also applies to: 2127-2172
parquet/src/parquet_macros.rs (1)
39-42: Nice: expose VARIANTS and MAX_DISCRIMINANT on thrift enums.This enables static sizing (e.g., dispatch tables) without manual updates. LGTM.
Also applies to: 85-112
parquet/src/basic.rs (1)
698-705: Good: EncodingMask derives MAX_ENCODING from enum discriminants.Keeps masks in sync automatically with new encodings. LGTM.
arrow-array/src/iterator.rs (2)
47-53: ArrayIter now Clone — good additive APICloning preserves iterator cursors; helpful for tests and user code. No behavior risks spotted.
101-103: Fix: exact size_hint after next_back/nextUsing current_end - current makes len()/ExactSizeIterator correct under mixed forward/backward consumption. Matches the PR goal.
arrow-buffer/src/util/bit_iterator.rs (4)
26-31: BitIterator now Clone — goodCheap, safe clone of cursor state; enables robust test patterns.
76-82: count() optimized via len()Avoids iteration; correct for ExactSizeIterator.
83-102: nth(): correct semantics and overflow handlingAdvances to current + n, returns next(); handles overflow/out-of-range by exhausting. Looks good.
117-139: max(): efficient boolean specializationShort-circuits by checking any set bit via BitIndexIterator; returns Some(false) for non-empty all-false. Matches Iterator::max semantics for bool.
arrow-data/src/equal/mod.rs (1)
33-34: ListView equality wiring looks correctNew module import and match arms correctly route ListView/LargeListView to list_view_equal with appropriate offset types. No issues from this change.
Please ensure equal/tests cover sliced ListView equality and null-handling paths run in CI.
Also applies to: 45-47, 109-111
arrow-select/src/zip.rs (1)
298-352: Good coverage for NULL-in-mask behaviorNew tests validate primitive/string/large-string cases with nulls treated as false. LGTM.
Also applies to: 353-402
arrow-select/src/concat.rs (2)
484-486: Dispatch for ListView/LargeListView looks goodconcat() routes ListView variants to concat_list_view; tests cover sliced and non-sliced paths.
951-1048: ListView concat tests are solidGood coverage for regular and sliced ListView cases. LGTM.
arrow/tests/array_equal.rs (1)
825-863: Nested List<ListView<…>> tests are valuableSliced outer List with inner ListView equality is well covered. LGTM.
arrow-data/src/transform/mod.rs (6)
36-36: Wiring module looks good.Module registration for list_view is correct and consistent with the added match arms below.
270-272: Dispatch to list_view::build_extend is correct, but depends on offset rebasing.Routing is fine. Note: correctness hinges on list_view::build_extend rebasing offsets and extending child_data when concatenating inputs. See comment in list_view.rs.
317-319: Nulls dispatch LGTM.ListView null extension is wired consistently with List/LargeList variants.
452-456: Capacities match updated appropriately.Including ListView/LargeListView in List-like capacities is correct.
497-520: Child data grouping for ListView is OK, but extension must be driven by the extend closure.Collecting the child arrays mirrors List behavior and is fine. However, unless list_view::build_extend extends child_data (and re-bases offsets), the frozen output will have inconsistent offsets vs values. See fix proposal in list_view.rs.
792-799: Buffer selection includes ListView correctly.Emitting two buffers (offsets, sizes) for ListView/LargeListView is correct.
arrow-data/src/transform/list_view.rs (1)
49-56: Nulls handling is fine.Pushing zeroed offsets/sizes for nulls is acceptable for ListView.
arrow-select/src/take.rs (4)
221-226: ListView dispatch in take_impl looks correct.Routes ListView/LargeListView to the new take_list_view with proper offset types.
630-655: take_list_view implementation LGTM.Offsets/sizes are taken; nulls propagated; child values reused from input. Matches ListView semantics.
1016-1016: Import ToPrimitive for tests is fine.Required for index builder mapping; OK.
1858-2004: Good ListView take tests.Covers reversal, null indices, null values, and slicing across i32/i64 offsets.
arrow-select/src/filter.rs (6)
61-68: Iterator ergonomics improved.Using From<&BooleanBuffer> for SlicesIterator is clean and avoids extra calls.
184-205: filter_record_batch helper is sensible.Optimization gating mirrors filter(), and the call-through is straightforward.
318-341: FilterPredicate::filter_record_batch LGTM.Builds arrays then constructs the batch with correct row count. Safe given consistent lengths.
389-394: ListView dispatch added correctly.Selects i32/i64 variants and routes to filter_list_view.
909-936: filter_list_view implementation looks correct.Filters offsets/sizes; computes filtered nulls; preserves values child array. Matches expected ListView behavior.
1447-1509: Solid ListView filter tests.Covers basic and sliced scenarios for both offset sizes.
parquet-variant-compute/src/type_conversion.rs (1)
20-20: Switch to shared rescale_decimal looks good.
Consolidates logic with arrow-cast; no functional concerns.arrow-cast/Cargo.toml (1)
46-46: Dependency addition verified; no workspace cycles detected.arrow-ord is confirmed as a workspace member, and its dependencies do not include arrow-cast, confirming no circular dependency exists. The addition is safe and appropriate.
arrow-cast/src/cast/mod.rs (4)
44-51: Module wiring LGTMNew module and import are clean and scoped correctly. No further action.
72-72: Public re-export is fineExposing rescale_decimal alongside DecimalCast makes sense for external consumers.
798-805: REE source dispatch looks correctIndex-type driven dispatch to run_end_encoded_cast:: is clear; unsupported types error out fast with a clear message.
806-828: No changes neededThe function signature confirms
cast_to_run_end_encodedrequires&ArrayRef, which matches the review comment's own caveat: "If its signature requires &ArrayRef, ignore." The code is correct as written.arrow-cast/src/cast/decimal.rs (9)
148-194: Well-designed upscaling builder with efficient precomputed tables.The approach of retrieving
10^k - 1fromMAX_FOR_EACH_PRECISIONand adding 1 avoids runtime power calculations while staying within type bounds. The distinction between fallible and infallible paths based on precision analysis is sound.Minor note: The infallible closure uses
.unwrap()on line 192, which relies on the precision analysis guaranteeing thatfrom_decimalsucceeds. This assumption holds given that:
- Input values fit within declared precision (validated elsewhere)
- Precision limits ensure valid values fit in native types
- The condition
input_precision + delta_scale <= output_precisioncorrectly predicts successYou may want to verify that all callers ensure input arrays satisfy precision constraints, though this should already be enforced by Arrow's validation.
196-264: LGTM: Downscaling logic correctly implements round-half-away-from-zero.The rounding implementation properly handles both positive and negative values. The infallible condition using
<rather than<=(line 261) correctly accounts for the potential digit added by rounding, as explained in the inline comments.The design choice to return
Nonewhen scale reduction exceeds precision (causing mathematical zero) is appropriate and well-documented.
266-284: Appropriate helper for applying rescaling functions.The function correctly delegates to either the infallible or fallible path, with validation only on the fallible path (for performance).
286-328: New public API for row-level decimal rescaling is well-designed.This function provides a clean interface for rescaling individual decimal values, complementing the existing array-level operations. The documentation clearly explains the behavior, including:
- Validation of precision and scale for both input and output
- Upscale overflow returns
None- Excessive downscale returns
Some(ZERO)(mathematically correct)The asymmetric error handling is appropriate given the mathematical semantics of each operation.
330-349: LGTM: Straightforward error message helper.
351-378: Correct array-level casting with appropriate paths for each scenario.The three-way branching correctly handles:
- Infallible casts (direct
unary)- Safe fallible casts (filtering with
unary_opt)- Unsafe fallible casts (validation with
try_unary)Each path appropriately validates or omits validation based on guarantees.
380-410: Good refactoring using the new builder pattern.The use of
make_downscalerandapply_decimal_castimproves code modularity. The special case handling (returning zeros when scale reduction exceeds precision) correctly preserves null values while replacing non-nulls with zero.
412-446: Refactoring mirrors downscaler pattern with appropriate error handling.The asymmetric behavior compared to
convert_to_smaller_scale_decimalis intentional:
- Upscale overflow → error (line 439-444)
- Downscale excessive → zeros (in the other function)
This reflects the mathematical reality: excessive upscaling causes overflow (error condition), while excessive downscaling mathematically rounds to zero (valid result).
926-978: Comprehensive test coverage for the new rescale_decimal function.The tests cover the essential scenarios:
- Normal upscaling and downscaling
- Rounding behavior (half-away-from-zero for both positive and negative values)
- Edge cases (excessive scale changes, overflow)
- Error cases (invalid precision/scale parameters)
parquet/src/encryption/ciphers.rs (2)
21-21: Import looks correct.
Brings HeapSize into scope for the new supertrait requirement.
31-35: Extending BlockDecryptor with HeapSize — Verified as correct.All implementations of
BlockDecryptorproperly implement theHeapSizetrait. The only implementor,RingGcmBlockDecryptor(line 61), correctly implements bothBlockDecryptor(line 61) andHeapSize(line 54). The trait constraint is satisfied.parquet/src/file/metadata/memory.rs (3)
31-31: Import for HashMap — OK.
Needed for the new HeapSize impl.
96-101: Arc overhead calculation — reasonable approximation.
Counts two counters + T + inner heap. Keep in mind std layout is not strictly guaranteed.
103-109: Arc impl — LGTM.
Correctly uses size_of_val for the underlying type plus inner heap usage.parquet/src/encryption/decrypt.rs (5)
25-25: Import looks correct.
Brings HeapSize into scope for the added impls.
275-279: HeapSize for ExplicitDecryptionKeys — OK.
Accurately sums footer key and column keys.
300-311: HeapSize for DecryptionKeys — OK.
Returns 0 for retriever variant; sensible since it’s user‑defined.
357-361: HeapSize for FileDecryptionProperties — OK.
Sums keys and optional AAD prefix.
582-588: Verify that callingheap_size()directly onArc<dyn BlockDecryptor>compiles without casting.The review comment is factually correct:
BlockDecryptorextendsHeapSizeas a supertrait (defined inparquet/src/encryption/ciphers.rs:31). In theory, trait object method dispatch should allow calling supertrait methods directly without casting. However, the current code explicitly castsArc<dyn BlockDecryptor>toArc<dyn HeapSize>before callingheap_size(), suggesting there may be a compilation or type resolution issue.The suggested fix removes the cast and calls
heap_size()directly on the trait object. Verify locally that this compiles and functions correctly:self.decryption_properties.heap_size() + self.footer_decryptor.heap_size() // Direct call without cast + self.file_aad.heap_size()If the direct call fails to compile, the cast may be necessary despite the supertrait relationship.
parquet/src/file/metadata/mod.rs (1)
290-301: Including decryptor memory in memory_size — OK.
Feature‑gated, uses HeapSize on Option<Box<...>>; consistent with existing accounting.
| ```toml | ||
| [dependencies] | ||
| arrow-avro = "56" | ||
| arrow-avro = "57.0.0" |
There was a problem hiding this comment.
Update version references in feature snippets section for consistency.
Lines 47 and 54 have been updated to 57.0.0, but the feature snippet examples at lines 152, 157, and 162 still reference version = "56". This creates inconsistency in the documentation.
To maintain clarity, update all version references consistently:
* Minimal, fast build (common pipelines):
```toml
- arrow-avro = { version = "56", default-features = false, features = ["deflate", "snappy"] }
+ arrow-avro = { version = "57.0.0", default-features = false, features = ["deflate", "snappy"] }
```
* Include Zstandard too (modern data lakes):
```toml
- arrow-avro = { version = "56", default-features = false, features = ["deflate", "snappy", "zstd"] }
+ arrow-avro = { version = "57.0.0", default-features = false, features = ["deflate", "snappy", "zstd"] }
```
* Fingerprint helpers:
```toml
- arrow-avro = { version = "56", features = ["md5", "sha256"] }
+ arrow-avro = { version = "57.0.0", features = ["md5", "sha256"] }
```Also applies to: 54-54
🤖 Prompt for AI Agents
In arrow-avro/README.md around lines 47, 54 and especially the feature snippet
lines 152, 157, and 162, update the inconsistent crate version references that
currently show version = "56" to the new canonical version = "57.0.0"; edit each
toml example to replace the old version string with "57.0.0" (including examples
with default-features and feature lists) so all documented snippets match the
top-of-file version references.
| let new_run_array: ArrayRef = match target_index_field.data_type() { | ||
| DataType::Int16 => { | ||
| let re = cast_run_ends.as_primitive::<Int16Type>(); | ||
| Arc::new(RunArray::<Int16Type>::try_new(re, cast_values.as_ref())?) | ||
| } | ||
| DataType::Int32 => { | ||
| let re = cast_run_ends.as_primitive::<Int32Type>(); | ||
| Arc::new(RunArray::<Int32Type>::try_new(re, cast_values.as_ref())?) | ||
| } | ||
| DataType::Int64 => { | ||
| let re = cast_run_ends.as_primitive::<Int64Type>(); | ||
| Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?) | ||
| } | ||
| _ => { | ||
| return Err(ArrowError::CastError( | ||
| "Run-end type must be i16, i32, or i64".to_string(), | ||
| )); | ||
| } | ||
| }; | ||
| Ok(Arc::new(new_run_array)) | ||
| } |
There was a problem hiding this comment.
Double-wrapping Arc causes a type error.
new_run_array is already an ArrayRef (Arc). Returning Arc::new(new_run_array) wraps an Arc in an Arc.
- Ok(Arc::new(new_run_array))
+ Ok(new_run_array)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let new_run_array: ArrayRef = match target_index_field.data_type() { | |
| DataType::Int16 => { | |
| let re = cast_run_ends.as_primitive::<Int16Type>(); | |
| Arc::new(RunArray::<Int16Type>::try_new(re, cast_values.as_ref())?) | |
| } | |
| DataType::Int32 => { | |
| let re = cast_run_ends.as_primitive::<Int32Type>(); | |
| Arc::new(RunArray::<Int32Type>::try_new(re, cast_values.as_ref())?) | |
| } | |
| DataType::Int64 => { | |
| let re = cast_run_ends.as_primitive::<Int64Type>(); | |
| Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?) | |
| } | |
| _ => { | |
| return Err(ArrowError::CastError( | |
| "Run-end type must be i16, i32, or i64".to_string(), | |
| )); | |
| } | |
| }; | |
| Ok(Arc::new(new_run_array)) | |
| } | |
| let new_run_array: ArrayRef = match target_index_field.data_type() { | |
| DataType::Int16 => { | |
| let re = cast_run_ends.as_primitive::<Int16Type>(); | |
| Arc::new(RunArray::<Int16Type>::try_new(re, cast_values.as_ref())?) | |
| } | |
| DataType::Int32 => { | |
| let re = cast_run_ends.as_primitive::<Int32Type>(); | |
| Arc::new(RunArray::<Int32Type>::try_new(re, cast_values.as_ref())?) | |
| } | |
| DataType::Int64 => { | |
| let re = cast_run_ends.as_primitive::<Int64Type>(); | |
| Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?) | |
| } | |
| _ => { | |
| return Err(ArrowError::CastError( | |
| "Run-end type must be i16, i32, or i64".to_string(), | |
| )); | |
| } | |
| }; | |
| Ok(new_run_array) | |
| } |
🤖 Prompt for AI Agents
In arrow-cast/src/cast/run_array.rs around lines 51 to 71, the code constructs
new_run_array as an ArrayRef (Arc<dyn Array>) and then returns
Ok(Arc::new(new_run_array)), which double-wraps an Arc and causes a type error;
change the return to Ok(new_run_array) (or adjust the binding so it isn't
already an Arc) so you return the existing ArrayRef directly instead of wrapping
it again.
| let run_ends = run_array.run_ends().values().to_vec(); | ||
| let mut indices = Vec::with_capacity(run_array.run_ends().len()); | ||
| let mut physical_idx: usize = 0; | ||
| for logical_idx in 0..run_array.run_ends().len() { | ||
| // If the logical index is equal to the (next) run end, increment the physical index, | ||
| // since we are at the end of a run. | ||
| if logical_idx == run_ends[physical_idx].as_usize() { | ||
| physical_idx += 1; | ||
| } | ||
| indices.push(physical_idx as i32); | ||
| } | ||
|
|
||
| let taken = take(&values, &Int32Array::from_iter_values(indices), None)?; | ||
| if taken.data_type() != to_type { | ||
| cast_with_options(taken.as_ref(), to_type, cast_options) | ||
| } else { | ||
| Ok(taken) | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
Incorrect logical expansion: builds indices per run, not per row.
This returns only one value per run instead of expanding to array length. Build indices of total_len = last run-end, repeating each run index by its run length.
- let run_ends = run_array.run_ends().values().to_vec();
- let mut indices = Vec::with_capacity(run_array.run_ends().len());
- let mut physical_idx: usize = 0;
- for logical_idx in 0..run_array.run_ends().len() {
- // If the logical index is equal to the (next) run end, increment the physical index,
- // since we are at the end of a run.
- if logical_idx == run_ends[physical_idx].as_usize() {
- physical_idx += 1;
- }
- indices.push(physical_idx as i32);
- }
-
- let taken = take(&values, &Int32Array::from_iter_values(indices), None)?;
+ let run_ends = run_array.run_ends().values().to_vec();
+ let total_len = run_ends.last().map(|v| v.as_usize()).unwrap_or(0);
+ let mut indices: Vec<u32> = Vec::with_capacity(total_len);
+ let mut prev_end = 0usize;
+ for (run_idx, end) in run_ends.iter().enumerate() {
+ let end = end.as_usize();
+ let run_len = end - prev_end;
+ indices.extend(std::iter::repeat(run_idx as u32).take(run_len));
+ prev_end = end;
+ }
+ let taken = take(values.as_ref(), &UInt32Array::from_iter_values(indices), None)?;Add a unit test that casts a small REE array (e.g., run_ends=[2,5], values=["a","b"]) to Utf8 and checks length=5 and correct expansion ["a","a","b","b","b"].
🏁 Script executed:
#!/bin/bash
# Find and examine the run_array.rs file to verify the current code
find . -name "run_array.rs" -type f | head -20Length of output: 71
🏁 Script executed:
#!/bin/bash
# Search for the specific code pattern around the suspected issue
rg -n "let run_ends = run_array.run_ends" --type rust -A 20Length of output: 6237
🏁 Script executed:
#!/bin/bash
# Look for existing tests related to run_array casting
find . -name "*.rs" -type f | xargs grep -l "run_array" | grep -i testLength of output: 0
🏁 Script executed:
#!/bin/bash
# Check for RunArray/REE related documentation or implementation
rg -n "RunArray|run_encoded" --type rust -C 2 | head -100Length of output: 6749
🏁 Script executed:
#!/bin/bash
# Find the take function signature to verify it accepts UInt32Array
rg -n "pub fn take" --type rust -A 5 | grep -A 5 "indices"Length of output: 942
🏁 Script executed:
#!/bin/bash
# Look for tests in run_array.rs file itself
cat arrow-cast/src/cast/run_array.rs | tail -200Length of output: 6928
🏁 Script executed:
#!/bin/bash
# Search for tests that use cast_with_options on RunArray
rg -n "RunArray|cast_with_options" arrow-cast/src/cast/run_array.rs --type rust -C 3Length of output: 3588
🏁 Script executed:
#!/bin/bash
# Check the full context of the problematic code to understand the function signature
sed -n '50,100p' arrow-cast/src/cast/run_array.rsLength of output: 2449
🏁 Script executed:
#!/bin/bash
# Search for UInt32Array usage with take function to validate the proposed change
rg -n "UInt32Array.*take\|take.*UInt32Array" --type rust -C 2Length of output: 0
Implement the expansion fix for RunArray logical form casting.
The current code has a critical logic error: it iterates 0..run_array.run_ends().len() (number of runs) instead of the logical array length, producing incorrectly sized output. For example, with run_ends=[2,5] and values=["a","b"], the current code returns indices=[0,0] (length 2), but should return [0,0,1,1,1] (length 5) to expand the compressed representation.
The proposed fix correctly expands each run by its run_length. The data type change to UInt32Array is valid—take() accepts any Array type, and UInt32Array is already used successfully elsewhere in this same file (line 157).
- let run_ends = run_array.run_ends().values().to_vec();
- let mut indices = Vec::with_capacity(run_array.run_ends().len());
- let mut physical_idx: usize = 0;
- for logical_idx in 0..run_array.run_ends().len() {
- // If the logical index is equal to the (next) run end, increment the physical index,
- // since we are at the end of a run.
- if logical_idx == run_ends[physical_idx].as_usize() {
- physical_idx += 1;
- }
- indices.push(physical_idx as i32);
- }
-
- let taken = take(&values, &Int32Array::from_iter_values(indices), None)?;
+ let run_ends = run_array.run_ends().values().to_vec();
+ let total_len = run_ends.last().map(|v| v.as_usize()).unwrap_or(0);
+ let mut indices: Vec<u32> = Vec::with_capacity(total_len);
+ let mut prev_end = 0usize;
+ for (run_idx, end) in run_ends.iter().enumerate() {
+ let end = end.as_usize();
+ let run_len = end - prev_end;
+ indices.extend(std::iter::repeat(run_idx as u32).take(run_len));
+ prev_end = end;
+ }
+ let taken = take(values.as_ref(), &UInt32Array::from_iter_values(indices), None)?;Add a unit test that casts a small REE array (e.g., run_ends=[2,5], values=["a","b"]) to Utf8 and verifies length=5 with correct expansion ["a","a","b","b","b"].
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let run_ends = run_array.run_ends().values().to_vec(); | |
| let mut indices = Vec::with_capacity(run_array.run_ends().len()); | |
| let mut physical_idx: usize = 0; | |
| for logical_idx in 0..run_array.run_ends().len() { | |
| // If the logical index is equal to the (next) run end, increment the physical index, | |
| // since we are at the end of a run. | |
| if logical_idx == run_ends[physical_idx].as_usize() { | |
| physical_idx += 1; | |
| } | |
| indices.push(physical_idx as i32); | |
| } | |
| let taken = take(&values, &Int32Array::from_iter_values(indices), None)?; | |
| if taken.data_type() != to_type { | |
| cast_with_options(taken.as_ref(), to_type, cast_options) | |
| } else { | |
| Ok(taken) | |
| } | |
| } | |
| let run_ends = run_array.run_ends().values().to_vec(); | |
| let total_len = run_ends.last().map(|v| v.as_usize()).unwrap_or(0); | |
| let mut indices: Vec<u32> = Vec::with_capacity(total_len); | |
| let mut prev_end = 0usize; | |
| for (run_idx, end) in run_ends.iter().enumerate() { | |
| let end = end.as_usize(); | |
| let run_len = end - prev_end; | |
| indices.extend(std::iter::repeat(run_idx as u32).take(run_len)); | |
| prev_end = end; | |
| } | |
| let taken = take(values.as_ref(), &UInt32Array::from_iter_values(indices), None)?; | |
| if taken.data_type() != to_type { | |
| cast_with_options(taken.as_ref(), to_type, cast_options) | |
| } else { | |
| Ok(taken) | |
| } | |
| } |
🤖 Prompt for AI Agents
In arrow-cast/src/cast/run_array.rs around lines 75-93, the code builds indices
by iterating 0..run_array.run_ends().len() (number of runs) instead of the
logical array length, producing a too-short indices vector; change the loop to
iterate over the logical length (e.g., 0..run_array.len() or
run_array.logical_len()) and expand each run by comparing the current logical
index to run_ends[physical_idx].as_usize(), incrementing physical_idx only when
you pass a run end, and push the current physical_idx for each logical position;
construct the indices as a UInt32Array (consistent with other uses in this file)
and call take(&values, &UInt32Array::from_iter_values(indices), None) as before,
then cast if needed. Also add a unit test that builds a RunEndEncoded array with
run_ends=[2,5] and values=["a","b"], casts it to Utf8, and asserts length == 5
and contents == ["a","a","b","b","b"] to verify correct expansion.
| // Cast the input array to the target value type if necessary | ||
| let cast_array = if array.data_type() == value_type { | ||
| array | ||
| } else { | ||
| &cast_with_options(array, value_type, cast_options)? | ||
| }; | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
Borrow to temporary (lifetime) and minor API nit in take.
cast_array borrows a temporary ArrayRef; also prefer .as_ref() for take.
- let cast_array = if array.data_type() == value_type {
- array
- } else {
- &cast_with_options(array, value_type, cast_options)?
- };
+ let cast_array: ArrayRef = if array.data_type() == value_type {
+ Arc::clone(array)
+ } else {
+ cast_with_options(array, value_type, cast_options)?
+ };
@@
- let partitions = partition(&[Arc::clone(cast_array)])?;
+ let partitions = partition(&[Arc::clone(&cast_array)])?;
@@
- let values_array = take(&cast_array, &indices, None)?;
+ let values_array = take(cast_array.as_ref(), &indices, None)?;Compile check this file to ensure lifetimes and trait bounds (as_usize) resolve cleanly.
Also applies to: 156-160
I'll verify this review comment by examining the actual code and checking the validity of the lifetime and API concerns.
🏁 Script executed:
#!/bin/bash
# Examine the file context around the flagged lines
cat arrow-cast/src/cast/run_array.rs | head -150 | tail -50Length of output: 1784
🏁 Script executed:
#!/bin/bash
# Also check lines 156-160 context
cat arrow-cast/src/cast/run_array.rs | sed -n '150,165p'Length of output: 658
🏁 Script executed:
#!/bin/bash
# Find function signatures for cast_with_options and take
rg -n -A5 'fn (cast_with_options|take)\s*\(' arrow-cast/src --type rustLength of output: 347
🏁 Script executed:
#!/bin/bash
# Check imports and type definitions at top of file
head -40 arrow-cast/src/cast/run_array.rsLength of output: 1652
🏁 Script executed:
#!/bin/bash
# Look for the full function containing lines 113-119
rg -nB20 'Cast the input array to the target' arrow-cast/src/cast/run_array.rsLength of output: 691
🏁 Script executed:
#!/bin/bash
# Check for ArrayRef type definition and usage
rg -n 'type ArrayRef' arrow-cast/src --type rustLength of output: 0
🏁 Script executed:
#!/bin/bash
# Examine partition function signature
rg -n -A3 'fn partition' arrow-cast/src --type rustLength of output: 0
🏁 Script executed:
#!/bin/bash
# Check the take function more carefully - might be from arrow library
rg -rn 'use.*take' arrow-cast/src/cast/run_array.rsLength of output: 0
🏁 Script executed:
#!/bin/bash
# Get broader context - find the function definition containing line 113
awk 'NR>=100 && NR<=170' arrow-cast/src/cast/run_array.rsLength of output: 2371
🏁 Script executed:
#!/bin/bash
# Check what type array is and its return from cast_with_options
rg -B5 -A10 'cast_with_options.*array.*value_type' arrow-cast/src/cast/run_array.rsLength of output: 559
🏁 Script executed:
#!/bin/bash
# Look at the actual variable usage after line 119 to understand lifetime requirements
awk 'NR>=113 && NR<=145' arrow-cast/src/cast/run_array.rsLength of output: 1283
🏁 Script executed:
#!/bin/bash
# Check if ArrayRef is Arc-based
grep -r 'pub type ArrayRef' arrow-cast/src --include='*.rs'Length of output: 0
🏁 Script executed:
#!/bin/bash
# Look for arrow crate dependencies
grep -E 'arrow|ArrayRef' arrow-cast/Cargo.tomlLength of output: 249
🏁 Script executed:
#!/bin/bash
# Check the take function import source
grep -n 'use.*take\|from.*take' arrow-cast/src/cast/run_array.rsLength of output: 0
🏁 Script executed:
#!/bin/bash
# Full file for reference
wc -l arrow-cast/src/cast/run_array.rsLength of output: 36
🏁 Script executed:
#!/bin/bash
# Get complete function context around lines 113-119 and 156-160
sed -n '90,175p' arrow-cast/src/cast/run_array.rsLength of output: 2645
🏁 Script executed:
#!/bin/bash
# Check the second instance (156-160)
sed -n '145,170p' arrow-cast/src/cast/run_array.rsLength of output: 771
🏁 Script executed:
#!/bin/bash
# Look for return type of cast_with_options
ast-grep --pattern 'fn cast_with_options($$$) -> $$$'Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check partition and take function signatures in this file
grep -n -A4 'fn partition\|pub fn take' arrow-cast/src/cast/run_array.rsLength of output: 0
🏁 Script executed:
#!/bin/bash
# Inspect imports more carefully
sed -n '1,50p' arrow-cast/src/cast/run_array.rs | grep -E '^use|^mod|^pub'Length of output: 112
🏁 Script executed:
#!/bin/bash
# Find all instances where cast_array is used
grep -n 'cast_array' arrow-cast/src/cast/run_array.rsLength of output: 223
🏁 Script executed:
#!/bin/bash
# Check arrow dependency version and ArrayRef
grep -A10 'dependencies' arrow-cast/Cargo.tomlLength of output: 760
🏁 Script executed:
#!/bin/bash
# More direct approach - look at the actual code lines
sed -n '100,165p' arrow-cast/src/cast/run_array.rsLength of output: 2371
🏁 Script executed:
#!/bin/bash
# Check file structure
ls -la arrow-cast/src/cast/Length of output: 543
🏁 Script executed:
#!/bin/bash
# Ensure proper path
find . -name "run_array.rs" -type fLength of output: 71
🏁 Script executed:
#!/bin/bash
# Get the full content if file is small
cat arrow-cast/src/cast/run_array.rs 2>/dev/null || cat arrow-cast/src/run_array.rs 2>/dev/null || find . -name "run_array.rs" -exec cat {} \;Length of output: 6928
🏁 Script executed:
#!/bin/bash
# Try to find the file with fd
fd run_array.rs arrow-cast/Length of output: 32
🏁 Script executed:
#!/bin/bash
# Check what's in arrow-cast
ls -la arrow-cast/src/Length of output: 466
🏁 Script executed:
#!/bin/bash
# Find all Rust files in arrow-cast
find arrow-cast -name "*.rs" -type fLength of output: 471
🏁 Script executed:
#!/bin/bash
# Look for cast directory
find . -type d -name castLength of output: 21
🏁 Script executed:
#!/bin/bash
# Verify project structure
ls -la */src/ 2>/dev/null | head -20Length of output: 932
🏁 Script executed:
#!/bin/bash
# List files matching the pattern
find . -path "*/cast/*array*.rs"Length of output: 34
🏁 Script executed:
#!/bin/bash
# Broader search
find . -name "*run_array*"Length of output: 71
🏁 Script executed:
#!/bin/bash
# Check current directory structure
ls -RLength of output: 15946
🏁 Script executed:
#!/bin/bash
# Look for any file with "run_array" in name
locate run_array.rs 2>/dev/null || find . -type f -name "*array*" 2>/dev/null | grep -i runLength of output: 71
🏁 Script executed:
#!/bin/bash
# Simple current dir check
pwd && ls -laLength of output: 3716
🏁 Script executed:
#!/bin/bash
# Find partition function signature
rg -n 'fn partition' arrow-ord/src --type rust -A3Length of output: 789
🏁 Script executed:
#!/bin/bash
# Find take function signature
rg -n 'pub fn take' arrow-select/src --type rust -A4Length of output: 867
🏁 Script executed:
#!/bin/bash
# Check cast module imports at top to understand take source
head -50 arrow-cast/src/cast/mod.rs | grep -E '^use|^pub'Length of output: 142
🏁 Script executed:
#!/bin/bash
# Look for take import in cast/mod.rs more broadly
grep -n 'take' arrow-cast/src/cast/mod.rs | head -5Length of output: 188
Fix critical lifetime issue: borrow of temporary ArrayRef in else branch (line 118).
The current code creates a dangling reference in the else branch:
&cast_with_options(array, value_type, cast_options)?This borrows a temporary that is immediately dropped. The suggested fix resolves this by making cast_array own the ArrayRef directly, and improves API clarity with explicit .as_ref() calls:
- let cast_array = if array.data_type() == value_type {
- array
- } else {
- &cast_with_options(array, value_type, cast_options)?
- };
+ let cast_array: ArrayRef = if array.data_type() == value_type {
+ Arc::clone(array)
+ } else {
+ cast_with_options(array, value_type, cast_options)?
+ };Then update usages at line 138 and line 159:
- let partitions = partition(&[Arc::clone(cast_array)])?;
+ let partitions = partition(&[Arc::clone(&cast_array)])?;- let values_array = take(&cast_array, &indices, None)?;
+ let values_array = take(cast_array.as_ref(), &indices, None)?;🤖 Prompt for AI Agents
In arrow-cast/src/cast/run_array.rs around lines 113 to 119, the else branch
currently returns a reference to a temporary (&cast_with_options(...)?),
creating a dangling borrow; change cast_array to own an ArrayRef (not &ArrayRef)
by assigning either a cloned/owned array for the equality case or the returned
ArrayRef from cast_with_options in the else branch so the value lives long
enough, and then update the subsequent uses at lines 138 and 159 to call
.as_ref() (or otherwise dereference) on the owned ArrayRef where a &Array is
required.
| // Check values for equality, with null checking | ||
| let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len]; | ||
| let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len]; | ||
|
|
||
| if lhs_range_offsets.len() != rhs_range_offsets.len() { | ||
| return false; | ||
| } | ||
|
|
||
| for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets | ||
| .iter() | ||
| .zip(rhs_range_offsets) | ||
| .zip(lhs_range_sizes) | ||
| .enumerate() | ||
| { | ||
| let lhs_is_null = lhs_nulls.is_null(index); | ||
| let rhs_is_null = rhs_nulls.is_null(index); | ||
|
|
||
| if lhs_is_null != rhs_is_null { | ||
| return false; | ||
| } | ||
|
|
||
| let lhs_offset = lhs_offset.to_usize().unwrap(); | ||
| let rhs_offset = rhs_offset.to_usize().unwrap(); | ||
| let size = size.to_usize().unwrap(); | ||
|
|
||
| // Check if values match in the range | ||
| if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Bug: per-element size mismatch not checked when nulls are present
In the null-path, equality must assert that sizes are equal for non-null positions. Currently, only lhs size is used for comparison, which can yield false positives if rhs has extra elements.
Apply this fix to compare both sizes and use the agreed size:
- // Check values for equality, with null checking
- let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];
- let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];
+ // Check values for equality, with null checking
+ let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];
+ let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];
- for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
- .iter()
- .zip(rhs_range_offsets)
- .zip(lhs_range_sizes)
- .enumerate()
- {
+ for (index, ((&lhs_offset, &rhs_offset), (&lhs_size, &rhs_size))) in lhs_range_offsets
+ .iter()
+ .zip(rhs_range_offsets)
+ .zip(lhs_range_sizes.iter().zip(rhs_range_sizes))
+ .enumerate()
+ {
let lhs_is_null = lhs_nulls.is_null(index);
let rhs_is_null = rhs_nulls.is_null(index);
if lhs_is_null != rhs_is_null {
return false;
}
- let lhs_offset = lhs_offset.to_usize().unwrap();
- let rhs_offset = rhs_offset.to_usize().unwrap();
- let size = size.to_usize().unwrap();
+ if !lhs_is_null {
+ if lhs_size != rhs_size {
+ return false;
+ }
+ let lhs_offset = lhs_offset.to_usize().unwrap();
+ let rhs_offset = rhs_offset.to_usize().unwrap();
+ let size = lhs_size.to_usize().unwrap();
- // Check if values match in the range
- if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
- return false;
- }
+ // Check if values match in the range
+ if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
+ return false;
+ }
+ }
}Add a test where one side has an extra element for a non-null sublist to prevent regressions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Check values for equality, with null checking | |
| let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len]; | |
| let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len]; | |
| if lhs_range_offsets.len() != rhs_range_offsets.len() { | |
| return false; | |
| } | |
| for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets | |
| .iter() | |
| .zip(rhs_range_offsets) | |
| .zip(lhs_range_sizes) | |
| .enumerate() | |
| { | |
| let lhs_is_null = lhs_nulls.is_null(index); | |
| let rhs_is_null = rhs_nulls.is_null(index); | |
| if lhs_is_null != rhs_is_null { | |
| return false; | |
| } | |
| let lhs_offset = lhs_offset.to_usize().unwrap(); | |
| let rhs_offset = rhs_offset.to_usize().unwrap(); | |
| let size = size.to_usize().unwrap(); | |
| // Check if values match in the range | |
| if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) { | |
| return false; | |
| } | |
| } | |
| // Check values for equality, with null checking | |
| let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len]; | |
| let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len]; | |
| if lhs_range_offsets.len() != rhs_range_offsets.len() { | |
| return false; | |
| } | |
| for (index, ((&lhs_offset, &rhs_offset), (&lhs_size, &rhs_size))) in lhs_range_offsets | |
| .iter() | |
| .zip(rhs_range_offsets) | |
| .zip(lhs_range_sizes.iter().zip(rhs_range_sizes)) | |
| .enumerate() | |
| { | |
| let lhs_is_null = lhs_nulls.is_null(index); | |
| let rhs_is_null = rhs_nulls.is_null(index); | |
| if lhs_is_null != rhs_is_null { | |
| return false; | |
| } | |
| if !lhs_is_null { | |
| if lhs_size != rhs_size { | |
| return false; | |
| } | |
| let lhs_offset = lhs_offset.to_usize().unwrap(); | |
| let rhs_offset = rhs_offset.to_usize().unwrap(); | |
| let size = lhs_size.to_usize().unwrap(); | |
| // Check if values match in the range | |
| if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) { | |
| return false; | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
In arrow-data/src/equal/list_view.rs around lines 96-125, the loop only reads
lhs sizes and doesn't verify that rhs per-element sizes match when values are
non-null; update the code to also slice rhs_range_sizes (similar to
lhs_range_sizes), in the for loop read both lhs_size and rhs_size (convert both
to usize) and immediately return false if they differ, then use the agreed size
(either value after the equality check) when calling equal_values; also add a
unit test that constructs two lists where a non-null sublist on one side has an
extra element (size mismatch) to prevent regressions.
| pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>( | ||
| array: &ArrayData, | ||
| ) -> crate::transform::Extend<'_> { | ||
| let offsets = array.buffer::<T>(0); | ||
| let sizes = array.buffer::<T>(1); | ||
| Box::new( | ||
| move |mutable: &mut _MutableArrayData, _index: usize, start: usize, len: usize| { | ||
| let offset_buffer = &mut mutable.buffer1; | ||
| let sizes_buffer = &mut mutable.buffer2; | ||
|
|
||
| for &offset in &offsets[start..start + len] { | ||
| offset_buffer.push(offset); | ||
| } | ||
|
|
||
| // sizes | ||
| for &size in &sizes[start..start + len] { | ||
| sizes_buffer.push(size); | ||
| } | ||
|
|
||
| // the beauty of views is that we don't need to copy child_data, we just splat | ||
| // the offsets and sizes. | ||
| }, | ||
| ) | ||
| } |
There was a problem hiding this comment.
Critical: Offsets not rebased and child_data not extended when concatenating; output can be invalid.
When MutableArrayData combines multiple ListView arrays, you must:
- Extend mutable.child_data[0] with the referenced value range(s).
- Rebase pushed offsets by the base child length so they still point to the concatenated values.
Without this, offsets from later input arrays can point before appended regions, and child_data remains empty, yielding corrupted arrays.
Apply this fix to rebase offsets and extend child values in one contiguous span covering the selected rows (skipping null/empty rows):
@@
-pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(
+pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(
array: &ArrayData,
) -> crate::transform::Extend<'_> {
let offsets = array.buffer::<T>(0);
let sizes = array.buffer::<T>(1);
+ let nulls = array.nulls().cloned();
Box::new(
- move |mutable: &mut _MutableArrayData, _index: usize, start: usize, len: usize| {
- let offset_buffer = &mut mutable.buffer1;
- let sizes_buffer = &mut mutable.buffer2;
-
- for &offset in &offsets[start..start + len] {
- offset_buffer.push(offset);
- }
-
- // sizes
- for &size in &sizes[start..start + len] {
- sizes_buffer.push(size);
- }
-
- // the beauty of views is that we don't need to copy child_data, we just splat
- // the offsets and sizes.
- },
+ move |mutable: &mut _MutableArrayData, index: usize, start: usize, len: usize| {
+ let end = start + len;
+ let offset_buffer = &mut mutable.buffer1;
+ let sizes_buffer = &mut mutable.buffer2;
+
+ // Compute a contiguous value range covering all selected non-empty, non-null lists
+ let mut min_off: Option<usize> = None;
+ let mut max_end: usize = 0;
+ for i in start..end {
+ // Skip null rows
+ if let Some(ref n) = nulls {
+ if !n.is_valid(i) {
+ continue;
+ }
+ }
+ let sz = sizes[i];
+ if sz == T::zero() {
+ continue;
+ }
+ let off = offsets[i].as_usize();
+ let list_end = off
+ .checked_add(sz.as_usize())
+ .expect("ListView size overflow");
+ min_off = Some(min_off.map_or(off, |m| m.min(off)));
+ if list_end > max_end {
+ max_end = list_end;
+ }
+ }
+
+ // Base shift where newly appended child values will start
+ let base_child_len = mutable.child_data[0].len();
+
+ // If there are any values to copy, extend child_data once with the covering span
+ let delta_usize = if let Some(min) = min_off {
+ mutable.child_data[0].extend(index, min, max_end);
+ base_child_len
+ .checked_sub(min)
+ .expect("underflow computing ListView base shift")
+ } else {
+ 0
+ };
+
+ let delta: T = T::from_usize(delta_usize)
+ .expect("ListView base shift does not fit target offset type");
+
+ // Push rebased offsets and original sizes
+ for i in start..end {
+ let off = offsets[i];
+ let rebased = off.checked_add(&delta).expect("ListView offset overflow");
+ offset_buffer.push(rebased);
+ }
+ for i in start..end {
+ sizes_buffer.push(sizes[i]);
+ }
+ },
)
}Add a unit test that concatenates two ListView arrays via MutableArrayData and asserts offsets point into the concatenated values.
I can add the test and wire it into array_transform tests if you want.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>( | |
| array: &ArrayData, | |
| ) -> crate::transform::Extend<'_> { | |
| let offsets = array.buffer::<T>(0); | |
| let sizes = array.buffer::<T>(1); | |
| Box::new( | |
| move |mutable: &mut _MutableArrayData, _index: usize, start: usize, len: usize| { | |
| let offset_buffer = &mut mutable.buffer1; | |
| let sizes_buffer = &mut mutable.buffer2; | |
| for &offset in &offsets[start..start + len] { | |
| offset_buffer.push(offset); | |
| } | |
| // sizes | |
| for &size in &sizes[start..start + len] { | |
| sizes_buffer.push(size); | |
| } | |
| // the beauty of views is that we don't need to copy child_data, we just splat | |
| // the offsets and sizes. | |
| }, | |
| ) | |
| } | |
| pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>( | |
| array: &ArrayData, | |
| ) -> crate::transform::Extend<'_> { | |
| let offsets = array.buffer::<T>(0); | |
| let sizes = array.buffer::<T>(1); | |
| let nulls = array.nulls().cloned(); | |
| Box::new( | |
| move |mutable: &mut _MutableArrayData, index: usize, start: usize, len: usize| { | |
| let end = start + len; | |
| let offset_buffer = &mut mutable.buffer1; | |
| let sizes_buffer = &mut mutable.buffer2; | |
| // Compute a contiguous value range covering all selected non-empty, non-null lists | |
| let mut min_off: Option<usize> = None; | |
| let mut max_end: usize = 0; | |
| for i in start..end { | |
| // Skip null rows | |
| if let Some(ref n) = nulls { | |
| if !n.is_valid(i) { | |
| continue; | |
| } | |
| } | |
| let sz = sizes[i]; | |
| if sz == T::zero() { | |
| continue; | |
| } | |
| let off = offsets[i].as_usize(); | |
| let list_end = off | |
| .checked_add(sz.as_usize()) | |
| .expect("ListView size overflow"); | |
| min_off = Some(min_off.map_or(off, |m| m.min(off))); | |
| if list_end > max_end { | |
| max_end = list_end; | |
| } | |
| } | |
| // Base shift where newly appended child values will start | |
| let base_child_len = mutable.child_data[0].len(); | |
| // If there are any values to copy, extend child_data once with the covering span | |
| let delta_usize = if let Some(min) = min_off { | |
| mutable.child_data[0].extend(index, min, max_end); | |
| base_child_len | |
| .checked_sub(min) | |
| .expect("underflow computing ListView base shift") | |
| } else { | |
| 0 | |
| }; | |
| let delta: T = T::from_usize(delta_usize) | |
| .expect("ListView base shift does not fit target offset type"); | |
| // Push rebased offsets and original sizes | |
| for i in start..end { | |
| let off = offsets[i]; | |
| let rebased = off.checked_add(&delta).expect("ListView offset overflow"); | |
| offset_buffer.push(rebased); | |
| } | |
| for i in start..end { | |
| sizes_buffer.push(sizes[i]); | |
| } | |
| }, | |
| ) | |
| } |
| c.bench_function("cast runs of 10 string to ree<int32>", |b| { | ||
| let source_array: Int32Array = (0..8192).map(|i| i / 10).collect(); | ||
| let array_ref = Arc::new(source_array) as ArrayRef; | ||
| let target_type = DataType::RunEndEncoded( | ||
| Arc::new(Field::new("run_ends", DataType::Int32, false)), | ||
| Arc::new(Field::new("values", DataType::Int32, true)), | ||
| ); | ||
| b.iter(|| cast(&array_ref, &target_type).unwrap()); | ||
| }); |
There was a problem hiding this comment.
Bench name/type mismatch.
Label says “string” but the source array is Int32. Update for clarity.
- c.bench_function("cast runs of 10 string to ree<int32>", |b| {
+ c.bench_function("cast runs of 10 int32 to ree<int32>", |b| {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| c.bench_function("cast runs of 10 string to ree<int32>", |b| { | |
| let source_array: Int32Array = (0..8192).map(|i| i / 10).collect(); | |
| let array_ref = Arc::new(source_array) as ArrayRef; | |
| let target_type = DataType::RunEndEncoded( | |
| Arc::new(Field::new("run_ends", DataType::Int32, false)), | |
| Arc::new(Field::new("values", DataType::Int32, true)), | |
| ); | |
| b.iter(|| cast(&array_ref, &target_type).unwrap()); | |
| }); | |
| c.bench_function("cast runs of 10 int32 to ree<int32>", |b| { | |
| let source_array: Int32Array = (0..8192).map(|i| i / 10).collect(); | |
| let array_ref = Arc::new(source_array) as ArrayRef; | |
| let target_type = DataType::RunEndEncoded( | |
| Arc::new(Field::new("run_ends", DataType::Int32, false)), | |
| Arc::new(Field::new("values", DataType::Int32, true)), | |
| ); | |
| b.iter(|| cast(&array_ref, &target_type).unwrap()); | |
| }); |
🤖 Prompt for AI Agents
In arrow/benches/cast_kernels.rs around lines 373 to 381, the benchmark label
incorrectly says "string" while the source array is Int32; update the bench name
to accurately reflect the types (e.g., "cast runs of 10 int32 to ree<int32>") so
the description matches the source_array and target_type.
| if start > self.len() || start + length > self.len() { | ||
| return Err(eof_err!( | ||
| "Expected to read {} bytes at offset {}, while file has length {}", | ||
| length, | ||
| start, | ||
| self.len() | ||
| )); | ||
| } |
There was a problem hiding this comment.
Potential integer overflow in bounds check.
The expression start + length on line 138 can overflow since both operands are usize. If overflow occurs, the wrapped value may incorrectly pass the bounds check, allowing out-of-bounds access.
Consider using checked arithmetic or restructuring the condition:
- if start > self.len() || start + length > self.len() {
+ if start > self.len() || length > self.len() - start {Or with explicit overflow handling:
- if start > self.len() || start + length > self.len() {
+ let end = start.checked_add(length).ok_or_else(|| {
+ ParquetError::EOF(format!(
+ "Expected to read {} bytes at offset {}, while file has length {}",
+ length, start, self.len()
+ ))
+ })?;
+ if end > self.len() {
return Err(eof_err!(
"Expected to read {} bytes at offset {}, while file has length {}",
length,
start,
self.len()
));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if start > self.len() || start + length > self.len() { | |
| return Err(eof_err!( | |
| "Expected to read {} bytes at offset {}, while file has length {}", | |
| length, | |
| start, | |
| self.len() | |
| )); | |
| } | |
| if start > self.len() || length > self.len() - start { | |
| return Err(eof_err!( | |
| "Expected to read {} bytes at offset {}, while file has length {}", | |
| length, | |
| start, | |
| self.len() | |
| )); | |
| } |
🤖 Prompt for AI Agents
In parquet/src/file/reader.rs around lines 138 to 145, the bounds check uses
`start + length` which can overflow for usize and let an out-of-bounds read slip
through; replace that arithmetic with a checked operation (e.g., use
start.checked_add(length) and handle None as overflow) or rewrite the condition
as `length > self.len().saturating_sub(start)` / `start > self.len() || length >
self.len().wrapping_sub(start)` (prefer checked_add) so overflows are detected
and return the same EOF error when adding would overflow or the sum exceeds file
length.
| if offset > buffer.len() || offset > uncompressed_page_size { | ||
| return Err(general_err!("Invalid page header")); | ||
| } |
There was a problem hiding this comment.
Redundant condition in offset validation.
The check offset > uncompressed_page_size at line 395 is redundant. The earlier validation at lines 368-382 already ensures that definition_levels_byte_length + repetition_levels_byte_length <= uncompressed_page_size, and since offset is calculated from these same values (lines 383-385), the condition offset > uncompressed_page_size can never be true at this point.
The meaningful part of this validation is offset > buffer.len(), which correctly checks if the received buffer is too small.
Consider simplifying to:
- if offset > buffer.len() || offset > uncompressed_page_size {
+ if offset > buffer.len() {
return Err(general_err!("Invalid page header"));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if offset > buffer.len() || offset > uncompressed_page_size { | |
| return Err(general_err!("Invalid page header")); | |
| } | |
| if offset > buffer.len() { | |
| return Err(general_err!("Invalid page header")); | |
| } |
🤖 Prompt for AI Agents
In parquet/src/file/serialized_reader.rs around lines 395 to 397, the condition
`offset > uncompressed_page_size` is redundant because earlier checks guarantee
the sum used to compute offset cannot exceed uncompressed_page_size; simplify
the validation by removing that redundant comparison and only check `offset >
buffer.len()` to detect a too-small buffer, preserving the existing error return
path and message.
| #[test] | ||
| fn test_decode_page_invalid_offset() { | ||
| let page_header = PageHeader { | ||
| r#type: PageType::DATA_PAGE_V2, | ||
| uncompressed_page_size: 10, | ||
| compressed_page_size: 10, | ||
| data_page_header: None, | ||
| index_page_header: None, | ||
| dictionary_page_header: None, | ||
| crc: None, | ||
| data_page_header_v2: Some(DataPageHeaderV2 { | ||
| num_nulls: 0, | ||
| num_rows: 0, | ||
| num_values: 0, | ||
| encoding: Encoding::PLAIN, | ||
| definition_levels_byte_length: 11, | ||
| repetition_levels_byte_length: 0, | ||
| is_compressed: None, | ||
| statistics: None, | ||
| }), | ||
| }; | ||
|
|
||
| let buffer = Bytes::new(); | ||
| let err = decode_page(page_header, buffer, Type::INT32, None).unwrap_err(); | ||
| assert!( | ||
| err.to_string() | ||
| .contains("DataPage v2 header contains implausible values") | ||
| ); | ||
| } |
There was a problem hiding this comment.
Test doesn't cover the new validation logic.
The test test_decode_page_invalid_offset validates the existing check at lines 368-382 (where definition_levels_byte_length + repetition_levels_byte_length > uncompressed_page_size), but doesn't reach the new validation added at lines 395-397.
To test the new code path at lines 395-397, the test would need to:
- Pass the first validation (set levels so their sum ≤ uncompressed_page_size)
- Provide an empty or small buffer so
offset > buffer.len()triggers - Include a decompressor
Consider renaming this test to reflect what it actually tests, or add a separate test that specifically validates line 395's offset > buffer.len() check.
|
The fork's main branch was outdated and the PR shows a bigger diff than the actual one. |
8728: To review by AI