8728: fix: `ArrayIter` does not report size hint correctly after advancing from the iterator back by martin-augment · Pull Request #10 · martin-augment/arrow-rs

martin-augment · 2025-10-28T06:36:24Z

8728: To review by AI

…che#8706) # Which issue does this PR close? N/A # Rationale for this change It is not obvious that the thrift macros produce public enums only (e.g. see apache#8680 (comment)). This should be made clear in the documentation. # What changes are included in this PR? Add said clarification. # Are these changes tested? Documentation only, so no tests required. # Are there any user-facing changes? No, only changes to private documentation

# Which issue does this PR close? - Closes apache#8691 # Rationale for this change The `README.md` file in `arrow-avro` instructs users to install version 56. This is invalid and should be changed to version 57. # What changes are included in this PR? Updated the `README.md` file to reference version 57. # Are these changes tested? N/A since this a small `README.md` file change. # Are there any user-facing changes? The `README.md` file in `arrow-avro` now instructs users to install version 57. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

# Which issue does this PR close? chore: add test case of `RowSelection::trim` # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.

…DING_SLOTS (apache#8663) # Which issue does this PR close? - Closes [apache#8662] # Rationale for this change Related to apache#8607 We need to know how many encoding are support to create a decoder slot. # What changes are included in this PR? Update the `thrift_enum` to know the fields count of enum `Encoding`, and the value is passed to `EncodingMask` And the `ENCODING_SLOTS` # Are these changes tested? 1. Originally I think add a UT can prevent failure after the new encoding are introduced, then I realized the counts are already transferred, the UT is not required, the original tests can already cover the code. # Are there any user-facing changes? No

- A follow up from apache#8625 # Rationale for this change While working on a separate task, I noticed `create_test_variant_array` was redundant. Since `VariantArray` can already be constructed directly from an iterator of Variants, this PR removes the now-unnecessary test helper.

# Which issue does this PR close? - Closes apache#8685. # What changes are included in this PR? In the implementation of `RowConverter::from_binary`, the `BinaryArray` is broken into parts and an attempt is made to convert the data buffer into `Vec` at no copying cost with `Buffer::into_vec`. Only if this fails, the data is copied out for a newly allocated `Vec`. # Are these changes tested? Passes existing tests using `RowConverter::from_binary`, which all convert a non-shared buffer taking advantage of the optimization. Another test is added to cover the copying path. # Are there any user-facing changes? No

# Which issue does this PR close? - Closes apache#8692. # Rationale for this change Explained in issue. # What changes are included in this PR? - Adds `FilterPredicate::filter_record_batch` - Adapts the free function `filter_record_batch` to use the new function - Uses `new_unchecked` to create the filtered result. The rationale for this is identical to apache#8583 # Are these changes tested? Covered by existing tests for `filter_record_batch` # Are there any user-facing changes? No --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of apache#7806 # Rationale for this change The `DeltaBitPackDecoder` can panic if it encounters a bit width in the encoded data that is larger than the bit width of the data type being decoded. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…8680) # Which issue does this PR close? - Part of apache#5853. # Rationale for this change While converting to the new Thrift model, the `ConvertedType` enum was done manually due to the `NONE` variant, which used the discriminant of `0`. This PR changes that to `-1` which allows the `thrift_enum` macro to be used instead. This improves code maintainability. # What changes are included in this PR? See above. # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No, this only changes the discriminant value for a unit variant enum.

# Which issue does this PR close? - Contribues towards the RunEndEncoded (REE) epic apache#3520, but there is no specific issue for casting. - Replaces PRs apache#7713 and apache#8384. # Rationale for this change This PR implements casting support for RunEndEncoded arrays in Apache Arrow. # What changes are included in this PR? - `run_end_encoded_cast` in `arrow-cast/src/cast/run_array.rs` - `cast_to_run_end_encoded` in `arrow-cast/src/cast/run_array.rs` - Tests in `arrow-cast/src/cast/mod.rs` # Are these changes tested? Yes! # Are there any user-facing changes? No breaking changes, just new functionality --------- Co-authored-by: Richard Baah <richard.baah@datadoghq.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…erator (apache#8696) # Which issue does this PR close? N/A # Rationale for this change overriding this function improve performance over the fallback implementation # What changes are included in this PR? Override implementation of: - `count` which is not optimized away even when `ExactSizeIterator` is implemented - `nth` to avoid calling `next` `n + 1` times (which is also used when doing `.skip`) - `nth_back` - `last` - `max` # Are these changes tested? Yes, I've added a lot of tests # Are there any user-facing changes? Nope

# Rationale for this change We've caused some unexpected panics from our internal testing. We've put in error checks for all of these so that they don't affect other users. # What changes are included in this PR? Various error checks to ensure panics don't occur. # Are these changes tested? Tests should continue to pass. If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? Existing tests should cover these changes. # Are there any user-facing changes? None. --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

# Which issue does this PR close? Part of apache#5375 Vortex was encountering some issues after we switched our preferred List type to `ListView`, the first thing we noticed was that `arrow_select::filter_array` would fail on ListView (and LargeListView, though we don't use that). This PR addresses some missing select kernel implementations for ListView and LargeListView. This also fixes an existing bug in the ArrayData validation for ListView arrays that would trigger an out of bounds index panic. # Are these changes tested? - [x] filter_array - [x] concat - [x] take # Are there any user-facing changes? ListView/LargeListView can now be used with the `take`, `concat` and `filter_array` kernels You can now use the `PartialEq` to compare ListView arrays. --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev>

@etseidl

…ulation (apache#8671) # Which issue does this PR close? - Closes apache#8472. # Rationale for this change Makes the metadata heap size calculation more accurate when reading encrypted Parquet files, which helps to better manage caches of Parquet metadata. # What changes are included in this PR? * Accounts for heap allocations related to the `FileDecryptor` in `ParquetMetaData` * Fixes the `HeapSize` implementation for `Arc<T>` so the size of `T` is included, as well as the reference counts that are stored on the heap * Fixes the heap size of type pointers within `ColumnDescriptor` being included twice ## Not included * Accounting for any heap allocations in a user-provided `KeyRetriever` # Are these changes tested? Yes, there's a new unit test added that computes the heap size with a decryptor. I also did a manual test that created a test Parquet file with 100 columns using per-column encryption keys, and loaded 10,000 copies of the `ParquetMetaData` into a vector. `heaptrack` reported 1.136 GB memory heap allocated in this test program. Prior to this change, the sum of the metadata was reported as 879.2 MB, and afterwards it was 1.136 GB. # Are there any user-facing changes? No This was co-authored by @etseidl --------- Co-authored-by: Ed Seidl <etseidl@live.com>

…n Parquet variant casts (apache#8689) # Which issue does this PR close? - Closes apache#8670. # Rationale for this change We currently have two separate code paths that both handle decimal casting between different (precision, scale) pairs. Without unifying the logic, a fix in one place often needs to be duplicated in the other (e.g., apache#8579 fixed the `arrow-cast` and apache#8552 fixed the `parquet-variant-compute`), which can easily lead to divergence when contributors lack full context. This PR consolidates the decimal rescale logic for both `arrow-cast` and `parquet-variant-compute`. # What changes are included in this PR? 1. Extract the shared array-unary logic from `convert_to_smaller_scale_decimal` and `convert_to_bigger_or_equal_scale_decimal` into `apply_decimal_cast` 2. Move the rescale-closure creation into `make_upscaler` and `make_downscaler` so that they can be used in `parquet-compute-variant` 3. rework `rescale_decimal` in `parquet-compute-variant` to use the new `make_upscaler` and `make_downscaler` utilities. One challenge is incorporating the large-scale reduction path (aka the `delta_scale` cannot fit into `I::MAX_PRECISION`) into `make_downscaler` without hurting performance. Returning 0 directly is usually cheaper than applying a unary operation to return zero. Therefore, `make_downscaler` may return None, and it is the caller’s responsibility to handle this case appropriately based on the documented behavior. # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No

Closes apache#8709. Adds bench `cast_ree` which can be run with `cargo bench --bench cast_ree`. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…he underlying bit value (apache#8711) # Which issue does this PR close? - closes apache#8721 # Rationale for this change mask `nulls` should be treated as `false` (even if the underlying values are not 0) as described in the docs for zip # What changes are included in this PR? used `prep_null_mask_filter` before iterating over the mask, added tests for both scalar and non scalar (to prepare for apache#8653) # Are these changes tested? Yes # Are there any user-facing changes? Kinda

…from the iterator back this also adds a LOT of tests extracted from (which is how I found that bug): - apache#8697

coderabbitai · 2025-10-28T06:37:29Z

Walkthrough

This pull request introduces ListView/ListViewArray support across data operations, adds RunEndEncoded casting capabilities, refactors decimal casting with builder patterns, enhances iterator trait implementations, improves parquet error handling and validation, adds HeapSize implementations for encryption types, and updates macro-generated thrift enum constants.

Changes

Cohort / File(s)	Summary
Iterator Clone and Trait Enhancements `arrow-array/src/iterator.rs`, `arrow-buffer/src/util/bit_iterator.rs`	ArrayIter and BitIterator now derive Clone. BitIterator implements additional Iterator methods (count, nth, last, max) and DoubleEndedIterator method (nth_back) with corresponding test coverage.
ListView Equality and Transformation `arrow-data/src/equal/list_view.rs`, `arrow-data/src/equal/mod.rs`, `arrow-data/src/transform/list_view.rs`, `arrow-data/src/transform/mod.rs`	New modules implementing ListView equality comparison and array transformation support. Routes ListView and LargeListView through new list_view_equal and extend functions in transform pipeline.
ListView Selection Operations `arrow-select/src/concat.rs`, `arrow-select/src/take.rs`, `arrow-select/src/filter.rs`	Adds concat_list_view, take_list_view, and filter_list_view functions to support ListView/LargeListView concatenation, selection, and filtering with offset and size handling.
ListView Testing and Selection Utilities `arrow/tests/array_equal.rs`, `arrow-select/src/zip.rs`	Introduces ListView test helpers and array equality validation. Updates zip mask preprocessing to treat nulls as false via BooleanBuffer preparation.
Decimal Casting Refactoring `arrow-cast/src/cast/decimal.rs`, `arrow-cast/Cargo.toml`	Introduces make_upscaler and make_downscaler builders returning fallible/infallible closure pairs. Adds public rescale_decimal function and apply_rescaler/apply_decimal_cast helpers. Updates convert_to_smaller_scale_decimal and convert_to_bigger_or_equal_scale_decimal to use new builder patterns. Adds arrow-ord dependency.
RunEndEncoded Casting Support `arrow-cast/src/cast/mod.rs`, `arrow-cast/src/cast/run_array.rs`	Introduces RunEndEncoded support in can_cast_types and cast_with_options. New run_array module implements run_end_encoded_cast and cast_to_run_end_encoded for bidirectional casting between RunEndEncoded and other types. Publicly exports rescale_decimal.
RunEndEncoded Benchmarks `arrow/benches/cast_kernels.rs`	Adds benchmarks for casting to RunEndEncoded from StringArray and Int32Array with various run patterns.
Data Buffer Validation `arrow-data/src/data.rs`, `arrow-row/src/lib.rs`	Adds length matching check in validate_offsets_and_sizes for offsets and sizes buffers. Updates from_binary to use buffer deconstructing via into_parts with fallback copy strategy.
Parquet Enum Refactoring `parquet/src/basic.rs`, `parquet/src/parquet_macros.rs`	ConvertedType enum rewritten with explicit integer discriminants. thrift_enum macro extended with VARIANTS and MAX_DISCRIMINANT public constants. EncodingMask updated to use MAX_DISCRIMINANT.
Parquet Error Handling and Validation `parquet/src/column/reader.rs`, `parquet/src/encodings/decoding.rs`, `parquet/src/encodings/rle.rs`, `parquet/src/file/serialized_reader.rs`, `parquet/tests/arrow_reader/bad_data.rs`	Adds bounds checking in parse_v1_level, DictDecoder.set_data, and DeltaBitPackDecoder. Introduces check_bit_width validation. Updates decode_page for offset validation and replaces unimplemented page type with error. Adds diagnostic message to RLE decoding. Updates test assertions for error paths.
Parquet HeapSize Implementations `parquet/src/encryption/ciphers.rs`, `parquet/src/encryption/decrypt.rs`, `parquet/src/file/metadata/memory.rs`, `parquet/src/file/metadata/mod.rs`, `parquet/src/schema/types.rs`	Adds HeapSize trait implementations for BlockDecryptor, RingGcmBlockDecryptor, encryption-related types (ExplicitDecryptionKeys, DecryptionKeys, FileDecryptionProperties, FileDecryptor), HashMap, Arc, and Arc<dyn HeapSize>. Updates memory_size accounting for encryption and test expectations.
Parquet Page and Column Improvements `parquet/src/column/page.rs`, `parquet/src/column/reader/decoder.rs`, `parquet/src/file/reader.rs`	Page enum now derives Debug. ENCODING_SLOTS updated to use MAX_DISCRIMINANT. BytesChunkReader.get_read and get_bytes add EOF error returns for out-of-bounds access.
Parquet Reader and Selection Tests `parquet/src/arrow/arrow_reader/selection.rs`, `parquet/src/file/serialized_reader.rs`	Adds unit tests for RowSelection.trim() and decode_page validation. Introduces DataPageHeaderV2 import for test construction.
Documentation and Version Updates `parquet/THRIFT.md`, `arrow-avro/README.md`	Adds visibility notes to thrift_enum and thrift_union macro documentation. Bumps arrow-avro version references from 56 to 57 in README code blocks.
Variant Compute Updates `parquet-variant-compute/src/type_conversion.rs`, `parquet-variant-compute/src/shred_variant.rs`	Removes local rescale_decimal implementation and switches to external import. Refactors test helpers in shred_variant to use VariantArray::from_iter directly.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-8728-2025-10-28-06-36-21

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2025-10-28T06:39:11Z

Pull Request Review

Summary

This PR fixes a bug in ArrayIter where the size_hint() method was not correctly reporting the remaining iterator length after consuming elements from either end. The fix changes the implementation from using array.len() to using current_end, and includes comprehensive test coverage.

Code Quality ✅

Strengths:

Clean implementation: The fix is minimal and focused - changing only 2 lines in the core implementation (lines 101-102 in arrow-array/src/iterator.rs)
Good abstraction: Added Clone derive to ArrayIter which is essential for the testing approach
Well-documented: The existing documentation remains accurate and the code is self-explanatory

Minor observations:

The core fix is simple and correct: self.current_end - self.current properly tracks remaining elements
Safety comments in the implementation are appropriate and helpful

Test Coverage ⭐ EXCELLENT

This PR demonstrates exceptional test coverage with 930 new lines of tests:

Comprehensive test scenarios:

Multiple data patterns (no nulls, all nulls, partial nulls, random patterns)
Various iterator consumption patterns (from start, from end, both ends, until exhausted)
All major iterator trait methods tested:
- next() / next_back()
- nth() / nth_back()
- position() / rposition()
- find() / rfind()
- count(), last(), any(), all()
- for_each(), fold() / rfold()

Testing approach highlights:

Uses a trait-based abstraction (SharedBetweenArrayIterAndSliceIter) to test ArrayIter behavior against Rust slice iterators
Tests verify both return values AND remaining iterator state after operations
Edge cases thoroughly covered (empty iterators, single-element, all consumed, etc.)

Potential Issues 🔍

1. Test complexity trade-off:
The test code is quite complex with multiple trait abstractions (ConsumingArrayIteratorOp, MutatingArrayIteratorOp, SetupIter). While this provides excellent coverage, consider:

Add more inline comments explaining the test framework structure
The test complexity might make future maintenance harder - future contributors may find it difficult to understand or modify
Consider extracting a test helper module if this pattern is reused

2. Missing documentation:

No doc comment added to the Clone derive explaining why cloning is safe/desired
Consider adding a brief comment about the current_end field's role in the struct

3. Potential test data generation issue:

let no_nulls_random_values = (0..10)
    .map(|_| rng.random::<i32>())
    .map(Some)
    .collect::<Vec<Option<i32>>>();

While using a seeded RNG (line 274) ensures determinism, consider whether fixed test cases would be clearer for debugging failures.

Performance Considerations ✅

Positive:

The fix has zero performance impact - it changes which field is read, not the logic flow
size_hint() is now more accurate, which may improve performance of downstream operations that rely on accurate size hints (like collect() pre-allocation)

No concerns:

No additional allocations introduced
No algorithmic complexity changes
The Clone implementation is trivial (just copies a few integers and a reference)

Security Concerns ✅

No security issues identified:

No unsafe code changes
Existing safety comments remain valid
No potential for buffer overflows or memory safety issues
Input validation remains unchanged

Recommendations

High Priority

None - the core fix is solid

Medium Priority

Add documentation: Consider adding a doc comment to explain the Clone trait:

/// ArrayIter can be safely cloned as it only contains references
/// and simple copy types (usize indices)
#[derive(Debug, Clone)]
pub struct ArrayIter<T: ArrayAccessor> {

Test documentation: Add a module-level doc comment in the test section explaining the testing strategy:

/// Tests validate ArrayIter behavior matches std::slice::Iter by comparing
/// results across various consumption patterns and iterator operations.

Low Priority

Test organization: Consider breaking the 930-line test module into sub-modules:
- test_helpers - trait definitions and utilities
- test_basic_iteration - next/next_back tests
- test_consuming_operations - count/last/fold tests
- test_mutating_operations - position/find/any tests
Test readability: Some test names could be more descriptive:
- get_int32_iterator_cases() → create_test_cases_with_various_null_patterns()

Verdict: ✅ APPROVE with minor suggestions

This is a high-quality PR that:

Fixes a legitimate bug correctly
Includes exceptional test coverage
Has no performance or security concerns
Follows project conventions

The comprehensive testing significantly strengthens confidence in the fix. The suggestions above are minor improvements that would not block merging.

Review generated by Claude Code

github-actions · 2025-10-28T06:45:47Z

Review

I didn’t spot any functional bugs or regressions in the changes. The new iterator tests are thorough, and the refactors to the selection and transform kernels look consistent with the previous semantics.

Residual risks / follow-up

This PR touches a wide set of kernels (filter/take/zip, list view handling, etc.). It would be worth running the full Arrow test matrix, especially any integration tests that exercise list views, dictionary filtering, and parquet readers, just to be safe.

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

parquet/src/encodings/decoding.rs (1)

827-837: Error message uses the wrong expected count variable.

Use mini_block_to_skip (expected) rather than mini_block_batch_size to avoid misleading diagnostics.

-                return Err(general_err!(
-                    "Expected to skip {} values from mini block got {}.",
-                    mini_block_batch_size,
-                    skip_count
-                ));
+                return Err(general_err!(
+                    "Expected to skip {} values from mini block got {}.",
+                    mini_block_to_skip,
+                    skip_count
+                ));

arrow-array/src/iterator.rs (2)

90-95: Safety comment is incorrect (initial cursor invariant)

Comment says “initialized with current = array.len()”, but new() sets current = 0 and current_end = len. Please fix to avoid misleading future maintainers.

-            // this is safe on the premise that this struct is initialized with
-            // current = array.len()
-            // and that current_end is ever only decremented
+            // This is safe because the iterator is initialized with:
+            //   current = 0 and current_end = array.len()
+            // We only increment `current` and only decrement `current_end`,
+            // and we check bounds (current < current_end) before access.

116-121: Repeat the safety comment fix for next_back

Same incorrect invariant appears here; please update consistently.

-                // this is safe on the premise that this struct is initialized with
-                // current = array.len()
-                // and that current_end is ever only decremented
+                // This is safe because the iterator is initialized with:
+                //   current = 0 and current_end = array.len()
+                // We only increment `current` and only decrement `current_end`,
+                // and we check bounds (current < current_end) before access.

🧹 Nitpick comments (32)

arrow-data/src/data.rs (1)
983-989: Defensive length check, but consider error type consistency.

Since both typed_buffer calls request self.len elements and return slices of exactly that length, this check should always pass. However, it serves as useful defensive programming.

Consider using InvalidArgumentError instead of ComputeError for consistency with other data validation errors in this file (see lines 984, 993, 1009).

Apply this diff if you prefer consistency:
-            return Err(ArrowError::ComputeError(format!(
+            return Err(ArrowError::InvalidArgumentError(format!(
                 "ListView offsets len {} does not match sizes len {}",
parquet-variant-compute/src/shred_variant.rs (2)
440-443: Consider consistent wrapping for clarity.

While functionally correct, this test uses Variant::from(...) directly, whereas line 338 uses Some(Variant::from(...)). Both patterns work with from_iter, but consistency would improve readability.

Apply this diff for consistency with line 338:
-        let input = VariantArray::from_iter(vec![
-            Variant::from(42i32),
-            Variant::from(3.15f64),
-            Variant::from("not_a_number"),
-        ]);
+        let input = VariantArray::from_iter(vec![
+            Some(Variant::from(42i32)),
+            Some(Variant::from(3.15f64)),
+            Some(Variant::from("not_a_number")),
+        ]);
874-874: Consider consistent wrapping for clarity.

Same observation as lines 440-443: this uses Variant::from(...) directly while line 338 uses Some(Variant::from(...)). Consider using consistent wrapping throughout the test file.

Apply this diff for consistency:
-        let input = VariantArray::from_iter(vec![Variant::from(42i64), Variant::from("hello")]);
+        let input = VariantArray::from_iter(vec![Some(Variant::from(42i64)), Some(Variant::from("hello"))]);
parquet/src/file/reader.rs (2)
292-321: Consider adding edge-case tests.

The current tests verify basic out-of-bounds scenarios effectively. However, additional test cases would improve coverage:

Boundary case: start == data.len() (should succeed for get_read)

Valid range: get_bytes(0, 4) or get_bytes(2, 2) where start + length == data.len()

Overflow scenario: very large length values

Zero-length reads: get_bytes(0, 0) or get_bytes(4, 0)

Example additional tests:
#[test]
fn test_bytes_chunk_reader_edge_cases() {
    let data = Bytes::from(vec![0, 1, 2, 3]);
    
    // Reading from end position should succeed
    let result = data.get_read(4);
    assert!(result.is_ok());
    
    // Valid ranges should succeed
    assert!(data.get_bytes(0, 4).is_ok());
    assert!(data.get_bytes(2, 2).is_ok());
    
    // Zero-length reads
    assert!(data.get_bytes(0, 0).is_ok());
    assert!(data.get_bytes(4, 0).is_ok());
}
122-148: Add bounds checking to test utility get_bytes methods for consistency.

The production Bytes implementation now includes defensive bounds checking (reader.rs 138-148), but the test utilities still lack validation:

parquet/tests/arrow_reader/io/sync_reader.rs:413-418 — slices directly without validating start + length

parquet/tests/arrow_reader/io/async_reader.rs:373-383 — slices directly without validating range bounds

Both will panic on invalid ranges instead of returning errors like the production code. Adding the same defensive pattern would improve consistency and provide clearer error messages in tests.
parquet/src/schema/types.rs (1)
1353-1370: Non-root group must have repetition + propagate build errors

Enforcing repetition for non-root types matches the parquet spec; good guard.

Switching to builder.build()? avoids panics and surfaces conversion errors.

Recommend also changing the zero-children group branch to use ? for consistency (currently unwrap):
-                Ok((index + 1, Arc::new(builder.build().unwrap())))
+                Ok((index + 1, Arc::new(builder.build()?)))
Consider adding a small test that a non-root group lacking repetition returns the expected error.
parquet/src/encodings/rle.rs (1)
500-520: Avoid assert! in library decode path; propagate an error instead.

The new assert panics in release builds. Prefer returning a descriptive error from reload and plumbing it to callers (get/get_batch/skip) so bad inputs don’t abort the process.

Example (sketch):
-    fn reload(&mut self) -> bool {
+    fn reload(&mut self) -> Result<bool> {
       let bit_reader = self.bit_reader.as_mut().expect("bit_reader should be set");
-      if let Some(indicator_value) = bit_reader.get_vlq_int() {
+      if let Some(indicator_value) = bit_reader.get_vlq_int() {
         ...
-        self.current_value = bit_reader.get_aligned::<u64>(value_width);
-        assert!(self.current_value.is_some(), "parquet_data_error: not enough data for RLE decoding");
+        self.current_value = bit_reader.get_aligned::<u64>(value_width);
+        if self.current_value.is_none() {
+            return Err(eof_err!("Not enough data for RLE decoding"));
+        }
         ...
-        true
+        Ok(true)
       } else {
-        false
+        Ok(false)
       }
     }
And update call sites: if !self.reload()? { ... } and in set_data: let _ = self.reload()?;.
parquet/THRIFT.md (1)

56-71: Docs clarity: note about pub visibility is helpful.

Consider linking directly to the macro definitions (parquet/src/parquet_macros.rs) for quick reference.

Also applies to: 88-89
parquet/tests/arrow_reader/bad_data.rs (1)
88-93: Make test resilient to error wording.

Exact string equality is brittle. Prefer contains/starts_with or matching the error type.
-    assert_eq!(
-        err.to_string(),
-        "External: Parquet argument error: Parquet error: Invalid or corrupted RLE bit width 254. Max allowed is 32"
-    );
+    let s = err.to_string();
+    assert!(
+        s.contains("Invalid or corrupted RLE bit width 254"),
+        "unexpected error: {s}"
+    );
parquet/src/encodings/decoding.rs (2)
645-657: Bit-width validator is appropriate; minor polish.

Great addition. Nit: remove the trailing space in the error message for consistency.
-            return Err(general_err!(
-                "Invalid delta bit width {} which is larger than expected {} ",
-                bit_width,
-                std::mem::size_of::<T::T>() * 8
-            ));
+            return Err(general_err!(
+                "Invalid delta bit width {} which is larger than expected {}",
+                bit_width,
+                std::mem::size_of::<T::T>() * 8
+            ));
751-753: Validate miniblock bit-widths at use sites — good.

Calling check_bit_width in both get and skip catches invalid pages early. Consider also validating all widths once in next_block() to fail fast.

Also applies to: 822-824
parquet/src/basic.rs (1)
65-69: ConvertedType via thrift_enum looks good; doc nit.

The enum definition with explicit discriminants (including NONE = -1) matches usage across conversions and tests. Nit: “This struct was renamed…” → “This enum was renamed…”.
-/// This struct was renamed from `LogicalType` in version 4.0.0.
+/// This enum was renamed from `LogicalType` in version 4.0.0.
Also applies to: 75-169
arrow-array/src/iterator.rs (3)
1186-1190: Seed RNG for deterministic tests

Use a local StdRng to avoid nondeterminism across test runs.
-        let random_values = (0..100).map(|_| rand::random_bool(0.5)).collect::<Vec<_>>();
+        let mut rng = StdRng::seed_from_u64(1);
+        let random_values = (0..100)
+            .map(|_| rng.random_bool(0.5))
+            .collect::<Vec<_>>();
986-988: Nit: message string for AllOp

The label says “false” but this tests all(); suggest “true”.
-            fn name(&self) -> String {
-                format!("all with {} false returned", self.true_count)
-            }
+            fn name(&self) -> String {
+                format!("all with {} true returned", self.true_count)
+            }
149-149: Add targeted size_hint regression test for ArrayIter

Locks the fix against regressions by asserting size_hint after mixed next/next_back.
+    #[test]
+    fn assert_array_iter_size_hint_after_next_back_and_next() {
+        let a = Int32Array::from(vec![Some(1), None, Some(3), Some(4)]);
+        let mut it = ArrayIter::new(&a);
+        assert_eq!(it.size_hint(), (4, Some(4)));
+        assert_eq!(it.next_back(), Some(Some(4)));
+        assert_eq!(it.size_hint(), (3, Some(3)));
+        assert_eq!(it.next(), Some(Some(1)));
+        assert_eq!(it.size_hint(), (2, Some(2)));
+        // consume remaining
+        let _: Vec<_> = it.collect();
+        assert_eq!(it.size_hint(), (0, Some(0)));
+    }
arrow-buffer/src/util/bit_iterator.rs (1)
156-175: nth_back(): correct; tweak comment wording

Logic matches std semantics, including exhausting on out-of-range. Minor doc nit: subtract from end_offset, not “current offset”.
-        // so subtracting n to the current offset and not n - 1
+        // so subtracting n from end_offset (and not n - 1)
arrow-select/src/zip.rs (2)
131-133: Treating NULL as false is correct; use borrowed buffer when possible

Current flow clones the values buffer even when mask has no nulls. Borrowing avoids an allocation.
-    let mask = maybe_prep_null_mask_filter(mask);
-    SlicesIterator::from(&mask).for_each(|(start, end)| {
+    // Borrow when no nulls, allocate only if needed
+    if mask.null_count() == 0 {
+        SlicesIterator::from(mask.values()).for_each(|(start, end)| {
+            // body unchanged
+        });
+    } else {
+        let cleaned = maybe_prep_null_mask_filter(mask);
+        SlicesIterator::from(&cleaned).for_each(|(start, end)| {
+            // body unchanged
+        });
+    }
Note: requires changing maybe_prep_null_mask_filter to return BooleanBuffer and keeping both paths.

171-180: Co-locate mask-prep helper with filtering utilities

maybe_prep_null_mask_filter duplicates filter semantics; consider moving it into crate::filter next to prep_null_mask_filter and re-exporting for reuse.
arrow-select/src/concat.rs (2)
244-256: Allocate offsets buffer by bytes to reduce reallocations

MutableBuffer::with_capacity expects bytes; current capacity is in elements. Precompute element count and multiply by size_of::().
-    let mut offsets = MutableBuffer::with_capacity(lists.iter().map(|l| l.offsets().len()).sum());
+    let offset_len: usize = lists.iter().map(|l| l.offsets().len()).sum();
+    let mut offsets =
+        MutableBuffer::with_capacity(offset_len * std::mem::size_of::<OffsetSize>());
This avoids repeated growth on large inputs.

238-253: Optional: avoid carrying unused child values for sliced ListView inputs

For sliced ListView arrays, values() returns the full child; concatenating full children increases memory. Consider computing the minimal [min_offset, max_offset) used per input and concatenating sliced children instead, adjusting offsets by (offset - min_offset) + global_offset.
arrow/tests/array_equal.rs (1)
865-872: Add a guard test for size mismatch on non-null ListView elements

To catch the null-path size check bug (now fixed), add a case where sizes differ for a non-null slot.
+#[test]
+fn test_list_view_size_mismatch_is_unequal() {
+    // a: [[1,2,3]], b: [[1,2,3,4]] -> unequal
+    let a = create_list_view_array::<i32, _, _>([Some(vec![Some(1), Some(2), Some(3)])]);
+    let b = create_list_view_array::<i32, _, _>([Some(vec![Some(1), Some(2), Some(3), Some(4)])]);
+    test_equal(&a, &b, false);
+}
This ensures per-element size equality is enforced.
arrow/benches/cast_kernels.rs (1)

363-401: Optional: factor target type construction.
DRY the RunEndEncoded<Int32, T> construction with a small helper to avoid duplication across benches.
arrow-cast/src/cast/mod.rs (5)
144-146: Guard can_cast_types for valid RunEndEncoded index types

Today this can return true even if run_ends uses an unsupported index type (leading to can_cast_types=true but cast failing later). Add an index-type filter to keep capability checks aligned with cast behavior.
-        (RunEndEncoded(_, value_type), _) => can_cast_types(value_type.data_type(), to_type),
-        (_, RunEndEncoded(_, value_type)) => can_cast_types(from_type, value_type.data_type()),
+        (RunEndEncoded(index_type, value_type), _)
+            if matches!(index_type.data_type(), DataType::Int16 | DataType::Int32 | DataType::Int64) =>
+        {
+            can_cast_types(value_type.data_type(), to_type)
+        }
+        (_, RunEndEncoded(index_type, value_type))
+            if matches!(index_type.data_type(), DataType::Int16 | DataType::Int32 | DataType::Int64) =>
+        {
+            can_cast_types(from_type, value_type.data_type())
+        }
11452-11467: Fix misleading test comment and variable name

The test casts REE -> Primitive Int64 and downcasts to StringArray/Int64Array, not a RunArray. Tweak the comments/variable name for clarity.
-    // Cast to Int64
+    // Cast to Int64 (flatten REE to a primitive array)
@@
-    // Verify the result is a RunArray with Int64 values
-    let result_run_array = cast_result.as_any().downcast_ref::<Int64Array>().unwrap();
+    // Verify the result is a flat Int64Array with expanded values
+    let result_values = cast_result.as_any().downcast_ref::<Int64Array>().unwrap();
@@
-            result_run_array.values(),
+            result_values.values(),
11469-11485: Tighten test comment wording

Result is a StringArray, not a RunArray; update comment to avoid confusion.
-    // Verify the result is a RunArray with String values
+    // Verify the result is a StringArray with expanded values
11717-11734: Align test comment with intent

This test checks Struct -> FixedSizeBinary cast failure; remove the unrelated RunEndEncoded note in the comment.
-        // This should fail because:
-        // 1. The target type is not RunEndEncoded
-        // 2. The target type is not supported for casting from StructArray
+        // This should fail because casting from StructArray to FixedSizeBinary is not supported
11548-11556: Strengthen null-run assertions

You already assert null_count on values(). Also assert which runs are null to make intent explicit.
         assert_eq!(result_run_array.values().null_count(), 3);
+        // Optionally assert positions of null runs in the values array
+        let vals = result_run_array.values().as_primitive::<Int32Type>();
+        assert!(vals.is_null(1)); // run representing consecutive nulls
+        assert!(vals.is_null(4));
+        assert!(vals.is_null(7));
parquet/src/encryption/ciphers.rs (1)

54-59: HeapSize for RingGcmBlockDecryptor returns 0 — add a note or guard for ring changes.
If ring ever allocates for key schedules, this will under‑report. Consider a brief comment referencing the specific ring version tested.
parquet/src/file/metadata/memory.rs (1)
54-94: HashMap heap sizing: make group_size arch‑aware to reduce test brittleness.
Current 16-byte group size can be smaller on 32‑bit, causing platform‑dependent results. Consider:
-        // Overhead for the control tags group, which may be smaller depending on architecture
-        let group_size = 16;
+        // Overhead for the control tags group; 16 on 64‑bit, typically 8 on 32‑bit
+        let group_size = if cfg!(target_pointer_width = "64") { 16 } else { 8 };
Also keep the “approximation” disclaimer; update if hashbrown behavior changes.
parquet/src/file/metadata/mod.rs (3)

1884-1887: Avoid brittle exact byte constants in tests.
Given HashMap/Arc sizing is approximate and arch‑dependent, strict equality to 2766/2934 may fail cross‑platform or with std/hashbrown changes. Prefer arch‑aware guards or relative assertions (e.g., >= previous baseline), or compute expected from the same sizing helpers.

I can propose a patch to make these assertions resilient if you prefer.

1915-1918: Same concern: platform‑sensitive constants.
Use arch‑aware cfg or relative checks to prevent flakes.

1924-1997: New decryptor memory test: prefer resilient assertions.
Exact 2058/3072 deltas are likely to drift across platforms and versions. Suggest asserting that adding a decryptor increases memory_size by a positive amount and by at least the decryptor’s reported HeapSize, instead of hard‑coding totals.

I can draft a version that derives expectations from heap_size() calls to avoid magic numbers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b3b262 and ae5e961.

📒 Files selected for processing (38)

arrow-array/src/iterator.rs (4 hunks)
arrow-avro/README.md (1 hunks)
arrow-buffer/src/util/bit_iterator.rs (5 hunks)
arrow-cast/Cargo.toml (2 hunks)
arrow-cast/src/cast/decimal.rs (4 hunks)
arrow-cast/src/cast/mod.rs (6 hunks)
arrow-cast/src/cast/run_array.rs (1 hunks)
arrow-data/src/data.rs (1 hunks)
arrow-data/src/equal/list_view.rs (1 hunks)
arrow-data/src/equal/mod.rs (3 hunks)
arrow-data/src/transform/list_view.rs (1 hunks)
arrow-data/src/transform/mod.rs (6 hunks)
arrow-row/src/lib.rs (2 hunks)
arrow-select/src/concat.rs (11 hunks)
arrow-select/src/filter.rs (7 hunks)
arrow-select/src/take.rs (5 hunks)
arrow-select/src/zip.rs (4 hunks)
arrow/benches/cast_kernels.rs (1 hunks)
arrow/tests/array_equal.rs (2 hunks)
parquet-variant-compute/src/shred_variant.rs (4 hunks)
parquet-variant-compute/src/type_conversion.rs (1 hunks)
parquet/THRIFT.md (3 hunks)
parquet/src/arrow/arrow_reader/selection.rs (1 hunks)
parquet/src/basic.rs (2 hunks)
parquet/src/column/page.rs (1 hunks)
parquet/src/column/reader.rs (2 hunks)
parquet/src/column/reader/decoder.rs (1 hunks)
parquet/src/encodings/decoding.rs (6 hunks)
parquet/src/encodings/rle.rs (1 hunks)
parquet/src/encryption/ciphers.rs (3 hunks)
parquet/src/encryption/decrypt.rs (5 hunks)
parquet/src/file/metadata/memory.rs (2 hunks)
parquet/src/file/metadata/mod.rs (3 hunks)
parquet/src/file/reader.rs (2 hunks)
parquet/src/file/serialized_reader.rs (4 hunks)
parquet/src/parquet_macros.rs (5 hunks)
parquet/src/schema/types.rs (2 hunks)
parquet/tests/arrow_reader/bad_data.rs (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (26)

arrow-array/src/iterator.rs (1)

arrow-array/src/array/mod.rs (19)

iter (601-601)

iter (609-611)

iter (618-620)

iter (632-632)

iter (636-638)

iter (641-643)

array (944-944)

array (953-953)

array (962-962)

array (970-970)

array (980-980)

array (994-994)

array (1008-1008)

array (1020-1020)

array (1042-1042)

array (1132-1135)

len (155-155)

len (366-368)

len (445-447)

parquet/src/encryption/ciphers.rs (3)

parquet/src/encryption/decrypt.rs (4)

heap_size (276-278)

heap_size (301-310)

heap_size (358-360)

heap_size (583-587)

parquet/src/file/metadata/memory.rs (6)

heap_size (41-41)

heap_size (45-51)

heap_size (55-93)

heap_size (97-100)

heap_size (104-108)

heap_size (112-114)

parquet/src/file/metadata/mod.rs (1)

heap_size (931-933)

parquet/src/file/metadata/mod.rs (2)

parquet/src/encryption/decrypt.rs (4)

builder (365-367)

aad_prefix (378-380)

footer_key (389-397)

column_key (400-418)

parquet/src/encryption/encrypt.rs (5)

builder (103-105)

aad_file_unique (330-332)

aad_prefix (138-140)

footer_key_metadata (113-115)

footer_key (118-120)

arrow-cast/src/cast/run_array.rs (2)

arrow-cast/src/cast/mod.rs (25)

cast (345-347)

run_end_encoded_cast (799-799)

run_end_encoded_cast (800-800)

run_end_encoded_cast (801-801)

array (375-375)

array (381-381)

array (389-389)

array (395-395)

array (411-411)

array (413-413)

array (424-424)

array (427-427)

array (438-438)

array (455-467)

array (479-481)

array (508-524)

array (535-535)

array (545-546)

array (550-551)

array (555-556)

cast_with_options (751-2202)

None (9986-9986)

cast_to_run_end_encoded (809-809)

cast_to_run_end_encoded (814-814)

cast_to_run_end_encoded (819-819)

arrow-ord/src/partition.rs (1)

partition (127-153)

arrow-data/src/equal/list_view.rs (2)

arrow-data/src/data.rs (3)

count_nulls (51-62)

len (402-404)

len (1942-1945)

arrow-data/src/equal/mod.rs (4)

equal (161-166)

equal_values (62-129)

list_view_equal (109-109)

list_view_equal (110-110)

arrow/tests/array_equal.rs (2)

arrow-select/src/take.rs (3)

take (86-102)

data (1026-1028)

indices (702-713)

arrow-array/src/array/list_view_array.rs (2)

Buffer (1097-1097)

test_list_view_array (586-619)

arrow-select/src/zip.rs (1)

arrow-select/src/filter.rs (10)

filter (155-167)

filter (314-316)

filter (1459-1461)

filter (1489-1491)

prep_null_mask_filter (122-126)

from (66-68)

new (60-62)

new (89-93)

new (217-231)

new (636-653)

parquet/src/schema/types.rs (2)

parquet/src/record/reader.rs (1)

repetition (534-545)

parquet_derive/src/parquet_field.rs (1)

repetition (751-757)

arrow/benches/cast_kernels.rs (1)

arrow-cast/src/cast/mod.rs (18)

c (7821-7821)

b (4154-4154)

b (4166-4166)

b (4196-4196)

b (4207-4207)

b (4224-4224)

b (4249-4249)

b (4277-4277)

b (4343-4343)

b (4355-4355)

b (4367-4367)

b (4452-4452)

b (4462-4462)

b (4472-4472)

b (4482-4482)

b (4492-4492)

b (4751-4751)

cast (345-347)

arrow-row/src/lib.rs (2)

arrow-array/src/array/list_array.rs (1)

offsets (311-313)

arrow-array/src/array/primitive_array.rs (2)

i (2070-2070)

from_iter_values (753-761)

parquet/src/file/metadata/memory.rs (2)

parquet/src/encryption/decrypt.rs (4)

heap_size (276-278)

heap_size (301-310)

heap_size (358-360)

heap_size (583-587)

parquet/src/schema/types.rs (5)

heap_size (72-77)

heap_size (677-680)

heap_size (752-754)

heap_size (847-851)

heap_size (1026-1028)

parquet/src/arrow/arrow_reader/selection.rs (1)

parquet/src/arrow/arrow_reader/read_plan.rs (1)

selection (55-57)

parquet/src/file/reader.rs (2)

parquet/tests/arrow_reader/io/sync_reader.rs (2)

len (396-398)

get_bytes (413-418)

parquet/tests/arrow_reader/io/async_reader.rs (1)

get_bytes (373-390)

arrow-data/src/transform/list_view.rs (2)

arrow-data/src/transform/mod.rs (5)

build_extend (238-283)

array (220-220)

new (401-403)

len (753-755)

extend_nulls (742-749)

arrow-array/src/array/list_view_array.rs (1)

sizes (277-279)

arrow-data/src/transform/mod.rs (2)

arrow-data/src/transform/list_view.rs (2)

build_extend (24-47)

extend_nulls (49-56)

arrow-data/src/transform/list.rs (2)

build_extend (27-48)

extend_nulls (50-57)

arrow-data/src/equal/mod.rs (2)

arrow-data/src/equal/run.rs (1)

run_equal (26-86)

arrow-data/src/equal/list_view.rs (5)

list_view_equal (24-129)

lhs (31-31)

lhs (32-32)

rhs (34-34)

rhs (35-35)

parquet-variant-compute/src/shred_variant.rs (1)

parquet-variant-compute/src/variant_array.rs (11)

from_iter (445-451)

from_iter (455-457)

from (433-435)

from (439-441)

from (701-703)

from (707-709)

from (865-870)

new (493-499)

new (766-768)

new (820-822)

new (884-886)

arrow-data/src/data.rs (1)

arrow-array/src/array/list_view_array.rs (2)

offsets (262-264)

sizes (277-279)

arrow-select/src/take.rs (3)

arrow-data/src/data.rs (5)

new (1857-1859)

new (1919-1932)

nulls (390-392)

nulls (1948-1953)

slice (550-578)

arrow-array/src/cast.rs (1)

as_list_view (927-929)

arrow-array/src/array/list_view_array.rs (6)

nulls (416-418)

to_data (379-381)

x (1010-1010)

x (1045-1045)

x (1080-1080)

x (1119-1119)

arrow-select/src/concat.rs (5)

arrow-array/src/array/list_array.rs (12)

x (1269-1269)

nulls (569-571)

new (266-273)

values (322-324)

offsets (311-313)

offset (565-567)

try_new (209-259)

builder (1278-1278)

builder (1279-1279)

builder (1280-1280)

builder (1281-1281)

builder (1282-1282)

arrow-array/src/array/list_view_array.rs (8)

x (1010-1010)

x (1045-1045)

x (1080-1080)

x (1119-1119)

nulls (416-418)

sizes (277-279)

offsets (262-264)

offset (412-414)

arrow-buffer/src/buffer/offset.rs (4)

new (68-79)

n (169-171)

offsets (282-290)

as_ref (248-250)

arrow-array/src/builder/generic_list_view_builder.rs (3)

values (119-121)

with_capacity (85-96)

array (443-443)

arrow-buffer/src/buffer/scalar.rs (1)

as_ref (144-146)

arrow-cast/src/cast/mod.rs (2)

arrow-array/src/array/run_array.rs (6)

run_array (560-562)

run_array (973-973)

run_array (999-999)

run_array (1033-1033)

run_ends (121-123)

run_ends (550-552)

arrow-array/src/array/mod.rs (5)

make_array (744-842)

value (581-581)

data_type (127-127)

data_type (358-360)

data_type (437-439)

parquet-variant-compute/src/type_conversion.rs (1)

arrow-cast/src/cast/decimal.rs (8)

rescale_decimal (301-328)

rescale_decimal (928-928)

rescale_decimal (940-940)

rescale_decimal (946-946)

rescale_decimal (958-958)

rescale_decimal (964-964)

rescale_decimal (970-970)

rescale_decimal (976-976)

arrow-cast/src/cast/decimal.rs (3)

arrow-array/src/array/primitive_array.rs (5)

value (742-750)

value (1227-1229)

validate_decimal_precision (1603-1618)

len (684-686)

len (1184-1186)

arrow-array/src/types.rs (10)

is_valid_decimal_precision (1347-1347)

is_valid_decimal_precision (1413-1415)

is_valid_decimal_precision (1449-1451)

is_valid_decimal_precision (1485-1487)

is_valid_decimal_precision (1521-1523)

validate_decimal_precision (1340-1344)

validate_decimal_precision (1409-1411)

validate_decimal_precision (1445-1447)

validate_decimal_precision (1481-1483)

validate_decimal_precision (1517-1519)

arrow-data/src/decimal.rs (1)

validate_decimal_precision (1134-1167)

arrow-buffer/src/util/bit_iterator.rs (1)

arrow-buffer/src/buffer/boolean.rs (6)

new (61-74)

iter (202-204)

slice (162-172)

len (119-121)

from (283-287)

from (291-293)

parquet/src/encryption/decrypt.rs (2)

parquet/src/encryption/ciphers.rs (1)

heap_size (55-58)

parquet/src/file/metadata/memory.rs (10)

heap_size (41-41)

heap_size (45-51)

heap_size (55-93)

heap_size (97-100)

heap_size (104-108)

heap_size (112-114)

heap_size (118-120)

heap_size (124-126)

heap_size (130-142)

keys (91-91)

arrow-select/src/filter.rs (4)

arrow-data/src/data.rs (10)

new (1857-1859)

new (1919-1932)

new_unchecked (273-300)

nulls (390-392)

nulls (1948-1953)

null_count (420-425)

null_count (1956-1959)

buffer (586-588)

buffer (922-922)

buffer (1496-1496)

arrow-array/src/record_batch.rs (1)

new_unchecked (261-271)

arrow-array/src/array/list_view_array.rs (2)

nulls (416-418)

to_data (379-381)

arrow-array/src/array/mod.rs (9)

nulls (205-205)

nulls (387-389)

nulls (457-459)

null_count (286-288)

null_count (403-405)

null_count (473-475)

to_data (108-108)

to_data (350-352)

to_data (429-431)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: MIRI
GitHub Check: Test
GitHub Check: Clippy
GitHub Check: Check Compilation
GitHub Check: Build wasm32
GitHub Check: Archery test With other arrows
GitHub Check: Check Compilation
GitHub Check: Test
GitHub Check: Test
GitHub Check: Check Compilation
GitHub Check: claude-review
GitHub Check: PySpark Integration Test (stable)
GitHub Check: Test
GitHub Check: Verify MSRV (Minimum Supported Rust Version)
GitHub Check: Rustdocs are clean
GitHub Check: Audit
GitHub Check: Test on Mac
GitHub Check: codex
GitHub Check: Test on Windows

🔇 Additional comments (74)

parquet/src/arrow/arrow_reader/selection.rs (1)

1436-1463: Well-structured tests for trim() behavior.

The two test scenarios comprehensively verify the expected behavior:

When trailing selectors are non-skips, they are preserved.

When trailing skips exist, they are correctly removed.

The assertions clearly validate that the selector order and content remain intact after trimming.

parquet/src/column/page.rs (1)

34-34: LGTM! Useful debugging improvement.

Adding Debug to the Page enum is a good practice that enables debug formatting without any behavioral changes or performance impact. All field types already implement Debug, so this is a safe addition.

Note: The PR title mentions ArrayIter size hint fixes, but this file concerns Parquet page definitions. If this is part of a larger multi-file PR, consider ensuring the PR title reflects the full scope of changes.

parquet/src/column/reader.rs (2)

572-581: Excellent defensive bounds checking.

The added validation properly guards against:

Insufficient buffer to read the 4-byte size header

Integer overflow when computing the end offset

Insufficient buffer to contain the declared data

The use of checked_add and explicit bounds validation prevents potential panics and provides clear error messages.

605-622: Good test coverage for error paths.

The test properly exercises the new bounds checking logic with two scenarios where the declared data size exceeds the available buffer. Error message assertions confirm the expected behavior.

arrow-row/src/lib.rs (2)

916-927: Good zero-copy optimization with proper fallback handling.

The implementation correctly attempts a zero-copy path by consuming the BinaryArray with into_parts() and trying into_vec(). When the buffer is shared (reference count > 1), into_vec() will fail, and the fallback to to_vec() ensures correctness by copying the data. This is a solid optimization that maintains correctness for all cases.

2481-2492: Excellent test coverage for the shared buffer scenario.

This test specifically validates the fallback path in from_binary when the buffer is shared. By cloning binary_rows on line 2487, the test ensures the buffer's reference count is greater than 1, which forces into_vec() to fail and triggers the fallback to to_vec(). The unused variable _binary_rows_shared_buffer is intentionally kept in scope to maintain the shared state.

arrow-data/src/data.rs (1)

991-991: Correct loop bound change.

Iterating over sizes.len() (which equals the number of ListView elements) is the correct approach for validating each list element's offset and size. The previous iteration bound was likely incorrect.

parquet-variant-compute/src/shred_variant.rs (3)

338-338: LGTM!

The refactored test construction using from_iter is correct and simplifies the test setup.

367-367: LGTM!

The simplified construction using direct Variant values is correct and concise.

375-382: LGTM!

The test correctly uses Some/None pattern to test various null scenarios (array-level null vs variant null), which is essential for comprehensive null handling validation.

parquet/src/file/serialized_reader.rs (3)

1139-1139: LGTM!

The import is appropriately added for test usage.

464-467: LGTM!

Replacing unimplemented!() with a proper error return is the correct approach. The error message helpfully includes the page type for debugging.

1179-1213: LGTM!

The test appropriately validates error handling for unsupported page types, covering the new error return at lines 464-467.

parquet/src/file/reader.rs (1)

127-132: Bounds checking prevents out-of-bounds reads.

The defensive check correctly prevents reading beyond the buffer length while allowing valid edge cases (e.g., start == self.len() for empty reads).

parquet/src/schema/types.rs (1)

848-851: HeapSize: avoid double-counting primitive_type — good change

Excluding primitive_type here prevents double-counting since SchemaDescriptor::schema accounts for the Type tree. Keeping only path.heap_size() is correct.

parquet/src/column/reader/decoder.rs (1)

141-153: Good: dispatch table sized by MAX_DISCRIMINANT.

This future-proofs the decoder cache against added encodings and stays consistent with EncodingMask. LGTM.

parquet/src/encodings/decoding.rs (2)

382-395: Good guards for empty data and oversized bit width.

Early EOF and bit-width validation make RLE dictionary decoding safer. LGTM.

1408-1413: Tests cover new error paths well.

Empty dict data and invalid miniblock width cases are covered. Nice.

Also applies to: 2127-2172

parquet/src/parquet_macros.rs (1)

39-42: Nice: expose VARIANTS and MAX_DISCRIMINANT on thrift enums.

This enables static sizing (e.g., dispatch tables) without manual updates. LGTM.

Also applies to: 85-112

parquet/src/basic.rs (1)

698-705: Good: EncodingMask derives MAX_ENCODING from enum discriminants.

Keeps masks in sync automatically with new encodings. LGTM.

arrow-array/src/iterator.rs (2)

47-53: ArrayIter now Clone — good additive API

Cloning preserves iterator cursors; helpful for tests and user code. No behavior risks spotted.

101-103: Fix: exact size_hint after next_back/next

Using current_end - current makes len()/ExactSizeIterator correct under mixed forward/backward consumption. Matches the PR goal.

arrow-buffer/src/util/bit_iterator.rs (4)

26-31: BitIterator now Clone — good

Cheap, safe clone of cursor state; enables robust test patterns.

76-82: count() optimized via len()

Avoids iteration; correct for ExactSizeIterator.

83-102: nth(): correct semantics and overflow handling

Advances to current + n, returns next(); handles overflow/out-of-range by exhausting. Looks good.

117-139: max(): efficient boolean specialization

Short-circuits by checking any set bit via BitIndexIterator; returns Some(false) for non-empty all-false. Matches Iterator::max semantics for bool.

arrow-data/src/equal/mod.rs (1)

33-34: ListView equality wiring looks correct

New module import and match arms correctly route ListView/LargeListView to list_view_equal with appropriate offset types. No issues from this change.

Please ensure equal/tests cover sliced ListView equality and null-handling paths run in CI.

Also applies to: 45-47, 109-111

arrow-select/src/zip.rs (1)

298-352: Good coverage for NULL-in-mask behavior

New tests validate primitive/string/large-string cases with nulls treated as false. LGTM.

Also applies to: 353-402

arrow-select/src/concat.rs (2)

484-486: Dispatch for ListView/LargeListView looks good

concat() routes ListView variants to concat_list_view; tests cover sliced and non-sliced paths.

951-1048: ListView concat tests are solid

Good coverage for regular and sliced ListView cases. LGTM.

arrow/tests/array_equal.rs (1)

825-863: Nested List<ListView<…>> tests are valuable

Sliced outer List with inner ListView equality is well covered. LGTM.

arrow-data/src/transform/mod.rs (6)

36-36: Wiring module looks good.

Module registration for list_view is correct and consistent with the added match arms below.

270-272: Dispatch to list_view::build_extend is correct, but depends on offset rebasing.

Routing is fine. Note: correctness hinges on list_view::build_extend rebasing offsets and extending child_data when concatenating inputs. See comment in list_view.rs.

317-319: Nulls dispatch LGTM.

ListView null extension is wired consistently with List/LargeList variants.

452-456: Capacities match updated appropriately.

Including ListView/LargeListView in List-like capacities is correct.

497-520: Child data grouping for ListView is OK, but extension must be driven by the extend closure.

Collecting the child arrays mirrors List behavior and is fine. However, unless list_view::build_extend extends child_data (and re-bases offsets), the frozen output will have inconsistent offsets vs values. See fix proposal in list_view.rs.

792-799: Buffer selection includes ListView correctly.

Emitting two buffers (offsets, sizes) for ListView/LargeListView is correct.

arrow-data/src/transform/list_view.rs (1)

49-56: Nulls handling is fine.

Pushing zeroed offsets/sizes for nulls is acceptable for ListView.

arrow-select/src/take.rs (4)

221-226: ListView dispatch in take_impl looks correct.

Routes ListView/LargeListView to the new take_list_view with proper offset types.

630-655: take_list_view implementation LGTM.

Offsets/sizes are taken; nulls propagated; child values reused from input. Matches ListView semantics.

1016-1016: Import ToPrimitive for tests is fine.

Required for index builder mapping; OK.

1858-2004: Good ListView take tests.

Covers reversal, null indices, null values, and slicing across i32/i64 offsets.

arrow-select/src/filter.rs (6)

61-68: Iterator ergonomics improved.

Using From<&BooleanBuffer> for SlicesIterator is clean and avoids extra calls.

184-205: filter_record_batch helper is sensible.

Optimization gating mirrors filter(), and the call-through is straightforward.

318-341: FilterPredicate::filter_record_batch LGTM.

Builds arrays then constructs the batch with correct row count. Safe given consistent lengths.

389-394: ListView dispatch added correctly.

Selects i32/i64 variants and routes to filter_list_view.

909-936: filter_list_view implementation looks correct.

Filters offsets/sizes; computes filtered nulls; preserves values child array. Matches expected ListView behavior.

1447-1509: Solid ListView filter tests.

Covers basic and sliced scenarios for both offset sizes.

parquet-variant-compute/src/type_conversion.rs (1)

20-20: Switch to shared rescale_decimal looks good.
Consolidates logic with arrow-cast; no functional concerns.

arrow-cast/Cargo.toml (1)

46-46: Dependency addition verified; no workspace cycles detected.

arrow-ord is confirmed as a workspace member, and its dependencies do not include arrow-cast, confirming no circular dependency exists. The addition is safe and appropriate.

arrow-cast/src/cast/mod.rs (4)

44-51: Module wiring LGTM

New module and import are clean and scoped correctly. No further action.

72-72: Public re-export is fine

Exposing rescale_decimal alongside DecimalCast makes sense for external consumers.

798-805: REE source dispatch looks correct

Index-type driven dispatch to run_end_encoded_cast:: is clear; unsupported types error out fast with a clear message.

806-828: No changes needed

The function signature confirms cast_to_run_end_encoded requires &ArrayRef, which matches the review comment's own caveat: "If its signature requires &ArrayRef, ignore." The code is correct as written.

arrow-cast/src/cast/decimal.rs (9)

148-194: Well-designed upscaling builder with efficient precomputed tables.

The approach of retrieving 10^k - 1 from MAX_FOR_EACH_PRECISION and adding 1 avoids runtime power calculations while staying within type bounds. The distinction between fallible and infallible paths based on precision analysis is sound.

Minor note: The infallible closure uses .unwrap() on line 192, which relies on the precision analysis guaranteeing that from_decimal succeeds. This assumption holds given that:

Input values fit within declared precision (validated elsewhere)

Precision limits ensure valid values fit in native types

The condition input_precision + delta_scale <= output_precision correctly predicts success

You may want to verify that all callers ensure input arrays satisfy precision constraints, though this should already be enforced by Arrow's validation.

196-264: LGTM: Downscaling logic correctly implements round-half-away-from-zero.

The rounding implementation properly handles both positive and negative values. The infallible condition using < rather than <= (line 261) correctly accounts for the potential digit added by rounding, as explained in the inline comments.

The design choice to return None when scale reduction exceeds precision (causing mathematical zero) is appropriate and well-documented.

266-284: Appropriate helper for applying rescaling functions.

The function correctly delegates to either the infallible or fallible path, with validation only on the fallible path (for performance).

286-328: New public API for row-level decimal rescaling is well-designed.

This function provides a clean interface for rescaling individual decimal values, complementing the existing array-level operations. The documentation clearly explains the behavior, including:

Validation of precision and scale for both input and output

Upscale overflow returns None

Excessive downscale returns Some(ZERO) (mathematically correct)

The asymmetric error handling is appropriate given the mathematical semantics of each operation.

330-349: LGTM: Straightforward error message helper.

351-378: Correct array-level casting with appropriate paths for each scenario.

The three-way branching correctly handles:

Infallible casts (direct unary)

Safe fallible casts (filtering with unary_opt)

Unsafe fallible casts (validation with try_unary)

Each path appropriately validates or omits validation based on guarantees.

380-410: Good refactoring using the new builder pattern.

The use of make_downscaler and apply_decimal_cast improves code modularity. The special case handling (returning zeros when scale reduction exceeds precision) correctly preserves null values while replacing non-nulls with zero.

412-446: Refactoring mirrors downscaler pattern with appropriate error handling.

The asymmetric behavior compared to convert_to_smaller_scale_decimal is intentional:

Upscale overflow → error (line 439-444)

Downscale excessive → zeros (in the other function)

This reflects the mathematical reality: excessive upscaling causes overflow (error condition), while excessive downscaling mathematically rounds to zero (valid result).

926-978: Comprehensive test coverage for the new rescale_decimal function.

The tests cover the essential scenarios:

Normal upscaling and downscaling

Rounding behavior (half-away-from-zero for both positive and negative values)

Edge cases (excessive scale changes, overflow)

Error cases (invalid precision/scale parameters)

parquet/src/encryption/ciphers.rs (2)

21-21: Import looks correct.
Brings HeapSize into scope for the new supertrait requirement.

31-35: Extending BlockDecryptor with HeapSize — Verified as correct.

All implementations of BlockDecryptor properly implement the HeapSize trait. The only implementor, RingGcmBlockDecryptor (line 61), correctly implements both BlockDecryptor (line 61) and HeapSize (line 54). The trait constraint is satisfied.

parquet/src/file/metadata/memory.rs (3)

31-31: Import for HashMap — OK.
Needed for the new HeapSize impl.

96-101: Arc overhead calculation — reasonable approximation.
Counts two counters + T + inner heap. Keep in mind std layout is not strictly guaranteed.

103-109: Arc impl — LGTM.
Correctly uses size_of_val for the underlying type plus inner heap usage.
parquet/src/encryption/decrypt.rs (5)
25-25: Import looks correct.
Brings HeapSize into scope for the added impls.

275-279: HeapSize for ExplicitDecryptionKeys — OK.
Accurately sums footer key and column keys.

300-311: HeapSize for DecryptionKeys — OK.
Returns 0 for retriever variant; sensible since it’s user‑defined.

357-361: HeapSize for FileDecryptionProperties — OK.
Sums keys and optional AAD prefix.

582-588: Verify that calling heap_size() directly on Arc<dyn BlockDecryptor> compiles without casting.

The review comment is factually correct: BlockDecryptor extends HeapSize as a supertrait (defined in parquet/src/encryption/ciphers.rs:31). In theory, trait object method dispatch should allow calling supertrait methods directly without casting. However, the current code explicitly casts Arc<dyn BlockDecryptor> to Arc<dyn HeapSize> before calling heap_size(), suggesting there may be a compilation or type resolution issue.

The suggested fix removes the cast and calls heap_size() directly on the trait object. Verify locally that this compiles and functions correctly:
self.decryption_properties.heap_size()
    + self.footer_decryptor.heap_size()  // Direct call without cast
    + self.file_aad.heap_size()
If the direct call fails to compile, the cast may be necessary despite the supertrait relationship.
parquet/src/file/metadata/mod.rs (1)

290-301: Including decryptor memory in memory_size — OK.
Feature‑gated, uses HeapSize on Option<Box<...>>; consistent with existing accounting.

coderabbitai · 2025-10-28T06:52:02Z

arrow-avro/README.md

 ```toml
 [dependencies]
-arrow-avro = "56"
+arrow-avro = "57.0.0"


⚠️ Potential issue | 🟡 Minor

Update version references in feature snippets section for consistency.

Lines 47 and 54 have been updated to 57.0.0, but the feature snippet examples at lines 152, 157, and 162 still reference version = "56". This creates inconsistency in the documentation.

To maintain clarity, update all version references consistently:

* Minimal, fast build (common pipelines): ```toml - arrow-avro = { version = "56", default-features = false, features = ["deflate", "snappy"] } + arrow-avro = { version = "57.0.0", default-features = false, features = ["deflate", "snappy"] } ``` * Include Zstandard too (modern data lakes): ```toml - arrow-avro = { version = "56", default-features = false, features = ["deflate", "snappy", "zstd"] } + arrow-avro = { version = "57.0.0", default-features = false, features = ["deflate", "snappy", "zstd"] } ``` * Fingerprint helpers: ```toml - arrow-avro = { version = "56", features = ["md5", "sha256"] } + arrow-avro = { version = "57.0.0", features = ["md5", "sha256"] } ```

Also applies to: 54-54

🤖 Prompt for AI Agents

In arrow-avro/README.md around lines 47, 54 and especially the feature snippet lines 152, 157, and 162, update the inconsistent crate version references that currently show version = "56" to the new canonical version = "57.0.0"; edit each toml example to replace the old version string with "57.0.0" (including examples with default-features and feature lists) so all documented snippets match the top-of-file version references.

coderabbitai · 2025-10-28T06:52:02Z

arrow-cast/src/cast/run_array.rs

+                    let new_run_array: ArrayRef = match target_index_field.data_type() {
+                        DataType::Int16 => {
+                            let re = cast_run_ends.as_primitive::<Int16Type>();
+                            Arc::new(RunArray::<Int16Type>::try_new(re, cast_values.as_ref())?)
+                        }
+                        DataType::Int32 => {
+                            let re = cast_run_ends.as_primitive::<Int32Type>();
+                            Arc::new(RunArray::<Int32Type>::try_new(re, cast_values.as_ref())?)
+                        }
+                        DataType::Int64 => {
+                            let re = cast_run_ends.as_primitive::<Int64Type>();
+                            Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?)
+                        }
+                        _ => {
+                            return Err(ArrowError::CastError(
+                                "Run-end type must be i16, i32, or i64".to_string(),
+                            ));
+                        }
+                    };
+                    Ok(Arc::new(new_run_array))
+                }


⚠️ Potential issue | 🔴 Critical

Double-wrapping Arc causes a type error.
new_run_array is already an ArrayRef (Arc). Returning Arc::new(new_run_array) wraps an Arc in an Arc.

- Ok(Arc::new(new_run_array)) + Ok(new_run_array)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let new_run_array: ArrayRef = match target_index_field.data_type() {

DataType::Int16 => {

let re = cast_run_ends.as_primitive::<Int16Type>();

Arc::new(RunArray::<Int16Type>::try_new(re, cast_values.as_ref())?)

}

DataType::Int32 => {

let re = cast_run_ends.as_primitive::<Int32Type>();

Arc::new(RunArray::<Int32Type>::try_new(re, cast_values.as_ref())?)

}

DataType::Int64 => {

let re = cast_run_ends.as_primitive::<Int64Type>();

Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?)

}

_ => {

return Err(ArrowError::CastError(

"Run-end type must be i16, i32, or i64".to_string(),

));

}

};

Ok(Arc::new(new_run_array))

}

let new_run_array: ArrayRef = match target_index_field.data_type() {

DataType::Int16 => {

let re = cast_run_ends.as_primitive::<Int16Type>();

Arc::new(RunArray::<Int16Type>::try_new(re, cast_values.as_ref())?)

}

DataType::Int32 => {

let re = cast_run_ends.as_primitive::<Int32Type>();

Arc::new(RunArray::<Int32Type>::try_new(re, cast_values.as_ref())?)

}

DataType::Int64 => {

let re = cast_run_ends.as_primitive::<Int64Type>();

Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?)

}

_ => {

return Err(ArrowError::CastError(

"Run-end type must be i16, i32, or i64".to_string(),

));

}

};

Ok(new_run_array)

}

🤖 Prompt for AI Agents

In arrow-cast/src/cast/run_array.rs around lines 51 to 71, the code constructs new_run_array as an ArrayRef (Arc<dyn Array>) and then returns Ok(Arc::new(new_run_array)), which double-wraps an Arc and causes a type error; change the return to Ok(new_run_array) (or adjust the binding so it isn't already an Arc) so you return the existing ArrayRef directly instead of wrapping it again.

coderabbitai · 2025-10-28T06:52:03Z

arrow-cast/src/cast/run_array.rs

+                    let run_ends = run_array.run_ends().values().to_vec();
+                    let mut indices = Vec::with_capacity(run_array.run_ends().len());
+                    let mut physical_idx: usize = 0;
+                    for logical_idx in 0..run_array.run_ends().len() {
+                        // If the logical index is equal to the (next) run end, increment the physical index,
+                        // since we are at the end of a run.
+                        if logical_idx == run_ends[physical_idx].as_usize() {
+                            physical_idx += 1;
+                        }
+                        indices.push(physical_idx as i32);
+                    }
+
+                    let taken = take(&values, &Int32Array::from_iter_values(indices), None)?;
+                    if taken.data_type() != to_type {
+                        cast_with_options(taken.as_ref(), to_type, cast_options)
+                    } else {
+                        Ok(taken)
+                    }
+                }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Incorrect logical expansion: builds indices per run, not per row.
This returns only one value per run instead of expanding to array length. Build indices of total_len = last run-end, repeating each run index by its run length.

- let run_ends = run_array.run_ends().values().to_vec(); - let mut indices = Vec::with_capacity(run_array.run_ends().len()); - let mut physical_idx: usize = 0; - for logical_idx in 0..run_array.run_ends().len() { - // If the logical index is equal to the (next) run end, increment the physical index, - // since we are at the end of a run. - if logical_idx == run_ends[physical_idx].as_usize() { - physical_idx += 1; - } - indices.push(physical_idx as i32); - } - - let taken = take(&values, &Int32Array::from_iter_values(indices), None)?; + let run_ends = run_array.run_ends().values().to_vec(); + let total_len = run_ends.last().map(|v| v.as_usize()).unwrap_or(0); + let mut indices: Vec<u32> = Vec::with_capacity(total_len); + let mut prev_end = 0usize; + for (run_idx, end) in run_ends.iter().enumerate() { + let end = end.as_usize(); + let run_len = end - prev_end; + indices.extend(std::iter::repeat(run_idx as u32).take(run_len)); + prev_end = end; + } + let taken = take(values.as_ref(), &UInt32Array::from_iter_values(indices), None)?;

Add a unit test that casts a small REE array (e.g., run_ends=[2,5], values=["a","b"]) to Utf8 and checks length=5 and correct expansion ["a","a","b","b","b"].

🏁 Script executed:

#!/bin/bash # Find and examine the run_array.rs file to verify the current code find . -name "run_array.rs" -type f | head -20

Length of output: 71

🏁 Script executed:

#!/bin/bash # Search for the specific code pattern around the suspected issue rg -n "let run_ends = run_array.run_ends" --type rust -A 20

Length of output: 6237

🏁 Script executed:

#!/bin/bash # Look for existing tests related to run_array casting find . -name "*.rs" -type f | xargs grep -l "run_array" | grep -i test

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check for RunArray/REE related documentation or implementation rg -n "RunArray|run_encoded" --type rust -C 2 | head -100

Length of output: 6749

🏁 Script executed:

#!/bin/bash # Find the take function signature to verify it accepts UInt32Array rg -n "pub fn take" --type rust -A 5 | grep -A 5 "indices"

Length of output: 942

🏁 Script executed:

#!/bin/bash # Look for tests in run_array.rs file itself cat arrow-cast/src/cast/run_array.rs | tail -200

Length of output: 6928

🏁 Script executed:

#!/bin/bash # Search for tests that use cast_with_options on RunArray rg -n "RunArray|cast_with_options" arrow-cast/src/cast/run_array.rs --type rust -C 3

Length of output: 3588

🏁 Script executed:

#!/bin/bash # Check the full context of the problematic code to understand the function signature sed -n '50,100p' arrow-cast/src/cast/run_array.rs

Length of output: 2449

🏁 Script executed:

#!/bin/bash # Search for UInt32Array usage with take function to validate the proposed change rg -n "UInt32Array.*take\|take.*UInt32Array" --type rust -C 2

Length of output: 0

Implement the expansion fix for RunArray logical form casting.

The current code has a critical logic error: it iterates 0..run_array.run_ends().len() (number of runs) instead of the logical array length, producing incorrectly sized output. For example, with run_ends=[2,5] and values=["a","b"], the current code returns indices=[0,0] (length 2), but should return [0,0,1,1,1] (length 5) to expand the compressed representation.

The proposed fix correctly expands each run by its run_length. The data type change to UInt32Array is valid—take() accepts any Array type, and UInt32Array is already used successfully elsewhere in this same file (line 157).

- let run_ends = run_array.run_ends().values().to_vec(); - let mut indices = Vec::with_capacity(run_array.run_ends().len()); - let mut physical_idx: usize = 0; - for logical_idx in 0..run_array.run_ends().len() { - // If the logical index is equal to the (next) run end, increment the physical index, - // since we are at the end of a run. - if logical_idx == run_ends[physical_idx].as_usize() { - physical_idx += 1; - } - indices.push(physical_idx as i32); - } - - let taken = take(&values, &Int32Array::from_iter_values(indices), None)?; + let run_ends = run_array.run_ends().values().to_vec(); + let total_len = run_ends.last().map(|v| v.as_usize()).unwrap_or(0); + let mut indices: Vec<u32> = Vec::with_capacity(total_len); + let mut prev_end = 0usize; + for (run_idx, end) in run_ends.iter().enumerate() { + let end = end.as_usize(); + let run_len = end - prev_end; + indices.extend(std::iter::repeat(run_idx as u32).take(run_len)); + prev_end = end; + } + let taken = take(values.as_ref(), &UInt32Array::from_iter_values(indices), None)?;

Add a unit test that casts a small REE array (e.g., run_ends=[2,5], values=["a","b"]) to Utf8 and verifies length=5 with correct expansion ["a","a","b","b","b"].

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let run_ends = run_array.run_ends().values().to_vec();

let mut indices = Vec::with_capacity(run_array.run_ends().len());

let mut physical_idx: usize = 0;

for logical_idx in 0..run_array.run_ends().len() {

// If the logical index is equal to the (next) run end, increment the physical index,

// since we are at the end of a run.

if logical_idx == run_ends[physical_idx].as_usize() {

physical_idx += 1;

}

indices.push(physical_idx as i32);

}

let taken = take(&values, &Int32Array::from_iter_values(indices), None)?;

if taken.data_type() != to_type {

cast_with_options(taken.as_ref(), to_type, cast_options)

} else {

Ok(taken)

}

}

let run_ends = run_array.run_ends().values().to_vec();

let total_len = run_ends.last().map(|v| v.as_usize()).unwrap_or(0);

let mut indices: Vec<u32> = Vec::with_capacity(total_len);

let mut prev_end = 0usize;

for (run_idx, end) in run_ends.iter().enumerate() {

let end = end.as_usize();

let run_len = end - prev_end;

indices.extend(std::iter::repeat(run_idx as u32).take(run_len));

prev_end = end;

}

let taken = take(values.as_ref(), &UInt32Array::from_iter_values(indices), None)?;

if taken.data_type() != to_type {

cast_with_options(taken.as_ref(), to_type, cast_options)

} else {

Ok(taken)

}

}

🤖 Prompt for AI Agents

In arrow-cast/src/cast/run_array.rs around lines 75-93, the code builds indices by iterating 0..run_array.run_ends().len() (number of runs) instead of the logical array length, producing a too-short indices vector; change the loop to iterate over the logical length (e.g., 0..run_array.len() or run_array.logical_len()) and expand each run by comparing the current logical index to run_ends[physical_idx].as_usize(), incrementing physical_idx only when you pass a run end, and push the current physical_idx for each logical position; construct the indices as a UInt32Array (consistent with other uses in this file) and call take(&values, &UInt32Array::from_iter_values(indices), None) as before, then cast if needed. Also add a unit test that builds a RunEndEncoded array with run_ends=[2,5] and values=["a","b"], casts it to Utf8, and asserts length == 5 and contents == ["a","a","b","b","b"] to verify correct expansion.

coderabbitai · 2025-10-28T06:52:03Z

arrow-cast/src/cast/run_array.rs

+    // Cast the input array to the target value type if necessary
+    let cast_array = if array.data_type() == value_type {
+        array
+    } else {
+        &cast_with_options(array, value_type, cast_options)?
+    };
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Borrow to temporary (lifetime) and minor API nit in take.
cast_array borrows a temporary ArrayRef; also prefer .as_ref() for take.

- let cast_array = if array.data_type() == value_type { - array - } else { - &cast_with_options(array, value_type, cast_options)? - }; + let cast_array: ArrayRef = if array.data_type() == value_type { + Arc::clone(array) + } else { + cast_with_options(array, value_type, cast_options)? + }; @@ - let partitions = partition(&[Arc::clone(cast_array)])?; + let partitions = partition(&[Arc::clone(&cast_array)])?; @@ - let values_array = take(&cast_array, &indices, None)?; + let values_array = take(cast_array.as_ref(), &indices, None)?;

Compile check this file to ensure lifetimes and trait bounds (as_usize) resolve cleanly.

Also applies to: 156-160

I'll verify this review comment by examining the actual code and checking the validity of the lifetime and API concerns.

🏁 Script executed:

#!/bin/bash # Examine the file context around the flagged lines cat arrow-cast/src/cast/run_array.rs | head -150 | tail -50

Length of output: 1784

🏁 Script executed:

#!/bin/bash # Also check lines 156-160 context cat arrow-cast/src/cast/run_array.rs | sed -n '150,165p'

Length of output: 658

🏁 Script executed:

#!/bin/bash # Find function signatures for cast_with_options and take rg -n -A5 'fn (cast_with_options|take)\s*\(' arrow-cast/src --type rust

Length of output: 347

🏁 Script executed:

#!/bin/bash # Check imports and type definitions at top of file head -40 arrow-cast/src/cast/run_array.rs

Length of output: 1652

🏁 Script executed:

#!/bin/bash # Look for the full function containing lines 113-119 rg -nB20 'Cast the input array to the target' arrow-cast/src/cast/run_array.rs

Length of output: 691

🏁 Script executed:

#!/bin/bash # Check for ArrayRef type definition and usage rg -n 'type ArrayRef' arrow-cast/src --type rust

Length of output: 0

🏁 Script executed:

#!/bin/bash # Examine partition function signature rg -n -A3 'fn partition' arrow-cast/src --type rust

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check the take function more carefully - might be from arrow library rg -rn 'use.*take' arrow-cast/src/cast/run_array.rs

Length of output: 0

🏁 Script executed:

#!/bin/bash # Get broader context - find the function definition containing line 113 awk 'NR>=100 && NR<=170' arrow-cast/src/cast/run_array.rs

Length of output: 2371

🏁 Script executed:

#!/bin/bash # Check what type array is and its return from cast_with_options rg -B5 -A10 'cast_with_options.*array.*value_type' arrow-cast/src/cast/run_array.rs

Length of output: 559

🏁 Script executed:

#!/bin/bash # Look at the actual variable usage after line 119 to understand lifetime requirements awk 'NR>=113 && NR<=145' arrow-cast/src/cast/run_array.rs

Length of output: 1283

🏁 Script executed:

#!/bin/bash # Check if ArrayRef is Arc-based grep -r 'pub type ArrayRef' arrow-cast/src --include='*.rs'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Look for arrow crate dependencies grep -E 'arrow|ArrayRef' arrow-cast/Cargo.toml

Length of output: 249

🏁 Script executed:

#!/bin/bash # Check the take function import source grep -n 'use.*take\|from.*take' arrow-cast/src/cast/run_array.rs

Length of output: 0

🏁 Script executed:

#!/bin/bash # Full file for reference wc -l arrow-cast/src/cast/run_array.rs

Length of output: 36

🏁 Script executed:

#!/bin/bash # Get complete function context around lines 113-119 and 156-160 sed -n '90,175p' arrow-cast/src/cast/run_array.rs

Length of output: 2645

🏁 Script executed:

#!/bin/bash # Check the second instance (156-160) sed -n '145,170p' arrow-cast/src/cast/run_array.rs

Length of output: 771

🏁 Script executed:

#!/bin/bash # Look for return type of cast_with_options ast-grep --pattern 'fn cast_with_options($$$) -> $$$'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check partition and take function signatures in this file grep -n -A4 'fn partition\|pub fn take' arrow-cast/src/cast/run_array.rs

Length of output: 0

🏁 Script executed:

#!/bin/bash # Inspect imports more carefully sed -n '1,50p' arrow-cast/src/cast/run_array.rs | grep -E '^use|^mod|^pub'

Length of output: 112

🏁 Script executed:

#!/bin/bash # Find all instances where cast_array is used grep -n 'cast_array' arrow-cast/src/cast/run_array.rs

Length of output: 223

🏁 Script executed:

#!/bin/bash # Check arrow dependency version and ArrayRef grep -A10 'dependencies' arrow-cast/Cargo.toml

Length of output: 760

🏁 Script executed:

#!/bin/bash # More direct approach - look at the actual code lines sed -n '100,165p' arrow-cast/src/cast/run_array.rs

Length of output: 2371

🏁 Script executed:

#!/bin/bash # Check file structure ls -la arrow-cast/src/cast/

Length of output: 543

🏁 Script executed:

#!/bin/bash # Ensure proper path find . -name "run_array.rs" -type f

Length of output: 71

🏁 Script executed:

#!/bin/bash # Get the full content if file is small cat arrow-cast/src/cast/run_array.rs 2>/dev/null || cat arrow-cast/src/run_array.rs 2>/dev/null || find . -name "run_array.rs" -exec cat {} \;

Length of output: 6928

🏁 Script executed:

#!/bin/bash # Try to find the file with fd fd run_array.rs arrow-cast/

Length of output: 32

🏁 Script executed:

#!/bin/bash # Check what's in arrow-cast ls -la arrow-cast/src/

Length of output: 466

🏁 Script executed:

#!/bin/bash # Find all Rust files in arrow-cast find arrow-cast -name "*.rs" -type f

Length of output: 471

🏁 Script executed:

#!/bin/bash # Look for cast directory find . -type d -name cast

Length of output: 21

🏁 Script executed:

#!/bin/bash # Verify project structure ls -la */src/ 2>/dev/null | head -20

Length of output: 932

🏁 Script executed:

#!/bin/bash # List files matching the pattern find . -path "*/cast/*array*.rs"

Length of output: 34

🏁 Script executed:

#!/bin/bash # Broader search find . -name "*run_array*"

Length of output: 71

🏁 Script executed:

#!/bin/bash # Check current directory structure ls -R

Length of output: 15946

🏁 Script executed:

#!/bin/bash # Look for any file with "run_array" in name locate run_array.rs 2>/dev/null || find . -type f -name "*array*" 2>/dev/null | grep -i run

Length of output: 71

🏁 Script executed:

#!/bin/bash # Simple current dir check pwd && ls -la

Length of output: 3716

🏁 Script executed:

#!/bin/bash # Find partition function signature rg -n 'fn partition' arrow-ord/src --type rust -A3

Length of output: 789

🏁 Script executed:

#!/bin/bash # Find take function signature rg -n 'pub fn take' arrow-select/src --type rust -A4

Length of output: 867

🏁 Script executed:

#!/bin/bash # Check cast module imports at top to understand take source head -50 arrow-cast/src/cast/mod.rs | grep -E '^use|^pub'

Length of output: 142

🏁 Script executed:

#!/bin/bash # Look for take import in cast/mod.rs more broadly grep -n 'take' arrow-cast/src/cast/mod.rs | head -5

Length of output: 188

Fix critical lifetime issue: borrow of temporary ArrayRef in else branch (line 118).

The current code creates a dangling reference in the else branch:

&cast_with_options(array, value_type, cast_options)?

This borrows a temporary that is immediately dropped. The suggested fix resolves this by making cast_array own the ArrayRef directly, and improves API clarity with explicit .as_ref() calls:

- let cast_array = if array.data_type() == value_type { - array - } else { - &cast_with_options(array, value_type, cast_options)? - }; + let cast_array: ArrayRef = if array.data_type() == value_type { + Arc::clone(array) + } else { + cast_with_options(array, value_type, cast_options)? + };

Then update usages at line 138 and line 159:

- let partitions = partition(&[Arc::clone(cast_array)])?; + let partitions = partition(&[Arc::clone(&cast_array)])?;

- let values_array = take(&cast_array, &indices, None)?; + let values_array = take(cast_array.as_ref(), &indices, None)?;

🤖 Prompt for AI Agents

In arrow-cast/src/cast/run_array.rs around lines 113 to 119, the else branch currently returns a reference to a temporary (&cast_with_options(...)?), creating a dangling borrow; change cast_array to own an ArrayRef (not &ArrayRef) by assigning either a cloned/owned array for the equality case or the returned ArrayRef from cast_with_options in the else branch so the value lives long enough, and then update the subsequent uses at lines 138 and 159 to call .as_ref() (or otherwise dereference) on the owned ArrayRef where a &Array is required.

coderabbitai · 2025-10-28T06:52:03Z

arrow-data/src/equal/list_view.rs

+        // Check values for equality, with null checking
+        let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];
+        let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];
+
+        if lhs_range_offsets.len() != rhs_range_offsets.len() {
+            return false;
+        }
+
+        for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
+            .iter()
+            .zip(rhs_range_offsets)
+            .zip(lhs_range_sizes)
+            .enumerate()
+        {
+            let lhs_is_null = lhs_nulls.is_null(index);
+            let rhs_is_null = rhs_nulls.is_null(index);
+
+            if lhs_is_null != rhs_is_null {
+                return false;
+            }
+
+            let lhs_offset = lhs_offset.to_usize().unwrap();
+            let rhs_offset = rhs_offset.to_usize().unwrap();
+            let size = size.to_usize().unwrap();
+
+            // Check if values match in the range
+            if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
+                return false;
+            }
+        }


⚠️ Potential issue | 🔴 Critical

Bug: per-element size mismatch not checked when nulls are present

In the null-path, equality must assert that sizes are equal for non-null positions. Currently, only lhs size is used for comparison, which can yield false positives if rhs has extra elements.

Apply this fix to compare both sizes and use the agreed size:

- // Check values for equality, with null checking - let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len]; - let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len]; + // Check values for equality, with null checking + let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len]; + let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len]; - for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets - .iter() - .zip(rhs_range_offsets) - .zip(lhs_range_sizes) - .enumerate() - { + for (index, ((&lhs_offset, &rhs_offset), (&lhs_size, &rhs_size))) in lhs_range_offsets + .iter() + .zip(rhs_range_offsets) + .zip(lhs_range_sizes.iter().zip(rhs_range_sizes)) + .enumerate() + { let lhs_is_null = lhs_nulls.is_null(index); let rhs_is_null = rhs_nulls.is_null(index); if lhs_is_null != rhs_is_null { return false; } - let lhs_offset = lhs_offset.to_usize().unwrap(); - let rhs_offset = rhs_offset.to_usize().unwrap(); - let size = size.to_usize().unwrap(); + if !lhs_is_null { + if lhs_size != rhs_size { + return false; + } + let lhs_offset = lhs_offset.to_usize().unwrap(); + let rhs_offset = rhs_offset.to_usize().unwrap(); + let size = lhs_size.to_usize().unwrap(); - // Check if values match in the range - if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) { - return false; - } + // Check if values match in the range + if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) { + return false; + } + } }

Add a test where one side has an extra element for a non-null sublist to prevent regressions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Check values for equality, with null checking

let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];

let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];

if lhs_range_offsets.len() != rhs_range_offsets.len() {

return false;

}

for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets

.iter()

.zip(rhs_range_offsets)

.zip(lhs_range_sizes)

.enumerate()

{

let lhs_is_null = lhs_nulls.is_null(index);

let rhs_is_null = rhs_nulls.is_null(index);

if lhs_is_null != rhs_is_null {

return false;

}

let lhs_offset = lhs_offset.to_usize().unwrap();

let rhs_offset = rhs_offset.to_usize().unwrap();

let size = size.to_usize().unwrap();

// Check if values match in the range

if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {

return false;

}

}

// Check values for equality, with null checking

let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];

let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];

if lhs_range_offsets.len() != rhs_range_offsets.len() {

return false;

}

for (index, ((&lhs_offset, &rhs_offset), (&lhs_size, &rhs_size))) in lhs_range_offsets

.iter()

.zip(rhs_range_offsets)

.zip(lhs_range_sizes.iter().zip(rhs_range_sizes))

.enumerate()

{

let lhs_is_null = lhs_nulls.is_null(index);

let rhs_is_null = rhs_nulls.is_null(index);

if lhs_is_null != rhs_is_null {

return false;

}

if !lhs_is_null {

if lhs_size != rhs_size {

return false;

}

let lhs_offset = lhs_offset.to_usize().unwrap();

let rhs_offset = rhs_offset.to_usize().unwrap();

let size = lhs_size.to_usize().unwrap();

// Check if values match in the range

if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {

return false;

}

}

}

🤖 Prompt for AI Agents

In arrow-data/src/equal/list_view.rs around lines 96-125, the loop only reads lhs sizes and doesn't verify that rhs per-element sizes match when values are non-null; update the code to also slice rhs_range_sizes (similar to lhs_range_sizes), in the for loop read both lhs_size and rhs_size (convert both to usize) and immediately return false if they differ, then use the agreed size (either value after the equality check) when calling equal_values; also add a unit test that constructs two lists where a non-null sublist on one side has an extra element (size mismatch) to prevent regressions.

coderabbitai · 2025-10-28T06:52:03Z

arrow-data/src/transform/list_view.rs

+pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(
+    array: &ArrayData,
+) -> crate::transform::Extend<'_> {
+    let offsets = array.buffer::<T>(0);
+    let sizes = array.buffer::<T>(1);
+    Box::new(
+        move |mutable: &mut _MutableArrayData, _index: usize, start: usize, len: usize| {
+            let offset_buffer = &mut mutable.buffer1;
+            let sizes_buffer = &mut mutable.buffer2;
+
+            for &offset in &offsets[start..start + len] {
+                offset_buffer.push(offset);
+            }
+
+            // sizes
+            for &size in &sizes[start..start + len] {
+                sizes_buffer.push(size);
+            }
+
+            // the beauty of views is that we don't need to copy child_data, we just splat
+            // the offsets and sizes.
+        },
+    )
+}


⚠️ Potential issue | 🔴 Critical

Critical: Offsets not rebased and child_data not extended when concatenating; output can be invalid.

When MutableArrayData combines multiple ListView arrays, you must:

Extend mutable.child_data[0] with the referenced value range(s).

Rebase pushed offsets by the base child length so they still point to the concatenated values.

Without this, offsets from later input arrays can point before appended regions, and child_data remains empty, yielding corrupted arrays.

Apply this fix to rebase offsets and extend child values in one contiguous span covering the selected rows (skipping null/empty rows):

@@ -pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>( +pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>( array: &ArrayData, ) -> crate::transform::Extend<'_> { let offsets = array.buffer::<T>(0); let sizes = array.buffer::<T>(1); + let nulls = array.nulls().cloned(); Box::new( - move |mutable: &mut _MutableArrayData, _index: usize, start: usize, len: usize| { - let offset_buffer = &mut mutable.buffer1; - let sizes_buffer = &mut mutable.buffer2; - - for &offset in &offsets[start..start + len] { - offset_buffer.push(offset); - } - - // sizes - for &size in &sizes[start..start + len] { - sizes_buffer.push(size); - } - - // the beauty of views is that we don't need to copy child_data, we just splat - // the offsets and sizes. - }, + move |mutable: &mut _MutableArrayData, index: usize, start: usize, len: usize| { + let end = start + len; + let offset_buffer = &mut mutable.buffer1; + let sizes_buffer = &mut mutable.buffer2; + + // Compute a contiguous value range covering all selected non-empty, non-null lists + let mut min_off: Option<usize> = None; + let mut max_end: usize = 0; + for i in start..end { + // Skip null rows + if let Some(ref n) = nulls { + if !n.is_valid(i) { + continue; + } + } + let sz = sizes[i]; + if sz == T::zero() { + continue; + } + let off = offsets[i].as_usize(); + let list_end = off + .checked_add(sz.as_usize()) + .expect("ListView size overflow"); + min_off = Some(min_off.map_or(off, |m| m.min(off))); + if list_end > max_end { + max_end = list_end; + } + } + + // Base shift where newly appended child values will start + let base_child_len = mutable.child_data[0].len(); + + // If there are any values to copy, extend child_data once with the covering span + let delta_usize = if let Some(min) = min_off { + mutable.child_data[0].extend(index, min, max_end); + base_child_len + .checked_sub(min) + .expect("underflow computing ListView base shift") + } else { + 0 + }; + + let delta: T = T::from_usize(delta_usize) + .expect("ListView base shift does not fit target offset type"); + + // Push rebased offsets and original sizes + for i in start..end { + let off = offsets[i]; + let rebased = off.checked_add(&delta).expect("ListView offset overflow"); + offset_buffer.push(rebased); + } + for i in start..end { + sizes_buffer.push(sizes[i]); + } + }, ) }

Add a unit test that concatenates two ListView arrays via MutableArrayData and asserts offsets point into the concatenated values.

I can add the test and wire it into array_transform tests if you want.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(

array: &ArrayData,

) -> crate::transform::Extend<'_> {

let offsets = array.buffer::<T>(0);

let sizes = array.buffer::<T>(1);

Box::new(

move |mutable: &mut _MutableArrayData, _index: usize, start: usize, len: usize| {

let offset_buffer = &mut mutable.buffer1;

let sizes_buffer = &mut mutable.buffer2;

for &offset in &offsets[start..start + len] {

offset_buffer.push(offset);

}

// sizes

for &size in &sizes[start..start + len] {

sizes_buffer.push(size);

}

// the beauty of views is that we don't need to copy child_data, we just splat

// the offsets and sizes.

},

)

}

pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(

array: &ArrayData,

) -> crate::transform::Extend<'_> {

let offsets = array.buffer::<T>(0);

let sizes = array.buffer::<T>(1);

let nulls = array.nulls().cloned();

Box::new(

move |mutable: &mut _MutableArrayData, index: usize, start: usize, len: usize| {

let end = start + len;

let offset_buffer = &mut mutable.buffer1;

let sizes_buffer = &mut mutable.buffer2;

// Compute a contiguous value range covering all selected non-empty, non-null lists

let mut min_off: Option<usize> = None;

let mut max_end: usize = 0;

for i in start..end {

// Skip null rows

if let Some(ref n) = nulls {

if !n.is_valid(i) {

continue;

}

}

let sz = sizes[i];

if sz == T::zero() {

continue;

}

let off = offsets[i].as_usize();

let list_end = off

.checked_add(sz.as_usize())

.expect("ListView size overflow");

min_off = Some(min_off.map_or(off, |m| m.min(off)));

if list_end > max_end {

max_end = list_end;

}

}

// Base shift where newly appended child values will start

let base_child_len = mutable.child_data[0].len();

// If there are any values to copy, extend child_data once with the covering span

let delta_usize = if let Some(min) = min_off {

mutable.child_data[0].extend(index, min, max_end);

base_child_len

.checked_sub(min)

.expect("underflow computing ListView base shift")

} else {

0

};

let delta: T = T::from_usize(delta_usize)

.expect("ListView base shift does not fit target offset type");

// Push rebased offsets and original sizes

for i in start..end {

let off = offsets[i];

let rebased = off.checked_add(&delta).expect("ListView offset overflow");

offset_buffer.push(rebased);

}

for i in start..end {

sizes_buffer.push(sizes[i]);

}

},

)

}

coderabbitai · 2025-10-28T06:52:03Z

arrow/benches/cast_kernels.rs

+    c.bench_function("cast runs of 10 string to ree<int32>", |b| {
+        let source_array: Int32Array = (0..8192).map(|i| i / 10).collect();
+        let array_ref = Arc::new(source_array) as ArrayRef;
+        let target_type = DataType::RunEndEncoded(
+            Arc::new(Field::new("run_ends", DataType::Int32, false)),
+            Arc::new(Field::new("values", DataType::Int32, true)),
+        );
+        b.iter(|| cast(&array_ref, &target_type).unwrap());
+    });


⚠️ Potential issue | 🟡 Minor

Bench name/type mismatch.
Label says “string” but the source array is Int32. Update for clarity.

- c.bench_function("cast runs of 10 string to ree<int32>", |b| { + c.bench_function("cast runs of 10 int32 to ree<int32>", |b| {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

c.bench_function("cast runs of 10 string to ree<int32>", |b| {

let source_array: Int32Array = (0..8192).map(|i| i / 10).collect();

let array_ref = Arc::new(source_array) as ArrayRef;

let target_type = DataType::RunEndEncoded(

Arc::new(Field::new("run_ends", DataType::Int32, false)),

Arc::new(Field::new("values", DataType::Int32, true)),

);

b.iter(|| cast(&array_ref, &target_type).unwrap());

});

c.bench_function("cast runs of 10 int32 to ree<int32>", |b| {

let source_array: Int32Array = (0..8192).map(|i| i / 10).collect();

let array_ref = Arc::new(source_array) as ArrayRef;

let target_type = DataType::RunEndEncoded(

Arc::new(Field::new("run_ends", DataType::Int32, false)),

Arc::new(Field::new("values", DataType::Int32, true)),

);

b.iter(|| cast(&array_ref, &target_type).unwrap());

});

🤖 Prompt for AI Agents

In arrow/benches/cast_kernels.rs around lines 373 to 381, the benchmark label incorrectly says "string" while the source array is Int32; update the bench name to accurately reflect the types (e.g., "cast runs of 10 int32 to ree<int32>") so the description matches the source_array and target_type.

coderabbitai · 2025-10-28T06:52:04Z

parquet/src/file/reader.rs

+        if start > self.len() || start + length > self.len() {
+            return Err(eof_err!(
+                "Expected to read {} bytes at offset {}, while file has length {}",
+                length,
+                start,
+                self.len()
+            ));
+        }


⚠️ Potential issue | 🟠 Major

Potential integer overflow in bounds check.

The expression start + length on line 138 can overflow since both operands are usize. If overflow occurs, the wrapped value may incorrectly pass the bounds check, allowing out-of-bounds access.

Consider using checked arithmetic or restructuring the condition:

- if start > self.len() || start + length > self.len() { + if start > self.len() || length > self.len() - start {

Or with explicit overflow handling:

- if start > self.len() || start + length > self.len() { + let end = start.checked_add(length).ok_or_else(|| { + ParquetError::EOF(format!( + "Expected to read {} bytes at offset {}, while file has length {}", + length, start, self.len() + )) + })?; + if end > self.len() { return Err(eof_err!( "Expected to read {} bytes at offset {}, while file has length {}", length, start, self.len() )); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if start > self.len() || start + length > self.len() {

return Err(eof_err!(

"Expected to read {} bytes at offset {}, while file has length {}",

length,

start,

self.len()

));

}

if start > self.len() || length > self.len() - start {

return Err(eof_err!(

"Expected to read {} bytes at offset {}, while file has length {}",

length,

start,

self.len()

));

}

🤖 Prompt for AI Agents

In parquet/src/file/reader.rs around lines 138 to 145, the bounds check uses `start + length` which can overflow for usize and let an out-of-bounds read slip through; replace that arithmetic with a checked operation (e.g., use start.checked_add(length) and handle None as overflow) or rewrite the condition as `length > self.len().saturating_sub(start)` / `start > self.len() || length > self.len().wrapping_sub(start)` (prefer checked_add) so overflows are detected and return the same EOF error when adding would overflow or the sum exceeds file length.

coderabbitai · 2025-10-28T06:52:04Z

parquet/src/file/serialized_reader.rs

+            if offset > buffer.len() || offset > uncompressed_page_size {
+                return Err(general_err!("Invalid page header"));
+            }


⚠️ Potential issue | 🟡 Minor

Redundant condition in offset validation.

The check offset > uncompressed_page_size at line 395 is redundant. The earlier validation at lines 368-382 already ensures that definition_levels_byte_length + repetition_levels_byte_length <= uncompressed_page_size, and since offset is calculated from these same values (lines 383-385), the condition offset > uncompressed_page_size can never be true at this point.

The meaningful part of this validation is offset > buffer.len(), which correctly checks if the received buffer is too small.

Consider simplifying to:

- if offset > buffer.len() || offset > uncompressed_page_size { + if offset > buffer.len() { return Err(general_err!("Invalid page header")); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if offset > buffer.len() || offset > uncompressed_page_size {

return Err(general_err!("Invalid page header"));

}

if offset > buffer.len() {

return Err(general_err!("Invalid page header"));

}

🤖 Prompt for AI Agents

In parquet/src/file/serialized_reader.rs around lines 395 to 397, the condition `offset > uncompressed_page_size` is redundant because earlier checks guarantee the sum used to compute offset cannot exceed uncompressed_page_size; simplify the validation by removing that redundant comparison and only check `offset > buffer.len()` to detect a too-small buffer, preserving the existing error return path and message.

coderabbitai · 2025-10-28T06:52:04Z

parquet/src/file/serialized_reader.rs

+    #[test]
+    fn test_decode_page_invalid_offset() {
+        let page_header = PageHeader {
+            r#type: PageType::DATA_PAGE_V2,
+            uncompressed_page_size: 10,
+            compressed_page_size: 10,
+            data_page_header: None,
+            index_page_header: None,
+            dictionary_page_header: None,
+            crc: None,
+            data_page_header_v2: Some(DataPageHeaderV2 {
+                num_nulls: 0,
+                num_rows: 0,
+                num_values: 0,
+                encoding: Encoding::PLAIN,
+                definition_levels_byte_length: 11,
+                repetition_levels_byte_length: 0,
+                is_compressed: None,
+                statistics: None,
+            }),
+        };
+
+        let buffer = Bytes::new();
+        let err = decode_page(page_header, buffer, Type::INT32, None).unwrap_err();
+        assert!(
+            err.to_string()
+                .contains("DataPage v2 header contains implausible values")
+        );
+    }


⚠️ Potential issue | 🟡 Minor

Test doesn't cover the new validation logic.

The test test_decode_page_invalid_offset validates the existing check at lines 368-382 (where definition_levels_byte_length + repetition_levels_byte_length > uncompressed_page_size), but doesn't reach the new validation added at lines 395-397.

To test the new code path at lines 395-397, the test would need to:

Pass the first validation (set levels so their sum ≤ uncompressed_page_size)

Provide an empty or small buffer so offset > buffer.len() triggers

Include a decompressor

Consider renaming this test to reflect what it actually tests, or add a separate test that specifically validates line 395's offset > buffer.len() check.

martin-augment · 2025-10-28T06:55:08Z

The fork's main branch was outdated and the PR shows a bigger diff than the actual one.
Closing to redo it.

etseidl and others added 19 commits October 24, 2025 10:23

Add benchmark for casting to RunEndEncoded (REE) (apache#8710)

62df32e

Closes apache#8709. Adds bench `cast_ree` which can be run with `cargo bench --bench cast_ree`. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

fix: ArrayIter does not report size hint correctly after advancing …

dd98964

…from the iterator back this also adds a LOT of tests extracted from (which is how I found that bug): - apache#8697

cleanup tests

ae5e961

github-actions bot added arrow arrow-avro parquet-variant parquet labels Oct 28, 2025

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

martin-augment closed this Oct 28, 2025

Conversation

martin-augment commented Oct 28, 2025

Uh oh!

coderabbitai bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

claude bot commented Oct 28, 2025

Pull Request Review

Summary

Code Quality ✅

Test Coverage ⭐ EXCELLENT

Potential Issues 🔍

Performance Considerations ✅

Security Concerns ✅

Recommendations

High Priority

Medium Priority

Low Priority

Verdict: ✅ APPROVE with minor suggestions

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

coderabbitai bot commented Oct 28, 2025 •

edited

Loading