Skip to content

8728: fix: ArrayIter does not report size hint correctly after advancing from the iterator back#10

Closed
martin-augment wants to merge 19 commits intomainfrom
pr-8728-2025-10-28-06-36-21
Closed

8728: fix: ArrayIter does not report size hint correctly after advancing from the iterator back#10
martin-augment wants to merge 19 commits intomainfrom
pr-8728-2025-10-28-06-36-21

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

8728: To review by AI

etseidl and others added 19 commits October 24, 2025 10:23
…che#8706)

# Which issue does this PR close?

N/A

# Rationale for this change

It is not obvious that the thrift macros produce public enums only (e.g.
see
apache#8680 (comment)).
This should be made clear in the documentation.

# What changes are included in this PR?

Add said clarification.

# Are these changes tested?

Documentation only, so no tests required.

# Are there any user-facing changes?

No, only changes to private documentation
# Which issue does this PR close?

- Closes apache#8691

# Rationale for this change

The `README.md` file in `arrow-avro` instructs users to install version
56. This is invalid and should be changed to version 57.

# What changes are included in this PR?

Updated the `README.md` file to reference version 57.

# Are these changes tested?

N/A since this a small `README.md` file change.

# Are there any user-facing changes?

The `README.md` file in `arrow-avro` now instructs users to install
version 57.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?

chore: add test case of `RowSelection::trim`

# Rationale for this change

Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

# What changes are included in this PR?

There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.

# Are these changes tested?

We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
…DING_SLOTS (apache#8663)

# Which issue does this PR close?

- Closes [apache#8662]

# Rationale for this change

Related to apache#8607
We need to know how many encoding are support to create a decoder slot.

# What changes are included in this PR?
Update the `thrift_enum` to know the fields count of enum `Encoding`,
and the value is passed to `EncodingMask` And the `ENCODING_SLOTS`


# Are these changes tested?
1. Originally I think add a UT can prevent failure after the new
encoding are introduced, then I realized the counts are already
transferred, the UT is not required, the original tests can already
cover the code.

# Are there any user-facing changes?
No
- A follow up from apache#8625

# Rationale for this change

While working on a separate task, I noticed `create_test_variant_array`
was redundant. Since `VariantArray` can already be constructed directly
from an iterator of Variants, this PR removes the now-unnecessary test
helper.
# Which issue does this PR close?

- Closes apache#8685.

# What changes are included in this PR?

In the implementation of `RowConverter::from_binary`, the `BinaryArray`
is broken into parts and an attempt is made to convert the data buffer
into `Vec` at no copying cost with `Buffer::into_vec`. Only if this
fails, the data is copied out for a newly allocated `Vec`.

# Are these changes tested?

Passes existing tests using `RowConverter::from_binary`, which all
convert a non-shared buffer taking advantage of the optimization.
Another test is added to cover the copying path.

# Are there any user-facing changes?

No
# Which issue does this PR close?

- Closes apache#8692.

# Rationale for this change

Explained in issue.

# What changes are included in this PR?

- Adds `FilterPredicate::filter_record_batch`
- Adapts the free function `filter_record_batch` to use the new function
- Uses `new_unchecked` to create the filtered result. The rationale for
this is identical to apache#8583

# Are these changes tested?

Covered by existing tests for `filter_record_batch`

# Are there any user-facing changes?

No

---------

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Part of apache#7806 
# Rationale for this change

The `DeltaBitPackDecoder` can panic if it encounters a bit width in the
encoded data that is larger than the bit width of the data type being
decoded.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…8680)

# Which issue does this PR close?

- Part of apache#5853.

# Rationale for this change

While converting to the new Thrift model, the `ConvertedType` enum was
done manually due to the `NONE` variant, which used the discriminant of
`0`. This PR changes that to `-1` which allows the `thrift_enum` macro
to be used instead. This improves code maintainability.

# What changes are included in this PR?

See above.

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

No, this only changes the discriminant value for a unit variant enum.
# Which issue does this PR close?

- Contribues towards the RunEndEncoded (REE) epic apache#3520, but there is no
specific issue for casting.
- Replaces PRs apache#7713 and
apache#8384.

# Rationale for this change

This PR implements casting support for RunEndEncoded arrays in Apache
Arrow.

# What changes are included in this PR?
- `run_end_encoded_cast` in `arrow-cast/src/cast/run_array.rs`
- `cast_to_run_end_encoded` in `arrow-cast/src/cast/run_array.rs`
- Tests in `arrow-cast/src/cast/mod.rs`

# Are these changes tested?
Yes!

# Are there any user-facing changes?

No breaking changes, just new functionality

---------

Co-authored-by: Richard Baah <richard.baah@datadoghq.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…erator (apache#8696)

# Which issue does this PR close?

N/A

# Rationale for this change

overriding this function improve performance over the fallback
implementation

# What changes are included in this PR?

Override implementation of:
- `count` which is not optimized away even when `ExactSizeIterator` is
implemented
- `nth` to avoid calling `next` `n + 1` times (which is also used when
doing `.skip`)
- `nth_back`
- `last`
- `max`

# Are these changes tested?

Yes, I've added a lot of tests

# Are there any user-facing changes?

Nope
# Rationale for this change

We've caused some unexpected panics from our internal testing. We've put
in error checks for all of these so that they don't affect other users.

# What changes are included in this PR?

Various error checks to ensure panics don't occur.

# Are these changes tested?
Tests should continue to pass.

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
Existing tests should cover these changes.

# Are there any user-facing changes?
None.

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
# Which issue does this PR close?

Part of apache#5375

Vortex was encountering some issues after we switched our preferred List
type to `ListView`, the first thing we noticed was that
`arrow_select::filter_array` would fail on ListView (and LargeListView,
though we don't use that).

This PR addresses some missing select kernel implementations for
ListView and LargeListView.

This also fixes an existing bug in the ArrayData validation for ListView
arrays that would trigger an out of bounds index panic.

# Are these changes tested?

- [x] filter_array
- [x] concat
- [x] take


# Are there any user-facing changes?

ListView/LargeListView can now be used with the `take`, `concat` and
`filter_array` kernels

You can now use the `PartialEq` to compare ListView arrays.

---------

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
…ulation (apache#8671)

# Which issue does this PR close?

- Closes apache#8472.

# Rationale for this change

Makes the metadata heap size calculation more accurate when reading
encrypted Parquet files, which helps to better manage caches of Parquet
metadata.

# What changes are included in this PR?

* Accounts for heap allocations related to the `FileDecryptor` in
`ParquetMetaData`
* Fixes the `HeapSize` implementation for `Arc<T>` so the size of `T` is
included, as well as the reference counts that are stored on the heap
* Fixes the heap size of type pointers within `ColumnDescriptor` being
included twice

## Not included
* Accounting for any heap allocations in a user-provided `KeyRetriever`

# Are these changes tested?

Yes, there's a new unit test added that computes the heap size with a
decryptor.

I also did a manual test that created a test Parquet file with 100
columns using per-column encryption keys, and loaded 10,000 copies of
the `ParquetMetaData` into a vector. `heaptrack` reported 1.136 GB
memory heap allocated in this test program. Prior to this change, the
sum of the metadata was reported as 879.2 MB, and afterwards it was
1.136 GB.

# Are there any user-facing changes?

No

This was co-authored by @etseidl

---------

Co-authored-by: Ed Seidl <etseidl@live.com>
…n Parquet variant casts (apache#8689)

# Which issue does this PR close?

- Closes apache#8670.

# Rationale for this change

We currently have two separate code paths that both handle decimal
casting between different (precision, scale) pairs. Without unifying the
logic, a fix in one place often needs to be duplicated in the other
(e.g., apache#8579 fixed the
`arrow-cast`
and apache#8552 fixed the `parquet-variant-compute`), which can easily lead to
divergence when contributors lack full context. This PR consolidates the
decimal rescale logic for both `arrow-cast` and
`parquet-variant-compute`.

# What changes are included in this PR?

1. Extract the shared array-unary logic from
`convert_to_smaller_scale_decimal` and
`convert_to_bigger_or_equal_scale_decimal` into `apply_decimal_cast`
2. Move the rescale-closure creation into `make_upscaler` and
`make_downscaler` so that they can be used in `parquet-compute-variant`
3. rework `rescale_decimal` in `parquet-compute-variant` to use the new
`make_upscaler` and `make_downscaler` utilities.

One challenge is incorporating the large-scale reduction path (aka the
`delta_scale` cannot fit into `I::MAX_PRECISION`) into `make_downscaler`
without hurting performance. Returning 0 directly is usually cheaper
than applying a unary operation to return zero. Therefore,
`make_downscaler` may return None, and it is the caller’s responsibility
to handle this case appropriately based on the documented behavior.

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

No
Closes apache#8709. 

Adds bench `cast_ree` which can be run with `cargo bench --bench
cast_ree`.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…he underlying bit value (apache#8711)

# Which issue does this PR close?

- closes apache#8721

# Rationale for this change

mask `nulls` should be treated as `false` (even if the underlying values
are not 0) as described in the docs for zip

# What changes are included in this PR?

used `prep_null_mask_filter` before iterating over the mask,
added tests for both scalar and non scalar (to prepare for apache#8653)

# Are these changes tested?
Yes

# Are there any user-facing changes?

Kinda
…from the iterator back

this also adds a LOT of tests extracted from (which is how I found that bug):
- apache#8697
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 28, 2025

Walkthrough

This pull request introduces ListView/ListViewArray support across data operations, adds RunEndEncoded casting capabilities, refactors decimal casting with builder patterns, enhances iterator trait implementations, improves parquet error handling and validation, adds HeapSize implementations for encryption types, and updates macro-generated thrift enum constants.

Changes

Cohort / File(s) Summary
Iterator Clone and Trait Enhancements
arrow-array/src/iterator.rs, arrow-buffer/src/util/bit_iterator.rs
ArrayIter and BitIterator now derive Clone. BitIterator implements additional Iterator methods (count, nth, last, max) and DoubleEndedIterator method (nth_back) with corresponding test coverage.
ListView Equality and Transformation
arrow-data/src/equal/list_view.rs, arrow-data/src/equal/mod.rs, arrow-data/src/transform/list_view.rs, arrow-data/src/transform/mod.rs
New modules implementing ListView equality comparison and array transformation support. Routes ListView and LargeListView through new list_view_equal and extend functions in transform pipeline.
ListView Selection Operations
arrow-select/src/concat.rs, arrow-select/src/take.rs, arrow-select/src/filter.rs
Adds concat_list_view, take_list_view, and filter_list_view functions to support ListView/LargeListView concatenation, selection, and filtering with offset and size handling.
ListView Testing and Selection Utilities
arrow/tests/array_equal.rs, arrow-select/src/zip.rs
Introduces ListView test helpers and array equality validation. Updates zip mask preprocessing to treat nulls as false via BooleanBuffer preparation.
Decimal Casting Refactoring
arrow-cast/src/cast/decimal.rs, arrow-cast/Cargo.toml
Introduces make_upscaler and make_downscaler builders returning fallible/infallible closure pairs. Adds public rescale_decimal function and apply_rescaler/apply_decimal_cast helpers. Updates convert_to_smaller_scale_decimal and convert_to_bigger_or_equal_scale_decimal to use new builder patterns. Adds arrow-ord dependency.
RunEndEncoded Casting Support
arrow-cast/src/cast/mod.rs, arrow-cast/src/cast/run_array.rs
Introduces RunEndEncoded support in can_cast_types and cast_with_options. New run_array module implements run_end_encoded_cast and cast_to_run_end_encoded for bidirectional casting between RunEndEncoded and other types. Publicly exports rescale_decimal.
RunEndEncoded Benchmarks
arrow/benches/cast_kernels.rs
Adds benchmarks for casting to RunEndEncoded from StringArray and Int32Array with various run patterns.
Data Buffer Validation
arrow-data/src/data.rs, arrow-row/src/lib.rs
Adds length matching check in validate_offsets_and_sizes for offsets and sizes buffers. Updates from_binary to use buffer deconstructing via into_parts with fallback copy strategy.
Parquet Enum Refactoring
parquet/src/basic.rs, parquet/src/parquet_macros.rs
ConvertedType enum rewritten with explicit integer discriminants. thrift_enum macro extended with VARIANTS and MAX_DISCRIMINANT public constants. EncodingMask updated to use MAX_DISCRIMINANT.
Parquet Error Handling and Validation
parquet/src/column/reader.rs, parquet/src/encodings/decoding.rs, parquet/src/encodings/rle.rs, parquet/src/file/serialized_reader.rs, parquet/tests/arrow_reader/bad_data.rs
Adds bounds checking in parse_v1_level, DictDecoder.set_data, and DeltaBitPackDecoder. Introduces check_bit_width validation. Updates decode_page for offset validation and replaces unimplemented page type with error. Adds diagnostic message to RLE decoding. Updates test assertions for error paths.
Parquet HeapSize Implementations
parquet/src/encryption/ciphers.rs, parquet/src/encryption/decrypt.rs, parquet/src/file/metadata/memory.rs, parquet/src/file/metadata/mod.rs, parquet/src/schema/types.rs
Adds HeapSize trait implementations for BlockDecryptor, RingGcmBlockDecryptor, encryption-related types (ExplicitDecryptionKeys, DecryptionKeys, FileDecryptionProperties, FileDecryptor), HashMap, Arc, and Arc<dyn HeapSize>. Updates memory_size accounting for encryption and test expectations.
Parquet Page and Column Improvements
parquet/src/column/page.rs, parquet/src/column/reader/decoder.rs, parquet/src/file/reader.rs
Page enum now derives Debug. ENCODING_SLOTS updated to use MAX_DISCRIMINANT. BytesChunkReader.get_read and get_bytes add EOF error returns for out-of-bounds access.
Parquet Reader and Selection Tests
parquet/src/arrow/arrow_reader/selection.rs, parquet/src/file/serialized_reader.rs
Adds unit tests for RowSelection.trim() and decode_page validation. Introduces DataPageHeaderV2 import for test construction.
Documentation and Version Updates
parquet/THRIFT.md, arrow-avro/README.md
Adds visibility notes to thrift_enum and thrift_union macro documentation. Bumps arrow-avro version references from 56 to 57 in README code blocks.
Variant Compute Updates
parquet-variant-compute/src/type_conversion.rs, parquet-variant-compute/src/shred_variant.rs
Removes local rescale_decimal implementation and switches to external import. Refactors test helpers in shred_variant to use VariantArray::from_iter directly.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8728-2025-10-28-06-36-21

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Oct 28, 2025

Pull Request Review

Summary

This PR fixes a bug in ArrayIter where the size_hint() method was not correctly reporting the remaining iterator length after consuming elements from either end. The fix changes the implementation from using array.len() to using current_end, and includes comprehensive test coverage.

Code Quality ✅

Strengths:

  • Clean implementation: The fix is minimal and focused - changing only 2 lines in the core implementation (lines 101-102 in arrow-array/src/iterator.rs)
  • Good abstraction: Added Clone derive to ArrayIter which is essential for the testing approach
  • Well-documented: The existing documentation remains accurate and the code is self-explanatory

Minor observations:

  • The core fix is simple and correct: self.current_end - self.current properly tracks remaining elements
  • Safety comments in the implementation are appropriate and helpful

Test Coverage ⭐ EXCELLENT

This PR demonstrates exceptional test coverage with 930 new lines of tests:

Comprehensive test scenarios:

  • Multiple data patterns (no nulls, all nulls, partial nulls, random patterns)
  • Various iterator consumption patterns (from start, from end, both ends, until exhausted)
  • All major iterator trait methods tested:
    • next() / next_back()
    • nth() / nth_back()
    • position() / rposition()
    • find() / rfind()
    • count(), last(), any(), all()
    • for_each(), fold() / rfold()

Testing approach highlights:

  • Uses a trait-based abstraction (SharedBetweenArrayIterAndSliceIter) to test ArrayIter behavior against Rust slice iterators
  • Tests verify both return values AND remaining iterator state after operations
  • Edge cases thoroughly covered (empty iterators, single-element, all consumed, etc.)

Potential Issues 🔍

1. Test complexity trade-off:
The test code is quite complex with multiple trait abstractions (ConsumingArrayIteratorOp, MutatingArrayIteratorOp, SetupIter). While this provides excellent coverage, consider:

  • Add more inline comments explaining the test framework structure
  • The test complexity might make future maintenance harder - future contributors may find it difficult to understand or modify
  • Consider extracting a test helper module if this pattern is reused

2. Missing documentation:

  • No doc comment added to the Clone derive explaining why cloning is safe/desired
  • Consider adding a brief comment about the current_end field's role in the struct

3. Potential test data generation issue:

let no_nulls_random_values = (0..10)
    .map(|_| rng.random::<i32>())
    .map(Some)
    .collect::<Vec<Option<i32>>>();

While using a seeded RNG (line 274) ensures determinism, consider whether fixed test cases would be clearer for debugging failures.

Performance Considerations ✅

Positive:

  • The fix has zero performance impact - it changes which field is read, not the logic flow
  • size_hint() is now more accurate, which may improve performance of downstream operations that rely on accurate size hints (like collect() pre-allocation)

No concerns:

  • No additional allocations introduced
  • No algorithmic complexity changes
  • The Clone implementation is trivial (just copies a few integers and a reference)

Security Concerns ✅

No security issues identified:

  • No unsafe code changes
  • Existing safety comments remain valid
  • No potential for buffer overflows or memory safety issues
  • Input validation remains unchanged

Recommendations

High Priority

None - the core fix is solid

Medium Priority

  1. Add documentation: Consider adding a doc comment to explain the Clone trait:

    /// ArrayIter can be safely cloned as it only contains references
    /// and simple copy types (usize indices)
    #[derive(Debug, Clone)]
    pub struct ArrayIter<T: ArrayAccessor> {
  2. Test documentation: Add a module-level doc comment in the test section explaining the testing strategy:

    /// Tests validate ArrayIter behavior matches std::slice::Iter by comparing
    /// results across various consumption patterns and iterator operations.
    

Low Priority

  1. Test organization: Consider breaking the 930-line test module into sub-modules:

    • test_helpers - trait definitions and utilities
    • test_basic_iteration - next/next_back tests
    • test_consuming_operations - count/last/fold tests
    • test_mutating_operations - position/find/any tests
  2. Test readability: Some test names could be more descriptive:

    • get_int32_iterator_cases()create_test_cases_with_various_null_patterns()

Verdict: ✅ APPROVE with minor suggestions

This is a high-quality PR that:

  • Fixes a legitimate bug correctly
  • Includes exceptional test coverage
  • Has no performance or security concerns
  • Follows project conventions

The comprehensive testing significantly strengthens confidence in the fix. The suggestions above are minor improvements that would not block merging.


Review generated by Claude Code

@github-actions
Copy link
Copy Markdown

Review

  • I didn’t spot any functional bugs or regressions in the changes. The new iterator tests are thorough, and the refactors to the selection and transform kernels look consistent with the previous semantics.

Residual risks / follow-up

  • This PR touches a wide set of kernels (filter/take/zip, list view handling, etc.). It would be worth running the full Arrow test matrix, especially any integration tests that exercise list views, dictionary filtering, and parquet readers, just to be safe.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
parquet/src/encodings/decoding.rs (1)

827-837: Error message uses the wrong expected count variable.

Use mini_block_to_skip (expected) rather than mini_block_batch_size to avoid misleading diagnostics.

-                return Err(general_err!(
-                    "Expected to skip {} values from mini block got {}.",
-                    mini_block_batch_size,
-                    skip_count
-                ));
+                return Err(general_err!(
+                    "Expected to skip {} values from mini block got {}.",
+                    mini_block_to_skip,
+                    skip_count
+                ));
arrow-array/src/iterator.rs (2)

90-95: Safety comment is incorrect (initial cursor invariant)

Comment says “initialized with current = array.len()”, but new() sets current = 0 and current_end = len. Please fix to avoid misleading future maintainers.

-            // this is safe on the premise that this struct is initialized with
-            // current = array.len()
-            // and that current_end is ever only decremented
+            // This is safe because the iterator is initialized with:
+            //   current = 0 and current_end = array.len()
+            // We only increment `current` and only decrement `current_end`,
+            // and we check bounds (current < current_end) before access.

116-121: Repeat the safety comment fix for next_back

Same incorrect invariant appears here; please update consistently.

-                // this is safe on the premise that this struct is initialized with
-                // current = array.len()
-                // and that current_end is ever only decremented
+                // This is safe because the iterator is initialized with:
+                //   current = 0 and current_end = array.len()
+                // We only increment `current` and only decrement `current_end`,
+                // and we check bounds (current < current_end) before access.
🧹 Nitpick comments (32)
arrow-data/src/data.rs (1)

983-989: Defensive length check, but consider error type consistency.

Since both typed_buffer calls request self.len elements and return slices of exactly that length, this check should always pass. However, it serves as useful defensive programming.

Consider using InvalidArgumentError instead of ComputeError for consistency with other data validation errors in this file (see lines 984, 993, 1009).

Apply this diff if you prefer consistency:

-            return Err(ArrowError::ComputeError(format!(
+            return Err(ArrowError::InvalidArgumentError(format!(
                 "ListView offsets len {} does not match sizes len {}",
parquet-variant-compute/src/shred_variant.rs (2)

440-443: Consider consistent wrapping for clarity.

While functionally correct, this test uses Variant::from(...) directly, whereas line 338 uses Some(Variant::from(...)). Both patterns work with from_iter, but consistency would improve readability.

Apply this diff for consistency with line 338:

-        let input = VariantArray::from_iter(vec![
-            Variant::from(42i32),
-            Variant::from(3.15f64),
-            Variant::from("not_a_number"),
-        ]);
+        let input = VariantArray::from_iter(vec![
+            Some(Variant::from(42i32)),
+            Some(Variant::from(3.15f64)),
+            Some(Variant::from("not_a_number")),
+        ]);

874-874: Consider consistent wrapping for clarity.

Same observation as lines 440-443: this uses Variant::from(...) directly while line 338 uses Some(Variant::from(...)). Consider using consistent wrapping throughout the test file.

Apply this diff for consistency:

-        let input = VariantArray::from_iter(vec![Variant::from(42i64), Variant::from("hello")]);
+        let input = VariantArray::from_iter(vec![Some(Variant::from(42i64)), Some(Variant::from("hello"))]);
parquet/src/file/reader.rs (2)

292-321: Consider adding edge-case tests.

The current tests verify basic out-of-bounds scenarios effectively. However, additional test cases would improve coverage:

  • Boundary case: start == data.len() (should succeed for get_read)
  • Valid range: get_bytes(0, 4) or get_bytes(2, 2) where start + length == data.len()
  • Overflow scenario: very large length values
  • Zero-length reads: get_bytes(0, 0) or get_bytes(4, 0)

Example additional tests:

#[test]
fn test_bytes_chunk_reader_edge_cases() {
    let data = Bytes::from(vec![0, 1, 2, 3]);
    
    // Reading from end position should succeed
    let result = data.get_read(4);
    assert!(result.is_ok());
    
    // Valid ranges should succeed
    assert!(data.get_bytes(0, 4).is_ok());
    assert!(data.get_bytes(2, 2).is_ok());
    
    // Zero-length reads
    assert!(data.get_bytes(0, 0).is_ok());
    assert!(data.get_bytes(4, 0).is_ok());
}

122-148: Add bounds checking to test utility get_bytes methods for consistency.

The production Bytes implementation now includes defensive bounds checking (reader.rs 138-148), but the test utilities still lack validation:

  • parquet/tests/arrow_reader/io/sync_reader.rs:413-418 — slices directly without validating start + length
  • parquet/tests/arrow_reader/io/async_reader.rs:373-383 — slices directly without validating range bounds

Both will panic on invalid ranges instead of returning errors like the production code. Adding the same defensive pattern would improve consistency and provide clearer error messages in tests.

parquet/src/schema/types.rs (1)

1353-1370: Non-root group must have repetition + propagate build errors

  • Enforcing repetition for non-root types matches the parquet spec; good guard.
  • Switching to builder.build()? avoids panics and surfaces conversion errors.

Recommend also changing the zero-children group branch to use ? for consistency (currently unwrap):

-                Ok((index + 1, Arc::new(builder.build().unwrap())))
+                Ok((index + 1, Arc::new(builder.build()?)))

Consider adding a small test that a non-root group lacking repetition returns the expected error.

parquet/src/encodings/rle.rs (1)

500-520: Avoid assert! in library decode path; propagate an error instead.

The new assert panics in release builds. Prefer returning a descriptive error from reload and plumbing it to callers (get/get_batch/skip) so bad inputs don’t abort the process.

Example (sketch):

-    fn reload(&mut self) -> bool {
+    fn reload(&mut self) -> Result<bool> {
       let bit_reader = self.bit_reader.as_mut().expect("bit_reader should be set");
-      if let Some(indicator_value) = bit_reader.get_vlq_int() {
+      if let Some(indicator_value) = bit_reader.get_vlq_int() {
         ...
-        self.current_value = bit_reader.get_aligned::<u64>(value_width);
-        assert!(self.current_value.is_some(), "parquet_data_error: not enough data for RLE decoding");
+        self.current_value = bit_reader.get_aligned::<u64>(value_width);
+        if self.current_value.is_none() {
+            return Err(eof_err!("Not enough data for RLE decoding"));
+        }
         ...
-        true
+        Ok(true)
       } else {
-        false
+        Ok(false)
       }
     }

And update call sites: if !self.reload()? { ... } and in set_data: let _ = self.reload()?;.

parquet/THRIFT.md (1)

56-71: Docs clarity: note about pub visibility is helpful.

Consider linking directly to the macro definitions (parquet/src/parquet_macros.rs) for quick reference.

Also applies to: 88-89

parquet/tests/arrow_reader/bad_data.rs (1)

88-93: Make test resilient to error wording.

Exact string equality is brittle. Prefer contains/starts_with or matching the error type.

-    assert_eq!(
-        err.to_string(),
-        "External: Parquet argument error: Parquet error: Invalid or corrupted RLE bit width 254. Max allowed is 32"
-    );
+    let s = err.to_string();
+    assert!(
+        s.contains("Invalid or corrupted RLE bit width 254"),
+        "unexpected error: {s}"
+    );
parquet/src/encodings/decoding.rs (2)

645-657: Bit-width validator is appropriate; minor polish.

Great addition. Nit: remove the trailing space in the error message for consistency.

-            return Err(general_err!(
-                "Invalid delta bit width {} which is larger than expected {} ",
-                bit_width,
-                std::mem::size_of::<T::T>() * 8
-            ));
+            return Err(general_err!(
+                "Invalid delta bit width {} which is larger than expected {}",
+                bit_width,
+                std::mem::size_of::<T::T>() * 8
+            ));

751-753: Validate miniblock bit-widths at use sites — good.

Calling check_bit_width in both get and skip catches invalid pages early. Consider also validating all widths once in next_block() to fail fast.

Also applies to: 822-824

parquet/src/basic.rs (1)

65-69: ConvertedType via thrift_enum looks good; doc nit.

The enum definition with explicit discriminants (including NONE = -1) matches usage across conversions and tests. Nit: “This struct was renamed…” → “This enum was renamed…”.

-/// This struct was renamed from `LogicalType` in version 4.0.0.
+/// This enum was renamed from `LogicalType` in version 4.0.0.

Also applies to: 75-169

arrow-array/src/iterator.rs (3)

1186-1190: Seed RNG for deterministic tests

Use a local StdRng to avoid nondeterminism across test runs.

-        let random_values = (0..100).map(|_| rand::random_bool(0.5)).collect::<Vec<_>>();
+        let mut rng = StdRng::seed_from_u64(1);
+        let random_values = (0..100)
+            .map(|_| rng.random_bool(0.5))
+            .collect::<Vec<_>>();

986-988: Nit: message string for AllOp

The label says “false” but this tests all(); suggest “true”.

-            fn name(&self) -> String {
-                format!("all with {} false returned", self.true_count)
-            }
+            fn name(&self) -> String {
+                format!("all with {} true returned", self.true_count)
+            }

149-149: Add targeted size_hint regression test for ArrayIter

Locks the fix against regressions by asserting size_hint after mixed next/next_back.

+    #[test]
+    fn assert_array_iter_size_hint_after_next_back_and_next() {
+        let a = Int32Array::from(vec![Some(1), None, Some(3), Some(4)]);
+        let mut it = ArrayIter::new(&a);
+        assert_eq!(it.size_hint(), (4, Some(4)));
+        assert_eq!(it.next_back(), Some(Some(4)));
+        assert_eq!(it.size_hint(), (3, Some(3)));
+        assert_eq!(it.next(), Some(Some(1)));
+        assert_eq!(it.size_hint(), (2, Some(2)));
+        // consume remaining
+        let _: Vec<_> = it.collect();
+        assert_eq!(it.size_hint(), (0, Some(0)));
+    }
arrow-buffer/src/util/bit_iterator.rs (1)

156-175: nth_back(): correct; tweak comment wording

Logic matches std semantics, including exhausting on out-of-range. Minor doc nit: subtract from end_offset, not “current offset”.

-        // so subtracting n to the current offset and not n - 1
+        // so subtracting n from end_offset (and not n - 1)
arrow-select/src/zip.rs (2)

131-133: Treating NULL as false is correct; use borrowed buffer when possible

Current flow clones the values buffer even when mask has no nulls. Borrowing avoids an allocation.

-    let mask = maybe_prep_null_mask_filter(mask);
-    SlicesIterator::from(&mask).for_each(|(start, end)| {
+    // Borrow when no nulls, allocate only if needed
+    if mask.null_count() == 0 {
+        SlicesIterator::from(mask.values()).for_each(|(start, end)| {
+            // body unchanged
+        });
+    } else {
+        let cleaned = maybe_prep_null_mask_filter(mask);
+        SlicesIterator::from(&cleaned).for_each(|(start, end)| {
+            // body unchanged
+        });
+    }

Note: requires changing maybe_prep_null_mask_filter to return BooleanBuffer and keeping both paths.


171-180: Co-locate mask-prep helper with filtering utilities

maybe_prep_null_mask_filter duplicates filter semantics; consider moving it into crate::filter next to prep_null_mask_filter and re-exporting for reuse.

arrow-select/src/concat.rs (2)

244-256: Allocate offsets buffer by bytes to reduce reallocations

MutableBuffer::with_capacity expects bytes; current capacity is in elements. Precompute element count and multiply by size_of::().

-    let mut offsets = MutableBuffer::with_capacity(lists.iter().map(|l| l.offsets().len()).sum());
+    let offset_len: usize = lists.iter().map(|l| l.offsets().len()).sum();
+    let mut offsets =
+        MutableBuffer::with_capacity(offset_len * std::mem::size_of::<OffsetSize>());

This avoids repeated growth on large inputs.


238-253: Optional: avoid carrying unused child values for sliced ListView inputs

For sliced ListView arrays, values() returns the full child; concatenating full children increases memory. Consider computing the minimal [min_offset, max_offset) used per input and concatenating sliced children instead, adjusting offsets by (offset - min_offset) + global_offset.

arrow/tests/array_equal.rs (1)

865-872: Add a guard test for size mismatch on non-null ListView elements

To catch the null-path size check bug (now fixed), add a case where sizes differ for a non-null slot.

+#[test]
+fn test_list_view_size_mismatch_is_unequal() {
+    // a: [[1,2,3]], b: [[1,2,3,4]] -> unequal
+    let a = create_list_view_array::<i32, _, _>([Some(vec![Some(1), Some(2), Some(3)])]);
+    let b = create_list_view_array::<i32, _, _>([Some(vec![Some(1), Some(2), Some(3), Some(4)])]);
+    test_equal(&a, &b, false);
+}

This ensures per-element size equality is enforced.

arrow/benches/cast_kernels.rs (1)

363-401: Optional: factor target type construction.
DRY the RunEndEncoded<Int32, T> construction with a small helper to avoid duplication across benches.

arrow-cast/src/cast/mod.rs (5)

144-146: Guard can_cast_types for valid RunEndEncoded index types

Today this can return true even if run_ends uses an unsupported index type (leading to can_cast_types=true but cast failing later). Add an index-type filter to keep capability checks aligned with cast behavior.

-        (RunEndEncoded(_, value_type), _) => can_cast_types(value_type.data_type(), to_type),
-        (_, RunEndEncoded(_, value_type)) => can_cast_types(from_type, value_type.data_type()),
+        (RunEndEncoded(index_type, value_type), _)
+            if matches!(index_type.data_type(), DataType::Int16 | DataType::Int32 | DataType::Int64) =>
+        {
+            can_cast_types(value_type.data_type(), to_type)
+        }
+        (_, RunEndEncoded(index_type, value_type))
+            if matches!(index_type.data_type(), DataType::Int16 | DataType::Int32 | DataType::Int64) =>
+        {
+            can_cast_types(from_type, value_type.data_type())
+        }

11452-11467: Fix misleading test comment and variable name

The test casts REE -> Primitive Int64 and downcasts to StringArray/Int64Array, not a RunArray. Tweak the comments/variable name for clarity.

-    // Cast to Int64
+    // Cast to Int64 (flatten REE to a primitive array)
@@
-    // Verify the result is a RunArray with Int64 values
-    let result_run_array = cast_result.as_any().downcast_ref::<Int64Array>().unwrap();
+    // Verify the result is a flat Int64Array with expanded values
+    let result_values = cast_result.as_any().downcast_ref::<Int64Array>().unwrap();
@@
-            result_run_array.values(),
+            result_values.values(),

11469-11485: Tighten test comment wording

Result is a StringArray, not a RunArray; update comment to avoid confusion.

-    // Verify the result is a RunArray with String values
+    // Verify the result is a StringArray with expanded values

11717-11734: Align test comment with intent

This test checks Struct -> FixedSizeBinary cast failure; remove the unrelated RunEndEncoded note in the comment.

-        // This should fail because:
-        // 1. The target type is not RunEndEncoded
-        // 2. The target type is not supported for casting from StructArray
+        // This should fail because casting from StructArray to FixedSizeBinary is not supported

11548-11556: Strengthen null-run assertions

You already assert null_count on values(). Also assert which runs are null to make intent explicit.

         assert_eq!(result_run_array.values().null_count(), 3);
+        // Optionally assert positions of null runs in the values array
+        let vals = result_run_array.values().as_primitive::<Int32Type>();
+        assert!(vals.is_null(1)); // run representing consecutive nulls
+        assert!(vals.is_null(4));
+        assert!(vals.is_null(7));
parquet/src/encryption/ciphers.rs (1)

54-59: HeapSize for RingGcmBlockDecryptor returns 0 — add a note or guard for ring changes.
If ring ever allocates for key schedules, this will under‑report. Consider a brief comment referencing the specific ring version tested.

parquet/src/file/metadata/memory.rs (1)

54-94: HashMap heap sizing: make group_size arch‑aware to reduce test brittleness.
Current 16-byte group size can be smaller on 32‑bit, causing platform‑dependent results. Consider:

-        // Overhead for the control tags group, which may be smaller depending on architecture
-        let group_size = 16;
+        // Overhead for the control tags group; 16 on 64‑bit, typically 8 on 32‑bit
+        let group_size = if cfg!(target_pointer_width = "64") { 16 } else { 8 };

Also keep the “approximation” disclaimer; update if hashbrown behavior changes.

parquet/src/file/metadata/mod.rs (3)

1884-1887: Avoid brittle exact byte constants in tests.
Given HashMap/Arc sizing is approximate and arch‑dependent, strict equality to 2766/2934 may fail cross‑platform or with std/hashbrown changes. Prefer arch‑aware guards or relative assertions (e.g., >= previous baseline), or compute expected from the same sizing helpers.

I can propose a patch to make these assertions resilient if you prefer.


1915-1918: Same concern: platform‑sensitive constants.
Use arch‑aware cfg or relative checks to prevent flakes.


1924-1997: New decryptor memory test: prefer resilient assertions.
Exact 2058/3072 deltas are likely to drift across platforms and versions. Suggest asserting that adding a decryptor increases memory_size by a positive amount and by at least the decryptor’s reported HeapSize, instead of hard‑coding totals.

I can draft a version that derives expectations from heap_size() calls to avoid magic numbers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b3b262 and ae5e961.

📒 Files selected for processing (38)
  • arrow-array/src/iterator.rs (4 hunks)
  • arrow-avro/README.md (1 hunks)
  • arrow-buffer/src/util/bit_iterator.rs (5 hunks)
  • arrow-cast/Cargo.toml (2 hunks)
  • arrow-cast/src/cast/decimal.rs (4 hunks)
  • arrow-cast/src/cast/mod.rs (6 hunks)
  • arrow-cast/src/cast/run_array.rs (1 hunks)
  • arrow-data/src/data.rs (1 hunks)
  • arrow-data/src/equal/list_view.rs (1 hunks)
  • arrow-data/src/equal/mod.rs (3 hunks)
  • arrow-data/src/transform/list_view.rs (1 hunks)
  • arrow-data/src/transform/mod.rs (6 hunks)
  • arrow-row/src/lib.rs (2 hunks)
  • arrow-select/src/concat.rs (11 hunks)
  • arrow-select/src/filter.rs (7 hunks)
  • arrow-select/src/take.rs (5 hunks)
  • arrow-select/src/zip.rs (4 hunks)
  • arrow/benches/cast_kernels.rs (1 hunks)
  • arrow/tests/array_equal.rs (2 hunks)
  • parquet-variant-compute/src/shred_variant.rs (4 hunks)
  • parquet-variant-compute/src/type_conversion.rs (1 hunks)
  • parquet/THRIFT.md (3 hunks)
  • parquet/src/arrow/arrow_reader/selection.rs (1 hunks)
  • parquet/src/basic.rs (2 hunks)
  • parquet/src/column/page.rs (1 hunks)
  • parquet/src/column/reader.rs (2 hunks)
  • parquet/src/column/reader/decoder.rs (1 hunks)
  • parquet/src/encodings/decoding.rs (6 hunks)
  • parquet/src/encodings/rle.rs (1 hunks)
  • parquet/src/encryption/ciphers.rs (3 hunks)
  • parquet/src/encryption/decrypt.rs (5 hunks)
  • parquet/src/file/metadata/memory.rs (2 hunks)
  • parquet/src/file/metadata/mod.rs (3 hunks)
  • parquet/src/file/reader.rs (2 hunks)
  • parquet/src/file/serialized_reader.rs (4 hunks)
  • parquet/src/parquet_macros.rs (5 hunks)
  • parquet/src/schema/types.rs (2 hunks)
  • parquet/tests/arrow_reader/bad_data.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (26)
arrow-array/src/iterator.rs (1)
arrow-array/src/array/mod.rs (19)
  • iter (601-601)
  • iter (609-611)
  • iter (618-620)
  • iter (632-632)
  • iter (636-638)
  • iter (641-643)
  • array (944-944)
  • array (953-953)
  • array (962-962)
  • array (970-970)
  • array (980-980)
  • array (994-994)
  • array (1008-1008)
  • array (1020-1020)
  • array (1042-1042)
  • array (1132-1135)
  • len (155-155)
  • len (366-368)
  • len (445-447)
parquet/src/encryption/ciphers.rs (3)
parquet/src/encryption/decrypt.rs (4)
  • heap_size (276-278)
  • heap_size (301-310)
  • heap_size (358-360)
  • heap_size (583-587)
parquet/src/file/metadata/memory.rs (6)
  • heap_size (41-41)
  • heap_size (45-51)
  • heap_size (55-93)
  • heap_size (97-100)
  • heap_size (104-108)
  • heap_size (112-114)
parquet/src/file/metadata/mod.rs (1)
  • heap_size (931-933)
parquet/src/file/metadata/mod.rs (2)
parquet/src/encryption/decrypt.rs (4)
  • builder (365-367)
  • aad_prefix (378-380)
  • footer_key (389-397)
  • column_key (400-418)
parquet/src/encryption/encrypt.rs (5)
  • builder (103-105)
  • aad_file_unique (330-332)
  • aad_prefix (138-140)
  • footer_key_metadata (113-115)
  • footer_key (118-120)
arrow-cast/src/cast/run_array.rs (2)
arrow-cast/src/cast/mod.rs (25)
  • cast (345-347)
  • run_end_encoded_cast (799-799)
  • run_end_encoded_cast (800-800)
  • run_end_encoded_cast (801-801)
  • array (375-375)
  • array (381-381)
  • array (389-389)
  • array (395-395)
  • array (411-411)
  • array (413-413)
  • array (424-424)
  • array (427-427)
  • array (438-438)
  • array (455-467)
  • array (479-481)
  • array (508-524)
  • array (535-535)
  • array (545-546)
  • array (550-551)
  • array (555-556)
  • cast_with_options (751-2202)
  • None (9986-9986)
  • cast_to_run_end_encoded (809-809)
  • cast_to_run_end_encoded (814-814)
  • cast_to_run_end_encoded (819-819)
arrow-ord/src/partition.rs (1)
  • partition (127-153)
arrow-data/src/equal/list_view.rs (2)
arrow-data/src/data.rs (3)
  • count_nulls (51-62)
  • len (402-404)
  • len (1942-1945)
arrow-data/src/equal/mod.rs (4)
  • equal (161-166)
  • equal_values (62-129)
  • list_view_equal (109-109)
  • list_view_equal (110-110)
arrow/tests/array_equal.rs (2)
arrow-select/src/take.rs (3)
  • take (86-102)
  • data (1026-1028)
  • indices (702-713)
arrow-array/src/array/list_view_array.rs (2)
  • Buffer (1097-1097)
  • test_list_view_array (586-619)
arrow-select/src/zip.rs (1)
arrow-select/src/filter.rs (10)
  • filter (155-167)
  • filter (314-316)
  • filter (1459-1461)
  • filter (1489-1491)
  • prep_null_mask_filter (122-126)
  • from (66-68)
  • new (60-62)
  • new (89-93)
  • new (217-231)
  • new (636-653)
parquet/src/schema/types.rs (2)
parquet/src/record/reader.rs (1)
  • repetition (534-545)
parquet_derive/src/parquet_field.rs (1)
  • repetition (751-757)
arrow/benches/cast_kernels.rs (1)
arrow-cast/src/cast/mod.rs (18)
  • c (7821-7821)
  • b (4154-4154)
  • b (4166-4166)
  • b (4196-4196)
  • b (4207-4207)
  • b (4224-4224)
  • b (4249-4249)
  • b (4277-4277)
  • b (4343-4343)
  • b (4355-4355)
  • b (4367-4367)
  • b (4452-4452)
  • b (4462-4462)
  • b (4472-4472)
  • b (4482-4482)
  • b (4492-4492)
  • b (4751-4751)
  • cast (345-347)
arrow-row/src/lib.rs (2)
arrow-array/src/array/list_array.rs (1)
  • offsets (311-313)
arrow-array/src/array/primitive_array.rs (2)
  • i (2070-2070)
  • from_iter_values (753-761)
parquet/src/file/metadata/memory.rs (2)
parquet/src/encryption/decrypt.rs (4)
  • heap_size (276-278)
  • heap_size (301-310)
  • heap_size (358-360)
  • heap_size (583-587)
parquet/src/schema/types.rs (5)
  • heap_size (72-77)
  • heap_size (677-680)
  • heap_size (752-754)
  • heap_size (847-851)
  • heap_size (1026-1028)
parquet/src/arrow/arrow_reader/selection.rs (1)
parquet/src/arrow/arrow_reader/read_plan.rs (1)
  • selection (55-57)
parquet/src/file/reader.rs (2)
parquet/tests/arrow_reader/io/sync_reader.rs (2)
  • len (396-398)
  • get_bytes (413-418)
parquet/tests/arrow_reader/io/async_reader.rs (1)
  • get_bytes (373-390)
arrow-data/src/transform/list_view.rs (2)
arrow-data/src/transform/mod.rs (5)
  • build_extend (238-283)
  • array (220-220)
  • new (401-403)
  • len (753-755)
  • extend_nulls (742-749)
arrow-array/src/array/list_view_array.rs (1)
  • sizes (277-279)
arrow-data/src/transform/mod.rs (2)
arrow-data/src/transform/list_view.rs (2)
  • build_extend (24-47)
  • extend_nulls (49-56)
arrow-data/src/transform/list.rs (2)
  • build_extend (27-48)
  • extend_nulls (50-57)
arrow-data/src/equal/mod.rs (2)
arrow-data/src/equal/run.rs (1)
  • run_equal (26-86)
arrow-data/src/equal/list_view.rs (5)
  • list_view_equal (24-129)
  • lhs (31-31)
  • lhs (32-32)
  • rhs (34-34)
  • rhs (35-35)
parquet-variant-compute/src/shred_variant.rs (1)
parquet-variant-compute/src/variant_array.rs (11)
  • from_iter (445-451)
  • from_iter (455-457)
  • from (433-435)
  • from (439-441)
  • from (701-703)
  • from (707-709)
  • from (865-870)
  • new (493-499)
  • new (766-768)
  • new (820-822)
  • new (884-886)
arrow-data/src/data.rs (1)
arrow-array/src/array/list_view_array.rs (2)
  • offsets (262-264)
  • sizes (277-279)
arrow-select/src/take.rs (3)
arrow-data/src/data.rs (5)
  • new (1857-1859)
  • new (1919-1932)
  • nulls (390-392)
  • nulls (1948-1953)
  • slice (550-578)
arrow-array/src/cast.rs (1)
  • as_list_view (927-929)
arrow-array/src/array/list_view_array.rs (6)
  • nulls (416-418)
  • to_data (379-381)
  • x (1010-1010)
  • x (1045-1045)
  • x (1080-1080)
  • x (1119-1119)
arrow-select/src/concat.rs (5)
arrow-array/src/array/list_array.rs (12)
  • x (1269-1269)
  • nulls (569-571)
  • new (266-273)
  • values (322-324)
  • offsets (311-313)
  • offset (565-567)
  • try_new (209-259)
  • builder (1278-1278)
  • builder (1279-1279)
  • builder (1280-1280)
  • builder (1281-1281)
  • builder (1282-1282)
arrow-array/src/array/list_view_array.rs (8)
  • x (1010-1010)
  • x (1045-1045)
  • x (1080-1080)
  • x (1119-1119)
  • nulls (416-418)
  • sizes (277-279)
  • offsets (262-264)
  • offset (412-414)
arrow-buffer/src/buffer/offset.rs (4)
  • new (68-79)
  • n (169-171)
  • offsets (282-290)
  • as_ref (248-250)
arrow-array/src/builder/generic_list_view_builder.rs (3)
  • values (119-121)
  • with_capacity (85-96)
  • array (443-443)
arrow-buffer/src/buffer/scalar.rs (1)
  • as_ref (144-146)
arrow-cast/src/cast/mod.rs (2)
arrow-array/src/array/run_array.rs (6)
  • run_array (560-562)
  • run_array (973-973)
  • run_array (999-999)
  • run_array (1033-1033)
  • run_ends (121-123)
  • run_ends (550-552)
arrow-array/src/array/mod.rs (5)
  • make_array (744-842)
  • value (581-581)
  • data_type (127-127)
  • data_type (358-360)
  • data_type (437-439)
parquet-variant-compute/src/type_conversion.rs (1)
arrow-cast/src/cast/decimal.rs (8)
  • rescale_decimal (301-328)
  • rescale_decimal (928-928)
  • rescale_decimal (940-940)
  • rescale_decimal (946-946)
  • rescale_decimal (958-958)
  • rescale_decimal (964-964)
  • rescale_decimal (970-970)
  • rescale_decimal (976-976)
arrow-cast/src/cast/decimal.rs (3)
arrow-array/src/array/primitive_array.rs (5)
  • value (742-750)
  • value (1227-1229)
  • validate_decimal_precision (1603-1618)
  • len (684-686)
  • len (1184-1186)
arrow-array/src/types.rs (10)
  • is_valid_decimal_precision (1347-1347)
  • is_valid_decimal_precision (1413-1415)
  • is_valid_decimal_precision (1449-1451)
  • is_valid_decimal_precision (1485-1487)
  • is_valid_decimal_precision (1521-1523)
  • validate_decimal_precision (1340-1344)
  • validate_decimal_precision (1409-1411)
  • validate_decimal_precision (1445-1447)
  • validate_decimal_precision (1481-1483)
  • validate_decimal_precision (1517-1519)
arrow-data/src/decimal.rs (1)
  • validate_decimal_precision (1134-1167)
arrow-buffer/src/util/bit_iterator.rs (1)
arrow-buffer/src/buffer/boolean.rs (6)
  • new (61-74)
  • iter (202-204)
  • slice (162-172)
  • len (119-121)
  • from (283-287)
  • from (291-293)
parquet/src/encryption/decrypt.rs (2)
parquet/src/encryption/ciphers.rs (1)
  • heap_size (55-58)
parquet/src/file/metadata/memory.rs (10)
  • heap_size (41-41)
  • heap_size (45-51)
  • heap_size (55-93)
  • heap_size (97-100)
  • heap_size (104-108)
  • heap_size (112-114)
  • heap_size (118-120)
  • heap_size (124-126)
  • heap_size (130-142)
  • keys (91-91)
arrow-select/src/filter.rs (4)
arrow-data/src/data.rs (10)
  • new (1857-1859)
  • new (1919-1932)
  • new_unchecked (273-300)
  • nulls (390-392)
  • nulls (1948-1953)
  • null_count (420-425)
  • null_count (1956-1959)
  • buffer (586-588)
  • buffer (922-922)
  • buffer (1496-1496)
arrow-array/src/record_batch.rs (1)
  • new_unchecked (261-271)
arrow-array/src/array/list_view_array.rs (2)
  • nulls (416-418)
  • to_data (379-381)
arrow-array/src/array/mod.rs (9)
  • nulls (205-205)
  • nulls (387-389)
  • nulls (457-459)
  • null_count (286-288)
  • null_count (403-405)
  • null_count (473-475)
  • to_data (108-108)
  • to_data (350-352)
  • to_data (429-431)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: MIRI
  • GitHub Check: Test
  • GitHub Check: Clippy
  • GitHub Check: Check Compilation
  • GitHub Check: Build wasm32
  • GitHub Check: Archery test With other arrows
  • GitHub Check: Check Compilation
  • GitHub Check: Test
  • GitHub Check: Test
  • GitHub Check: Check Compilation
  • GitHub Check: claude-review
  • GitHub Check: PySpark Integration Test (stable)
  • GitHub Check: Test
  • GitHub Check: Verify MSRV (Minimum Supported Rust Version)
  • GitHub Check: Rustdocs are clean
  • GitHub Check: Audit
  • GitHub Check: Test on Mac
  • GitHub Check: codex
  • GitHub Check: Test on Windows
🔇 Additional comments (74)
parquet/src/arrow/arrow_reader/selection.rs (1)

1436-1463: Well-structured tests for trim() behavior.

The two test scenarios comprehensively verify the expected behavior:

  1. When trailing selectors are non-skips, they are preserved.
  2. When trailing skips exist, they are correctly removed.

The assertions clearly validate that the selector order and content remain intact after trimming.

parquet/src/column/page.rs (1)

34-34: LGTM! Useful debugging improvement.

Adding Debug to the Page enum is a good practice that enables debug formatting without any behavioral changes or performance impact. All field types already implement Debug, so this is a safe addition.

Note: The PR title mentions ArrayIter size hint fixes, but this file concerns Parquet page definitions. If this is part of a larger multi-file PR, consider ensuring the PR title reflects the full scope of changes.

parquet/src/column/reader.rs (2)

572-581: Excellent defensive bounds checking.

The added validation properly guards against:

  • Insufficient buffer to read the 4-byte size header
  • Integer overflow when computing the end offset
  • Insufficient buffer to contain the declared data

The use of checked_add and explicit bounds validation prevents potential panics and provides clear error messages.


605-622: Good test coverage for error paths.

The test properly exercises the new bounds checking logic with two scenarios where the declared data size exceeds the available buffer. Error message assertions confirm the expected behavior.

arrow-row/src/lib.rs (2)

916-927: Good zero-copy optimization with proper fallback handling.

The implementation correctly attempts a zero-copy path by consuming the BinaryArray with into_parts() and trying into_vec(). When the buffer is shared (reference count > 1), into_vec() will fail, and the fallback to to_vec() ensures correctness by copying the data. This is a solid optimization that maintains correctness for all cases.


2481-2492: Excellent test coverage for the shared buffer scenario.

This test specifically validates the fallback path in from_binary when the buffer is shared. By cloning binary_rows on line 2487, the test ensures the buffer's reference count is greater than 1, which forces into_vec() to fail and triggers the fallback to to_vec(). The unused variable _binary_rows_shared_buffer is intentionally kept in scope to maintain the shared state.

arrow-data/src/data.rs (1)

991-991: Correct loop bound change.

Iterating over sizes.len() (which equals the number of ListView elements) is the correct approach for validating each list element's offset and size. The previous iteration bound was likely incorrect.

parquet-variant-compute/src/shred_variant.rs (3)

338-338: LGTM!

The refactored test construction using from_iter is correct and simplifies the test setup.


367-367: LGTM!

The simplified construction using direct Variant values is correct and concise.


375-382: LGTM!

The test correctly uses Some/None pattern to test various null scenarios (array-level null vs variant null), which is essential for comprehensive null handling validation.

parquet/src/file/serialized_reader.rs (3)

1139-1139: LGTM!

The import is appropriately added for test usage.


464-467: LGTM!

Replacing unimplemented!() with a proper error return is the correct approach. The error message helpfully includes the page type for debugging.


1179-1213: LGTM!

The test appropriately validates error handling for unsupported page types, covering the new error return at lines 464-467.

parquet/src/file/reader.rs (1)

127-132: Bounds checking prevents out-of-bounds reads.

The defensive check correctly prevents reading beyond the buffer length while allowing valid edge cases (e.g., start == self.len() for empty reads).

parquet/src/schema/types.rs (1)

848-851: HeapSize: avoid double-counting primitive_type — good change

Excluding primitive_type here prevents double-counting since SchemaDescriptor::schema accounts for the Type tree. Keeping only path.heap_size() is correct.

parquet/src/column/reader/decoder.rs (1)

141-153: Good: dispatch table sized by MAX_DISCRIMINANT.

This future-proofs the decoder cache against added encodings and stays consistent with EncodingMask. LGTM.

parquet/src/encodings/decoding.rs (2)

382-395: Good guards for empty data and oversized bit width.

Early EOF and bit-width validation make RLE dictionary decoding safer. LGTM.


1408-1413: Tests cover new error paths well.

Empty dict data and invalid miniblock width cases are covered. Nice.

Also applies to: 2127-2172

parquet/src/parquet_macros.rs (1)

39-42: Nice: expose VARIANTS and MAX_DISCRIMINANT on thrift enums.

This enables static sizing (e.g., dispatch tables) without manual updates. LGTM.

Also applies to: 85-112

parquet/src/basic.rs (1)

698-705: Good: EncodingMask derives MAX_ENCODING from enum discriminants.

Keeps masks in sync automatically with new encodings. LGTM.

arrow-array/src/iterator.rs (2)

47-53: ArrayIter now Clone — good additive API

Cloning preserves iterator cursors; helpful for tests and user code. No behavior risks spotted.


101-103: Fix: exact size_hint after next_back/next

Using current_end - current makes len()/ExactSizeIterator correct under mixed forward/backward consumption. Matches the PR goal.

arrow-buffer/src/util/bit_iterator.rs (4)

26-31: BitIterator now Clone — good

Cheap, safe clone of cursor state; enables robust test patterns.


76-82: count() optimized via len()

Avoids iteration; correct for ExactSizeIterator.


83-102: nth(): correct semantics and overflow handling

Advances to current + n, returns next(); handles overflow/out-of-range by exhausting. Looks good.


117-139: max(): efficient boolean specialization

Short-circuits by checking any set bit via BitIndexIterator; returns Some(false) for non-empty all-false. Matches Iterator::max semantics for bool.

arrow-data/src/equal/mod.rs (1)

33-34: ListView equality wiring looks correct

New module import and match arms correctly route ListView/LargeListView to list_view_equal with appropriate offset types. No issues from this change.

Please ensure equal/tests cover sliced ListView equality and null-handling paths run in CI.

Also applies to: 45-47, 109-111

arrow-select/src/zip.rs (1)

298-352: Good coverage for NULL-in-mask behavior

New tests validate primitive/string/large-string cases with nulls treated as false. LGTM.

Also applies to: 353-402

arrow-select/src/concat.rs (2)

484-486: Dispatch for ListView/LargeListView looks good

concat() routes ListView variants to concat_list_view; tests cover sliced and non-sliced paths.


951-1048: ListView concat tests are solid

Good coverage for regular and sliced ListView cases. LGTM.

arrow/tests/array_equal.rs (1)

825-863: Nested List<ListView<…>> tests are valuable

Sliced outer List with inner ListView equality is well covered. LGTM.

arrow-data/src/transform/mod.rs (6)

36-36: Wiring module looks good.

Module registration for list_view is correct and consistent with the added match arms below.


270-272: Dispatch to list_view::build_extend is correct, but depends on offset rebasing.

Routing is fine. Note: correctness hinges on list_view::build_extend rebasing offsets and extending child_data when concatenating inputs. See comment in list_view.rs.


317-319: Nulls dispatch LGTM.

ListView null extension is wired consistently with List/LargeList variants.


452-456: Capacities match updated appropriately.

Including ListView/LargeListView in List-like capacities is correct.


497-520: Child data grouping for ListView is OK, but extension must be driven by the extend closure.

Collecting the child arrays mirrors List behavior and is fine. However, unless list_view::build_extend extends child_data (and re-bases offsets), the frozen output will have inconsistent offsets vs values. See fix proposal in list_view.rs.


792-799: Buffer selection includes ListView correctly.

Emitting two buffers (offsets, sizes) for ListView/LargeListView is correct.

arrow-data/src/transform/list_view.rs (1)

49-56: Nulls handling is fine.

Pushing zeroed offsets/sizes for nulls is acceptable for ListView.

arrow-select/src/take.rs (4)

221-226: ListView dispatch in take_impl looks correct.

Routes ListView/LargeListView to the new take_list_view with proper offset types.


630-655: take_list_view implementation LGTM.

Offsets/sizes are taken; nulls propagated; child values reused from input. Matches ListView semantics.


1016-1016: Import ToPrimitive for tests is fine.

Required for index builder mapping; OK.


1858-2004: Good ListView take tests.

Covers reversal, null indices, null values, and slicing across i32/i64 offsets.

arrow-select/src/filter.rs (6)

61-68: Iterator ergonomics improved.

Using From<&BooleanBuffer> for SlicesIterator is clean and avoids extra calls.


184-205: filter_record_batch helper is sensible.

Optimization gating mirrors filter(), and the call-through is straightforward.


318-341: FilterPredicate::filter_record_batch LGTM.

Builds arrays then constructs the batch with correct row count. Safe given consistent lengths.


389-394: ListView dispatch added correctly.

Selects i32/i64 variants and routes to filter_list_view.


909-936: filter_list_view implementation looks correct.

Filters offsets/sizes; computes filtered nulls; preserves values child array. Matches expected ListView behavior.


1447-1509: Solid ListView filter tests.

Covers basic and sliced scenarios for both offset sizes.

parquet-variant-compute/src/type_conversion.rs (1)

20-20: Switch to shared rescale_decimal looks good.
Consolidates logic with arrow-cast; no functional concerns.

arrow-cast/Cargo.toml (1)

46-46: Dependency addition verified; no workspace cycles detected.

arrow-ord is confirmed as a workspace member, and its dependencies do not include arrow-cast, confirming no circular dependency exists. The addition is safe and appropriate.

arrow-cast/src/cast/mod.rs (4)

44-51: Module wiring LGTM

New module and import are clean and scoped correctly. No further action.


72-72: Public re-export is fine

Exposing rescale_decimal alongside DecimalCast makes sense for external consumers.


798-805: REE source dispatch looks correct

Index-type driven dispatch to run_end_encoded_cast:: is clear; unsupported types error out fast with a clear message.


806-828: No changes needed

The function signature confirms cast_to_run_end_encoded requires &ArrayRef, which matches the review comment's own caveat: "If its signature requires &ArrayRef, ignore." The code is correct as written.

arrow-cast/src/cast/decimal.rs (9)

148-194: Well-designed upscaling builder with efficient precomputed tables.

The approach of retrieving 10^k - 1 from MAX_FOR_EACH_PRECISION and adding 1 avoids runtime power calculations while staying within type bounds. The distinction between fallible and infallible paths based on precision analysis is sound.

Minor note: The infallible closure uses .unwrap() on line 192, which relies on the precision analysis guaranteeing that from_decimal succeeds. This assumption holds given that:

  1. Input values fit within declared precision (validated elsewhere)
  2. Precision limits ensure valid values fit in native types
  3. The condition input_precision + delta_scale <= output_precision correctly predicts success

You may want to verify that all callers ensure input arrays satisfy precision constraints, though this should already be enforced by Arrow's validation.


196-264: LGTM: Downscaling logic correctly implements round-half-away-from-zero.

The rounding implementation properly handles both positive and negative values. The infallible condition using < rather than <= (line 261) correctly accounts for the potential digit added by rounding, as explained in the inline comments.

The design choice to return None when scale reduction exceeds precision (causing mathematical zero) is appropriate and well-documented.


266-284: Appropriate helper for applying rescaling functions.

The function correctly delegates to either the infallible or fallible path, with validation only on the fallible path (for performance).


286-328: New public API for row-level decimal rescaling is well-designed.

This function provides a clean interface for rescaling individual decimal values, complementing the existing array-level operations. The documentation clearly explains the behavior, including:

  • Validation of precision and scale for both input and output
  • Upscale overflow returns None
  • Excessive downscale returns Some(ZERO) (mathematically correct)

The asymmetric error handling is appropriate given the mathematical semantics of each operation.


330-349: LGTM: Straightforward error message helper.


351-378: Correct array-level casting with appropriate paths for each scenario.

The three-way branching correctly handles:

  1. Infallible casts (direct unary)
  2. Safe fallible casts (filtering with unary_opt)
  3. Unsafe fallible casts (validation with try_unary)

Each path appropriately validates or omits validation based on guarantees.


380-410: Good refactoring using the new builder pattern.

The use of make_downscaler and apply_decimal_cast improves code modularity. The special case handling (returning zeros when scale reduction exceeds precision) correctly preserves null values while replacing non-nulls with zero.


412-446: Refactoring mirrors downscaler pattern with appropriate error handling.

The asymmetric behavior compared to convert_to_smaller_scale_decimal is intentional:

  • Upscale overflow → error (line 439-444)
  • Downscale excessive → zeros (in the other function)

This reflects the mathematical reality: excessive upscaling causes overflow (error condition), while excessive downscaling mathematically rounds to zero (valid result).


926-978: Comprehensive test coverage for the new rescale_decimal function.

The tests cover the essential scenarios:

  • Normal upscaling and downscaling
  • Rounding behavior (half-away-from-zero for both positive and negative values)
  • Edge cases (excessive scale changes, overflow)
  • Error cases (invalid precision/scale parameters)
parquet/src/encryption/ciphers.rs (2)

21-21: Import looks correct.
Brings HeapSize into scope for the new supertrait requirement.


31-35: Extending BlockDecryptor with HeapSize — Verified as correct.

All implementations of BlockDecryptor properly implement the HeapSize trait. The only implementor, RingGcmBlockDecryptor (line 61), correctly implements both BlockDecryptor (line 61) and HeapSize (line 54). The trait constraint is satisfied.

parquet/src/file/metadata/memory.rs (3)

31-31: Import for HashMap — OK.
Needed for the new HeapSize impl.


96-101: Arc overhead calculation — reasonable approximation.
Counts two counters + T + inner heap. Keep in mind std layout is not strictly guaranteed.


103-109: Arc impl — LGTM.
Correctly uses size_of_val for the underlying type plus inner heap usage.

parquet/src/encryption/decrypt.rs (5)

25-25: Import looks correct.
Brings HeapSize into scope for the added impls.


275-279: HeapSize for ExplicitDecryptionKeys — OK.
Accurately sums footer key and column keys.


300-311: HeapSize for DecryptionKeys — OK.
Returns 0 for retriever variant; sensible since it’s user‑defined.


357-361: HeapSize for FileDecryptionProperties — OK.
Sums keys and optional AAD prefix.


582-588: Verify that calling heap_size() directly on Arc<dyn BlockDecryptor> compiles without casting.

The review comment is factually correct: BlockDecryptor extends HeapSize as a supertrait (defined in parquet/src/encryption/ciphers.rs:31). In theory, trait object method dispatch should allow calling supertrait methods directly without casting. However, the current code explicitly casts Arc<dyn BlockDecryptor> to Arc<dyn HeapSize> before calling heap_size(), suggesting there may be a compilation or type resolution issue.

The suggested fix removes the cast and calls heap_size() directly on the trait object. Verify locally that this compiles and functions correctly:

self.decryption_properties.heap_size()
    + self.footer_decryptor.heap_size()  // Direct call without cast
    + self.file_aad.heap_size()

If the direct call fails to compile, the cast may be necessary despite the supertrait relationship.

parquet/src/file/metadata/mod.rs (1)

290-301: Including decryptor memory in memory_size — OK.
Feature‑gated, uses HeapSize on Option<Box<...>>; consistent with existing accounting.

```toml
[dependencies]
arrow-avro = "56"
arrow-avro = "57.0.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update version references in feature snippets section for consistency.

Lines 47 and 54 have been updated to 57.0.0, but the feature snippet examples at lines 152, 157, and 162 still reference version = "56". This creates inconsistency in the documentation.

To maintain clarity, update all version references consistently:

  * Minimal, fast build (common pipelines):
  
    ```toml
-   arrow-avro = { version = "56", default-features = false, features = ["deflate", "snappy"] }
+   arrow-avro = { version = "57.0.0", default-features = false, features = ["deflate", "snappy"] }
    ```
  * Include Zstandard too (modern data lakes):
  
    ```toml
-   arrow-avro = { version = "56", default-features = false, features = ["deflate", "snappy", "zstd"] }
+   arrow-avro = { version = "57.0.0", default-features = false, features = ["deflate", "snappy", "zstd"] }
    ```
  * Fingerprint helpers:
  
    ```toml
-   arrow-avro = { version = "56", features = ["md5", "sha256"] }
+   arrow-avro = { version = "57.0.0", features = ["md5", "sha256"] }
    ```

Also applies to: 54-54

🤖 Prompt for AI Agents
In arrow-avro/README.md around lines 47, 54 and especially the feature snippet
lines 152, 157, and 162, update the inconsistent crate version references that
currently show version = "56" to the new canonical version = "57.0.0"; edit each
toml example to replace the old version string with "57.0.0" (including examples
with default-features and feature lists) so all documented snippets match the
top-of-file version references.

Comment on lines +51 to +71
let new_run_array: ArrayRef = match target_index_field.data_type() {
DataType::Int16 => {
let re = cast_run_ends.as_primitive::<Int16Type>();
Arc::new(RunArray::<Int16Type>::try_new(re, cast_values.as_ref())?)
}
DataType::Int32 => {
let re = cast_run_ends.as_primitive::<Int32Type>();
Arc::new(RunArray::<Int32Type>::try_new(re, cast_values.as_ref())?)
}
DataType::Int64 => {
let re = cast_run_ends.as_primitive::<Int64Type>();
Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?)
}
_ => {
return Err(ArrowError::CastError(
"Run-end type must be i16, i32, or i64".to_string(),
));
}
};
Ok(Arc::new(new_run_array))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Double-wrapping Arc causes a type error.
new_run_array is already an ArrayRef (Arc). Returning Arc::new(new_run_array) wraps an Arc in an Arc.

-                    Ok(Arc::new(new_run_array))
+                    Ok(new_run_array)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let new_run_array: ArrayRef = match target_index_field.data_type() {
DataType::Int16 => {
let re = cast_run_ends.as_primitive::<Int16Type>();
Arc::new(RunArray::<Int16Type>::try_new(re, cast_values.as_ref())?)
}
DataType::Int32 => {
let re = cast_run_ends.as_primitive::<Int32Type>();
Arc::new(RunArray::<Int32Type>::try_new(re, cast_values.as_ref())?)
}
DataType::Int64 => {
let re = cast_run_ends.as_primitive::<Int64Type>();
Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?)
}
_ => {
return Err(ArrowError::CastError(
"Run-end type must be i16, i32, or i64".to_string(),
));
}
};
Ok(Arc::new(new_run_array))
}
let new_run_array: ArrayRef = match target_index_field.data_type() {
DataType::Int16 => {
let re = cast_run_ends.as_primitive::<Int16Type>();
Arc::new(RunArray::<Int16Type>::try_new(re, cast_values.as_ref())?)
}
DataType::Int32 => {
let re = cast_run_ends.as_primitive::<Int32Type>();
Arc::new(RunArray::<Int32Type>::try_new(re, cast_values.as_ref())?)
}
DataType::Int64 => {
let re = cast_run_ends.as_primitive::<Int64Type>();
Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?)
}
_ => {
return Err(ArrowError::CastError(
"Run-end type must be i16, i32, or i64".to_string(),
));
}
};
Ok(new_run_array)
}
🤖 Prompt for AI Agents
In arrow-cast/src/cast/run_array.rs around lines 51 to 71, the code constructs
new_run_array as an ArrayRef (Arc<dyn Array>) and then returns
Ok(Arc::new(new_run_array)), which double-wraps an Arc and causes a type error;
change the return to Ok(new_run_array) (or adjust the binding so it isn't
already an Arc) so you return the existing ArrayRef directly instead of wrapping
it again.

Comment on lines +75 to +93
let run_ends = run_array.run_ends().values().to_vec();
let mut indices = Vec::with_capacity(run_array.run_ends().len());
let mut physical_idx: usize = 0;
for logical_idx in 0..run_array.run_ends().len() {
// If the logical index is equal to the (next) run end, increment the physical index,
// since we are at the end of a run.
if logical_idx == run_ends[physical_idx].as_usize() {
physical_idx += 1;
}
indices.push(physical_idx as i32);
}

let taken = take(&values, &Int32Array::from_iter_values(indices), None)?;
if taken.data_type() != to_type {
cast_with_options(taken.as_ref(), to_type, cast_options)
} else {
Ok(taken)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Incorrect logical expansion: builds indices per run, not per row.
This returns only one value per run instead of expanding to array length. Build indices of total_len = last run-end, repeating each run index by its run length.

-                    let run_ends = run_array.run_ends().values().to_vec();
-                    let mut indices = Vec::with_capacity(run_array.run_ends().len());
-                    let mut physical_idx: usize = 0;
-                    for logical_idx in 0..run_array.run_ends().len() {
-                        // If the logical index is equal to the (next) run end, increment the physical index,
-                        // since we are at the end of a run.
-                        if logical_idx == run_ends[physical_idx].as_usize() {
-                            physical_idx += 1;
-                        }
-                        indices.push(physical_idx as i32);
-                    }
-
-                    let taken = take(&values, &Int32Array::from_iter_values(indices), None)?;
+                    let run_ends = run_array.run_ends().values().to_vec();
+                    let total_len = run_ends.last().map(|v| v.as_usize()).unwrap_or(0);
+                    let mut indices: Vec<u32> = Vec::with_capacity(total_len);
+                    let mut prev_end = 0usize;
+                    for (run_idx, end) in run_ends.iter().enumerate() {
+                        let end = end.as_usize();
+                        let run_len = end - prev_end;
+                        indices.extend(std::iter::repeat(run_idx as u32).take(run_len));
+                        prev_end = end;
+                    }
+                    let taken = take(values.as_ref(), &UInt32Array::from_iter_values(indices), None)?;

Add a unit test that casts a small REE array (e.g., run_ends=[2,5], values=["a","b"]) to Utf8 and checks length=5 and correct expansion ["a","a","b","b","b"].


🏁 Script executed:

#!/bin/bash
# Find and examine the run_array.rs file to verify the current code
find . -name "run_array.rs" -type f | head -20

Length of output: 71


🏁 Script executed:

#!/bin/bash
# Search for the specific code pattern around the suspected issue
rg -n "let run_ends = run_array.run_ends" --type rust -A 20

Length of output: 6237


🏁 Script executed:

#!/bin/bash
# Look for existing tests related to run_array casting
find . -name "*.rs" -type f | xargs grep -l "run_array" | grep -i test

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check for RunArray/REE related documentation or implementation
rg -n "RunArray|run_encoded" --type rust -C 2 | head -100

Length of output: 6749


🏁 Script executed:

#!/bin/bash
# Find the take function signature to verify it accepts UInt32Array
rg -n "pub fn take" --type rust -A 5 | grep -A 5 "indices"

Length of output: 942


🏁 Script executed:

#!/bin/bash
# Look for tests in run_array.rs file itself
cat arrow-cast/src/cast/run_array.rs | tail -200

Length of output: 6928


🏁 Script executed:

#!/bin/bash
# Search for tests that use cast_with_options on RunArray
rg -n "RunArray|cast_with_options" arrow-cast/src/cast/run_array.rs --type rust -C 3

Length of output: 3588


🏁 Script executed:

#!/bin/bash
# Check the full context of the problematic code to understand the function signature
sed -n '50,100p' arrow-cast/src/cast/run_array.rs

Length of output: 2449


🏁 Script executed:

#!/bin/bash
# Search for UInt32Array usage with take function to validate the proposed change
rg -n "UInt32Array.*take\|take.*UInt32Array" --type rust -C 2

Length of output: 0


Implement the expansion fix for RunArray logical form casting.

The current code has a critical logic error: it iterates 0..run_array.run_ends().len() (number of runs) instead of the logical array length, producing incorrectly sized output. For example, with run_ends=[2,5] and values=["a","b"], the current code returns indices=[0,0] (length 2), but should return [0,0,1,1,1] (length 5) to expand the compressed representation.

The proposed fix correctly expands each run by its run_length. The data type change to UInt32Array is valid—take() accepts any Array type, and UInt32Array is already used successfully elsewhere in this same file (line 157).

-                    let run_ends = run_array.run_ends().values().to_vec();
-                    let mut indices = Vec::with_capacity(run_array.run_ends().len());
-                    let mut physical_idx: usize = 0;
-                    for logical_idx in 0..run_array.run_ends().len() {
-                        // If the logical index is equal to the (next) run end, increment the physical index,
-                        // since we are at the end of a run.
-                        if logical_idx == run_ends[physical_idx].as_usize() {
-                            physical_idx += 1;
-                        }
-                        indices.push(physical_idx as i32);
-                    }
-
-                    let taken = take(&values, &Int32Array::from_iter_values(indices), None)?;
+                    let run_ends = run_array.run_ends().values().to_vec();
+                    let total_len = run_ends.last().map(|v| v.as_usize()).unwrap_or(0);
+                    let mut indices: Vec<u32> = Vec::with_capacity(total_len);
+                    let mut prev_end = 0usize;
+                    for (run_idx, end) in run_ends.iter().enumerate() {
+                        let end = end.as_usize();
+                        let run_len = end - prev_end;
+                        indices.extend(std::iter::repeat(run_idx as u32).take(run_len));
+                        prev_end = end;
+                    }
+                    let taken = take(values.as_ref(), &UInt32Array::from_iter_values(indices), None)?;

Add a unit test that casts a small REE array (e.g., run_ends=[2,5], values=["a","b"]) to Utf8 and verifies length=5 with correct expansion ["a","a","b","b","b"].

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let run_ends = run_array.run_ends().values().to_vec();
let mut indices = Vec::with_capacity(run_array.run_ends().len());
let mut physical_idx: usize = 0;
for logical_idx in 0..run_array.run_ends().len() {
// If the logical index is equal to the (next) run end, increment the physical index,
// since we are at the end of a run.
if logical_idx == run_ends[physical_idx].as_usize() {
physical_idx += 1;
}
indices.push(physical_idx as i32);
}
let taken = take(&values, &Int32Array::from_iter_values(indices), None)?;
if taken.data_type() != to_type {
cast_with_options(taken.as_ref(), to_type, cast_options)
} else {
Ok(taken)
}
}
let run_ends = run_array.run_ends().values().to_vec();
let total_len = run_ends.last().map(|v| v.as_usize()).unwrap_or(0);
let mut indices: Vec<u32> = Vec::with_capacity(total_len);
let mut prev_end = 0usize;
for (run_idx, end) in run_ends.iter().enumerate() {
let end = end.as_usize();
let run_len = end - prev_end;
indices.extend(std::iter::repeat(run_idx as u32).take(run_len));
prev_end = end;
}
let taken = take(values.as_ref(), &UInt32Array::from_iter_values(indices), None)?;
if taken.data_type() != to_type {
cast_with_options(taken.as_ref(), to_type, cast_options)
} else {
Ok(taken)
}
}
🤖 Prompt for AI Agents
In arrow-cast/src/cast/run_array.rs around lines 75-93, the code builds indices
by iterating 0..run_array.run_ends().len() (number of runs) instead of the
logical array length, producing a too-short indices vector; change the loop to
iterate over the logical length (e.g., 0..run_array.len() or
run_array.logical_len()) and expand each run by comparing the current logical
index to run_ends[physical_idx].as_usize(), incrementing physical_idx only when
you pass a run end, and push the current physical_idx for each logical position;
construct the indices as a UInt32Array (consistent with other uses in this file)
and call take(&values, &UInt32Array::from_iter_values(indices), None) as before,
then cast if needed. Also add a unit test that builds a RunEndEncoded array with
run_ends=[2,5] and values=["a","b"], casts it to Utf8, and asserts length == 5
and contents == ["a","a","b","b","b"] to verify correct expansion.

Comment on lines +113 to +119
// Cast the input array to the target value type if necessary
let cast_array = if array.data_type() == value_type {
array
} else {
&cast_with_options(array, value_type, cast_options)?
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Borrow to temporary (lifetime) and minor API nit in take.
cast_array borrows a temporary ArrayRef; also prefer .as_ref() for take.

-    let cast_array = if array.data_type() == value_type {
-        array
-    } else {
-        &cast_with_options(array, value_type, cast_options)?
-    };
+    let cast_array: ArrayRef = if array.data_type() == value_type {
+        Arc::clone(array)
+    } else {
+        cast_with_options(array, value_type, cast_options)?
+    };
@@
-    let partitions = partition(&[Arc::clone(cast_array)])?;
+    let partitions = partition(&[Arc::clone(&cast_array)])?;
@@
-    let values_array = take(&cast_array, &indices, None)?;
+    let values_array = take(cast_array.as_ref(), &indices, None)?;

Compile check this file to ensure lifetimes and trait bounds (as_usize) resolve cleanly.

Also applies to: 156-160


I'll verify this review comment by examining the actual code and checking the validity of the lifetime and API concerns.


🏁 Script executed:

#!/bin/bash
# Examine the file context around the flagged lines
cat arrow-cast/src/cast/run_array.rs | head -150 | tail -50

Length of output: 1784


🏁 Script executed:

#!/bin/bash
# Also check lines 156-160 context
cat arrow-cast/src/cast/run_array.rs | sed -n '150,165p'

Length of output: 658


🏁 Script executed:

#!/bin/bash
# Find function signatures for cast_with_options and take
rg -n -A5 'fn (cast_with_options|take)\s*\(' arrow-cast/src --type rust

Length of output: 347


🏁 Script executed:

#!/bin/bash
# Check imports and type definitions at top of file
head -40 arrow-cast/src/cast/run_array.rs

Length of output: 1652


🏁 Script executed:

#!/bin/bash
# Look for the full function containing lines 113-119
rg -nB20 'Cast the input array to the target' arrow-cast/src/cast/run_array.rs

Length of output: 691


🏁 Script executed:

#!/bin/bash
# Check for ArrayRef type definition and usage
rg -n 'type ArrayRef' arrow-cast/src --type rust

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Examine partition function signature
rg -n -A3 'fn partition' arrow-cast/src --type rust

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check the take function more carefully - might be from arrow library
rg -rn 'use.*take' arrow-cast/src/cast/run_array.rs

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Get broader context - find the function definition containing line 113
awk 'NR>=100 && NR<=170' arrow-cast/src/cast/run_array.rs

Length of output: 2371


🏁 Script executed:

#!/bin/bash
# Check what type array is and its return from cast_with_options
rg -B5 -A10 'cast_with_options.*array.*value_type' arrow-cast/src/cast/run_array.rs

Length of output: 559


🏁 Script executed:

#!/bin/bash
# Look at the actual variable usage after line 119 to understand lifetime requirements
awk 'NR>=113 && NR<=145' arrow-cast/src/cast/run_array.rs

Length of output: 1283


🏁 Script executed:

#!/bin/bash
# Check if ArrayRef is Arc-based
grep -r 'pub type ArrayRef' arrow-cast/src --include='*.rs'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look for arrow crate dependencies
grep -E 'arrow|ArrayRef' arrow-cast/Cargo.toml

Length of output: 249


🏁 Script executed:

#!/bin/bash
# Check the take function import source
grep -n 'use.*take\|from.*take' arrow-cast/src/cast/run_array.rs

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Full file for reference
wc -l arrow-cast/src/cast/run_array.rs

Length of output: 36


🏁 Script executed:

#!/bin/bash
# Get complete function context around lines 113-119 and 156-160
sed -n '90,175p' arrow-cast/src/cast/run_array.rs

Length of output: 2645


🏁 Script executed:

#!/bin/bash
# Check the second instance (156-160)
sed -n '145,170p' arrow-cast/src/cast/run_array.rs

Length of output: 771


🏁 Script executed:

#!/bin/bash
# Look for return type of cast_with_options
ast-grep --pattern 'fn cast_with_options($$$) -> $$$'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check partition and take function signatures in this file
grep -n -A4 'fn partition\|pub fn take' arrow-cast/src/cast/run_array.rs

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Inspect imports more carefully
sed -n '1,50p' arrow-cast/src/cast/run_array.rs | grep -E '^use|^mod|^pub'

Length of output: 112


🏁 Script executed:

#!/bin/bash
# Find all instances where cast_array is used
grep -n 'cast_array' arrow-cast/src/cast/run_array.rs

Length of output: 223


🏁 Script executed:

#!/bin/bash
# Check arrow dependency version and ArrayRef
grep -A10 'dependencies' arrow-cast/Cargo.toml

Length of output: 760


🏁 Script executed:

#!/bin/bash
# More direct approach - look at the actual code lines
sed -n '100,165p' arrow-cast/src/cast/run_array.rs

Length of output: 2371


🏁 Script executed:

#!/bin/bash
# Check file structure
ls -la arrow-cast/src/cast/

Length of output: 543


🏁 Script executed:

#!/bin/bash
# Ensure proper path
find . -name "run_array.rs" -type f

Length of output: 71


🏁 Script executed:

#!/bin/bash
# Get the full content if file is small
cat arrow-cast/src/cast/run_array.rs 2>/dev/null || cat arrow-cast/src/run_array.rs 2>/dev/null || find . -name "run_array.rs" -exec cat {} \;

Length of output: 6928


🏁 Script executed:

#!/bin/bash
# Try to find the file with fd
fd run_array.rs arrow-cast/

Length of output: 32


🏁 Script executed:

#!/bin/bash
# Check what's in arrow-cast
ls -la arrow-cast/src/

Length of output: 466


🏁 Script executed:

#!/bin/bash
# Find all Rust files in arrow-cast
find arrow-cast -name "*.rs" -type f

Length of output: 471


🏁 Script executed:

#!/bin/bash
# Look for cast directory
find . -type d -name cast

Length of output: 21


🏁 Script executed:

#!/bin/bash
# Verify project structure
ls -la */src/ 2>/dev/null | head -20

Length of output: 932


🏁 Script executed:

#!/bin/bash
# List files matching the pattern
find . -path "*/cast/*array*.rs"

Length of output: 34


🏁 Script executed:

#!/bin/bash
# Broader search
find . -name "*run_array*"

Length of output: 71


🏁 Script executed:

#!/bin/bash
# Check current directory structure
ls -R

Length of output: 15946


🏁 Script executed:

#!/bin/bash
# Look for any file with "run_array" in name
locate run_array.rs 2>/dev/null || find . -type f -name "*array*" 2>/dev/null | grep -i run

Length of output: 71


🏁 Script executed:

#!/bin/bash
# Simple current dir check
pwd && ls -la

Length of output: 3716



🏁 Script executed:

#!/bin/bash
# Find partition function signature
rg -n 'fn partition' arrow-ord/src --type rust -A3

Length of output: 789


🏁 Script executed:

#!/bin/bash
# Find take function signature  
rg -n 'pub fn take' arrow-select/src --type rust -A4

Length of output: 867


🏁 Script executed:

#!/bin/bash
# Check cast module imports at top to understand take source
head -50 arrow-cast/src/cast/mod.rs | grep -E '^use|^pub'

Length of output: 142


🏁 Script executed:

#!/bin/bash
# Look for take import in cast/mod.rs more broadly
grep -n 'take' arrow-cast/src/cast/mod.rs | head -5

Length of output: 188


Fix critical lifetime issue: borrow of temporary ArrayRef in else branch (line 118).

The current code creates a dangling reference in the else branch:

&cast_with_options(array, value_type, cast_options)?

This borrows a temporary that is immediately dropped. The suggested fix resolves this by making cast_array own the ArrayRef directly, and improves API clarity with explicit .as_ref() calls:

-    let cast_array = if array.data_type() == value_type {
-        array
-    } else {
-        &cast_with_options(array, value_type, cast_options)?
-    };
+    let cast_array: ArrayRef = if array.data_type() == value_type {
+        Arc::clone(array)
+    } else {
+        cast_with_options(array, value_type, cast_options)?
+    };

Then update usages at line 138 and line 159:

-    let partitions = partition(&[Arc::clone(cast_array)])?;
+    let partitions = partition(&[Arc::clone(&cast_array)])?;
-    let values_array = take(&cast_array, &indices, None)?;
+    let values_array = take(cast_array.as_ref(), &indices, None)?;
🤖 Prompt for AI Agents
In arrow-cast/src/cast/run_array.rs around lines 113 to 119, the else branch
currently returns a reference to a temporary (&cast_with_options(...)?),
creating a dangling borrow; change cast_array to own an ArrayRef (not &ArrayRef)
by assigning either a cloned/owned array for the equality case or the returned
ArrayRef from cast_with_options in the else branch so the value lives long
enough, and then update the subsequent uses at lines 138 and 159 to call
.as_ref() (or otherwise dereference) on the owned ArrayRef where a &Array is
required.

Comment on lines +96 to +125
// Check values for equality, with null checking
let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];
let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];

if lhs_range_offsets.len() != rhs_range_offsets.len() {
return false;
}

for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
.iter()
.zip(rhs_range_offsets)
.zip(lhs_range_sizes)
.enumerate()
{
let lhs_is_null = lhs_nulls.is_null(index);
let rhs_is_null = rhs_nulls.is_null(index);

if lhs_is_null != rhs_is_null {
return false;
}

let lhs_offset = lhs_offset.to_usize().unwrap();
let rhs_offset = rhs_offset.to_usize().unwrap();
let size = size.to_usize().unwrap();

// Check if values match in the range
if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
return false;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: per-element size mismatch not checked when nulls are present

In the null-path, equality must assert that sizes are equal for non-null positions. Currently, only lhs size is used for comparison, which can yield false positives if rhs has extra elements.

Apply this fix to compare both sizes and use the agreed size:

-        // Check values for equality, with null checking
-        let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];
-        let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];
+        // Check values for equality, with null checking
+        let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];
+        let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];

-        for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
-            .iter()
-            .zip(rhs_range_offsets)
-            .zip(lhs_range_sizes)
-            .enumerate()
-        {
+        for (index, ((&lhs_offset, &rhs_offset), (&lhs_size, &rhs_size))) in lhs_range_offsets
+            .iter()
+            .zip(rhs_range_offsets)
+            .zip(lhs_range_sizes.iter().zip(rhs_range_sizes))
+            .enumerate()
+        {
             let lhs_is_null = lhs_nulls.is_null(index);
             let rhs_is_null = rhs_nulls.is_null(index);

             if lhs_is_null != rhs_is_null {
                 return false;
             }

-            let lhs_offset = lhs_offset.to_usize().unwrap();
-            let rhs_offset = rhs_offset.to_usize().unwrap();
-            let size = size.to_usize().unwrap();
+            if !lhs_is_null {
+                if lhs_size != rhs_size {
+                    return false;
+                }
+                let lhs_offset = lhs_offset.to_usize().unwrap();
+                let rhs_offset = rhs_offset.to_usize().unwrap();
+                let size = lhs_size.to_usize().unwrap();

-            // Check if values match in the range
-            if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
-                return false;
-            }
+                // Check if values match in the range
+                if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
+                    return false;
+                }
+            }
         }

Add a test where one side has an extra element for a non-null sublist to prevent regressions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Check values for equality, with null checking
let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];
let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];
if lhs_range_offsets.len() != rhs_range_offsets.len() {
return false;
}
for (index, ((&lhs_offset, &rhs_offset), &size)) in lhs_range_offsets
.iter()
.zip(rhs_range_offsets)
.zip(lhs_range_sizes)
.enumerate()
{
let lhs_is_null = lhs_nulls.is_null(index);
let rhs_is_null = rhs_nulls.is_null(index);
if lhs_is_null != rhs_is_null {
return false;
}
let lhs_offset = lhs_offset.to_usize().unwrap();
let rhs_offset = rhs_offset.to_usize().unwrap();
let size = size.to_usize().unwrap();
// Check if values match in the range
if !lhs_is_null && !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
return false;
}
}
// Check values for equality, with null checking
let lhs_range_offsets = &lhs_offsets[lhs_start..lhs_start + len];
let rhs_range_offsets = &rhs_offsets[rhs_start..rhs_start + len];
if lhs_range_offsets.len() != rhs_range_offsets.len() {
return false;
}
for (index, ((&lhs_offset, &rhs_offset), (&lhs_size, &rhs_size))) in lhs_range_offsets
.iter()
.zip(rhs_range_offsets)
.zip(lhs_range_sizes.iter().zip(rhs_range_sizes))
.enumerate()
{
let lhs_is_null = lhs_nulls.is_null(index);
let rhs_is_null = rhs_nulls.is_null(index);
if lhs_is_null != rhs_is_null {
return false;
}
if !lhs_is_null {
if lhs_size != rhs_size {
return false;
}
let lhs_offset = lhs_offset.to_usize().unwrap();
let rhs_offset = rhs_offset.to_usize().unwrap();
let size = lhs_size.to_usize().unwrap();
// Check if values match in the range
if !equal_values(lhs_data, rhs_data, lhs_offset, rhs_offset, size) {
return false;
}
}
}
🤖 Prompt for AI Agents
In arrow-data/src/equal/list_view.rs around lines 96-125, the loop only reads
lhs sizes and doesn't verify that rhs per-element sizes match when values are
non-null; update the code to also slice rhs_range_sizes (similar to
lhs_range_sizes), in the for loop read both lhs_size and rhs_size (convert both
to usize) and immediately return false if they differ, then use the agreed size
(either value after the equality check) when calling equal_values; also add a
unit test that constructs two lists where a non-null sublist on one side has an
extra element (size mismatch) to prevent regressions.

Comment on lines +24 to +47
pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(
array: &ArrayData,
) -> crate::transform::Extend<'_> {
let offsets = array.buffer::<T>(0);
let sizes = array.buffer::<T>(1);
Box::new(
move |mutable: &mut _MutableArrayData, _index: usize, start: usize, len: usize| {
let offset_buffer = &mut mutable.buffer1;
let sizes_buffer = &mut mutable.buffer2;

for &offset in &offsets[start..start + len] {
offset_buffer.push(offset);
}

// sizes
for &size in &sizes[start..start + len] {
sizes_buffer.push(size);
}

// the beauty of views is that we don't need to copy child_data, we just splat
// the offsets and sizes.
},
)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Offsets not rebased and child_data not extended when concatenating; output can be invalid.

When MutableArrayData combines multiple ListView arrays, you must:

  • Extend mutable.child_data[0] with the referenced value range(s).
  • Rebase pushed offsets by the base child length so they still point to the concatenated values.

Without this, offsets from later input arrays can point before appended regions, and child_data remains empty, yielding corrupted arrays.

Apply this fix to rebase offsets and extend child values in one contiguous span covering the selected rows (skipping null/empty rows):

@@
-pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(
+pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(
     array: &ArrayData,
 ) -> crate::transform::Extend<'_> {
     let offsets = array.buffer::<T>(0);
     let sizes = array.buffer::<T>(1);
+    let nulls = array.nulls().cloned();
     Box::new(
-        move |mutable: &mut _MutableArrayData, _index: usize, start: usize, len: usize| {
-            let offset_buffer = &mut mutable.buffer1;
-            let sizes_buffer = &mut mutable.buffer2;
-
-            for &offset in &offsets[start..start + len] {
-                offset_buffer.push(offset);
-            }
-
-            // sizes
-            for &size in &sizes[start..start + len] {
-                sizes_buffer.push(size);
-            }
-
-            // the beauty of views is that we don't need to copy child_data, we just splat
-            // the offsets and sizes.
-        },
+        move |mutable: &mut _MutableArrayData, index: usize, start: usize, len: usize| {
+            let end = start + len;
+            let offset_buffer = &mut mutable.buffer1;
+            let sizes_buffer = &mut mutable.buffer2;
+
+            // Compute a contiguous value range covering all selected non-empty, non-null lists
+            let mut min_off: Option<usize> = None;
+            let mut max_end: usize = 0;
+            for i in start..end {
+                // Skip null rows
+                if let Some(ref n) = nulls {
+                    if !n.is_valid(i) {
+                        continue;
+                    }
+                }
+                let sz = sizes[i];
+                if sz == T::zero() {
+                    continue;
+                }
+                let off = offsets[i].as_usize();
+                let list_end = off
+                    .checked_add(sz.as_usize())
+                    .expect("ListView size overflow");
+                min_off = Some(min_off.map_or(off, |m| m.min(off)));
+                if list_end > max_end {
+                    max_end = list_end;
+                }
+            }
+
+            // Base shift where newly appended child values will start
+            let base_child_len = mutable.child_data[0].len();
+
+            // If there are any values to copy, extend child_data once with the covering span
+            let delta_usize = if let Some(min) = min_off {
+                mutable.child_data[0].extend(index, min, max_end);
+                base_child_len
+                    .checked_sub(min)
+                    .expect("underflow computing ListView base shift")
+            } else {
+                0
+            };
+
+            let delta: T = T::from_usize(delta_usize)
+                .expect("ListView base shift does not fit target offset type");
+
+            // Push rebased offsets and original sizes
+            for i in start..end {
+                let off = offsets[i];
+                let rebased = off.checked_add(&delta).expect("ListView offset overflow");
+                offset_buffer.push(rebased);
+            }
+            for i in start..end {
+                sizes_buffer.push(sizes[i]);
+            }
+        },
     )
 }

Add a unit test that concatenates two ListView arrays via MutableArrayData and asserts offsets point into the concatenated values.

I can add the test and wire it into array_transform tests if you want.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(
array: &ArrayData,
) -> crate::transform::Extend<'_> {
let offsets = array.buffer::<T>(0);
let sizes = array.buffer::<T>(1);
Box::new(
move |mutable: &mut _MutableArrayData, _index: usize, start: usize, len: usize| {
let offset_buffer = &mut mutable.buffer1;
let sizes_buffer = &mut mutable.buffer2;
for &offset in &offsets[start..start + len] {
offset_buffer.push(offset);
}
// sizes
for &size in &sizes[start..start + len] {
sizes_buffer.push(size);
}
// the beauty of views is that we don't need to copy child_data, we just splat
// the offsets and sizes.
},
)
}
pub(super) fn build_extend<T: ArrowNativeType + Integer + CheckedAdd>(
array: &ArrayData,
) -> crate::transform::Extend<'_> {
let offsets = array.buffer::<T>(0);
let sizes = array.buffer::<T>(1);
let nulls = array.nulls().cloned();
Box::new(
move |mutable: &mut _MutableArrayData, index: usize, start: usize, len: usize| {
let end = start + len;
let offset_buffer = &mut mutable.buffer1;
let sizes_buffer = &mut mutable.buffer2;
// Compute a contiguous value range covering all selected non-empty, non-null lists
let mut min_off: Option<usize> = None;
let mut max_end: usize = 0;
for i in start..end {
// Skip null rows
if let Some(ref n) = nulls {
if !n.is_valid(i) {
continue;
}
}
let sz = sizes[i];
if sz == T::zero() {
continue;
}
let off = offsets[i].as_usize();
let list_end = off
.checked_add(sz.as_usize())
.expect("ListView size overflow");
min_off = Some(min_off.map_or(off, |m| m.min(off)));
if list_end > max_end {
max_end = list_end;
}
}
// Base shift where newly appended child values will start
let base_child_len = mutable.child_data[0].len();
// If there are any values to copy, extend child_data once with the covering span
let delta_usize = if let Some(min) = min_off {
mutable.child_data[0].extend(index, min, max_end);
base_child_len
.checked_sub(min)
.expect("underflow computing ListView base shift")
} else {
0
};
let delta: T = T::from_usize(delta_usize)
.expect("ListView base shift does not fit target offset type");
// Push rebased offsets and original sizes
for i in start..end {
let off = offsets[i];
let rebased = off.checked_add(&delta).expect("ListView offset overflow");
offset_buffer.push(rebased);
}
for i in start..end {
sizes_buffer.push(sizes[i]);
}
},
)
}

Comment on lines +373 to +381
c.bench_function("cast runs of 10 string to ree<int32>", |b| {
let source_array: Int32Array = (0..8192).map(|i| i / 10).collect();
let array_ref = Arc::new(source_array) as ArrayRef;
let target_type = DataType::RunEndEncoded(
Arc::new(Field::new("run_ends", DataType::Int32, false)),
Arc::new(Field::new("values", DataType::Int32, true)),
);
b.iter(|| cast(&array_ref, &target_type).unwrap());
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Bench name/type mismatch.
Label says “string” but the source array is Int32. Update for clarity.

-    c.bench_function("cast runs of 10 string to ree<int32>", |b| {
+    c.bench_function("cast runs of 10 int32 to ree<int32>", |b| {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
c.bench_function("cast runs of 10 string to ree<int32>", |b| {
let source_array: Int32Array = (0..8192).map(|i| i / 10).collect();
let array_ref = Arc::new(source_array) as ArrayRef;
let target_type = DataType::RunEndEncoded(
Arc::new(Field::new("run_ends", DataType::Int32, false)),
Arc::new(Field::new("values", DataType::Int32, true)),
);
b.iter(|| cast(&array_ref, &target_type).unwrap());
});
c.bench_function("cast runs of 10 int32 to ree<int32>", |b| {
let source_array: Int32Array = (0..8192).map(|i| i / 10).collect();
let array_ref = Arc::new(source_array) as ArrayRef;
let target_type = DataType::RunEndEncoded(
Arc::new(Field::new("run_ends", DataType::Int32, false)),
Arc::new(Field::new("values", DataType::Int32, true)),
);
b.iter(|| cast(&array_ref, &target_type).unwrap());
});
🤖 Prompt for AI Agents
In arrow/benches/cast_kernels.rs around lines 373 to 381, the benchmark label
incorrectly says "string" while the source array is Int32; update the bench name
to accurately reflect the types (e.g., "cast runs of 10 int32 to ree<int32>") so
the description matches the source_array and target_type.

Comment on lines +138 to +145
if start > self.len() || start + length > self.len() {
return Err(eof_err!(
"Expected to read {} bytes at offset {}, while file has length {}",
length,
start,
self.len()
));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential integer overflow in bounds check.

The expression start + length on line 138 can overflow since both operands are usize. If overflow occurs, the wrapped value may incorrectly pass the bounds check, allowing out-of-bounds access.

Consider using checked arithmetic or restructuring the condition:

-        if start > self.len() || start + length > self.len() {
+        if start > self.len() || length > self.len() - start {

Or with explicit overflow handling:

-        if start > self.len() || start + length > self.len() {
+        let end = start.checked_add(length).ok_or_else(|| {
+            ParquetError::EOF(format!(
+                "Expected to read {} bytes at offset {}, while file has length {}",
+                length, start, self.len()
+            ))
+        })?;
+        if end > self.len() {
             return Err(eof_err!(
                 "Expected to read {} bytes at offset {}, while file has length {}",
                 length,
                 start,
                 self.len()
             ));
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if start > self.len() || start + length > self.len() {
return Err(eof_err!(
"Expected to read {} bytes at offset {}, while file has length {}",
length,
start,
self.len()
));
}
if start > self.len() || length > self.len() - start {
return Err(eof_err!(
"Expected to read {} bytes at offset {}, while file has length {}",
length,
start,
self.len()
));
}
🤖 Prompt for AI Agents
In parquet/src/file/reader.rs around lines 138 to 145, the bounds check uses
`start + length` which can overflow for usize and let an out-of-bounds read slip
through; replace that arithmetic with a checked operation (e.g., use
start.checked_add(length) and handle None as overflow) or rewrite the condition
as `length > self.len().saturating_sub(start)` / `start > self.len() || length >
self.len().wrapping_sub(start)` (prefer checked_add) so overflows are detected
and return the same EOF error when adding would overflow or the sum exceeds file
length.

Comment on lines +395 to +397
if offset > buffer.len() || offset > uncompressed_page_size {
return Err(general_err!("Invalid page header"));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Redundant condition in offset validation.

The check offset > uncompressed_page_size at line 395 is redundant. The earlier validation at lines 368-382 already ensures that definition_levels_byte_length + repetition_levels_byte_length <= uncompressed_page_size, and since offset is calculated from these same values (lines 383-385), the condition offset > uncompressed_page_size can never be true at this point.

The meaningful part of this validation is offset > buffer.len(), which correctly checks if the received buffer is too small.

Consider simplifying to:

-            if offset > buffer.len() || offset > uncompressed_page_size {
+            if offset > buffer.len() {
                 return Err(general_err!("Invalid page header"));
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if offset > buffer.len() || offset > uncompressed_page_size {
return Err(general_err!("Invalid page header"));
}
if offset > buffer.len() {
return Err(general_err!("Invalid page header"));
}
🤖 Prompt for AI Agents
In parquet/src/file/serialized_reader.rs around lines 395 to 397, the condition
`offset > uncompressed_page_size` is redundant because earlier checks guarantee
the sum used to compute offset cannot exceed uncompressed_page_size; simplify
the validation by removing that redundant comparison and only check `offset >
buffer.len()` to detect a too-small buffer, preserving the existing error return
path and message.

Comment on lines +1149 to +1177
#[test]
fn test_decode_page_invalid_offset() {
let page_header = PageHeader {
r#type: PageType::DATA_PAGE_V2,
uncompressed_page_size: 10,
compressed_page_size: 10,
data_page_header: None,
index_page_header: None,
dictionary_page_header: None,
crc: None,
data_page_header_v2: Some(DataPageHeaderV2 {
num_nulls: 0,
num_rows: 0,
num_values: 0,
encoding: Encoding::PLAIN,
definition_levels_byte_length: 11,
repetition_levels_byte_length: 0,
is_compressed: None,
statistics: None,
}),
};

let buffer = Bytes::new();
let err = decode_page(page_header, buffer, Type::INT32, None).unwrap_err();
assert!(
err.to_string()
.contains("DataPage v2 header contains implausible values")
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test doesn't cover the new validation logic.

The test test_decode_page_invalid_offset validates the existing check at lines 368-382 (where definition_levels_byte_length + repetition_levels_byte_length > uncompressed_page_size), but doesn't reach the new validation added at lines 395-397.

To test the new code path at lines 395-397, the test would need to:

  1. Pass the first validation (set levels so their sum ≤ uncompressed_page_size)
  2. Provide an empty or small buffer so offset > buffer.len() triggers
  3. Include a decompressor

Consider renaming this test to reflect what it actually tests, or add a separate test that specifically validates line 395's offset > buffer.len() check.

@martin-augment
Copy link
Copy Markdown
Owner Author

The fork's main branch was outdated and the PR shows a bigger diff than the actual one.
Closing to redo it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.