Skip to content

fix(parquet): fix CDC panic on nested ListArrays with null entries #2

Closed
kszucs wants to merge 80 commits intomainfrom
cdc-nested-null
Closed

fix(parquet): fix CDC panic on nested ListArrays with null entries #2
kszucs wants to merge 80 commits intomainfrom
cdc-nested-null

Conversation

@kszucs
Copy link
Copy Markdown
Owner

@kszucs kszucs commented Apr 1, 2026

Which issue does this PR close?

  • Closes #NNN.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

SYaoJun and others added 30 commits February 23, 2026 11:47
# Which issue does this PR close?

# Rationale for this change

Use .cloned().map(Some) instead of .map(|b| Some(b.clone()))
for better readability and idiomatic Rust style.

# What changes are included in this PR?

# Are these changes tested?

# Are there any user-facing changes?
…apache#9411)

Implements a helper to replace the pattern of creating a `BooleanBuffer`
from an unsliced validity bitmap and filtering by null count. Previously
this was done with `BooleanBuffer::new(...)` plus
`Some(NullBuffer::new(...)).filter(|n| n.null_count() > 0);` now it is a
single call to` NullBuffer::try_from_unsliced(buffer, len)`, which
returns `Some(NullBuffer)` when there are nulls and `None` when all
values are valid.

- Added `try_from_unsliced` in `arrow-buffer/src/buffer/null.rs` with
tests for nulls, all valid, all null, empty
- Refactor `FixedSizeBinaryArray::try_from_iter_with_size` and
`try_from_sparse_iter_with_size` to use it
- Refactor `take_nulls` in `arrow-select` to use it


Closes apache#9385
# Which issue does this PR close?

- Closes apache#9455.

# Rationale for this change
check issue

# What changes are included in this PR?

Added list types support to `VariantArray` data type checking

# Are these changes tested?

# Are there any user-facing changes?
…taType (apache#9473)

This enables setting a timezone or precision & scale on parameterized
DataType values.

Note: I think the panic is unfortunate, and a try_with_data_type() would
be sensible.

# Which issue does this PR close?

- Closes apache#8042.

# Are these changes tested?

Yes

# Are there any user-facing changes?

- Adds `PrimitiveRunBuilder::with_data_type`.
# Which issue does this PR close?

- Closes apache#9336

# Rationale for this change

When an Avro reader schema has a union type that needs to be resolved
against the type in the writer schema, resolution information other than
primitive type promotions is not properly handled when creating the
decoder.
For example, when the reader schema has a nullable record field that has
an added nested field on top of the fields defined in the writer schema,
the record type resolution needs to be applied, using a projection with
the default field value.

# What changes are included in this PR?

Extend the union resolution information in the decoder with variant
data for enum remapping and record projection. The `Projector` data
structure with `Skipper` decoders makes part of this information,
which necessitated some refactoring.

# Are these changes tested?

TODO:

- [x] Debug failing tests including a busy-loop failure mode.
- [ ] Add more unit tests exercising the complex resolutions.

# Are there any user-facing changes?

No.
- part of apache#8466

Update release schedule based on historical reality
…ts (apache#9472)

# Rationale for this change

The motivation for this PR is to create to improve the testing
infrastructure as a precursor to the following PR:

- apache#9221 

@Jefffrey seemed to be in favor of using `insta` for more tests:
apache#9221 (comment)

# What changes are included in this PR?

This PR does not do logic changes, but is a straightforward translation
of the current tests. More test cases, especially around escape
sequences can be added in follow up PRs.

# Are these changes tested?

Yes, to review we still need to manually confirm that no test cases
changed accidentally.

# Are there any user-facing changes?

No.
…ime (apache#9491)

Mark ArrowTimestampType::make_value as deprecated and migrate internal
callers to the newer from_naive_datetime API.

# Which issue does this PR close?

- Closes apache#9490 .

# Rationale for this change

Follow-up from PR apache#9345.

# What changes are included in this PR?

Mark ArrowTimestampType::make_value as deprecated and migrate internal
callers to the newer from_naive_datetime API.

# Are these changes tested?

YES.

# Are there any user-facing changes?

Migration Path: Users should replace:

```rust
// Old
TimestampSecondType::make_value(naive)
```
With:

```rust
// New
TimestampSecondType::from_naive_datetime(naive, None)
```
…pache#9498)

# Which issue does this PR close?

- Closes apache#9492 .

# What changes are included in this PR?

See title.

# Are these changes tested?

YES

# Are there any user-facing changes?

NO
# Which issue does this PR close?

- Closes apache#9478 .

# What changes are included in this PR?

- Fix the typo
- Enhance the bracket access for the variant path

# Are these changes tested?

- Add some tests to cover the logic

# Are there any user-facing changes?

No
… parquet unshredding (apache#9437)

# Which issue does this PR close?

- Part of apache#9340.

# Rationale for this change

Json writers for ListLike types (List/ListView/FixedSizeList) are pretty
similar apart from the element range representation. We already had a
good way to abstract this kind of encoder in parquet variant
unshredding. Given this, it would be good to move this `ListLikeArray`
trait to arrow-array to be shared with json/parquet

# What changes are included in this PR?

Move `ListLikeArray` trait from parquet-variant-compute to arrow-array

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

New pub trait in arrow-array
# Which issue does this PR close?

- Closes apache#9425.

# Rationale for this change

I noticed that this method is available on PrimitiveTypeBuilder, but
missing on the GenericByteBuilder, which make sense since the gain is
less, but after benchmarking, it shows a solid 10%. Mostly because the
more efficient allocation of the null-mask.

```
┌───────────────────┬────────────────┬───────────────────┬─────────┐
│     Benchmark     │ append_value_n │ append_value loop │ Speedup │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=100/len=5       │ 371 ns         │ 408 ns            │ 10%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=100/len=30      │ 456 ns         │ 507 ns            │ 10%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=100/len=1024    │ 1.81 µs        │ 1.95 µs           │ 8%      │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=1000/len=5      │ 2.39 µs        │ 2.87 µs           │ 17%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=1000/len=30     │ 3.41 µs        │ 3.89 µs           │ 12%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=1000/len=1024   │ 12.3 µs        │ 14.4 µs           │ 15%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=10000/len=5     │ 23.8 µs        │ 29.3 µs           │ 19%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=10000/len=30    │ 33.7 µs        │ 39.0 µs           │ 14%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=10000/len=1024  │ 115.9 µs       │ 135.0 µs          │ 14%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=100000/len=5    │ 227.5 µs       │ 278.6 µs          │ 18%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=100000/len=30   │ 328.1 µs       │ 377.9 µs          │ 13%     │
├───────────────────┼────────────────┼───────────────────┼─────────┤
│ n=100000/len=1024 │ 1.16 ms        │ 1.34 ms           │ 14%     │
└───────────────────┴────────────────┴───────────────────┴─────────┘
```

I think this is still worthwhile to be added. Let me know what the
community thinks!

# What changes are included in this PR?

A new public API.

# Are these changes tested?

Yes!

# Are there any user-facing changes?

A new public API.
# Which issue does this PR close?

Closes apache#9431

# Rationale for this change

It would be nice to add `append_nulls` to MapBuilder, similar to
`append_nulls` on `GenericListBuilder`. Appending the nulls at once,
instead of using a loop has some nice performance implications:

```
Benchmark results (1,000,000 nulls):

┌─────────────────────────┬─────────┐
│         Method          │  Time   │
├─────────────────────────┼─────────┤
│ append(false) in a loop │ 2.36 ms │
├─────────────────────────┼─────────┤
│ append_nulls(N)         │ 50 µs   │
└─────────────────────────┴─────────┘
```

# What changes are included in this PR?

A new public API.

# Are these changes tested?

With some fresh unit tests.

# Are there any user-facing changes?

A nice and convient new public API
# Which issue does this PR close?

- Closes apache#9429

I'm doing some performance optimization, and noticed that we have a loop
adding one value to the null mask at a time. Instead, I'd suggest adding
`append_non_nulls` to do this at once.

```
append_non_nulls(n) vs append(true) in a loop (with bitmap allocated)

┌───────────┬───────────────────┬─────────────────────┬─────────┐
│     n     │ append(true) loop │ append_non_nulls(n) │ speedup │
├───────────┼───────────────────┼─────────────────────┼─────────┤
│ 100       │ 251 ns            │ 73 ns               │ ~3x     │
├───────────┼───────────────────┼─────────────────────┼─────────┤
│ 1,000     │ 2.0 µs            │ 94 ns               │ ~21x    │
├───────────┼───────────────────┼─────────────────────┼─────────┤
│ 10,000    │ 19.3 µs           │ 119 ns              │ ~162x   │
├───────────┼───────────────────┼─────────────────────┼─────────┤
│ 100,000   │ 191 µs            │ 348 ns              │ ~549x   │
├───────────┼───────────────────┼─────────────────────┼─────────┤
│ 1,000,000 │ 1.90 ms           │ 3.5 µs              │ ~543x   │
└───────────┴───────────────────┴─────────────────────┴─────────┘
```


# Rationale for this change

It adds a new public API in favor of performance improvements.

# What changes are included in this PR?

A new public API

# Are these changes tested?

Yes, with new unit-tests.

# Are there any user-facing changes?

Just a new convient API.
Updates the requirements on
[strum_macros](https://github.com/Peternator7/strum) to permit the
latest version.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/Peternator7/strum/blob/master/CHANGELOG.md">strum_macros's
changelog</a>.</em></p>
<blockquote>
<h2>0.28.0</h2>
<ul>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/461">#461</a>:
Allow any kind of passthrough attributes on
<code>EnumDiscriminants</code>.</p>
<ul>
<li>Previously only list-style attributes (e.g.
<code>#[strum_discriminants(derive(...))]</code>) were supported. Now
path-only
(e.g. <code>#[strum_discriminants(non_exhaustive)]</code>) and
name/value (e.g. <code>#[strum_discriminants(doc =
&quot;foo&quot;)]</code>)
attributes are also supported.</li>
</ul>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/462">#462</a>:
Add missing <code>#[automatically_derived]</code> to generated impls not
covered by <a
href="https://redirect.github.com/Peternator7/strum/pull/444">#444</a>.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/466">#466</a>:
Bump MSRV to 1.71, required to keep up with updated <code>syn</code> and
<code>windows-sys</code> dependencies. This is a breaking change if
you're on an old version of rust.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/469">#469</a>:
Use absolute paths in generated proc macro code to avoid
potential name conflicts.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/465">#465</a>:
Upgrade <code>phf</code> dependency to v0.13.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/473">#473</a>:
Fix <code>cargo fmt</code> / <code>clippy</code> issues and add GitHub
Actions CI.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/477">#477</a>:
<code>strum::ParseError</code> now implements
<code>core::fmt::Display</code> instead
<code>std::fmt::Display</code> to make it <code>#[no_std]</code>
compatible. Note the <code>Error</code> trait wasn't available in core
until <code>1.81</code>
so <code>strum::ParseError</code> still only implements that in std.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/476">#476</a>:
<strong>Breaking Change</strong> - <code>EnumString</code> now
implements <code>From&lt;&amp;str&gt;</code>
(infallible) instead of <code>TryFrom&lt;&amp;str&gt;</code> when the
enum has a <code>#[strum(default)]</code> variant. This more accurately
reflects that parsing cannot fail in that case. If you need the old
<code>TryFrom</code> behavior, you can opt back in using
<code>parse_error_ty</code> and <code>parse_error_fn</code>:</p>
<pre lang="rust"><code>#[derive(EnumString)]
#[strum(parse_error_ty = strum::ParseError, parse_error_fn =
make_error)]
pub enum Color {
    Red,
    #[strum(default)]
    Other(String),
}
<p>fn make_error(x: &amp;str) -&gt; strum::ParseError {
strum::ParseError::VariantNotFound
}
</code></pre></p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/431">#431</a>:
Fix bug where <code>EnumString</code> ignored the
<code>parse_err_ty</code>
attribute when the enum had a <code>#[strum(default)]</code>
variant.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/474">#474</a>:
EnumDiscriminants will now copy <code>default</code> over from the
original enum to the Discriminant enum.</p>
<pre lang="rust"><code>#[derive(Debug, Default, EnumDiscriminants)]
#[strum_discriminants(derive(Default))] // &lt;- Remove this in 0.28.
enum MyEnum {
    #[default] // &lt;- Will be the #[default] on the MyEnumDiscriminant
    #[strum_discriminants(default)] // &lt;- Remove this in 0.28
    Variant0,
    Variant1 { a: NonDefault },
}
</code></pre>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/Peternator7/strum/commit/7376771128834d28bb9beba5c39846cba62e71ec"><code>7376771</code></a>
Peternator7/0.28 (<a
href="https://redirect.github.com/Peternator7/strum/issues/475">#475</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/26e63cd964a2e364331a5dd977d589bb9f649d8c"><code>26e63cd</code></a>
Display exists in core (<a
href="https://redirect.github.com/Peternator7/strum/issues/477">#477</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/9334c728eedaa8a992d1388a8f4564bbccad1934"><code>9334c72</code></a>
Make TryFrom and FromStr infallible if there's a default (<a
href="https://redirect.github.com/Peternator7/strum/issues/476">#476</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/0ccbbf823c16e827afc263182cd55e99e3b2a52e"><code>0ccbbf8</code></a>
Honor parse_err_ty attribute when the enum has a default variant (<a
href="https://redirect.github.com/Peternator7/strum/issues/431">#431</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/2c9e5a9259189ce8397f2f4967060240c6bafd74"><code>2c9e5a9</code></a>
Automatically add Default implementation to EnumDiscriminant if it
exists on ...</li>
<li><a
href="https://github.com/Peternator7/strum/commit/e241243e48359b8b811b8eaccdcfa1ae87138e0d"><code>e241243</code></a>
Fix existing cargo fmt + clippy issues and add GH actions (<a
href="https://redirect.github.com/Peternator7/strum/issues/473">#473</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/639b67fefd20eaead1c5d2ea794e9afe70a00312"><code>639b67f</code></a>
feat: allow any kind of passthrough attributes on
<code>EnumDiscriminants</code> (<a
href="https://redirect.github.com/Peternator7/strum/issues/461">#461</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/0ea1e2d0fd1460e7492ea32e6b460394d9199ff8"><code>0ea1e2d</code></a>
docs: Fix typo (<a
href="https://redirect.github.com/Peternator7/strum/issues/463">#463</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/36c051b91086b37d531c63ccf5a49266832a846d"><code>36c051b</code></a>
Upgrade <code>phf</code> to v0.13 (<a
href="https://redirect.github.com/Peternator7/strum/issues/465">#465</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/9328b38617dc6f4a3bc5fdac03883d3fc766cf34"><code>9328b38</code></a>
Use absolute paths in proc macro (<a
href="https://redirect.github.com/Peternator7/strum/issues/469">#469</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/Peternator7/strum/compare/v0.27.0...v0.28.0">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
)

# Rationale for this change

The inner loop in `Projector::project_record` gives the optimizer
somewhat complicated dynamic data to branch through.
The sparse arrays in `Projector` are redundantly coded: `None` in the
index positions of `writer_to_reader` must match `Some` in
`skip_decoders` and vice versa.

# What changes are included in this PR?

Refactor record projection state with a single array of directive-like
enums corresponding to each writer schema field.

# Are these changes tested?

Added a benchmark for record projection (the benchmark code is partially
shared with apache#9397).
Somewhat counterintuitively for me, it does not show improvement on a
more complex case with a mix of projected fields, but does improve the
simpler one-field projection cases.
 
Passes the existing tests.
1. Reduce some quantifiers from `*` to `?` when 2+ occurrences would
generate invalid Rust code. `$(if $pred:expr)*`
2. Clean up 4-armed recursive macros: 
  * put the base case first
  * explain the fixups
* fix all at once, going directly to the base case, instead of possibly
multiple hoops

The inital motivation was getting rust-analyzer to stop choking on such
macros usage where the left-hand side was a tuple and the
right-hand-side an expr.
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Part of apache#9476.

# Rationale for this change

Add benchmarks to show benefit of the optimizations in apache#9477

# What changes are included in this PR?

Adds some benches for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and
DELTA_LENGTH_BYTE_ARRAY. The generated data is meant to show the benefit
of special casing for miniblocks with a bitwidth of 0.

# Are these changes tested?

Just benches

# Are there any user-facing changes?

No
# Which issue does this PR close?

 - Closes apache#9506


# Rationale for this change

`Vec::reserve(n)` does not guarantee exact capacity, Rust's
`MIN_NON_ZERO_CAP` optimization means `reserve(2)` gives capacity = 4
for most numeric types, causing `debug_assert_eq!(capacity, batch_size)`
to panic in debug mode when `batch_size < 4`.
# What changes are included in this PR?

Replace `reserve` with `reserve_exact` in `ensure_capacity` in both
`InProgressPrimitiveArray` and `InProgressByteViewArray`.
`reserve_exact` skips the amortized growth optimization and allocates
exactly the requested capacity, making the assertion correct.

# Are these changes tested?
No. This only fixes an incorrect debug assertion. 

# Are there any user-facing changes?
No
# Which issue does this PR close?

- Closes apache#9513

# Rationale for this change

`VariantArray::try_new` and `canonicalize_and_verify_data_type` both
accept `LargeUtf8` as a valid shredded variant type. However
unshred_variant currently only handles Utf8 for string typed_value
columns

This means a VariantArray with a LargeUtf8 typed_value column can be
constructed successfully, but calling unshred_variant on it fails
# Which issue does this PR close?

- Closes apache#9512

# Rationale for this change

You can build a Variant with a StringView type shredded out, but calling
`unshred_variant` will fail with not yet implemented
…page skip (apache#9374)

# Which issue does this PR close?

- Closes apache#9370 .

# Rationale for this change

The bug occurs when using RowSelection with nested types (like
List<String>) when:
1. A column has multiple pages in a row group
2. The selected rows span across page boundaries
  3. The first page is entirely consumed during skip operations

The issue was in `arrow-rs/parquet/src/column/reader.rs:287-382`
(`skip_records` function).

**Root cause:** When `skip_records` completed successfully after
crossing page boundaries, the `has_partial` state in the
`RepetitionLevelDecoder` could incorrectly remain true.

This happened when:

- The skip operation exhausted a page where has_record_delimiter was
false
- The skip found the remaining records on the next page by counting a
delimiter at index 0
- When a subsequent read_records(1) was called, the stale
has_partial=true state caused count_records to incorrectly interpret the
first repetition level (0) at index 0 as ending a "phantom" partial
record, returning (1 record, 0 levels, 0 values) instead of properly
reading the actual record data.

For a more descriptive explanation, look here:
apache#9370 (comment)

# What changes are included in this PR?

Added code at the end of skip_records to reset the partial record state
when all requested records have been successfully skipped.

This ensures that after skip_records completes, we're at a clean record
boundary with no lingering partial record state, fixing the array length
mismatch in StructArrayReader.

# Are these changes tested?

Commit
apache@365bd9a
introduces a test showcasing this issue with v2 data pages only on a
unit-test level. PR apache#9399 could
be used to showcase the issue in an end-to-end way.

Previously wrong assumption that thought it had to do with mixing v1 and
v2 data pages:

```
In b52e043 I added a test that I validated to fail whenever I remove my fix.

  Bug Mechanism                                                                                                                                                                                             
                                                                                                                                                                                                            
  The bug requires three ingredients:                                                                                                                                                                       

  1. Page 1 (DataPage v1): Contains a nested column (with rep levels). During skip_records, all levels on this page are consumed. count_records sees no following rep=0 delimiter, so it sets               
  has_partial=true. Since has_record_delimiter is false (the default InMemoryPageReader returns false when more pages exist), flush_partial is not called.
  2. Page 2 (DataPage v2): Has num_rows available in its metadata. When num_rows <= remaining_records, the entire page is skipped via skip_next_page() — this does not touch the rep level decoder at all,
  so has_partial remains stale true from page 1.
  3. Page 3 (DataPage v1): When read_records loads this page, the stale has_partial=true causes the rep=0 at position 0 to be misinterpreted as completing a "phantom" partial record. This produces (1
  record, 0 levels, 0 values) instead of reading the actual record data.

  Test Verification

  - With fix (flush_partial at end of skip_records): read_records(1) correctly returns (1, 2, 2) with values [70, 80]
  - Without fix: read_records(1) returns (1, 0, 0) — a phantom record with no data, which is what causes the "Not all children array length are the same!" error when different sibling columns in a struct
  produce different record counts
```

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…e#9532)

# Which issue does this PR close?


- Closes #NNN.

# Rationale for this change
I would like to use `ParquetMetaDataPushDecoder` in arrow-datafusion,
but the `with_file_decryption_properties` function is pub(crate), so I
can't fully implement the encryption feature.,

# What changes are included in this PR?
Make it pub

# Are these changes tested?

Not needed

# Are there any user-facing changes?

Now pub
…nion (apache#9524)

# Which issue does this PR close?

Field::try_merge correctly handles DataType::Null for primitive types
and when self is Null, but fails when self is a compound type (Struct,
List, LargeList, Union) and from is Null. This causes Schema::try_merge
to error when merging schemas where one has a Null field and another has
a
concrete compound type for the same field.

This is common in JSON inference where some files have null values for
fields that are structs/lists in other files.

- Closes[ apache#9523](apache#9523)

# Rationale for this change

Add `DataType::Null` arms to the Struct, List, LargeList, and Union
branches in `Field::try_merge`, consistent with how primitive types
already handle it.

# What changes are included in this PR?

Add `DataType::Null` arms to the Struct, List, LargeList, and Union
branches in `Field::try_merge`, consistent with how primitive types
already handle it.
# Are these changes tested?

- Added test `test_merge_compound_with_null` covering Struct, List,
  LargeList, and Union merging with Null in both directions.
- Existing tests continue to pass.

# Are there any user-facing changes?

No
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Relates to apache#9497.

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Add benchmark for `ListArray` in `json_reader` to support the
performance evaluation of apache#9497

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Benchmarks for decoding and serialize json list to `ListArray`.
- Benchmarks for `ListArray` and `FixedSizeListArray` json writer

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Benchmarks only

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

No
…apache#8918)

# Which issue does this PR close?

Part of apache#8137. Follow up of apache#7303. Replaces apache#8040.

# Rationale for this change

apache#7303 implements the fundamental symbols for tracking memory. This patch
exposes those APIs to a higher level Array and ArrayData.

# What changes are included in this PR?

New `claim` API for NullBuffer, ArrayData, and Array. New `pool`
feature-flag to arrow, arrow-array, and arrow-data.

# Are these changes tested?

Added a doctest on the `Array::claim` method.

# Are there any user-facing changes?

Added API and a new feature-flag for arrow, arrow-array, and arrow-data.
…dicates (apache#9509)

# Which issue does this PR close?

Raised an issue at apache#9516 for
this one

Same issue as apache#9239 but
extended to another scenario

# Rationale for this change

When there are multiple predicates being evaluated, we need to reset the
row selection policy before overriding the strategy.

Scenario:
- Dense initial RowSelection (alternating select/skip) covers all pages
→ Auto resolves to Mask
- Predicate 1 evaluates on column A, narrows selection to skip middle
pages
- Predicate 2's column B is fetched sparsely with the narrowed selection
(missing middle pages)
- Without the fix, the override for predicate 2 returns early
(policy=Mask, not Auto), so Mask is used and tries to read missing pages
→ "Invalid offset" error

# What changes are included in this PR?

This is a one line change to reset the selection policy in the
`RowGroupDecoderState::WaitingOnFilterData` arm

# Are these changes tested?

Yes a new test added that fails currently on `main`, but as you can see
it's a doozy to set up.

# Are there any user-facing changes?

Nope
…e#9481)

# Which issue does this PR close?

- Closes apache#9451
- Closes apache#6256

# Rationale for this change

A reader might be annoyed (performance wart) if a parquet footer lacks
nullcount stats, but inferring nullcount=0 for missing stats makes the
stats untrustworthy and can lead to incorrect behavior.

# What changes are included in this PR?

If a parquet footer nullcount stat is missing, surface it as None,
reserving `Some(0)` for known-no-null cases.

# Are these changes tested?

Fixed one unit test that broke, added a missing unit test that covers
the other change site.

# Are there any user-facing changes?

The stats API doesn't change signature, but there is a behavior change.
The existing doc that called out the incorrect behavior has been removed
to reflect that the incorrect behavior no longer occurs.
# Rationale for this change

Exposes the Arrow schema produced by the async Avro file reader,
similarly to the `schema` method on the synchronous reader.

This allows an application to prepare casting or other schema
transformations with no need to fetch the first record batch to learn
the produced Arrow schema. Since the async reader only parses OCF
content for the moment, the schema does not change from batch to batch.

# What changes are included in this PR?

The `schema` method for `AsyncAvroFileReader` exposes the Arrow schema
of record batches that are produced by the reader.

# Are these changes tested?

Added tests verifying that the returned schema matches the expected.

# Are there any user-facing changes?

Added a `schema` method to `AsyncAvroFileReader`.
etseidl and others added 24 commits March 19, 2026 13:39
# Which issue does this PR close?

- Closes apache#9543.

# Rationale for this change

See issue, but it is possible to construct arguments to
`arrow_buffer::bit_util::bit_mask::set_bits` that overflow the bounds
checking protecting unsafe code.

# What changes are included in this PR?
Use `checked_add` when doing the bounds checking and panic when an
overflow occurs.

# Are these changes tested?

Yes

# Are there any user-facing changes?

No
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes apache#9529 .

# Rationale for this change

- In a follow up PR, can fix the `variant_get` TODO:

https://github.com/apache/arrow-rs/blob/3b6179658203dc1b1610b67c1777d5b8beb137fc/parquet-variant-compute/src/variant_get.rs#L89-L92
- When we know that Struct VariantArray is not shredded can reuse
`shred_basic_variant`
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

- Added `StructVariantToArrowRowBuilder` builder.
- Moved `make_variant_to_arrow_row_builder` logic to
`make_typed_variant_to_arrow_row_builder` to reuse by `Struct` array's
inner fields.
- Changed a `variant_get` test to show that it now handles unshredded
`Struct` `VariantArray`
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
- Yes, added `test_struct_row_builder_handles_unshredded_nested_structs`
- Everything else still works.
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?
No
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

---------

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
…apache#9439)

# Which issue does this PR close?

- Closes apache#9438.

# Rationale for this change

Speed up conversion by only importing `pyarrow` once.

# What changes are included in this PR?

- Use `PyOnceLock::import` to import the types.
- Remove some not useful `.extract::<PyBackedStr>()?` (the `Display`
implementation already does something similar)

# Are these changes tested?

Covered by existing tests. It would be nice to add benchmark but it
might require to:
- either add a dependency to a python benchmark runner
- write some hacky code to import `pyarrow` from criterion tests (likely
by running `pip`/`uv` from the Rust benchmark code)

# Are there any user-facing changes?

No
…e#9578)

## Summary

- When a `RowSelection` selects every row in a row group, `fetch_ranges`
now treats it as no selection, producing a single whole-column-chunk I/O
request instead of N individual page requests
- This reduces the number of I/O requests for subsequent filter
predicates when an earlier predicate passes all rows

## Details

In `InMemoryRowGroup::fetch_ranges`, when both a `RowSelection` and an
`OffsetIndex` are present, the code enters a page-level fetch path that
uses `scan_ranges()` to produce individual page ranges. Even when the
selection covers all rows, this produces N separate ranges (one per
page).

The fix: before entering the page-level path, check if the selection's
`row_count()` equals the row group's total row count. If so, drop the
selection and take the simpler whole-column-chunk path.

This commonly happens when a multi-predicate `RowFilter` has an early
predicate that passes all rows in a row group (e.g., `CounterID = 62` on
a row group where all rows have `CounterID = 62`).

## Test plan

- [x] Existing tests pass (snapshot updated to reflect fewer I/O
requests)
- [x] `test_read_multiple_row_filter` verifies the coalesced fetch
pattern

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

# Which issue does this PR close?


- Closes apache#9378 

# Rationale for this change

the optimizations as listed in the issue description

- Align to 8 bytes
- Don't try to return a buffer with bit_offset 0 but round it to a
multiple of 64
- Use chunk_exact for the fallback path


# What changes are included in this PR?

When both inputs share the same sub-64-bit alignment (left_offset % 64
== right_offset % 64), the optimized path is used. This covers the
common cases (both offset 0, both sliced equally, etc.). The BitChunks
fallback is retained only when the two offsets have different sub-64-bit
alignment.

# Are these changes tested?

Yes the tests are changed and they are included

# Are there any user-facing changes?
  

Yes, this is a minor breaking change to from_bitwise_binary_op:

- The returned BooleanBuffer may now have a non-zero offset (previously
always 0)
- The returned BooleanBuffer may have padding bits set outside the
logical range in values()

---------

Signed-off-by: Kunal Singh Dadhwal <kunalsinghdadhwal@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?

None

# Rationale for this change

We want to use the SBBF Bloom Filter, but need to construct/serialize it
manually. Currently there is no way to create a new `Sbbf` outside of
this crate. Alongside this: we want to store the `Sbbf` in a
`FixedSizedBinary` column for some fancy indexing.

# What changes are included in this PR?

Some methods become public

# Are these changes tested?

N/A

# Are there any user-facing changes?

Yes, we add a few more public methods to the `Sbbf` struct
…9450)

# Which issue does this PR close?

- Closes #NNN.

# Rationale for this change

Rust implementation of apache/arrow#45360

Traditional Parquet writing splits data pages at fixed sizes, so a
single inserted or deleted row causes all subsequent pages to shift —
resulting in nearly every byte being re-uploaded to content-addressable
storage (CAS) systems. CDC determines page boundaries via a rolling
gearhash over column values, so unchanged data produces identical pages
across different writes enabling storage cost reductions and faster
upload times.

See more details in https://huggingface.co/blog/parquet-cdc

The original C++ implementation
apache/arrow#45360

Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator
where I already integrated this PR to verify that deduplication
effectiveness is on par with parquet-cpp (lower is better):

<img width="984" height="411" alt="image"
src="https://github.com/user-attachments/assets/e6e80931-ac76-4bdd-bf9c-ba7e06559411"
/>


# What changes are included in this PR?

- **Content-defined chunker**  at `parquet/src/column/chunker/`
- **Arrow writer integration** integrated in `ArrowColumnWriter`
- **Writer properties** via `CdcOptions` struct (`min_chunk_size`,
`max_chunk_size`, `norm_level`)
- **ColumnDescriptor**: added `repeated_ancestor_def_level` field to for
nested field values iteration

# Are these changes tested?

Yes — unit tests are located in `cdc.rs` and ported from the C++
implementation.

# Are there any user-facing changes?

New **experimental** API, disabled by default — no behavior change for
existing code:

```rust
// Simple toggle (256 KiB min, 1 MiB max, norm_level 0)
let props = WriterProperties::builder()
    .set_content_defined_chunking(true)
    .build();

// Excpliti CDC parameters
let props = WriterProperties::builder()
    .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 })
    .build();
```

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
apache#9576)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes apache#9526 

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

`shred_variant` already supports Binary and LargeBinary types (apache#9525,
apache#9554), but unshred_variant does not handle these types. This means
shredded Binary/LargeBinary columns cannot be converted back to
unshredded VariantArrays.

# What changes are included in this PR?

Adds unshred_variant support for DataType::Binary and
DataType::LargeBinary in parquet-variant-compute/src/unshred_variant.rs:
  - New enum variants PrimitiveBinary and PrimitiveLargeBinary
  - Match arms in append_row and try_new_opt
  - AppendToVariantBuilder impls for BinaryArray and LargeBinaryArray



# Are these changes tested?

Yes

# Are there any user-facing changes?

No breaking changes

---------

Signed-off-by: Kunal Singh Dadhwal <kunalsinghdadhwal@gmail.com>
# Which issue does this PR close?

- part of apache#9108

# Rationale for this change

Prepare for next release

# What changes are included in this PR?

1. Update version to `58.1.0`
2. Add changelog. See rendered preview here:
https://github.com/alamb/arrow-rs/blob/alamb/prepare_58.1.0/CHANGELOG.md

# Are these changes tested?

By CI
# Are there any user-facing changes?

Yes
…apache#9590)

## Summary

- Reserve `output.views` capacity in
`ByteViewArrayDecoderDictionary::read` before the decode loop
- Reserve `output.offsets` capacity in
`ByteArrayDecoderDictionary::read` before the decode loop

This avoids per-chunk reallocation during `extend` calls inside the
dictionary decode loop.

Closes apache#9587

## Test plan

- [ ] Existing tests pass (no functional change, only pre-allocation)
- [ ] Benchmark dictionary-encoded StringView/BinaryView/String reads

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Rationale for this change

In some cases, it is desirable to print strings with surrounding
quotation marks. A typical example that we run into in
https://github.com/rerun-io/rerun is a `StructArray` that contains empty
strings:

Current formatting:

```text
{name: }
```

Added option in this PR:

```text
{name: ""}
```

# What changes are included in this PR?

This PR relies on `std::fmt::Debug` to do the actual formatting of
strings, which means that all escaping is handled out of the box.

# Are these changes tested?

This PR contains test for different types of inputs, including escape
sequences. Additionally, it also tests the `StructArray` example
outlined above.

# Are there any user-facing changes?

By default this option is false, making the feature opt-in.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
## Which issue does this PR close?

Closes apache#9580

## Rationale

The current VLQ decoder calls `get_aligned` for each byte, which
involves repeated offset calculations and bounds checks in the hot loop.

## What changes are included in this PR?

Align to the byte boundary once, then iterate directly over the buffer
slice, avoiding per-byte overhead from `get_aligned`.

## Are there any user-facing changes?

No.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Rationale for this change

The `object_store` crate release 0.13.2 breaks the build of parquet
because it feature-gates the `buffered` module. I have filed
apache/arrow-rs-object-store#677 about the
breakage; meanwhile this fix is made in expectation that 0.13.2 will not
be yanked and the feature gate will remain.

# What changes are included in this PR?

Bump the version to 0.13.2 and requesting the "tokio" feature.

# Are these changes tested?

The build should succeed in CI workflows.

# Are there any user-facing changes?

No

Co-authored-by: Mikhail Zabaluev <mikhail.zabaluev@gmail.com>
Updates the requirements on [sha2](https://github.com/RustCrypto/hashes)
to permit the latest version.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/RustCrypto/hashes/commit/ffe093984c004769747e998f77da8ff7c0e7a765"><code>ffe0939</code></a>
Release sha2 0.11.0 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/806">#806</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/8991b65fe400c31c4cc189510f86ae642c470cd9"><code>8991b65</code></a>
Use the standard order of the <code>[package]</code> section fields (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/807">#807</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/3d2bc57db40fd6aeb25d6c6da98d67e2784c2985"><code>3d2bc57</code></a>
sha2: refactor backends (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/802">#802</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/faa55fb83697c8f3113636d88070e5f5edc8c335"><code>faa55fb</code></a>
sha3: bump <code>keccak</code> to v0.2 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/803">#803</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/d3e6489e56f8486d4a93ceb7a8abf4924af1de7b"><code>d3e6489</code></a>
sha3 v0.11.0-rc.9 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/801">#801</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/bbf6f51ff97f81ab15e6e5f6cf878bfbcb1f47c8"><code>bbf6f51</code></a>
sha2: tweak backend docs (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/800">#800</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/155dbbf2959dbec0ec75948a82590ddaede2d3bc"><code>155dbbf</code></a>
sha3: add default value for the <code>DS</code> generic parameter on
<code>TurboShake128/256</code>...</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/ed514f2b34526683b3b7c41670f1887982c3df64"><code>ed514f2</code></a>
Use published version of <code>keccak</code> v0.2 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/799">#799</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/702bcd83735a49c928c0fc24506924f5c0aa22af"><code>702bcd8</code></a>
Migrate to closure-based <code>keccak</code> (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/796">#796</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/827c043f82d57666a0b146d156e91c39535c1305"><code>827c043</code></a>
sha3 v0.11.0-rc.8 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/794">#794</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/RustCrypto/hashes/compare/groestl-v0.10.0...sha2-v0.11.0">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes apache#9340.

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Support `ListView` codec in arrow-json. Using `ListLikeArray` trait to
simplify implementation.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Tests added

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

New encoder/decoder
… verification (apache#9604)

# Which issue does this PR close?

- Closes apache#9603 

# Rationale for this change

The release and dev KEYS files could get out of synch.
We should use the release/ version:
- Users use the release/ version not dev/ version when they verify our
artifacts' signature
- https://dist.apache.org/ may reject our request when we request many
times by CI

# What changes are included in this PR?

Use
`https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/KEYS`
to download the KEYS file and the expected
`https://dist.apache.org/repos/dist/dev/arrow` for the RC artifacts.

# Are these changes tested?

Yes, I've verified 58.1.0 1 both previous to the change and after the
change.

# Are there any user-facing changes?

No
…uct)` (apache#9597)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes apache#9596.

# Rationale for this change

Check issue
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

Reuse `shred_basic_variant` as a fast path for unshredded `Struct`
handling in `variant_get(..., Struct)`
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?

Yes, added two unit tests to establish safe mode behavior.
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
## Summary

- Fix `MutableArrayData::extend_nulls` which previously panicked
unconditionally for both sparse and dense Union arrays
- For sparse unions: append the first type_id and extend nulls in all
children
- For dense unions: append the first type_id, compute offsets into the
first child, and extend nulls in that child only

## Background

This bug was discovered via DataFusion. `CaseExpr` uses
`MutableArrayData` via `scatter()` to build result arrays. When a `CASE`
expression returns a Union type (e.g., from `json_get` which returns a
JSON union) and there are rows where no `WHEN` branch matches (implicit
`ELSE NULL`), `scatter` calls `extend_nulls` which panics with "cannot
call extend_nulls on UnionArray as cannot infer type".

Any query like:
```sql
SELECT CASE WHEN condition THEN returns_union(col, 'key') END FROM table
```
would panic if `condition` is false for any row.

## Root Cause

The `extend_nulls` implementation for Union arrays unconditionally
panicked because it claimed it "cannot infer type". However, the Union's
field definitions (child types and type IDs) are available in the
`MutableArrayData`'s data type — there's enough information to produce
valid null entries by picking the first declared type_id.

## Test plan

- [x] Added test for sparse union `extend_nulls`
- [x] Added test for dense union `extend_nulls`
- [x] Existing `test_union_dense` continues to pass
- [x] All `array_transform` tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Relates to apache#9497.

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

As part of the effort to move the Json reader away from `ArrayData`
toward typed `ArrayRef` APIs, it's necessary to change the
`ArrayDecoder::decode` interface to return `ArrayRef` directly and
updates all decoder implementations (list, struct, map, run-end encoded)
to construct typed arrays without intermediate `ArrayData` round-trips.
New benchmarks for map and run-end encoded decoding are added to verify
there is no performance regression.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Yes

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
No
# Which issue does this PR close?

- closes apache#9593

# Rationale for this change

In a previous PR (apache#9593), I change instances of `truncate(0)` to
`clear()`. However, this breaks the test `test_truncate_with_pool` at
`arrow-buffer/src/buffer/mutable.rs:1357`, due to an inconsistency
between the implementation of `truncate` and `clear`. This PR fixes that
test.

# What changes are included in this PR?

This PR copies a section of code related to the `pool` feature present
in `truncate` but absent in `clear`, fixing the failing unit test.

# Are these changes tested?

Yes.

# Are there any user-facing changes?

No.
)

# Rationale for this change

CdcOptions only contains primitive fields (usize, usize, i32) so
deriving PartialEq and Eq is straightforward. This is needed by
downstream crates such as DataFusion that embed CdcOptions in their own
configuration structs and need to compare them.

# What changes are included in this PR?

Implemented PartialEq and Eq for CdcOptions.

# Are these changes tested?

Added an equality test.

# Are there any user-facing changes?

No.
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes apache#8400.

# Rationale for this change

Check issue
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

- Added `AppendNullMode` enum supporting all semantics.
- Replaced the bool logic to the new enum
- Fix test outputs for List Array cases

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
- Added unit tests
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
# Rationale for this change

Makes the code simpler and more readable by relying on new PyO3 and Rust
features. No behavior should have changed outside of an error message if
`__arrow_c_array__` does not return a tuple

# What changes are included in this PR?

- use `.call_method0(M)?` instead of `.getattr(M)?.call0()`
- Use `.extract()` that allows more advanced features like directly
extracting tuple elements
- remove temporary variables just before returning
- use &raw const and &raw mut pointers instead of casting and addr_of!
…pache#9637)

The CDC chunker's value_offset diverged from actual leaf array positions
when null list entries had non-empty child offset ranges (valid per the
Arrow columnar format spec). This caused slice_for_chunk to produce
incorrect non_null_indices, leading to an out-of-bounds panic in
write_mini_batch.

Track non-null value counts (nni) separately from leaf slot counts in
the chunker, and use them in slice_for_chunk to correctly index into
non_null_indices regardless of gaps in the leaf array.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.