fix(parquet): fix CDC panic on nested ListArrays with null entries #2
Closed
fix(parquet): fix CDC panic on nested ListArrays with null entries #2
Conversation
# Which issue does this PR close? # Rationale for this change Use .cloned().map(Some) instead of .map(|b| Some(b.clone())) for better readability and idiomatic Rust style. # What changes are included in this PR? # Are these changes tested? # Are there any user-facing changes?
Fix missing link in Parquet README
…apache#9411) Implements a helper to replace the pattern of creating a `BooleanBuffer` from an unsliced validity bitmap and filtering by null count. Previously this was done with `BooleanBuffer::new(...)` plus `Some(NullBuffer::new(...)).filter(|n| n.null_count() > 0);` now it is a single call to` NullBuffer::try_from_unsliced(buffer, len)`, which returns `Some(NullBuffer)` when there are nulls and `None` when all values are valid. - Added `try_from_unsliced` in `arrow-buffer/src/buffer/null.rs` with tests for nulls, all valid, all null, empty - Refactor `FixedSizeBinaryArray::try_from_iter_with_size` and `try_from_sparse_iter_with_size` to use it - Refactor `take_nulls` in `arrow-select` to use it Closes apache#9385
# Which issue does this PR close? - Closes apache#9455. # Rationale for this change check issue # What changes are included in this PR? Added list types support to `VariantArray` data type checking # Are these changes tested? # Are there any user-facing changes?
…taType (apache#9473) This enables setting a timezone or precision & scale on parameterized DataType values. Note: I think the panic is unfortunate, and a try_with_data_type() would be sensible. # Which issue does this PR close? - Closes apache#8042. # Are these changes tested? Yes # Are there any user-facing changes? - Adds `PrimitiveRunBuilder::with_data_type`.
# Which issue does this PR close? - Closes apache#9336 # Rationale for this change When an Avro reader schema has a union type that needs to be resolved against the type in the writer schema, resolution information other than primitive type promotions is not properly handled when creating the decoder. For example, when the reader schema has a nullable record field that has an added nested field on top of the fields defined in the writer schema, the record type resolution needs to be applied, using a projection with the default field value. # What changes are included in this PR? Extend the union resolution information in the decoder with variant data for enum remapping and record projection. The `Projector` data structure with `Skipper` decoders makes part of this information, which necessitated some refactoring. # Are these changes tested? TODO: - [x] Debug failing tests including a busy-loop failure mode. - [ ] Add more unit tests exercising the complex resolutions. # Are there any user-facing changes? No.
- part of apache#8466 Update release schedule based on historical reality
…ts (apache#9472) # Rationale for this change The motivation for this PR is to create to improve the testing infrastructure as a precursor to the following PR: - apache#9221 @Jefffrey seemed to be in favor of using `insta` for more tests: apache#9221 (comment) # What changes are included in this PR? This PR does not do logic changes, but is a straightforward translation of the current tests. More test cases, especially around escape sequences can be added in follow up PRs. # Are these changes tested? Yes, to review we still need to manually confirm that no test cases changed accidentally. # Are there any user-facing changes? No.
…ime (apache#9491) Mark ArrowTimestampType::make_value as deprecated and migrate internal callers to the newer from_naive_datetime API. # Which issue does this PR close? - Closes apache#9490 . # Rationale for this change Follow-up from PR apache#9345. # What changes are included in this PR? Mark ArrowTimestampType::make_value as deprecated and migrate internal callers to the newer from_naive_datetime API. # Are these changes tested? YES. # Are there any user-facing changes? Migration Path: Users should replace: ```rust // Old TimestampSecondType::make_value(naive) ``` With: ```rust // New TimestampSecondType::from_naive_datetime(naive, None) ```
…pache#9498) # Which issue does this PR close? - Closes apache#9492 . # What changes are included in this PR? See title. # Are these changes tested? YES # Are there any user-facing changes? NO
# Which issue does this PR close? - Closes apache#9478 . # What changes are included in this PR? - Fix the typo - Enhance the bracket access for the variant path # Are these changes tested? - Add some tests to cover the logic # Are there any user-facing changes? No
… parquet unshredding (apache#9437) # Which issue does this PR close? - Part of apache#9340. # Rationale for this change Json writers for ListLike types (List/ListView/FixedSizeList) are pretty similar apart from the element range representation. We already had a good way to abstract this kind of encoder in parquet variant unshredding. Given this, it would be good to move this `ListLikeArray` trait to arrow-array to be shared with json/parquet # What changes are included in this PR? Move `ListLikeArray` trait from parquet-variant-compute to arrow-array # Are these changes tested? Covered by existing tests # Are there any user-facing changes? New pub trait in arrow-array
# Which issue does this PR close? - Closes apache#9425. # Rationale for this change I noticed that this method is available on PrimitiveTypeBuilder, but missing on the GenericByteBuilder, which make sense since the gain is less, but after benchmarking, it shows a solid 10%. Mostly because the more efficient allocation of the null-mask. ``` ┌───────────────────┬────────────────┬───────────────────┬─────────┐ │ Benchmark │ append_value_n │ append_value loop │ Speedup │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100/len=5 │ 371 ns │ 408 ns │ 10% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100/len=30 │ 456 ns │ 507 ns │ 10% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100/len=1024 │ 1.81 µs │ 1.95 µs │ 8% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=1000/len=5 │ 2.39 µs │ 2.87 µs │ 17% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=1000/len=30 │ 3.41 µs │ 3.89 µs │ 12% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=1000/len=1024 │ 12.3 µs │ 14.4 µs │ 15% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=10000/len=5 │ 23.8 µs │ 29.3 µs │ 19% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=10000/len=30 │ 33.7 µs │ 39.0 µs │ 14% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=10000/len=1024 │ 115.9 µs │ 135.0 µs │ 14% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100000/len=5 │ 227.5 µs │ 278.6 µs │ 18% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100000/len=30 │ 328.1 µs │ 377.9 µs │ 13% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100000/len=1024 │ 1.16 ms │ 1.34 ms │ 14% │ └───────────────────┴────────────────┴───────────────────┴─────────┘ ``` I think this is still worthwhile to be added. Let me know what the community thinks! # What changes are included in this PR? A new public API. # Are these changes tested? Yes! # Are there any user-facing changes? A new public API.
# Which issue does this PR close? Closes apache#9431 # Rationale for this change It would be nice to add `append_nulls` to MapBuilder, similar to `append_nulls` on `GenericListBuilder`. Appending the nulls at once, instead of using a loop has some nice performance implications: ``` Benchmark results (1,000,000 nulls): ┌─────────────────────────┬─────────┐ │ Method │ Time │ ├─────────────────────────┼─────────┤ │ append(false) in a loop │ 2.36 ms │ ├─────────────────────────┼─────────┤ │ append_nulls(N) │ 50 µs │ └─────────────────────────┴─────────┘ ``` # What changes are included in this PR? A new public API. # Are these changes tested? With some fresh unit tests. # Are there any user-facing changes? A nice and convient new public API
# Which issue does this PR close? - Closes apache#9429 I'm doing some performance optimization, and noticed that we have a loop adding one value to the null mask at a time. Instead, I'd suggest adding `append_non_nulls` to do this at once. ``` append_non_nulls(n) vs append(true) in a loop (with bitmap allocated) ┌───────────┬───────────────────┬─────────────────────┬─────────┐ │ n │ append(true) loop │ append_non_nulls(n) │ speedup │ ├───────────┼───────────────────┼─────────────────────┼─────────┤ │ 100 │ 251 ns │ 73 ns │ ~3x │ ├───────────┼───────────────────┼─────────────────────┼─────────┤ │ 1,000 │ 2.0 µs │ 94 ns │ ~21x │ ├───────────┼───────────────────┼─────────────────────┼─────────┤ │ 10,000 │ 19.3 µs │ 119 ns │ ~162x │ ├───────────┼───────────────────┼─────────────────────┼─────────┤ │ 100,000 │ 191 µs │ 348 ns │ ~549x │ ├───────────┼───────────────────┼─────────────────────┼─────────┤ │ 1,000,000 │ 1.90 ms │ 3.5 µs │ ~543x │ └───────────┴───────────────────┴─────────────────────┴─────────┘ ``` # Rationale for this change It adds a new public API in favor of performance improvements. # What changes are included in this PR? A new public API # Are these changes tested? Yes, with new unit-tests. # Are there any user-facing changes? Just a new convient API.
Updates the requirements on [strum_macros](https://github.com/Peternator7/strum) to permit the latest version. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/Peternator7/strum/blob/master/CHANGELOG.md">strum_macros's changelog</a>.</em></p> <blockquote> <h2>0.28.0</h2> <ul> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/461">#461</a>: Allow any kind of passthrough attributes on <code>EnumDiscriminants</code>.</p> <ul> <li>Previously only list-style attributes (e.g. <code>#[strum_discriminants(derive(...))]</code>) were supported. Now path-only (e.g. <code>#[strum_discriminants(non_exhaustive)]</code>) and name/value (e.g. <code>#[strum_discriminants(doc = "foo")]</code>) attributes are also supported.</li> </ul> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/462">#462</a>: Add missing <code>#[automatically_derived]</code> to generated impls not covered by <a href="https://redirect.github.com/Peternator7/strum/pull/444">#444</a>.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/466">#466</a>: Bump MSRV to 1.71, required to keep up with updated <code>syn</code> and <code>windows-sys</code> dependencies. This is a breaking change if you're on an old version of rust.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/469">#469</a>: Use absolute paths in generated proc macro code to avoid potential name conflicts.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/465">#465</a>: Upgrade <code>phf</code> dependency to v0.13.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/473">#473</a>: Fix <code>cargo fmt</code> / <code>clippy</code> issues and add GitHub Actions CI.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/477">#477</a>: <code>strum::ParseError</code> now implements <code>core::fmt::Display</code> instead <code>std::fmt::Display</code> to make it <code>#[no_std]</code> compatible. Note the <code>Error</code> trait wasn't available in core until <code>1.81</code> so <code>strum::ParseError</code> still only implements that in std.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/476">#476</a>: <strong>Breaking Change</strong> - <code>EnumString</code> now implements <code>From<&str></code> (infallible) instead of <code>TryFrom<&str></code> when the enum has a <code>#[strum(default)]</code> variant. This more accurately reflects that parsing cannot fail in that case. If you need the old <code>TryFrom</code> behavior, you can opt back in using <code>parse_error_ty</code> and <code>parse_error_fn</code>:</p> <pre lang="rust"><code>#[derive(EnumString)] #[strum(parse_error_ty = strum::ParseError, parse_error_fn = make_error)] pub enum Color { Red, #[strum(default)] Other(String), } <p>fn make_error(x: &str) -> strum::ParseError { strum::ParseError::VariantNotFound } </code></pre></p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/431">#431</a>: Fix bug where <code>EnumString</code> ignored the <code>parse_err_ty</code> attribute when the enum had a <code>#[strum(default)]</code> variant.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/474">#474</a>: EnumDiscriminants will now copy <code>default</code> over from the original enum to the Discriminant enum.</p> <pre lang="rust"><code>#[derive(Debug, Default, EnumDiscriminants)] #[strum_discriminants(derive(Default))] // <- Remove this in 0.28. enum MyEnum { #[default] // <- Will be the #[default] on the MyEnumDiscriminant #[strum_discriminants(default)] // <- Remove this in 0.28 Variant0, Variant1 { a: NonDefault }, } </code></pre> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/Peternator7/strum/commit/7376771128834d28bb9beba5c39846cba62e71ec"><code>7376771</code></a> Peternator7/0.28 (<a href="https://redirect.github.com/Peternator7/strum/issues/475">#475</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/26e63cd964a2e364331a5dd977d589bb9f649d8c"><code>26e63cd</code></a> Display exists in core (<a href="https://redirect.github.com/Peternator7/strum/issues/477">#477</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/9334c728eedaa8a992d1388a8f4564bbccad1934"><code>9334c72</code></a> Make TryFrom and FromStr infallible if there's a default (<a href="https://redirect.github.com/Peternator7/strum/issues/476">#476</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/0ccbbf823c16e827afc263182cd55e99e3b2a52e"><code>0ccbbf8</code></a> Honor parse_err_ty attribute when the enum has a default variant (<a href="https://redirect.github.com/Peternator7/strum/issues/431">#431</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/2c9e5a9259189ce8397f2f4967060240c6bafd74"><code>2c9e5a9</code></a> Automatically add Default implementation to EnumDiscriminant if it exists on ...</li> <li><a href="https://github.com/Peternator7/strum/commit/e241243e48359b8b811b8eaccdcfa1ae87138e0d"><code>e241243</code></a> Fix existing cargo fmt + clippy issues and add GH actions (<a href="https://redirect.github.com/Peternator7/strum/issues/473">#473</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/639b67fefd20eaead1c5d2ea794e9afe70a00312"><code>639b67f</code></a> feat: allow any kind of passthrough attributes on <code>EnumDiscriminants</code> (<a href="https://redirect.github.com/Peternator7/strum/issues/461">#461</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/0ea1e2d0fd1460e7492ea32e6b460394d9199ff8"><code>0ea1e2d</code></a> docs: Fix typo (<a href="https://redirect.github.com/Peternator7/strum/issues/463">#463</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/36c051b91086b37d531c63ccf5a49266832a846d"><code>36c051b</code></a> Upgrade <code>phf</code> to v0.13 (<a href="https://redirect.github.com/Peternator7/strum/issues/465">#465</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/9328b38617dc6f4a3bc5fdac03883d3fc766cf34"><code>9328b38</code></a> Use absolute paths in proc macro (<a href="https://redirect.github.com/Peternator7/strum/issues/469">#469</a>)</li> <li>Additional commits viewable in <a href="https://github.com/Peternator7/strum/compare/v0.27.0...v0.28.0">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
) # Rationale for this change The inner loop in `Projector::project_record` gives the optimizer somewhat complicated dynamic data to branch through. The sparse arrays in `Projector` are redundantly coded: `None` in the index positions of `writer_to_reader` must match `Some` in `skip_decoders` and vice versa. # What changes are included in this PR? Refactor record projection state with a single array of directive-like enums corresponding to each writer schema field. # Are these changes tested? Added a benchmark for record projection (the benchmark code is partially shared with apache#9397). Somewhat counterintuitively for me, it does not show improvement on a more complex case with a mix of projected fields, but does improve the simpler one-field projection cases. Passes the existing tests.
1. Reduce some quantifiers from `*` to `?` when 2+ occurrences would generate invalid Rust code. `$(if $pred:expr)*` 2. Clean up 4-armed recursive macros: * put the base case first * explain the fixups * fix all at once, going directly to the base case, instead of possibly multiple hoops The inital motivation was getting rust-analyzer to stop choking on such macros usage where the left-hand side was a tuple and the right-hand-side an expr.
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Part of apache#9476. # Rationale for this change Add benchmarks to show benefit of the optimizations in apache#9477 # What changes are included in this PR? Adds some benches for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and DELTA_LENGTH_BYTE_ARRAY. The generated data is meant to show the benefit of special casing for miniblocks with a bitwidth of 0. # Are these changes tested? Just benches # Are there any user-facing changes? No
# Which issue does this PR close? - Closes apache#9506 # Rationale for this change `Vec::reserve(n)` does not guarantee exact capacity, Rust's `MIN_NON_ZERO_CAP` optimization means `reserve(2)` gives capacity = 4 for most numeric types, causing `debug_assert_eq!(capacity, batch_size)` to panic in debug mode when `batch_size < 4`. # What changes are included in this PR? Replace `reserve` with `reserve_exact` in `ensure_capacity` in both `InProgressPrimitiveArray` and `InProgressByteViewArray`. `reserve_exact` skips the amortized growth optimization and allocates exactly the requested capacity, making the assertion correct. # Are these changes tested? No. This only fixes an incorrect debug assertion. # Are there any user-facing changes? No
# Which issue does this PR close? - Closes apache#9513 # Rationale for this change `VariantArray::try_new` and `canonicalize_and_verify_data_type` both accept `LargeUtf8` as a valid shredded variant type. However unshred_variant currently only handles Utf8 for string typed_value columns This means a VariantArray with a LargeUtf8 typed_value column can be constructed successfully, but calling unshred_variant on it fails
# Which issue does this PR close? - Closes apache#9512 # Rationale for this change You can build a Variant with a StringView type shredded out, but calling `unshred_variant` will fail with not yet implemented
…page skip (apache#9374) # Which issue does this PR close? - Closes apache#9370 . # Rationale for this change The bug occurs when using RowSelection with nested types (like List<String>) when: 1. A column has multiple pages in a row group 2. The selected rows span across page boundaries 3. The first page is entirely consumed during skip operations The issue was in `arrow-rs/parquet/src/column/reader.rs:287-382` (`skip_records` function). **Root cause:** When `skip_records` completed successfully after crossing page boundaries, the `has_partial` state in the `RepetitionLevelDecoder` could incorrectly remain true. This happened when: - The skip operation exhausted a page where has_record_delimiter was false - The skip found the remaining records on the next page by counting a delimiter at index 0 - When a subsequent read_records(1) was called, the stale has_partial=true state caused count_records to incorrectly interpret the first repetition level (0) at index 0 as ending a "phantom" partial record, returning (1 record, 0 levels, 0 values) instead of properly reading the actual record data. For a more descriptive explanation, look here: apache#9370 (comment) # What changes are included in this PR? Added code at the end of skip_records to reset the partial record state when all requested records have been successfully skipped. This ensures that after skip_records completes, we're at a clean record boundary with no lingering partial record state, fixing the array length mismatch in StructArrayReader. # Are these changes tested? Commit apache@365bd9a introduces a test showcasing this issue with v2 data pages only on a unit-test level. PR apache#9399 could be used to showcase the issue in an end-to-end way. Previously wrong assumption that thought it had to do with mixing v1 and v2 data pages: ``` In b52e043 I added a test that I validated to fail whenever I remove my fix. Bug Mechanism The bug requires three ingredients: 1. Page 1 (DataPage v1): Contains a nested column (with rep levels). During skip_records, all levels on this page are consumed. count_records sees no following rep=0 delimiter, so it sets has_partial=true. Since has_record_delimiter is false (the default InMemoryPageReader returns false when more pages exist), flush_partial is not called. 2. Page 2 (DataPage v2): Has num_rows available in its metadata. When num_rows <= remaining_records, the entire page is skipped via skip_next_page() — this does not touch the rep level decoder at all, so has_partial remains stale true from page 1. 3. Page 3 (DataPage v1): When read_records loads this page, the stale has_partial=true causes the rep=0 at position 0 to be misinterpreted as completing a "phantom" partial record. This produces (1 record, 0 levels, 0 values) instead of reading the actual record data. Test Verification - With fix (flush_partial at end of skip_records): read_records(1) correctly returns (1, 2, 2) with values [70, 80] - Without fix: read_records(1) returns (1, 0, 0) — a phantom record with no data, which is what causes the "Not all children array length are the same!" error when different sibling columns in a struct produce different record counts ``` --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…e#9532) # Which issue does this PR close? - Closes #NNN. # Rationale for this change I would like to use `ParquetMetaDataPushDecoder` in arrow-datafusion, but the `with_file_decryption_properties` function is pub(crate), so I can't fully implement the encryption feature., # What changes are included in this PR? Make it pub # Are these changes tested? Not needed # Are there any user-facing changes? Now pub
…nion (apache#9524) # Which issue does this PR close? Field::try_merge correctly handles DataType::Null for primitive types and when self is Null, but fails when self is a compound type (Struct, List, LargeList, Union) and from is Null. This causes Schema::try_merge to error when merging schemas where one has a Null field and another has a concrete compound type for the same field. This is common in JSON inference where some files have null values for fields that are structs/lists in other files. - Closes[ apache#9523](apache#9523) # Rationale for this change Add `DataType::Null` arms to the Struct, List, LargeList, and Union branches in `Field::try_merge`, consistent with how primitive types already handle it. # What changes are included in this PR? Add `DataType::Null` arms to the Struct, List, LargeList, and Union branches in `Field::try_merge`, consistent with how primitive types already handle it. # Are these changes tested? - Added test `test_merge_compound_with_null` covering Struct, List, LargeList, and Union merging with Null in both directions. - Existing tests continue to pass. # Are there any user-facing changes? No
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Relates to apache#9497. # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Add benchmark for `ListArray` in `json_reader` to support the performance evaluation of apache#9497 # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Benchmarks for decoding and serialize json list to `ListArray`. - Benchmarks for `ListArray` and `FixedSizeListArray` json writer # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Benchmarks only # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> No
…apache#8918) # Which issue does this PR close? Part of apache#8137. Follow up of apache#7303. Replaces apache#8040. # Rationale for this change apache#7303 implements the fundamental symbols for tracking memory. This patch exposes those APIs to a higher level Array and ArrayData. # What changes are included in this PR? New `claim` API for NullBuffer, ArrayData, and Array. New `pool` feature-flag to arrow, arrow-array, and arrow-data. # Are these changes tested? Added a doctest on the `Array::claim` method. # Are there any user-facing changes? Added API and a new feature-flag for arrow, arrow-array, and arrow-data.
…dicates (apache#9509) # Which issue does this PR close? Raised an issue at apache#9516 for this one Same issue as apache#9239 but extended to another scenario # Rationale for this change When there are multiple predicates being evaluated, we need to reset the row selection policy before overriding the strategy. Scenario: - Dense initial RowSelection (alternating select/skip) covers all pages → Auto resolves to Mask - Predicate 1 evaluates on column A, narrows selection to skip middle pages - Predicate 2's column B is fetched sparsely with the narrowed selection (missing middle pages) - Without the fix, the override for predicate 2 returns early (policy=Mask, not Auto), so Mask is used and tries to read missing pages → "Invalid offset" error # What changes are included in this PR? This is a one line change to reset the selection policy in the `RowGroupDecoderState::WaitingOnFilterData` arm # Are these changes tested? Yes a new test added that fails currently on `main`, but as you can see it's a doozy to set up. # Are there any user-facing changes? Nope
…e#9481) # Which issue does this PR close? - Closes apache#9451 - Closes apache#6256 # Rationale for this change A reader might be annoyed (performance wart) if a parquet footer lacks nullcount stats, but inferring nullcount=0 for missing stats makes the stats untrustworthy and can lead to incorrect behavior. # What changes are included in this PR? If a parquet footer nullcount stat is missing, surface it as None, reserving `Some(0)` for known-no-null cases. # Are these changes tested? Fixed one unit test that broke, added a missing unit test that covers the other change site. # Are there any user-facing changes? The stats API doesn't change signature, but there is a behavior change. The existing doc that called out the incorrect behavior has been removed to reflect that the incorrect behavior no longer occurs.
# Rationale for this change Exposes the Arrow schema produced by the async Avro file reader, similarly to the `schema` method on the synchronous reader. This allows an application to prepare casting or other schema transformations with no need to fetch the first record batch to learn the produced Arrow schema. Since the async reader only parses OCF content for the moment, the schema does not change from batch to batch. # What changes are included in this PR? The `schema` method for `AsyncAvroFileReader` exposes the Arrow schema of record batches that are produced by the reader. # Are these changes tested? Added tests verifying that the returned schema matches the expected. # Are there any user-facing changes? Added a `schema` method to `AsyncAvroFileReader`.
# Which issue does this PR close? - Closes apache#9543. # Rationale for this change See issue, but it is possible to construct arguments to `arrow_buffer::bit_util::bit_mask::set_bits` that overflow the bounds checking protecting unsafe code. # What changes are included in this PR? Use `checked_add` when doing the bounds checking and panic when an overflow occurs. # Are these changes tested? Yes # Are there any user-facing changes? No
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes apache#9529 . # Rationale for this change - In a follow up PR, can fix the `variant_get` TODO: https://github.com/apache/arrow-rs/blob/3b6179658203dc1b1610b67c1777d5b8beb137fc/parquet-variant-compute/src/variant_get.rs#L89-L92 - When we know that Struct VariantArray is not shredded can reuse `shred_basic_variant` <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? - Added `StructVariantToArrowRowBuilder` builder. - Moved `make_variant_to_arrow_row_builder` logic to `make_typed_variant_to_arrow_row_builder` to reuse by `Struct` array's inner fields. - Changed a `variant_get` test to show that it now handles unshredded `Struct` `VariantArray` <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? - Yes, added `test_struct_row_builder_handles_unshredded_nested_structs` - Everything else still works. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> --------- Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
…apache#9439) # Which issue does this PR close? - Closes apache#9438. # Rationale for this change Speed up conversion by only importing `pyarrow` once. # What changes are included in this PR? - Use `PyOnceLock::import` to import the types. - Remove some not useful `.extract::<PyBackedStr>()?` (the `Display` implementation already does something similar) # Are these changes tested? Covered by existing tests. It would be nice to add benchmark but it might require to: - either add a dependency to a python benchmark runner - write some hacky code to import `pyarrow` from criterion tests (likely by running `pip`/`uv` from the Rust benchmark code) # Are there any user-facing changes? No
…e#9578) ## Summary - When a `RowSelection` selects every row in a row group, `fetch_ranges` now treats it as no selection, producing a single whole-column-chunk I/O request instead of N individual page requests - This reduces the number of I/O requests for subsequent filter predicates when an earlier predicate passes all rows ## Details In `InMemoryRowGroup::fetch_ranges`, when both a `RowSelection` and an `OffsetIndex` are present, the code enters a page-level fetch path that uses `scan_ranges()` to produce individual page ranges. Even when the selection covers all rows, this produces N separate ranges (one per page). The fix: before entering the page-level path, check if the selection's `row_count()` equals the row group's total row count. If so, drop the selection and take the simpler whole-column-chunk path. This commonly happens when a multi-predicate `RowFilter` has an early predicate that passes all rows in a row group (e.g., `CounterID = 62` on a row group where all rows have `CounterID = 62`). ## Test plan - [x] Existing tests pass (snapshot updated to reflect fewer I/O requests) - [x] `test_read_multiple_row_filter` verifies the coalesced fetch pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) # Which issue does this PR close? - Closes apache#9378 # Rationale for this change the optimizations as listed in the issue description - Align to 8 bytes - Don't try to return a buffer with bit_offset 0 but round it to a multiple of 64 - Use chunk_exact for the fallback path # What changes are included in this PR? When both inputs share the same sub-64-bit alignment (left_offset % 64 == right_offset % 64), the optimized path is used. This covers the common cases (both offset 0, both sliced equally, etc.). The BitChunks fallback is retained only when the two offsets have different sub-64-bit alignment. # Are these changes tested? Yes the tests are changed and they are included # Are there any user-facing changes? Yes, this is a minor breaking change to from_bitwise_binary_op: - The returned BooleanBuffer may now have a non-zero offset (previously always 0) - The returned BooleanBuffer may have padding bits set outside the logical range in values() --------- Signed-off-by: Kunal Singh Dadhwal <kunalsinghdadhwal@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? None # Rationale for this change We want to use the SBBF Bloom Filter, but need to construct/serialize it manually. Currently there is no way to create a new `Sbbf` outside of this crate. Alongside this: we want to store the `Sbbf` in a `FixedSizedBinary` column for some fancy indexing. # What changes are included in this PR? Some methods become public # Are these changes tested? N/A # Are there any user-facing changes? Yes, we add a few more public methods to the `Sbbf` struct
…9450) # Which issue does this PR close? - Closes #NNN. # Rationale for this change Rust implementation of apache/arrow#45360 Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times. See more details in https://huggingface.co/blog/parquet-cdc The original C++ implementation apache/arrow#45360 Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better): <img width="984" height="411" alt="image" src="https://github.com/user-attachments/assets/e6e80931-ac76-4bdd-bf9c-ba7e06559411" /> # What changes are included in this PR? - **Content-defined chunker** at `parquet/src/column/chunker/` - **Arrow writer integration** integrated in `ArrowColumnWriter` - **Writer properties** via `CdcOptions` struct (`min_chunk_size`, `max_chunk_size`, `norm_level`) - **ColumnDescriptor**: added `repeated_ancestor_def_level` field to for nested field values iteration # Are these changes tested? Yes — unit tests are located in `cdc.rs` and ported from the C++ implementation. # Are there any user-facing changes? New **experimental** API, disabled by default — no behavior change for existing code: ```rust // Simple toggle (256 KiB min, 1 MiB max, norm_level 0) let props = WriterProperties::builder() .set_content_defined_chunking(true) .build(); // Excpliti CDC parameters let props = WriterProperties::builder() .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 }) .build(); ``` --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
apache#9576) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes apache#9526 # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> `shred_variant` already supports Binary and LargeBinary types (apache#9525, apache#9554), but unshred_variant does not handle these types. This means shredded Binary/LargeBinary columns cannot be converted back to unshredded VariantArrays. # What changes are included in this PR? Adds unshred_variant support for DataType::Binary and DataType::LargeBinary in parquet-variant-compute/src/unshred_variant.rs: - New enum variants PrimitiveBinary and PrimitiveLargeBinary - Match arms in append_row and try_new_opt - AppendToVariantBuilder impls for BinaryArray and LargeBinaryArray # Are these changes tested? Yes # Are there any user-facing changes? No breaking changes --------- Signed-off-by: Kunal Singh Dadhwal <kunalsinghdadhwal@gmail.com>
# Which issue does this PR close? - part of apache#9108 # Rationale for this change Prepare for next release # What changes are included in this PR? 1. Update version to `58.1.0` 2. Add changelog. See rendered preview here: https://github.com/alamb/arrow-rs/blob/alamb/prepare_58.1.0/CHANGELOG.md # Are these changes tested? By CI # Are there any user-facing changes? Yes
…apache#9590) ## Summary - Reserve `output.views` capacity in `ByteViewArrayDecoderDictionary::read` before the decode loop - Reserve `output.offsets` capacity in `ByteArrayDecoderDictionary::read` before the decode loop This avoids per-chunk reallocation during `extend` calls inside the dictionary decode loop. Closes apache#9587 ## Test plan - [ ] Existing tests pass (no functional change, only pre-allocation) - [ ] Benchmark dictionary-encoded StringView/BinaryView/String reads 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Rationale for this change In some cases, it is desirable to print strings with surrounding quotation marks. A typical example that we run into in https://github.com/rerun-io/rerun is a `StructArray` that contains empty strings: Current formatting: ```text {name: } ``` Added option in this PR: ```text {name: ""} ``` # What changes are included in this PR? This PR relies on `std::fmt::Debug` to do the actual formatting of strings, which means that all escaping is handled out of the box. # Are these changes tested? This PR contains test for different types of inputs, including escape sequences. Additionally, it also tests the `StructArray` example outlined above. # Are there any user-facing changes? By default this option is false, making the feature opt-in. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
## Which issue does this PR close? Closes apache#9580 ## Rationale The current VLQ decoder calls `get_aligned` for each byte, which involves repeated offset calculations and bounds checks in the hot loop. ## What changes are included in this PR? Align to the byte boundary once, then iterate directly over the buffer slice, avoiding per-byte overhead from `get_aligned`. ## Are there any user-facing changes? No. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Rationale for this change The `object_store` crate release 0.13.2 breaks the build of parquet because it feature-gates the `buffered` module. I have filed apache/arrow-rs-object-store#677 about the breakage; meanwhile this fix is made in expectation that 0.13.2 will not be yanked and the feature gate will remain. # What changes are included in this PR? Bump the version to 0.13.2 and requesting the "tokio" feature. # Are these changes tested? The build should succeed in CI workflows. # Are there any user-facing changes? No Co-authored-by: Mikhail Zabaluev <mikhail.zabaluev@gmail.com>
Updates the requirements on [sha2](https://github.com/RustCrypto/hashes) to permit the latest version. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/RustCrypto/hashes/commit/ffe093984c004769747e998f77da8ff7c0e7a765"><code>ffe0939</code></a> Release sha2 0.11.0 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/806">#806</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/8991b65fe400c31c4cc189510f86ae642c470cd9"><code>8991b65</code></a> Use the standard order of the <code>[package]</code> section fields (<a href="https://redirect.github.com/RustCrypto/hashes/issues/807">#807</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/3d2bc57db40fd6aeb25d6c6da98d67e2784c2985"><code>3d2bc57</code></a> sha2: refactor backends (<a href="https://redirect.github.com/RustCrypto/hashes/issues/802">#802</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/faa55fb83697c8f3113636d88070e5f5edc8c335"><code>faa55fb</code></a> sha3: bump <code>keccak</code> to v0.2 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/803">#803</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/d3e6489e56f8486d4a93ceb7a8abf4924af1de7b"><code>d3e6489</code></a> sha3 v0.11.0-rc.9 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/801">#801</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/bbf6f51ff97f81ab15e6e5f6cf878bfbcb1f47c8"><code>bbf6f51</code></a> sha2: tweak backend docs (<a href="https://redirect.github.com/RustCrypto/hashes/issues/800">#800</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/155dbbf2959dbec0ec75948a82590ddaede2d3bc"><code>155dbbf</code></a> sha3: add default value for the <code>DS</code> generic parameter on <code>TurboShake128/256</code>...</li> <li><a href="https://github.com/RustCrypto/hashes/commit/ed514f2b34526683b3b7c41670f1887982c3df64"><code>ed514f2</code></a> Use published version of <code>keccak</code> v0.2 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/799">#799</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/702bcd83735a49c928c0fc24506924f5c0aa22af"><code>702bcd8</code></a> Migrate to closure-based <code>keccak</code> (<a href="https://redirect.github.com/RustCrypto/hashes/issues/796">#796</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/827c043f82d57666a0b146d156e91c39535c1305"><code>827c043</code></a> sha3 v0.11.0-rc.8 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/794">#794</a>)</li> <li>Additional commits viewable in <a href="https://github.com/RustCrypto/hashes/compare/groestl-v0.10.0...sha2-v0.11.0">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes apache#9340. # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Support `ListView` codec in arrow-json. Using `ListLikeArray` trait to simplify implementation. # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Tests added # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> New encoder/decoder
… verification (apache#9604) # Which issue does this PR close? - Closes apache#9603 # Rationale for this change The release and dev KEYS files could get out of synch. We should use the release/ version: - Users use the release/ version not dev/ version when they verify our artifacts' signature - https://dist.apache.org/ may reject our request when we request many times by CI # What changes are included in this PR? Use `https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/KEYS` to download the KEYS file and the expected `https://dist.apache.org/repos/dist/dev/arrow` for the RC artifacts. # Are these changes tested? Yes, I've verified 58.1.0 1 both previous to the change and after the change. # Are there any user-facing changes? No
…uct)` (apache#9597) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes apache#9596. # Rationale for this change Check issue <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? Reuse `shred_basic_variant` as a fast path for unshredded `Struct` handling in `variant_get(..., Struct)` <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? Yes, added two unit tests to establish safe mode behavior. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
## Summary - Fix `MutableArrayData::extend_nulls` which previously panicked unconditionally for both sparse and dense Union arrays - For sparse unions: append the first type_id and extend nulls in all children - For dense unions: append the first type_id, compute offsets into the first child, and extend nulls in that child only ## Background This bug was discovered via DataFusion. `CaseExpr` uses `MutableArrayData` via `scatter()` to build result arrays. When a `CASE` expression returns a Union type (e.g., from `json_get` which returns a JSON union) and there are rows where no `WHEN` branch matches (implicit `ELSE NULL`), `scatter` calls `extend_nulls` which panics with "cannot call extend_nulls on UnionArray as cannot infer type". Any query like: ```sql SELECT CASE WHEN condition THEN returns_union(col, 'key') END FROM table ``` would panic if `condition` is false for any row. ## Root Cause The `extend_nulls` implementation for Union arrays unconditionally panicked because it claimed it "cannot infer type". However, the Union's field definitions (child types and type IDs) are available in the `MutableArrayData`'s data type — there's enough information to produce valid null entries by picking the first declared type_id. ## Test plan - [x] Added test for sparse union `extend_nulls` - [x] Added test for dense union `extend_nulls` - [x] Existing `test_union_dense` continues to pass - [x] All `array_transform` tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Relates to apache#9497. # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> As part of the effort to move the Json reader away from `ArrayData` toward typed `ArrayRef` APIs, it's necessary to change the `ArrayDecoder::decode` interface to return `ArrayRef` directly and updates all decoder implementations (list, struct, map, run-end encoded) to construct typed arrays without intermediate `ArrayData` round-trips. New benchmarks for map and run-end encoded decoding are added to verify there is no performance regression. # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> No
# Which issue does this PR close? - closes apache#9593 # Rationale for this change In a previous PR (apache#9593), I change instances of `truncate(0)` to `clear()`. However, this breaks the test `test_truncate_with_pool` at `arrow-buffer/src/buffer/mutable.rs:1357`, due to an inconsistency between the implementation of `truncate` and `clear`. This PR fixes that test. # What changes are included in this PR? This PR copies a section of code related to the `pool` feature present in `truncate` but absent in `clear`, fixing the failing unit test. # Are these changes tested? Yes. # Are there any user-facing changes? No.
) # Rationale for this change CdcOptions only contains primitive fields (usize, usize, i32) so deriving PartialEq and Eq is straightforward. This is needed by downstream crates such as DataFusion that embed CdcOptions in their own configuration structs and need to compare them. # What changes are included in this PR? Implemented PartialEq and Eq for CdcOptions. # Are these changes tested? Added an equality test. # Are there any user-facing changes? No.
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes apache#8400. # Rationale for this change Check issue <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? - Added `AppendNullMode` enum supporting all semantics. - Replaced the bool logic to the new enum - Fix test outputs for List Array cases <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? - Added unit tests <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Rationale for this change Makes the code simpler and more readable by relying on new PyO3 and Rust features. No behavior should have changed outside of an error message if `__arrow_c_array__` does not return a tuple # What changes are included in this PR? - use `.call_method0(M)?` instead of `.getattr(M)?.call0()` - Use `.extract()` that allows more advanced features like directly extracting tuple elements - remove temporary variables just before returning - use &raw const and &raw mut pointers instead of casting and addr_of!
…pache#9637) The CDC chunker's value_offset diverged from actual leaf array positions when null list entries had non-empty child offset ranges (valid per the Arrow columnar format spec). This caused slice_for_chunk to produce incorrect non_null_indices, leading to an out-of-bounds panic in write_mini_batch. Track non-null value counts (nni) separately from leaf slot counts in the chunker, and use them in slice_for_chunk to correctly index into non_null_indices regardless of gaps in the leaf array.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NullBuffer::from_unsliced_bufferhelper and refactor call sites (AddNullBuffer::from_unsliced_bufferhelper and refactor call sites apache/arrow-rs#9411)prettyprinttests inarrow-casttoinstainline snapshots (Convertprettyprinttests inarrow-casttoinstainline snapshots apache/arrow-rs#9472)partially_shredded_variant_array_gen(chore: remove duplicate macropartially_shredded_variant_array_genapache/arrow-rs#9498)ListLikeArrayto arrow-array to be shared with json writer and parquet unshredding (MoveListLikeArrayto arrow-array to be shared with json writer and parquet unshredding apache/arrow-rs#9437)append_value_nto GenericByteBuilder (Addappend_value_nto GenericByteBuilder apache/arrow-rs#9426)append_nullstoMapBuilder(Addappend_nullstoMapBuilderapache/arrow-rs#9432)append_non_nullstoStructBuilder(Addappend_non_nullstoStructBuilderapache/arrow-rs#9430)infer_json_schema(Add benchmark forinfer_json_schemaapache/arrow-rs#9546)mainbranch with required reviews (chore: Protectmainbranch with required reviews apache/arrow-rs#9547)shred_variantsupport forLargeUtf8andLargeBinary(addshred_variantsupport forLargeUtf8andLargeBinaryapache/arrow-rs#9554)RunArray::new_uncheckedandRunArray::into_parts(feat: addRunArray::new_uncheckedandRunArray::into_partsapache/arrow-rs#9376)HeaderInfoto expose OCF header (feat(arrow-avro):HeaderInfoto expose OCF header apache/arrow-rs#9548)claimmethod to recordbatch for memory accounting (Addclaimmethod to recordbatch for memory accounting apache/arrow-rs#9433)ValueIterinto own module, and add publicrecord_countfunction (MoveValueIterinto own module, and add publicrecord_countfunction apache/arrow-rs#9557)variant_gettests ([Variant] clean upvariant_gettests apache/arrow-rs#9518)take_fixed_size_binaryFor Predefined Value Lengths (Optimizetake_fixed_size_binaryFor Predefined Value Lengths apache/arrow-rs#9535)checked_addfor bounds checks to avoid UB (fix: Usedchecked_addfor bounds checks to avoid UB apache/arrow-rs#9568)variant_to_arrowStructtype support ([Variant] Addvariant_to_arrowStructtype support apache/arrow-rs#9572)quoted_stringstoFormatOptions(Addquoted_stringstoFormatOptionsapache/arrow-rs#9221)object_storebreakage for 0.13.2 (deps: fixobject_storebreakage for 0.13.2 apache/arrow-rs#9612)ListViewcodec in arrow-json (SupportListViewcodec in arrow-json apache/arrow-rs#9503)Structfast-path forvariant_get(..., Struct)([Variant] Add unshreddedStructfast-path forvariant_get(..., Struct)apache/arrow-rs#9597)extend_nullspanic for UnionArray (Fixextend_nullspanic for UnionArray apache/arrow-rs#9607)MutableBuffer::clear(FixMutableBuffer::clearapache/arrow-rs#9622)PartialEqandEqforCdcOptions(feat(parquet): derivePartialEqandEqforCdcOptionsapache/arrow-rs#9602)Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?