[pull] main from apache:main by pull[bot] · Pull Request #84 · buraksenn/datafusion

pull · 2026-04-06T18:33:29Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…#20562) ## Which issue does this PR close?  - Closes #20561 ## Rationale for this change Previously `create_physical_plan` consumed the `DataFrame`, making it impossible to inspect (e.g. log) the physical plan and then execute the same `DataFrame` (e.g. via `write_parquet` or `collect`) without first cloning it. Since the method only needs `&LogicalPlan` (which it forwards to `SessionState::create_physical_plan`), there is no reason to take ownership. Changing the signature to `&self` makes the common pattern of "get plan for logging, then write/collect" work naturally. Also removes the now-unnecessary `self.clone()` in `DataFrame::cache` that was introduced for the same reason.  ## What changes are included in this PR? Changing `self` to `&self`  ## Are these changes tested? Yes  ## Are there any user-facing changes?   --------- Co-authored-by: xanderbailey <xanderbailey@users.noreply.github.com>

- Closes #20989. ## Rationale for this change The planner should be consistent with the expected SQL behavior—swapping the names of tables that have identical structure in a SQL query should not affect the schema for that query. ## What changes are included in this PR? - A fix in the `exclude_using_columns` helper method in `datafusion/expr/src/utils.rs` that ensures that we don't retain columns from the projected side when deciding which USING columns to exclude and which to retain on top of semi- or antijoins. - Regression tests for the change in `test_using_join_wildcard_schema_semi_anti`. ## Are these changes tested? - Added a regression test. ## Are there any user-facing changes? Yes, the change is user facing, but I doubt that this behavior is expected and is documented anywhere. If existing docs need to be updated, please point me to the concrete places and I can take a look. Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

## Rationale for this change - closes #21110 Expose the new Content-Defined Chunking feature from parquet-rs apache/arrow-rs#9450 ## What changes are included in this PR? New parquet writer options for enabling CDC. ## Are these changes tested? In-progress. ## Are there any user-facing changes? New config options. Depends on the 58.1 arrow-rs release.

## Which issue does this PR close? - Closes #21204. ## Rationale for this change In practice, `split_part(string, delimiter, position)` is often invoked with constant values for `delimiter` and `position`. We can take advantage of that to hoist some conditional branches out of the per-row hot loop; more importantly, we can switch from using `str::split` to building a `memchr::memmem::Finder` and using it for each row. Building a `Finder` is relatively expensive but it's a clear win when we can amortize that one-time cost over an entire input batch. Benchmarks (M4 Max): - `scalar_utf8_single_char/pos_first`: 105 µs → 41 µs, -61% - `scalar_utf8_single_char/pos_middle`: 358 µs → 97 µs, -73% - `scalar_utf8_single_char/pos_negative`: 110 µs → 46 µs, -58% - `scalar_utf8_multi_char/pos_middle`: 355 µs → 132 µs, -63% - `scalar_utf8_long_strings/pos_middle`: 1.97 ms → 1.11 ms, -43% - `scalar_utf8view_long_parts/pos_middle`: 467 µs → 169 µs, -63% - `array_utf8_single_char/pos_middle`: 351 µs → 357 µs, no change - `array_utf8_multi_char/pos_middle`: 366 µs → 357 µs, -2.6% ## What changes are included in this PR? * Add benchmarks for `split_part` with scalar delimiter and position * Add new fast-path for `split_part` with scalar delimiter and position * Add SLT tests for `split_part` with scalar delimiter and position ## Are these changes tested? Yes. ## Are there any user-facing changes? No.

…unts (#21369) ## Which issue does this PR close? N/A — standalone API improvement, prerequisite for #21157. ## Rationale for this change `PruningStatistics::row_counts(&self, column: &Column)` takes a column parameter, but row counts are container-level (same for all columns). 8 of 11 implementations ignore the parameter with `_column`. The Parquet impl (`RowGroupPruningStatistics`) unnecessarily constructs a `StatisticsConverter` from the column just to call `row_group_row_counts()`, which doesn't use the column at all. The existing code even has a comment acknowledging this: > "row counts are the same for all columns in a row group" And a test comment: > "This is debatable, personally I think `row_count` should not take a `Column` as an argument at all since all columns should have the same number of rows." ## What changes are included in this PR? **Breaking change**: `fn row_counts(&self, column: &Column) -> Option<ArrayRef>` becomes `fn row_counts(&self) -> Option<ArrayRef>`. - Remove `column` parameter from trait definition and all 11 implementations - `RowGroupPruningStatistics`: read `num_rows()` directly from row group metadata instead of routing through `StatisticsConverter` - `PrunableStatistics`: remove column-exists validation (row count is container-level) - Update all call sites and tests ## Are these changes tested? Yes — all existing tests updated and passing. The behavior change is: - `row_counts()` on `PrunableStatistics` now returns data even for non-existent columns (correct, since row count is container-level) - `RowGroupPruningStatistics::row_counts()` always returns row counts (previously could fail if column wasn't in Parquet schema) ## Are there any user-facing changes? Yes — breaking change to `PruningStatistics` trait. Downstream implementations need to remove the `column` parameter from their `row_counts` method. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

xanderbailey and others added 5 commits April 6, 2026 16:03

pull bot locked and limited conversation to collaborators Apr 6, 2026

pull bot added the ⤵️ pull label Apr 6, 2026

pull bot merged commit 206ac6b into buraksenn:main Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from apache:main#84

[pull] main from apache:main#84
pull[bot] merged 5 commits intoburaksenn:mainfrom
apache:main

pull bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pull bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pull bot commented Apr 6, 2026 •

edited

Loading