[pull] main from apache:main by pull[bot] · Pull Request #89 · buraksenn/datafusion

pull · 2026-04-08T06:33:25Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

## Which issue does this PR close? Part of #20766 ## Rationale for this change Grouped aggregations currently estimate output rows as input_rows, ignoring available NDV statistics. Spark's AggregateEstimation and Trino's AggregationStatsRule both use NDV products to tighten this estimate. This PR is highly referenced by both. - [Spark reference](https://github.com/apache/spark/blob/e8d8e6a8d040d26aae9571e968e0c64bda0875dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala#L38-L61) - [Trino reference](https://github.com/trinodb/trino/blob/43c8c3ba8bff814697c5926149ce13b9532f030b/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java#L92-L101) ## What changes are included in this PR? - Estimate aggregate output rows as min(input_rows, product(NDV_i + null_adj_i) * grouping_sets) - Cap by Top K limit when active since output row cannot be higher than K - Propagate distinct_count from child stats to group-by output columns ## Are these changes tested? Yes existing and new tests that cover different scenarios and edge cases ## Are there any user-facing changes? No

## Which issue does this PR close?  - Closes #21217 ## What changes are included in this PR?  - Adds `ScalarUDFImpl::struct_field_mapping` - Adds logic in `ProjectionMapping` to decompose struct-producing functions into their field-level mapping entries so that orderings propagate through struct projections - Adds unit tests/SLT ## Are these changes tested? Yes. ## Are there any user-facing changes? N/A --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

## Which issue does this PR close?  - Closes #18816 . ## Rationale for this change  In `UserDefinedLogicalNodeCore`, the default implementation of `necessary_children_exprs ` returns `None`, which signals to the optimizer that it cannot determine which columns are required from the child. The optimizer takes a conservative approach and skips projection pruning for that node, leading to complex and redundant plans in the subtree. However, it would make more sense to assume all columns are required and let the optimizer proceed, rather than giving up on the entire subtree entirely. ## What changes are included in this PR? ```rust LogicalPlan::Extension(extension) => { if let Some(necessary_children_indices) = extension.node.necessary_children_exprs(indices.indices()) { ... } else { // Requirements from parent cannot be routed down to user defined logical plan safely // Assume it requires all input exprs here plan.inputs() .into_iter() .map(RequiredIndices::new_for_all_exprs) .collect() } } ``` instead of https://github.com/apache/datafusion/blob/b6d46a63824f003117297848d8d83b659ac2e759/datafusion/optimizer/src/optimize_projections/mod.rs#L331-L337  ## Are these changes tested? Yes. In addition to unit tests, I've also added a complete end-to-end integration test that reproduces the full scenario in the issue. This might seem redundant, bloated, or even unnecessary. Please let me know if I should remove these tests. An existing test is modified, but I think the newer behavior is expected.  ## Are there any user-facing changes? Yes. But I think the new implementation is the expected behavior.

## Which issue does this PR close? - Closes #. ## Rationale for this change `Fix sort merge interleave overflow` (#20922) added a temporary `catch_unwind` shim around Arrow's `interleave` call because the upstream implementation still panicked on offset overflow at the time. Arrow 58.1.0 includes apache/arrow-rs#9549, which returns `ArrowError::OffsetOverflowError` directly instead of panicking. DataFusion main now depends on that release, so the panic recovery path is no longer needed and only broadens the set of failures we might accidentally treat as recoverable. ## What changes are included in this PR? - Remove the temporary panic-catching wrapper from `BatchBuilder::try_interleave_columns`. - Keep the existing retry logic, but trigger it only from the returned `OffsetOverflowError`. - Replace the panic-specific unit tests with a direct error-shape assertion. ## Are these changes tested? Yes. - `cargo test -p datafusion-physical-plan sorts::builder -- --nocapture` - `cargo test -p datafusion-physical-plan sorts:: -- --nocapture` - `./dev/rust_lint.sh` ## Are there any user-facing changes? No.

#21099) When the SQL unparser encountered a SubqueryAlias node whose direct child was an Aggregate (or other clause-building plan like Window, Sort, Limit, Union), it would flatten the subquery into a simple table alias, losing the aggregate entirely. For example, a plan representing: SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m would unparse to: SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m dropping the MAX aggregate and the subquery. Root cause: the SubqueryAlias handler in select_to_sql_recursively would call subquery_alias_inner_query_and_columns (which only unwraps Projection children) and unparse_table_scan_pushdown (which only handles TableScan/SubqueryAlias/Projection). When both returned nothing useful for an Aggregate child, the code recursed directly into the Aggregate, merging its GROUP BY into the outer SELECT instead of wrapping it in a derived subquery. The fix adds an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union), emit it as a derived subquery via self.derive() with the alias always attached, rather than falling through to the recursive path that would flatten it. ## Which issue does this PR close? - Closes #21098 ## Rationale for this change The SQL unparser silently drops subquery structure when a SubqueryAlias node directly wraps an Aggregate (or Window, Sort, Limit, Union). For example, a plan representing ```sql SELECT ... FROM j1 JOIN (SELECT max(id) FROM j2) AS b ... ``` unparses to ```sql SELECT ... FROM j1 JOIN j2 AS b ... ``` losing the aggregate entirely. This produces semantically incorrect SQL. ## What changes are included in this PR? In the SubqueryAlias handler within select_to_sql_recursively (`datafusion/sql/src/unparser/plan.rs`): - Added an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union) and cannot be reduced to a table scan, emit it as a derived subquery (SELECT ...) AS alias via self.derive() instead of recursing into the child and flattening it. - Added a helper requires_derived_subquery() that identifies plan types requiring this treatment. ## Are these changes tested? Yes. A new test test_unparse_manual_join_with_subquery_aggregate is added that constructs a SubqueryAlias > Aggregate plan (without an intermediate Projection) and asserts the unparsed SQL preserves the MAX() aggregate function call. This test fails without the fix. All current unparser tests succeed without modification ## Are there any user-facing changes? No. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Which issue does this PR close? - Closes #21410. ## Rationale for this change When `split_part` is invoked with a `StringViewArray`, we can avoid copying when constructing the return value by instead returning pointers into the view buffers of the input `StringViewArray`. This PR only applies this optimization to the code path for scalar `delimiter` and `position`, because that's the most common usage mode in practice. We could also optimize the array-args case but it didn't seem worth the extra code. Benchmarks (M4 Max): - scalar_utf8view_very_long_parts/pos_first: 102 µs → 68 µs (-33%) - scalar_utf8view_long_parts/pos_middle: 164 µs → 137 µs (-15%) - scalar_utf8_single_char/pos_first: 42.5 µs → 42.9 µs (no change) - scalar_utf8_single_char/pos_middle: 96.5 µs → 99.5 µs (noise) - scalar_utf8_single_char/pos_negative: 48.3 µs → 48.6 µs (no change) - scalar_utf8_multi_char/pos_middle: 132 µs → 132 µs (no change) - scalar_utf8_long_strings/pos_middle: 1.06 ms → 1.08 ms (noise) - array_utf8_single_char/pos_middle: 355 µs → 365 µs (noise) - array_utf8_multi_char/pos_middle: 357 µs → 360 µs (no change) ## What changes are included in this PR? * Implement optimization * Add benchmark that covers this case * Improve SLT test coverage for this code path ## Are these changes tested? Yes. ## Are there any user-facing changes? No.

@alamb

## Which issue does this PR close? This attempts to bridge the missing test coverage mentioned by @alamb on this issue #8791  - Closes #. ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested? The changes are test  ## Are there any user-facing changes? no

buraksenn and others added 4 commits April 8, 2026 00:45

pull bot locked and limited conversation to collaborators Apr 8, 2026

pull bot added the ⤵️ pull label Apr 8, 2026

github-actions bot added logical-expr physical-plan optimizer core sqllogictest physical-expr functions labels Apr 8, 2026

github-actions bot added the sql label Apr 8, 2026

neilconway and others added 2 commits April 8, 2026 11:31

pull bot merged commit 5ba06ac into buraksenn:main Apr 8, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from apache:main#89

[pull] main from apache:main#89
pull[bot] merged 7 commits intoburaksenn:mainfrom
apache:main

pull bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

pull bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pull bot commented Apr 8, 2026 •

edited

Loading