Skip to content

[pull] main from apache:main#89

Merged
pull[bot] merged 7 commits intoburaksenn:mainfrom
apache:main
Apr 8, 2026
Merged

[pull] main from apache:main#89
pull[bot] merged 7 commits intoburaksenn:mainfrom
apache:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 8, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

buraksenn and others added 4 commits April 8, 2026 00:45
## Which issue does this PR close?
Part of #20766

## Rationale for this change
Grouped aggregations currently estimate output rows as input_rows,
ignoring available NDV statistics. Spark's AggregateEstimation and
Trino's AggregationStatsRule both use NDV products to tighten this
estimate. This PR is highly referenced by both.


- [Spark
reference](https://github.com/apache/spark/blob/e8d8e6a8d040d26aae9571e968e0c64bda0875dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala#L38-L61)
- [Trino
reference](https://github.com/trinodb/trino/blob/43c8c3ba8bff814697c5926149ce13b9532f030b/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java#L92-L101)

## What changes are included in this PR?
- Estimate aggregate output rows as min(input_rows, product(NDV_i +
null_adj_i) * grouping_sets)
- Cap by Top K limit when active since output row cannot be higher than
K
- Propagate distinct_count from child stats to group-by output columns

## Are these changes tested?
Yes existing and new tests that cover different scenarios and edge cases


## Are there any user-facing changes?
No
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #21217

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Adds `ScalarUDFImpl::struct_field_mapping`
- Adds logic in `ProjectionMapping` to decompose struct-producing
functions into their field-level mapping entries so that orderings
propagate through struct projections
- Adds unit tests/SLT

## Are these changes tested?

Yes.

## Are there any user-facing changes?

N/A

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #18816 .

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
In `UserDefinedLogicalNodeCore`, the default implementation of
`necessary_children_exprs ` returns `None`, which signals to the
optimizer that it cannot determine which columns are required from the
child.

The optimizer takes a conservative approach and skips projection pruning
for that node, leading to complex and redundant plans in the subtree.
However, it would make more sense to assume all columns are required and
let the optimizer proceed, rather than giving up on the entire subtree
entirely.

## What changes are included in this PR?

```rust
LogicalPlan::Extension(extension) => {
    if let Some(necessary_children_indices) =
        extension.node.necessary_children_exprs(indices.indices())
    {
        ...
    } else {
        // Requirements from parent cannot be routed down to user defined logical plan safely
        // Assume it requires all input exprs here
        plan.inputs()
            .into_iter()
            .map(RequiredIndices::new_for_all_exprs)
            .collect()
    }
}
```

instead of 


https://github.com/apache/datafusion/blob/b6d46a63824f003117297848d8d83b659ac2e759/datafusion/optimizer/src/optimize_projections/mod.rs#L331-L337
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes.

In addition to unit tests, I've also added a complete end-to-end
integration test that reproduces the full scenario in the issue. This
might seem redundant, bloated, or even unnecessary. Please let me know
if I should remove these tests.

An existing test is modified, but I think the newer behavior is
expected.
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

Yes. But I think the new implementation is the expected behavior.
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
## Which issue does this PR close?

- Closes #.

## Rationale for this change

`Fix sort merge interleave overflow` (#20922) added a temporary
`catch_unwind` shim around Arrow's `interleave` call because the
upstream implementation still panicked on offset overflow at the time.

Arrow 58.1.0 includes apache/arrow-rs#9549, which returns
`ArrowError::OffsetOverflowError` directly instead of panicking.
DataFusion main now depends on that release, so the panic recovery path
is no longer needed and only broadens the set of failures we might
accidentally treat as recoverable.

## What changes are included in this PR?

- Remove the temporary panic-catching wrapper from
  `BatchBuilder::try_interleave_columns`.
- Keep the existing retry logic, but trigger it only from the returned
  `OffsetOverflowError`.
- Replace the panic-specific unit tests with a direct error-shape
assertion.

## Are these changes tested?

Yes.

- `cargo test -p datafusion-physical-plan sorts::builder -- --nocapture`
- `cargo test -p datafusion-physical-plan sorts:: -- --nocapture`
- `./dev/rust_lint.sh`

## Are there any user-facing changes?

No.
#21099)

When the SQL unparser encountered a SubqueryAlias node whose direct
child was an Aggregate (or other clause-building plan like Window, Sort,
Limit, Union), it would flatten the subquery into a simple table alias,
losing the aggregate entirely.

For example, a plan representing:
SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id =
b.m

would unparse to:
  SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m

dropping the MAX aggregate and the subquery.

Root cause: the SubqueryAlias handler in select_to_sql_recursively would
call subquery_alias_inner_query_and_columns (which only unwraps
Projection children) and unparse_table_scan_pushdown (which only handles
TableScan/SubqueryAlias/Projection). When both returned nothing useful
for an Aggregate child, the code recursed directly into the Aggregate,
merging its GROUP BY into the outer SELECT instead of wrapping it in a
derived subquery.

The fix adds an early check: if the SubqueryAlias's direct child is a
plan type that builds its own SELECT clauses (Aggregate, Window, Sort,
Limit, Union), emit it as a derived subquery via self.derive() with the
alias always attached, rather than falling through to the recursive path
that would flatten it.

## Which issue does this PR close?
- Closes #21098 

## Rationale for this change

The SQL unparser silently drops subquery structure when a SubqueryAlias
node directly wraps an Aggregate (or Window, Sort, Limit, Union). For
example, a plan representing
```sql 
SELECT ... FROM j1 JOIN (SELECT max(id) FROM j2) AS b ...
``` 
unparses to 
```sql
SELECT ... FROM j1 JOIN j2 AS b ...
```
losing the aggregate entirely. This produces semantically incorrect SQL.

## What changes are included in this PR?

In the SubqueryAlias handler within select_to_sql_recursively
(`datafusion/sql/src/unparser/plan.rs`):
- Added an early check: if the SubqueryAlias's direct child is a plan
type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit,
Union) and cannot be reduced to a table scan, emit it as a derived
subquery (SELECT ...) AS alias via self.derive() instead of recursing
into the child and flattening it.
- Added a helper requires_derived_subquery() that identifies plan types
requiring this treatment.

## Are these changes tested?

Yes. A new test test_unparse_manual_join_with_subquery_aggregate is
added that constructs a SubqueryAlias > Aggregate plan (without an
intermediate Projection) and asserts the unparsed SQL preserves the
MAX() aggregate function call. This test fails without the fix. All
current unparser tests succeed without modification

## Are there any user-facing changes?

No.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the sql label Apr 8, 2026
neilconway and others added 2 commits April 8, 2026 11:31
## Which issue does this PR close?

- Closes #21410.

## Rationale for this change

When `split_part` is invoked with a `StringViewArray`, we can avoid
copying when constructing the return value by instead returning pointers
into the view buffers of the input `StringViewArray`.

This PR only applies this optimization to the code path for scalar
`delimiter` and `position`, because that's the most common usage mode in
practice. We could also optimize the array-args case but it didn't seem
worth the extra code.

Benchmarks (M4 Max):

  - scalar_utf8view_very_long_parts/pos_first: 102 µs → 68 µs (-33%)
  - scalar_utf8view_long_parts/pos_middle: 164 µs → 137 µs (-15%)
  - scalar_utf8_single_char/pos_first: 42.5 µs → 42.9 µs (no change)
  - scalar_utf8_single_char/pos_middle: 96.5 µs → 99.5 µs (noise)
  - scalar_utf8_single_char/pos_negative: 48.3 µs → 48.6 µs (no change)
  - scalar_utf8_multi_char/pos_middle: 132 µs → 132 µs (no change)
  - scalar_utf8_long_strings/pos_middle: 1.06 ms → 1.08 ms (noise)
  - array_utf8_single_char/pos_middle: 355 µs → 365 µs (noise)
  - array_utf8_multi_char/pos_middle: 357 µs → 360 µs (no change)

## What changes are included in this PR?

* Implement optimization
* Add benchmark that covers this case
* Improve SLT test coverage for this code path

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.
## Which issue does this PR close?
This attempts to bridge the missing test coverage mentioned by @alamb on
this issue #8791
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?
The changes are test
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?
no
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
@pull pull bot merged commit 5ba06ac into buraksenn:main Apr 8, 2026
1 check passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants