Merged
Conversation
## Which issue does this PR close? Part of #20766 ## Rationale for this change Grouped aggregations currently estimate output rows as input_rows, ignoring available NDV statistics. Spark's AggregateEstimation and Trino's AggregationStatsRule both use NDV products to tighten this estimate. This PR is highly referenced by both. - [Spark reference](https://github.com/apache/spark/blob/e8d8e6a8d040d26aae9571e968e0c64bda0875dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala#L38-L61) - [Trino reference](https://github.com/trinodb/trino/blob/43c8c3ba8bff814697c5926149ce13b9532f030b/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java#L92-L101) ## What changes are included in this PR? - Estimate aggregate output rows as min(input_rows, product(NDV_i + null_adj_i) * grouping_sets) - Cap by Top K limit when active since output row cannot be higher than K - Propagate distinct_count from child stats to group-by output columns ## Are these changes tested? Yes existing and new tests that cover different scenarios and edge cases ## Are there any user-facing changes? No
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #21217 ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Adds `ScalarUDFImpl::struct_field_mapping` - Adds logic in `ProjectionMapping` to decompose struct-producing functions into their field-level mapping entries so that orderings propagate through struct projections - Adds unit tests/SLT ## Are these changes tested? Yes. ## Are there any user-facing changes? N/A --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #18816 . ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> In `UserDefinedLogicalNodeCore`, the default implementation of `necessary_children_exprs ` returns `None`, which signals to the optimizer that it cannot determine which columns are required from the child. The optimizer takes a conservative approach and skips projection pruning for that node, leading to complex and redundant plans in the subtree. However, it would make more sense to assume all columns are required and let the optimizer proceed, rather than giving up on the entire subtree entirely. ## What changes are included in this PR? ```rust LogicalPlan::Extension(extension) => { if let Some(necessary_children_indices) = extension.node.necessary_children_exprs(indices.indices()) { ... } else { // Requirements from parent cannot be routed down to user defined logical plan safely // Assume it requires all input exprs here plan.inputs() .into_iter() .map(RequiredIndices::new_for_all_exprs) .collect() } } ``` instead of https://github.com/apache/datafusion/blob/b6d46a63824f003117297848d8d83b659ac2e759/datafusion/optimizer/src/optimize_projections/mod.rs#L331-L337 <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes. In addition to unit tests, I've also added a complete end-to-end integration test that reproduces the full scenario in the issue. This might seem redundant, bloated, or even unnecessary. Please let me know if I should remove these tests. An existing test is modified, but I think the newer behavior is expected. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? Yes. But I think the new implementation is the expected behavior. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
## Which issue does this PR close? - Closes #. ## Rationale for this change `Fix sort merge interleave overflow` (#20922) added a temporary `catch_unwind` shim around Arrow's `interleave` call because the upstream implementation still panicked on offset overflow at the time. Arrow 58.1.0 includes apache/arrow-rs#9549, which returns `ArrowError::OffsetOverflowError` directly instead of panicking. DataFusion main now depends on that release, so the panic recovery path is no longer needed and only broadens the set of failures we might accidentally treat as recoverable. ## What changes are included in this PR? - Remove the temporary panic-catching wrapper from `BatchBuilder::try_interleave_columns`. - Keep the existing retry logic, but trigger it only from the returned `OffsetOverflowError`. - Replace the panic-specific unit tests with a direct error-shape assertion. ## Are these changes tested? Yes. - `cargo test -p datafusion-physical-plan sorts::builder -- --nocapture` - `cargo test -p datafusion-physical-plan sorts:: -- --nocapture` - `./dev/rust_lint.sh` ## Are there any user-facing changes? No.
#21099) When the SQL unparser encountered a SubqueryAlias node whose direct child was an Aggregate (or other clause-building plan like Window, Sort, Limit, Union), it would flatten the subquery into a simple table alias, losing the aggregate entirely. For example, a plan representing: SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m would unparse to: SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m dropping the MAX aggregate and the subquery. Root cause: the SubqueryAlias handler in select_to_sql_recursively would call subquery_alias_inner_query_and_columns (which only unwraps Projection children) and unparse_table_scan_pushdown (which only handles TableScan/SubqueryAlias/Projection). When both returned nothing useful for an Aggregate child, the code recursed directly into the Aggregate, merging its GROUP BY into the outer SELECT instead of wrapping it in a derived subquery. The fix adds an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union), emit it as a derived subquery via self.derive() with the alias always attached, rather than falling through to the recursive path that would flatten it. ## Which issue does this PR close? - Closes #21098 ## Rationale for this change The SQL unparser silently drops subquery structure when a SubqueryAlias node directly wraps an Aggregate (or Window, Sort, Limit, Union). For example, a plan representing ```sql SELECT ... FROM j1 JOIN (SELECT max(id) FROM j2) AS b ... ``` unparses to ```sql SELECT ... FROM j1 JOIN j2 AS b ... ``` losing the aggregate entirely. This produces semantically incorrect SQL. ## What changes are included in this PR? In the SubqueryAlias handler within select_to_sql_recursively (`datafusion/sql/src/unparser/plan.rs`): - Added an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union) and cannot be reduced to a table scan, emit it as a derived subquery (SELECT ...) AS alias via self.derive() instead of recursing into the child and flattening it. - Added a helper requires_derived_subquery() that identifies plan types requiring this treatment. ## Are these changes tested? Yes. A new test test_unparse_manual_join_with_subquery_aggregate is added that constructs a SubqueryAlias > Aggregate plan (without an intermediate Projection) and asserts the unparsed SQL preserves the MAX() aggregate function call. This test fails without the fix. All current unparser tests succeed without modification ## Are there any user-facing changes? No. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Which issue does this PR close? - Closes #21410. ## Rationale for this change When `split_part` is invoked with a `StringViewArray`, we can avoid copying when constructing the return value by instead returning pointers into the view buffers of the input `StringViewArray`. This PR only applies this optimization to the code path for scalar `delimiter` and `position`, because that's the most common usage mode in practice. We could also optimize the array-args case but it didn't seem worth the extra code. Benchmarks (M4 Max): - scalar_utf8view_very_long_parts/pos_first: 102 µs → 68 µs (-33%) - scalar_utf8view_long_parts/pos_middle: 164 µs → 137 µs (-15%) - scalar_utf8_single_char/pos_first: 42.5 µs → 42.9 µs (no change) - scalar_utf8_single_char/pos_middle: 96.5 µs → 99.5 µs (noise) - scalar_utf8_single_char/pos_negative: 48.3 µs → 48.6 µs (no change) - scalar_utf8_multi_char/pos_middle: 132 µs → 132 µs (no change) - scalar_utf8_long_strings/pos_middle: 1.06 ms → 1.08 ms (noise) - array_utf8_single_char/pos_middle: 355 µs → 365 µs (noise) - array_utf8_multi_char/pos_middle: 357 µs → 360 µs (no change) ## What changes are included in this PR? * Implement optimization * Add benchmark that covers this case * Improve SLT test coverage for this code path ## Are these changes tested? Yes. ## Are there any user-facing changes? No.
## Which issue does this PR close? This attempts to bridge the missing test coverage mentioned by @alamb on this issue #8791 <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? The changes are test <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? no <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )