[pull] main from apache:main#107
Merged
pull[bot] merged 6 commits intoburaksenn:mainfrom Apr 15, 2026
Merged
Conversation
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change Create `array_compact` function which removes NULLs from input array. There is no direct counterparty in DuckDB however the function used in Spark, SnowFlake <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
## Which issue does this PR close? - Partially addresses #21543. Also needed to properly evaluate the ExternalSorter refactor in #21629, which improves the merge path. ## Rationale for this change Current sort benchmarks use 100K rows across 8 partitions (~12.5K rows per partition, ~100KB for integers). This falls below the `sort_in_place_threshold_bytes` (1MB), so the "sort partitioned" benchmarks always take the concat-and-sort-in-place path and never exercise the sort-then-merge path that dominates real workloads. ## What changes are included in this PR? Parameterizes the sort benchmark on input size, running each case at both 100K rows (existing) and 1M rows (new). At 1M rows, each partition holds ~125K rows (~1MB for integers), which exercises the merge path. - `INPUT_SIZE` constant replaced with `INPUT_SIZES` array: `[(100_000, "100k"), (1_000_000, "1M")]` - `DataGenerator` takes `input_size` as a constructor parameter - All stream generator functions accept `input_size` - Benchmark names include size label (e.g. `sort partitioned i64 100k`, `sort partitioned i64 10M`) - Data distribution and cardinality ratios are preserved across sizes ## Are these changes tested? Benchmark compiles and runs. No functional test changes. ## Are there any user-facing changes? No.
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #21450 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> The into_stream() implementation of GetResult (from arrow-rs-objectstore) fetches every 8KiB chunk using a spawn_blocking() task, resulting in a lot of scheduling overhead. Fix this by reading the data directly from the async context, using a buffer size of 8KiBs. This avoids any context switch. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? ``` Validated that the initial reported overhead is now much smaller: Comparing json-test-on-main and test-json-improvement -------------------- Benchmark clickbench_2.json -------------------- ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ Query ┃ json-test-on-main ┃ test-json-improvement ┃ Change ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ QQuery 0 │ 2421.62 ms │ 2521.19 ms │ no change │ │ QQuery 1 │ 2584.29 ms │ 2729.98 ms │ 1.06x slower │ │ QQuery 2 │ 2662.11 ms │ 2782.29 ms │ no change │ │ QQuery 3 │ FAIL │ FAIL │ incomparable │ │ QQuery 4 │ 2764.78 ms │ 2896.46 ms │ no change │ │ QQuery 5 │ 2676.46 ms │ 2758.01 ms │ no change │ │ QQuery 6 │ FAIL │ FAIL │ incomparable │ │ QQuery 7 │ 2684.50 ms │ 2752.37 ms │ no change │ │ QQuery 8 │ 2781.21 ms │ 2827.46 ms │ no change │ │ QQuery 9 │ 3039.17 ms │ 3165.29 ms │ no change │ │ QQuery 10 │ 2791.32 ms │ 2843.44 ms │ no change │ │ QQuery 11 │ 2839.05 ms │ 3011.84 ms │ 1.06x slower │ │ QQuery 12 │ 2691.51 ms │ 2839.97 ms │ 1.06x slower │ │ QQuery 13 │ 2768.57 ms │ 2860.68 ms │ no change │ │ QQuery 14 │ 2712.50 ms │ 2856.80 ms │ 1.05x slower │ │ QQuery 15 │ 2807.64 ms │ 2888.94 ms │ no change │ │ QQuery 16 │ 2774.87 ms │ 2875.44 ms │ no change │ │ QQuery 17 │ 2797.28 ms │ 2850.17 ms │ no change │ │ QQuery 18 │ 3017.75 ms │ 3111.64 ms │ no change │ │ QQuery 19 │ 2801.30 ms │ 2927.25 ms │ no change │ │ QQuery 20 │ 2743.43 ms │ 2862.10 ms │ no change │ │ QQuery 21 │ 2811.41 ms │ 2906.42 ms │ no change │ │ QQuery 22 │ 2953.66 ms │ 3038.23 ms │ no change │ │ QQuery 23 │ FAIL │ FAIL │ incomparable │ │ QQuery 24 │ 2862.27 ms │ 2940.31 ms │ no change │ │ QQuery 25 │ 2763.40 ms │ 2848.55 ms │ no change │ │ QQuery 26 │ 2840.39 ms │ 2950.47 ms │ no change │ │ QQuery 27 │ 2886.70 ms │ 2921.28 ms │ no change │ │ QQuery 28 │ 3145.39 ms │ 3221.27 ms │ no change │ │ QQuery 29 │ 2821.87 ms │ 2869.85 ms │ no change │ │ QQuery 30 │ 2953.55 ms │ 2990.15 ms │ no change │ │ QQuery 31 │ 2997.81 ms │ 3049.28 ms │ no change │ │ QQuery 32 │ 2969.14 ms │ 3126.79 ms │ 1.05x slower │ │ QQuery 33 │ 2764.80 ms │ 2866.63 ms │ no change │ │ QQuery 34 │ 2828.77 ms │ 2848.54 ms │ no change │ │ QQuery 35 │ 2812.55 ms │ 2793.79 ms │ no change │ │ QQuery 36 │ FAIL │ FAIL │ incomparable │ │ QQuery 37 │ FAIL │ FAIL │ incomparable │ │ QQuery 38 │ FAIL │ FAIL │ incomparable │ │ QQuery 39 │ FAIL │ FAIL │ incomparable │ │ QQuery 40 │ FAIL │ FAIL │ incomparable │ │ QQuery 41 │ FAIL │ FAIL │ incomparable │ │ QQuery 42 │ FAIL │ FAIL │ incomparable │ └───────────┴───────────────────┴───────────────────────┴──────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ Total Time (json-test-on-main) │ 92771.07ms │ │ Total Time (test-json-improvement) │ 95732.89ms │ │ Average Time (json-test-on-main) │ 2811.24ms │ │ Average Time (test-json-improvement) │ 2901.00ms │ │ Queries Faster │ 0 │ │ Queries Slower │ 5 │ │ Queries with No Change │ 28 │ │ Queries with Failure │ 10 │ └──────────────────────────────────────┴────────────┘ ``` and with SIMULATE_LATENCY: ``` -------------------- Benchmark clickbench_2.json -------------------- ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ json-test-on-main ┃ test-json-improvement ┃ Change ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 0 │ 2795.68 ms │ 2687.68 ms │ no change │ │ QQuery 1 │ 2880.50 ms │ 2768.30 ms │ no change │ │ QQuery 2 │ 2960.75 ms │ 2826.89 ms │ no change │ │ QQuery 3 │ FAIL │ FAIL │ incomparable │ │ QQuery 4 │ 3140.38 ms │ 2963.15 ms │ +1.06x faster │ │ QQuery 5 │ 2926.66 ms │ 2830.43 ms │ no change │ │ QQuery 6 │ FAIL │ FAIL │ incomparable │ │ QQuery 7 │ 3026.29 ms │ 2858.30 ms │ +1.06x faster │ │ QQuery 8 │ 4302.35 ms │ 2954.96 ms │ +1.46x faster │ │ QQuery 9 │ 4439.83 ms │ 3200.43 ms │ +1.39x faster │ │ QQuery 10 │ 3028.32 ms │ 2969.32 ms │ no change │ │ QQuery 11 │ 3147.81 ms │ 3040.74 ms │ no change │ │ QQuery 12 │ 4169.45 ms │ 2886.59 ms │ +1.44x faster │ │ QQuery 13 │ 3839.01 ms │ 2997.80 ms │ +1.28x faster │ │ QQuery 14 │ 4086.30 ms │ 2907.42 ms │ +1.41x faster │ │ QQuery 15 │ 4308.07 ms │ 3025.22 ms │ +1.42x faster │ │ QQuery 16 │ 3084.89 ms │ 2984.34 ms │ no change │ │ QQuery 17 │ 4287.89 ms │ 2984.27 ms │ +1.44x faster │ │ QQuery 18 │ 3542.80 ms │ 3144.98 ms │ +1.13x faster │ │ QQuery 19 │ 4388.70 ms │ 3014.37 ms │ +1.46x faster │ │ QQuery 20 │ 3149.54 ms │ 2986.73 ms │ +1.05x faster │ │ QQuery 21 │ 3250.81 ms │ 2906.60 ms │ +1.12x faster │ │ QQuery 22 │ 3265.98 ms │ 3122.25 ms │ no change │ │ QQuery 23 │ FAIL │ FAIL │ incomparable │ │ QQuery 24 │ 3066.52 ms │ 2997.55 ms │ no change │ │ QQuery 25 │ 4289.31 ms │ 2884.22 ms │ +1.49x faster │ │ QQuery 26 │ 4223.03 ms │ 2933.16 ms │ +1.44x faster │ │ QQuery 27 │ 3156.86 ms │ 3001.17 ms │ no change │ │ QQuery 28 │ 4831.42 ms │ 3318.89 ms │ +1.46x faster │ │ QQuery 29 │ 3252.45 ms │ 4375.90 ms │ 1.35x slower │ │ QQuery 30 │ 4460.06 ms │ 3153.77 ms │ +1.41x faster │ │ QQuery 31 │ 4235.85 ms │ 3171.58 ms │ +1.34x faster │ │ QQuery 32 │ 3435.14 ms │ 3202.64 ms │ +1.07x faster │ │ QQuery 33 │ 3147.21 ms │ 3031.54 ms │ no change │ │ QQuery 34 │ 4378.41 ms │ 3008.79 ms │ +1.46x faster │ │ QQuery 35 │ 4224.36 ms │ 2897.53 ms │ +1.46x faster │ │ QQuery 36 │ FAIL │ FAIL │ incomparable │ │ QQuery 37 │ FAIL │ FAIL │ incomparable │ │ QQuery 38 │ FAIL │ FAIL │ incomparable │ │ QQuery 39 │ FAIL │ FAIL │ incomparable │ │ QQuery 40 │ FAIL │ FAIL │ incomparable │ │ QQuery 41 │ FAIL │ FAIL │ incomparable │ │ QQuery 42 │ FAIL │ FAIL │ incomparable │ └───────────┴───────────────────┴───────────────────────┴───────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ Total Time (json-test-on-main) │ 120722.63ms │ │ Total Time (test-json-improvement) │ 100037.48ms │ │ Average Time (json-test-on-main) │ 3658.26ms │ │ Average Time (test-json-improvement) │ 3031.44ms │ │ Queries Faster │ 21 │ │ Queries Slower │ 1 │ │ Queries with No Change │ 11 │ │ Queries with Failure │ 10 │ └──────────────────────────────────────┴─────────────┘ ``` For the tests I've used a c7a.16xlarge ec2 instance, with a trimmed down version of hits.json to 51G (original has 217 GiB), with a warm cache (by running `cat hits_50.json > /dev/null`) ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #21368. ## Rationale for this change TPC-H query 11 was using a fixed `0.0001` threshold in the HAVING clause. Per the TPC-H spec, this value must be `0.0001 / SF`. This means the benchmark query is only correct for scale factor 1. For larger scale factors the filter becomes too strict, and for smaller scale factors it becomes too loose. There are also few benchmark queries using fixed end dates where the spec uses `date + interval`. Those are equivalent but using intervals matches the TPC-H query definitions more closely. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Make TPC-H query 11 use scale-factor substitution. - Add scale factor support in the benchmark runner. - Infer scale factor from dataset paths like `tpch_sf10`. - Pass the scale factor from `bench.sh`. - Keep the old query-loading entry point working with the default scale factor of 1. - Update query 5, 6, 10, 12, and 14 to use interval-based date ranges. - Add regression tests for scale-factor substitution, scale-factor parsing, and invalid scale factors. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? TPC-H benchmark query 11 now returns correct results when the scale factor is not 1. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
## Which issue does this PR close? - Related to #19692 ## Rationale for this change As we see more people helping with releases (thank you @timsaucer and @comphead ❤️ ) I think having the process documented more will help Also, I spent quite a while back porting PRs and wanted to document it better ## What changes are included in this PR? 1. Add a release management page 2. Leave links to relevant sections ## Are these changes tested? By CI ## Are there any user-facing changes? Docs only
## Which issue does this PR close? - Closes #. ## Rationale for this change [#21160](#21160) added `datafusion.explain.analyze_categories`, which lets `EXPLAIN ANALYZE` emit only deterministic metric categories (e.g. `'rows'`). That unlocked a long-standing blocker on porting tests out of `datafusion/core/tests/physical_optimizer/filter_pushdown.rs`: previously these tests had to assert on execution state via `insta` snapshots over hand-wired `ExecutionPlan` trees and mock `TestSource` data, which kept them expensive to read, expensive to update, and impossible to test from the user-facing SQL path. With `analyze_categories = 'rows'`, the `predicate=DynamicFilter [ ... ]` text on a parquet scan is stable across runs, so the same invariants can now be expressed as plain `EXPLAIN ANALYZE` SQL in sqllogictest, where they are easier to read, easier to update, and exercise the full SQL → logical optimizer → physical optimizer → execution pipeline rather than a single optimizer rule in isolation. ## What changes are included in this PR? 24 end-to-end filter-pushdown tests are ported out of `filter_pushdown.rs` and deleted. The helpers `run_aggregate_dyn_filter_case` and `run_projection_dyn_filter_case` (and their supporting structs) are deleted along with the tests that used them. The 24 synchronous `#[test]` optimizer-rule-in-isolation tests are untouched — they stay in Rust because they specifically exercise `FilterPushdown::new()` / `OptimizationTest` over a hand-built plan. ### `datafusion/sqllogictest/test_files/push_down_filter_parquet.slt` New tests covering: - TopK dynamic filter pushdown integration (100k-row parquet, `max_row_group_size = 128`, asserting on `pushdown_rows_matched = 128` / `pushdown_rows_pruned = 99.87 K`) - TopK single-column and multi-column (compound-sort) dynamic filter shapes - HashJoin CollectLeft dynamic filter with `struct(a, b) IN (SET) ([...])` content - Nested hash joins propagating filters to both inner scans - Parent `WHERE` filter splitting across the two sides of a HashJoin - TopK above HashJoin, with both dynamic filters ANDed on the probe scan - Dynamic filter flowing through a `GROUP BY` sitting between a HashJoin and the probe scan - TopK projection rewrite — reorder, prune, expression, alias shadowing - NULL-bearing build-side join keys - `LEFT JOIN` and `LEFT SEMI JOIN` dynamic filter pushdown - HashTable strategy (`hash_lookup`) via `hash_join_inlist_pushdown_max_size = 1`, on both string and integer multi-column keys ### `datafusion/sqllogictest/test_files/push_down_filter_regression.slt` New tests covering: - Aggregate dynamic filter baseline: `MIN(a)`, `MAX(a)`, `MIN(a), MAX(a)`, `MIN(a), MAX(b)`, mixed `MIN/MAX` with an unsupported expression input, all-NULL input (filter stays `true`), `MIN(a+1)` (no filter emitted) - `WHERE` filter on a grouping column pushes through `AggregateExec` - `HAVING count(b) > 5` filter stays above the aggregate - End-to-end aggregate dynamic filter actually pruning a multi-file parquet scan The aggregate baseline tests run under `analyze_level = summary` + `analyze_categories = 'none'` so that metrics render empty and only the `predicate=DynamicFilter [ ... ]` content remains — the filter text is deterministic even though the pruning counts are subject to parallel-execution scheduling. ### What stayed in Rust Ten async tests now carry a short `// Not portable to sqllogictest: …` header explaining why. In short, they either: - Hand-wire `PartitionMode::Partitioned` or a `RepartitionExec` boundary that SQL never constructs for the sizes of data these tests use - Assert via debug-only APIs (`HashJoinExec::dynamic_filter_for_test().is_used()`, `ExecutionPlan::apply_expressions()` + `downcast_ref::<DynamicFilterPhysicalExpr>`) that are not observable from SQL - Target the specific stacked-`FilterExec` shape (#20109 regression) that the logical optimizer collapses before physical planning ## Are these changes tested? Yes — the ported tests _are_ the tests. Each ported slt case was generated with `cargo test -p datafusion-sqllogictest --test sqllogictests -- <file> --complete`, then re-run twice back-to-back without `--complete` to confirm determinism. The remaining Rust `filter_pushdown` tests continue to pass (`cargo test -p datafusion --test core_integration filter_pushdown` → 47 passed, 0 failed). `cargo clippy --tests -D warnings` and `cargo fmt --all` are clean. ## Test plan - [x] `cargo test -p datafusion-sqllogictest --test sqllogictests -- push_down_filter` - [x] `cargo test -p datafusion --test core_integration filter_pushdown` - [x] `cargo clippy -p datafusion --tests -- -D warnings` - [x] `cargo fmt --all` ## Are there any user-facing changes? No. This is a test-only refactor. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )