[pull] main from apache:main by pull[bot] · Pull Request #92 · buraksenn/datafusion

pull · 2026-04-09T18:33:29Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

## Which issue does this PR close? - related to #21379 ## Rationale for this change While reviewing #21379 I noticed there was minimal Utf8View coverage of the related code. ## What changes are included in this PR? Update the regexp_replace tests to cover utf8, largeutf8, utf8view and dictionary ## Are these changes tested? Yes only tests I verified these tests also pass when run on - #21379 ## Are there any user-facing changes? No

## Which issue does this PR close? - Closes #21060. ## Rationale for this change `lpad`, `rpad`, and `translate` use grapheme segmentation. This is inconsistent with how these functions behave in Postgres and DuckDB, as well as the SQL standard -- segmentation based on Unicode codepoints is used instead. It also happens that grapheme-based segmentation is significantly more expensive than codepoint-based segmentation. In the case of `lpad` and `rpad`, graphemes and codepoints were used inconsistently: the input string was measured in code points but the fill string was measured in graphemes. #3054 switched to using codepoints for most string-related functions in DataFusion but these three functions still need to be changed. Benchmarks (M4 Max): lpad size=1024: - lpad utf8 [str_len=5, target=20]: 12.4 µs → 12.8 µs, +3.0% - lpad stringview [str_len=5, target=20]: 11.5 µs → 11.7 µs, +1.4% - lpad utf8 [str_len=20, target=50]: 11.3 µs → 11.3 µs, +0.1% - lpad stringview [str_len=20, target=50]: 11.8 µs → 12.0 µs, +1.6% - lpad utf8 unicode [target=20]: 98.4 µs → 24.4 µs, -75.1% - lpad stringview unicode [target=20]: 99.8 µs → 26.0 µs, -74.0% - lpad utf8 scalar [str_len=5, target=20, fill='x']: 8.7 µs → 8.8 µs, +1.0% - lpad stringview scalar [str_len=5, target=20, fill='x']: 10.2 µs → 10.1 µs, -0.1% - lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 44.7 µs → 10.9 µs, -75.7% - lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 152.5 µs → 11.7 µs, -92.3% lpad size=4096: - lpad utf8 [str_len=5, target=20]: 55.9 µs → 55.1 µs, -1.4% - lpad stringview [str_len=5, target=20]: 49.2 µs → 50.1 µs, +1.8% - lpad utf8 [str_len=20, target=50]: 46.6 µs → 46.4 µs, -0.5% - lpad stringview [str_len=20, target=50]: 47.5 µs → 48.5 µs, +2.1% - lpad utf8 unicode [target=20]: 401.3 µs → 100.1 µs, -75.0% - lpad stringview unicode [target=20]: 397.7 µs → 104.9 µs, -73.6% - lpad utf8 scalar [str_len=5, target=20, fill='x']: 34.2 µs → 35.0 µs, +2.4% - lpad stringview scalar [str_len=5, target=20, fill='x']: 40.1 µs → 40.4 µs, +0.6% - lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 178.3 µs → 42.9 µs, -76.0% - lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 601.3 µs → 46.2 µs, -92.3% rpad size=1024: - rpad utf8 [str_len=5, target=20]: 15.5 µs → 14.4 µs, -7.1% - rpad stringview [str_len=5, target=20]: 13.8 µs → 14.0 µs, +1.7% - rpad utf8 [str_len=20, target=50]: 12.6 µs → 12.7 µs, +1.3% - rpad stringview [str_len=20, target=50]: 13.0 µs → 13.1 µs, +0.7% - rpad utf8 unicode [target=20]: 103.5 µs → 26.0 µs, -74.8% - rpad stringview unicode [target=20]: 101.2 µs → 27.6 µs, -72.7% - rpad utf8 scalar [str_len=5, target=20, fill='x']: 11.4 µs → 10.9 µs, -3.9% - rpad stringview scalar [str_len=5, target=20, fill='x']: 12.2 µs → 12.6 µs, +2.8% - rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 46.3 µs → 12.4 µs, -73.1% - rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 155.6 µs → 11.6 µs, -92.4% rpad size=4096: - rpad utf8 [str_len=5, target=20]: 70.1 µs → 61.6 µs, -12.2% - rpad stringview [str_len=5, target=20]: 60.4 µs → 56.8 µs, -6.0% - rpad utf8 [str_len=20, target=50]: 50.6 µs → 51.2 µs, +1.2% - rpad stringview [str_len=20, target=50]: 53.7 µs → 53.3 µs, -0.8% - rpad utf8 unicode [target=20]: 407.1 µs → 104.0 µs, -74.5% - rpad stringview unicode [target=20]: 404.8 µs → 114.5 µs, -71.7% - rpad utf8 scalar [str_len=5, target=20, fill='x']: 47.5 µs → 45.6 µs, -4.0% - rpad stringview scalar [str_len=5, target=20, fill='x']: 56.4 µs → 58.5 µs, +3.6% - rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 184.1 µs → 48.1 µs, -73.9% - rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 606.4 µs → 45.6 µs, -92.5% translate size=1024: - array_from_to [str_len=8]: 140.0 µs → 37.6 µs, -73.2% - scalar_from_to [str_len=8]: 9.0 µs → 8.8 µs, -2.7% - array_from_to [str_len=32]: 371.3 µs → 65.6 µs, -82.3% - scalar_from_to [str_len=32]: 19.9 µs → 19.2 µs, -3.6% - array_from_to [str_len=128]: 1249.6 µs → 188.7 µs, -84.9% - scalar_from_to [str_len=128]: 70.2 µs → 64.7 µs, -7.9% - array_from_to [str_len=1024]: 9349.4 µs → 1378.1 µs, -85.3% - scalar_from_to [str_len=1024]: 506.5 µs → 445.8 µs, -12.0% translate size=4096: - array_from_to [str_len=8]: 548.0 µs → 147.1 µs, -73.2% - scalar_from_to [str_len=8]: 33.9 µs → 32.8 µs, -3.1% - array_from_to [str_len=32]: 1457.2 µs → 266.0 µs, -81.7% - scalar_from_to [str_len=32]: 78.0 µs → 75.5 µs, -3.2% - array_from_to [str_len=128]: 4935.0 µs → 791.1 µs, -84.0% - scalar_from_to [str_len=128]: 278.2 µs → 260.7 µs, -6.3% - array_from_to [str_len=1024]: 37496 µs → 5591 µs, -85.1% - scalar_from_to [str_len=1024]: 2058.0 µs → 1770 µs, -14.0% ## What changes are included in this PR? * Switch from grapheme segmentation to codepoint segmentation for `lpad`, `rpad`, and `translate` * Add SLT tests * Refactor a few helper functions * Remove dependency on `unicode_segmentation` crate as it is no longer used ## Are these changes tested? Yes. The new SLT tests were also run against DuckDB and Postgres to confirm the behavior is consistent. ## Are there any user-facing changes? Yes. This PR changes the behavior of `lpad`, `rpad`, and `translate`, although the new behavior is more consistent with the SQL standard and with other SQL implementations.

## Which issue does this PR close? N/A — new feature ## Rationale for this change DuckDB provides a [`cast_to_type(expression, reference)`](https://duckdb.org/docs/current/sql/expressions/cast#cast_to_type-function) function that casts the first argument to the data type of the second argument. This is useful in macros and generic SQL where types need to be preserved or matched dynamically. This PR adds the equivalent function to DataFusion, along with a fallible `try_cast_to_type` variant. ## What changes are included in this PR? - New `cast_to_type` scalar UDF in `datafusion/functions/src/core/cast_to_type.rs` - Takes two arguments: the expression to cast, and a reference expression whose **type** (not value) determines the target cast type - Uses `return_field_from_args` to infer return type from the second argument's data type - `simplify()` rewrites to `Expr::Cast` (or no-op if types match), so there is zero runtime overhead - New `try_cast_to_type` scalar UDF in `datafusion/functions/src/core/try_cast_to_type.rs` - Same as `cast_to_type` but returns NULL on cast failure instead of erroring - `simplify()` rewrites to `Expr::TryCast` - Output is always nullable - Registration of both functions in `datafusion/functions/src/core/mod.rs` ## Are these changes tested? Yes. New sqllogictest file `cast_to_type.slt` covering both functions: - Basic casts (string→int, string→double, int→string, int→double) - NULL handling - Same-type no-op - CASE expression as first argument - Arithmetic expression as first argument - Nested calls - Subquery as second argument - Column references as second argument - Boolean and date casts - Error on invalid cast (`cast_to_type`) vs NULL on invalid cast (`try_cast_to_type`) - Cross-column type matching ## Are there any user-facing changes? Two new SQL functions: - `cast_to_type(expression, reference)` — casts expression to the type of reference - `try_cast_to_type(expression, reference)` — same, but returns NULL on failure 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

…er` (#21327) ~(Draft until I am sure I can use this API to make FileStream behave better)~ ## Which issue does this PR close? - part of #20529 - Needed for #21351 - Broken out of #20820 - Closes #21427 ## Rationale for this change I can get 10% faster on many ClickBench queries by reordeirng files at runtime. You can see it all working together here: #21351 To do do, I need to rework the FileStream so that it can reorder operations at runtime. Eventually that will include both CPU and IO. This PR is a step in the direction by introducing the main Morsel API and implementing it for Parquet. The next PR (#21342) rewrites FileStream in terms of the Morsel API ## What changes are included in this PR? 1. Add proposed `Morsel` API 2. Rewrite Parquet opener in terms of that API 3. Add an adapter layer (back to FileOpener, so I don't have to rewrite FileStream in the same PR) My next PR will rewrite the FileStream to use the Morsel API ## Are these changes tested? Yes by existing CI. I will work on adding additional tests for just Parquet opener in a follow on PR ## Are there any user-facing changes? No

alamb and others added 4 commits April 9, 2026 13:22

pull bot locked and limited conversation to collaborators Apr 9, 2026

pull bot added the ⤵️ pull label Apr 9, 2026

pull bot merged commit 249c23c into buraksenn:main Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from apache:main#92

[pull] main from apache:main#92
pull[bot] merged 4 commits intoburaksenn:mainfrom
apache:main

pull bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pull bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pull bot commented Apr 9, 2026 •

edited

Loading