Merged
Conversation
## Which issue does this PR close? - related to #21379 ## Rationale for this change While reviewing #21379 I noticed there was minimal Utf8View coverage of the related code. ## What changes are included in this PR? Update the regexp_replace tests to cover utf8, largeutf8, utf8view and dictionary ## Are these changes tested? Yes only tests I verified these tests also pass when run on - #21379 ## Are there any user-facing changes? No
## Which issue does this PR close? - Closes #21060. ## Rationale for this change `lpad`, `rpad`, and `translate` use grapheme segmentation. This is inconsistent with how these functions behave in Postgres and DuckDB, as well as the SQL standard -- segmentation based on Unicode codepoints is used instead. It also happens that grapheme-based segmentation is significantly more expensive than codepoint-based segmentation. In the case of `lpad` and `rpad`, graphemes and codepoints were used inconsistently: the input string was measured in code points but the fill string was measured in graphemes. #3054 switched to using codepoints for most string-related functions in DataFusion but these three functions still need to be changed. Benchmarks (M4 Max): lpad size=1024: - lpad utf8 [str_len=5, target=20]: 12.4 µs → 12.8 µs, +3.0% - lpad stringview [str_len=5, target=20]: 11.5 µs → 11.7 µs, +1.4% - lpad utf8 [str_len=20, target=50]: 11.3 µs → 11.3 µs, +0.1% - lpad stringview [str_len=20, target=50]: 11.8 µs → 12.0 µs, +1.6% - lpad utf8 unicode [target=20]: 98.4 µs → 24.4 µs, -75.1% - lpad stringview unicode [target=20]: 99.8 µs → 26.0 µs, -74.0% - lpad utf8 scalar [str_len=5, target=20, fill='x']: 8.7 µs → 8.8 µs, +1.0% - lpad stringview scalar [str_len=5, target=20, fill='x']: 10.2 µs → 10.1 µs, -0.1% - lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 44.7 µs → 10.9 µs, -75.7% - lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 152.5 µs → 11.7 µs, -92.3% lpad size=4096: - lpad utf8 [str_len=5, target=20]: 55.9 µs → 55.1 µs, -1.4% - lpad stringview [str_len=5, target=20]: 49.2 µs → 50.1 µs, +1.8% - lpad utf8 [str_len=20, target=50]: 46.6 µs → 46.4 µs, -0.5% - lpad stringview [str_len=20, target=50]: 47.5 µs → 48.5 µs, +2.1% - lpad utf8 unicode [target=20]: 401.3 µs → 100.1 µs, -75.0% - lpad stringview unicode [target=20]: 397.7 µs → 104.9 µs, -73.6% - lpad utf8 scalar [str_len=5, target=20, fill='x']: 34.2 µs → 35.0 µs, +2.4% - lpad stringview scalar [str_len=5, target=20, fill='x']: 40.1 µs → 40.4 µs, +0.6% - lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 178.3 µs → 42.9 µs, -76.0% - lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 601.3 µs → 46.2 µs, -92.3% rpad size=1024: - rpad utf8 [str_len=5, target=20]: 15.5 µs → 14.4 µs, -7.1% - rpad stringview [str_len=5, target=20]: 13.8 µs → 14.0 µs, +1.7% - rpad utf8 [str_len=20, target=50]: 12.6 µs → 12.7 µs, +1.3% - rpad stringview [str_len=20, target=50]: 13.0 µs → 13.1 µs, +0.7% - rpad utf8 unicode [target=20]: 103.5 µs → 26.0 µs, -74.8% - rpad stringview unicode [target=20]: 101.2 µs → 27.6 µs, -72.7% - rpad utf8 scalar [str_len=5, target=20, fill='x']: 11.4 µs → 10.9 µs, -3.9% - rpad stringview scalar [str_len=5, target=20, fill='x']: 12.2 µs → 12.6 µs, +2.8% - rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 46.3 µs → 12.4 µs, -73.1% - rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 155.6 µs → 11.6 µs, -92.4% rpad size=4096: - rpad utf8 [str_len=5, target=20]: 70.1 µs → 61.6 µs, -12.2% - rpad stringview [str_len=5, target=20]: 60.4 µs → 56.8 µs, -6.0% - rpad utf8 [str_len=20, target=50]: 50.6 µs → 51.2 µs, +1.2% - rpad stringview [str_len=20, target=50]: 53.7 µs → 53.3 µs, -0.8% - rpad utf8 unicode [target=20]: 407.1 µs → 104.0 µs, -74.5% - rpad stringview unicode [target=20]: 404.8 µs → 114.5 µs, -71.7% - rpad utf8 scalar [str_len=5, target=20, fill='x']: 47.5 µs → 45.6 µs, -4.0% - rpad stringview scalar [str_len=5, target=20, fill='x']: 56.4 µs → 58.5 µs, +3.6% - rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 184.1 µs → 48.1 µs, -73.9% - rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 606.4 µs → 45.6 µs, -92.5% translate size=1024: - array_from_to [str_len=8]: 140.0 µs → 37.6 µs, -73.2% - scalar_from_to [str_len=8]: 9.0 µs → 8.8 µs, -2.7% - array_from_to [str_len=32]: 371.3 µs → 65.6 µs, -82.3% - scalar_from_to [str_len=32]: 19.9 µs → 19.2 µs, -3.6% - array_from_to [str_len=128]: 1249.6 µs → 188.7 µs, -84.9% - scalar_from_to [str_len=128]: 70.2 µs → 64.7 µs, -7.9% - array_from_to [str_len=1024]: 9349.4 µs → 1378.1 µs, -85.3% - scalar_from_to [str_len=1024]: 506.5 µs → 445.8 µs, -12.0% translate size=4096: - array_from_to [str_len=8]: 548.0 µs → 147.1 µs, -73.2% - scalar_from_to [str_len=8]: 33.9 µs → 32.8 µs, -3.1% - array_from_to [str_len=32]: 1457.2 µs → 266.0 µs, -81.7% - scalar_from_to [str_len=32]: 78.0 µs → 75.5 µs, -3.2% - array_from_to [str_len=128]: 4935.0 µs → 791.1 µs, -84.0% - scalar_from_to [str_len=128]: 278.2 µs → 260.7 µs, -6.3% - array_from_to [str_len=1024]: 37496 µs → 5591 µs, -85.1% - scalar_from_to [str_len=1024]: 2058.0 µs → 1770 µs, -14.0% ## What changes are included in this PR? * Switch from grapheme segmentation to codepoint segmentation for `lpad`, `rpad`, and `translate` * Add SLT tests * Refactor a few helper functions * Remove dependency on `unicode_segmentation` crate as it is no longer used ## Are these changes tested? Yes. The new SLT tests were also run against DuckDB and Postgres to confirm the behavior is consistent. ## Are there any user-facing changes? Yes. This PR changes the behavior of `lpad`, `rpad`, and `translate`, although the new behavior is more consistent with the SQL standard and with other SQL implementations.
## Which issue does this PR close? N/A — new feature ## Rationale for this change DuckDB provides a [`cast_to_type(expression, reference)`](https://duckdb.org/docs/current/sql/expressions/cast#cast_to_type-function) function that casts the first argument to the data type of the second argument. This is useful in macros and generic SQL where types need to be preserved or matched dynamically. This PR adds the equivalent function to DataFusion, along with a fallible `try_cast_to_type` variant. ## What changes are included in this PR? - New `cast_to_type` scalar UDF in `datafusion/functions/src/core/cast_to_type.rs` - Takes two arguments: the expression to cast, and a reference expression whose **type** (not value) determines the target cast type - Uses `return_field_from_args` to infer return type from the second argument's data type - `simplify()` rewrites to `Expr::Cast` (or no-op if types match), so there is zero runtime overhead - New `try_cast_to_type` scalar UDF in `datafusion/functions/src/core/try_cast_to_type.rs` - Same as `cast_to_type` but returns NULL on cast failure instead of erroring - `simplify()` rewrites to `Expr::TryCast` - Output is always nullable - Registration of both functions in `datafusion/functions/src/core/mod.rs` ## Are these changes tested? Yes. New sqllogictest file `cast_to_type.slt` covering both functions: - Basic casts (string→int, string→double, int→string, int→double) - NULL handling - Same-type no-op - CASE expression as first argument - Arithmetic expression as first argument - Nested calls - Subquery as second argument - Column references as second argument - Boolean and date casts - Error on invalid cast (`cast_to_type`) vs NULL on invalid cast (`try_cast_to_type`) - Cross-column type matching ## Are there any user-facing changes? Two new SQL functions: - `cast_to_type(expression, reference)` — casts expression to the type of reference - `try_cast_to_type(expression, reference)` — same, but returns NULL on failure 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
…er` (#21327) ~(Draft until I am sure I can use this API to make FileStream behave better)~ ## Which issue does this PR close? - part of #20529 - Needed for #21351 - Broken out of #20820 - Closes #21427 ## Rationale for this change I can get 10% faster on many ClickBench queries by reordeirng files at runtime. You can see it all working together here: #21351 To do do, I need to rework the FileStream so that it can reorder operations at runtime. Eventually that will include both CPU and IO. This PR is a step in the direction by introducing the main Morsel API and implementing it for Parquet. The next PR (#21342) rewrites FileStream in terms of the Morsel API ## What changes are included in this PR? 1. Add proposed `Morsel` API 2. Rewrite Parquet opener in terms of that API 3. Add an adapter layer (back to FileOpener, so I don't have to rewrite FileStream in the same PR) My next PR will rewrite the FileStream to use the Morsel API ## Are these changes tested? Yes by existing CI. I will work on adding additional tests for just Parquet opener in a follow on PR ## Are there any user-facing changes? No
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )