Skip to content

[pull] main from apache:main#92

Merged
pull[bot] merged 4 commits intoburaksenn:mainfrom
apache:main
Apr 9, 2026
Merged

[pull] main from apache:main#92
pull[bot] merged 4 commits intoburaksenn:mainfrom
apache:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 9, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

alamb and others added 4 commits April 9, 2026 13:22
## Which issue does this PR close?

- related to #21379

## Rationale for this change

While reviewing #21379 I
noticed there was minimal Utf8View coverage of the related code.

## What changes are included in this PR?

Update the regexp_replace tests to cover utf8, largeutf8, utf8view and
dictionary

## Are these changes tested?

Yes only tests

I verified these tests also pass when run on 
- #21379

## Are there any user-facing changes?

No
## Which issue does this PR close?

- Closes #21060.

## Rationale for this change

`lpad`, `rpad`, and `translate` use grapheme segmentation. This is
inconsistent with how these functions behave in Postgres and DuckDB, as
well as the SQL standard -- segmentation based on Unicode codepoints is
used instead. It also happens that grapheme-based segmentation is
significantly more expensive than codepoint-based segmentation.

In the case of `lpad` and `rpad`, graphemes and codepoints were used
inconsistently: the input string was measured in code points but the
fill string was measured in graphemes.

#3054 switched to using codepoints for most string-related functions in
DataFusion but these three functions still need to be changed.

Benchmarks (M4 Max):

lpad size=1024:
  - lpad utf8 [str_len=5, target=20]: 12.4 µs → 12.8 µs, +3.0%
  - lpad stringview [str_len=5, target=20]: 11.5 µs → 11.7 µs, +1.4%
  - lpad utf8 [str_len=20, target=50]: 11.3 µs → 11.3 µs, +0.1%
  - lpad stringview [str_len=20, target=50]: 11.8 µs → 12.0 µs, +1.6%
  - lpad utf8 unicode [target=20]: 98.4 µs → 24.4 µs, -75.1%
  - lpad stringview unicode [target=20]: 99.8 µs → 26.0 µs, -74.0%
- lpad utf8 scalar [str_len=5, target=20, fill='x']: 8.7 µs → 8.8 µs,
+1.0%
- lpad stringview scalar [str_len=5, target=20, fill='x']: 10.2 µs →
10.1 µs, -0.1%
- lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 44.7 µs →
10.9 µs, -75.7%
- lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 152.5 µs →
11.7 µs, -92.3%

  lpad size=4096:
  - lpad utf8 [str_len=5, target=20]: 55.9 µs → 55.1 µs, -1.4%
  - lpad stringview [str_len=5, target=20]: 49.2 µs → 50.1 µs, +1.8%
  - lpad utf8 [str_len=20, target=50]: 46.6 µs → 46.4 µs, -0.5%
  - lpad stringview [str_len=20, target=50]: 47.5 µs → 48.5 µs, +2.1%
  - lpad utf8 unicode [target=20]: 401.3 µs → 100.1 µs, -75.0%
  - lpad stringview unicode [target=20]: 397.7 µs → 104.9 µs, -73.6%
- lpad utf8 scalar [str_len=5, target=20, fill='x']: 34.2 µs → 35.0 µs,
+2.4%
- lpad stringview scalar [str_len=5, target=20, fill='x']: 40.1 µs →
40.4 µs, +0.6%
- lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 178.3 µs →
42.9 µs, -76.0%
- lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 601.3 µs →
46.2 µs, -92.3%

  rpad size=1024:
  - rpad utf8 [str_len=5, target=20]: 15.5 µs → 14.4 µs, -7.1%
  - rpad stringview [str_len=5, target=20]: 13.8 µs → 14.0 µs, +1.7%
  - rpad utf8 [str_len=20, target=50]: 12.6 µs → 12.7 µs, +1.3%
  - rpad stringview [str_len=20, target=50]: 13.0 µs → 13.1 µs, +0.7%
  - rpad utf8 unicode [target=20]: 103.5 µs → 26.0 µs, -74.8%
  - rpad stringview unicode [target=20]: 101.2 µs → 27.6 µs, -72.7%
- rpad utf8 scalar [str_len=5, target=20, fill='x']: 11.4 µs → 10.9 µs,
-3.9%
- rpad stringview scalar [str_len=5, target=20, fill='x']: 12.2 µs →
12.6 µs, +2.8%
- rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 46.3 µs →
12.4 µs, -73.1%
- rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 155.6 µs →
11.6 µs, -92.4%

  rpad size=4096:
  - rpad utf8 [str_len=5, target=20]: 70.1 µs → 61.6 µs, -12.2%
  - rpad stringview [str_len=5, target=20]: 60.4 µs → 56.8 µs, -6.0%
  - rpad utf8 [str_len=20, target=50]: 50.6 µs → 51.2 µs, +1.2%
  - rpad stringview [str_len=20, target=50]: 53.7 µs → 53.3 µs, -0.8%
  - rpad utf8 unicode [target=20]: 407.1 µs → 104.0 µs, -74.5%
  - rpad stringview unicode [target=20]: 404.8 µs → 114.5 µs, -71.7%
- rpad utf8 scalar [str_len=5, target=20, fill='x']: 47.5 µs → 45.6 µs,
-4.0%
- rpad stringview scalar [str_len=5, target=20, fill='x']: 56.4 µs →
58.5 µs, +3.6%
- rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 184.1 µs →
48.1 µs, -73.9%
- rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 606.4 µs →
45.6 µs, -92.5%

  translate size=1024:
  - array_from_to [str_len=8]: 140.0 µs → 37.6 µs, -73.2%
  - scalar_from_to [str_len=8]: 9.0 µs → 8.8 µs, -2.7%
  - array_from_to [str_len=32]: 371.3 µs → 65.6 µs, -82.3%
  - scalar_from_to [str_len=32]: 19.9 µs → 19.2 µs, -3.6%
  - array_from_to [str_len=128]: 1249.6 µs → 188.7 µs, -84.9%
  - scalar_from_to [str_len=128]: 70.2 µs → 64.7 µs, -7.9%
  - array_from_to [str_len=1024]: 9349.4 µs → 1378.1 µs, -85.3%
  - scalar_from_to [str_len=1024]: 506.5 µs → 445.8 µs, -12.0%

  translate size=4096:
  - array_from_to [str_len=8]: 548.0 µs → 147.1 µs, -73.2%
  - scalar_from_to [str_len=8]: 33.9 µs → 32.8 µs, -3.1%
  - array_from_to [str_len=32]: 1457.2 µs → 266.0 µs, -81.7%
  - scalar_from_to [str_len=32]: 78.0 µs → 75.5 µs, -3.2%
  - array_from_to [str_len=128]: 4935.0 µs → 791.1 µs, -84.0%
  - scalar_from_to [str_len=128]: 278.2 µs → 260.7 µs, -6.3%
  - array_from_to [str_len=1024]: 37496 µs → 5591 µs, -85.1%
  - scalar_from_to [str_len=1024]: 2058.0 µs → 1770 µs, -14.0%

## What changes are included in this PR?

* Switch from grapheme segmentation to codepoint segmentation for
`lpad`, `rpad`, and `translate`
* Add SLT tests
* Refactor a few helper functions
* Remove dependency on `unicode_segmentation` crate as it is no longer
used

## Are these changes tested?

Yes. The new SLT tests were also run against DuckDB and Postgres to
confirm the behavior is consistent.

## Are there any user-facing changes?

Yes. This PR changes the behavior of `lpad`, `rpad`, and `translate`,
although the new behavior is more consistent with the SQL standard and
with other SQL implementations.
## Which issue does this PR close?

N/A — new feature

## Rationale for this change

DuckDB provides a [`cast_to_type(expression,
reference)`](https://duckdb.org/docs/current/sql/expressions/cast#cast_to_type-function)
function that casts the first argument to the data type of the second
argument. This is useful in macros and generic SQL where types need to
be preserved or matched dynamically. This PR adds the equivalent
function to DataFusion, along with a fallible `try_cast_to_type`
variant.

## What changes are included in this PR?

- New `cast_to_type` scalar UDF in
`datafusion/functions/src/core/cast_to_type.rs`
- Takes two arguments: the expression to cast, and a reference
expression whose **type** (not value) determines the target cast type
- Uses `return_field_from_args` to infer return type from the second
argument's data type
- `simplify()` rewrites to `Expr::Cast` (or no-op if types match), so
there is zero runtime overhead
- New `try_cast_to_type` scalar UDF in
`datafusion/functions/src/core/try_cast_to_type.rs`
- Same as `cast_to_type` but returns NULL on cast failure instead of
erroring
  - `simplify()` rewrites to `Expr::TryCast`
  - Output is always nullable
- Registration of both functions in
`datafusion/functions/src/core/mod.rs`

## Are these changes tested?

Yes. New sqllogictest file `cast_to_type.slt` covering both functions:
- Basic casts (string→int, string→double, int→string, int→double)
- NULL handling
- Same-type no-op
- CASE expression as first argument
- Arithmetic expression as first argument
- Nested calls
- Subquery as second argument
- Column references as second argument
- Boolean and date casts
- Error on invalid cast (`cast_to_type`) vs NULL on invalid cast
(`try_cast_to_type`)
- Cross-column type matching

## Are there any user-facing changes?

Two new SQL functions:
- `cast_to_type(expression, reference)` — casts expression to the type
of reference
- `try_cast_to_type(expression, reference)` — same, but returns NULL on
failure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
…er` (#21327)

~(Draft until I am sure I can use this API to make FileStream behave
better)~

## Which issue does this PR close?

- part of #20529
- Needed for #21351
- Broken out of #20820
- Closes #21427

## Rationale for this change

I can get 10% faster on many ClickBench queries by reordeirng files at
runtime. You can see it all working together here:
#21351

To do do, I need to rework the FileStream so that it can reorder
operations at runtime. Eventually that will include both CPU and IO.

This PR is a step in the direction by introducing the main Morsel API
and implementing it for Parquet. The next PR
(#21342) rewrites FileStream in
terms of the Morsel API

## What changes are included in this PR?

1. Add proposed `Morsel` API
2. Rewrite Parquet opener in terms of that API
3. Add an adapter layer (back to FileOpener, so I don't have to rewrite
FileStream in the same PR)

My next PR will rewrite the FileStream to use the Morsel API

## Are these changes tested?

Yes by existing CI.

I will work on adding additional tests for just Parquet opener in a
follow on PR

## Are there any user-facing changes?
No
@pull pull bot locked and limited conversation to collaborators Apr 9, 2026
@pull pull bot added the ⤵️ pull label Apr 9, 2026
@pull pull bot merged commit 249c23c into buraksenn:main Apr 9, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants