21322: feat: add cast_to_type UDF for type-based casting#300
21322: feat: add cast_to_type UDF for type-based casting#300martin-augment wants to merge 4 commits intomainfrom
Conversation
Add a `cast_to_type(expression, reference)` function that casts the first argument to the data type of the second argument, similar to DuckDB's cast_to_type. The second argument's type (not value) determines the target cast type, which is useful in macros and generic SQL where types need to be preserved dynamically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `try_cast_to_type(expression, reference)` which works like `cast_to_type` but returns NULL on cast failure instead of erroring, similar to the relationship between arrow_cast and arrow_try_cast. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The second argument (reference) is used solely for its data type, so its nullability should not propagate to the result. Previously `cast_to_type(42, NULL::INTEGER)` was incorrectly marked nullable in the schema even though the input literal is non-null. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WalkthroughTwo new scalar user-defined functions are introduced: ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Review:
|
| Category | Finding |
|---|---|
| Bug | try_cast_to_type same-type path returns non-nullable expr despite always-nullable declared return type |
| Code quality | ~85% code duplication between the two implementation files |
| Robustness | invoke_with_args error path lacks explanatory comment |
| Tests | Error test does not assert message; one duplicate test case |
The nullability bug is the main blocker. Everything else is secondary.
There was a problem hiding this comment.
Code Review
This pull request introduces two new scalar functions, cast_to_type and try_cast_to_type, which allow casting an expression to the data type of a reference expression. The implementation includes the UDF definitions, registration in the core module, comprehensive SQL logic tests, and updated user documentation. A review comment identifies a logic inconsistency in the simplify method of cast_to_type, where nullability is not correctly inherited from the source argument, potentially leading to incorrect schema inference.
| // Use an actual cast to get the correct type | ||
| Expr::Cast(datafusion_expr::Cast { | ||
| expr: Box::new(arg), | ||
| field: target_type.into_nullable_field_ref(), |
There was a problem hiding this comment.
The simplify implementation uses into_nullable_field_ref(), which always creates a nullable field for the Cast expression. This is inconsistent with the logic in return_field_from_args (lines 111-116), where nullability is inherited from the source argument (unless the target type is Null). This discrepancy can lead to incorrect schema inference or suboptimal query plans after simplification. The simplify method should use the same nullability logic as return_field_from_args by leveraging info.nullable(&arg).
| field: target_type.into_nullable_field_ref(), | |
| field: Field::new("value", target_type.clone(), info.nullable(&arg)? || target_type == DataType::Null).into(), |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! There was an issue in the return_field_from_args() method that the PR author improved in the last commit in the PR. But it seems the author didn't improve the same in the simplify() function. Prevents a wrong behavior and wrong result in the information schema for a view that uses this user defined function
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@datafusion/functions/src/core/cast_to_type.rs`:
- Around line 140-145: simplify() currently builds the simplified Expr::Cast
using into_nullable_field_ref(), forcing the cast to always be nullable and
causing a mismatch with return_field_from_args which computes nullability from
source_field.is_nullable() and whether target_type == DataType::Null; update
simplify() to compute nullable the same way (nullable =
source_field.is_nullable() || target_type == DataType::Null) and construct the
target field via Field::new(..., target_type, nullable).into() so the simplified
Expr::Cast and return_field_from_args (and CastExpr::nullable()) produce
consistent nullability.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fe31b126-c6b6-4d36-84dc-c813925e11d7
📒 Files selected for processing (5)
datafusion/functions/src/core/cast_to_type.rsdatafusion/functions/src/core/mod.rsdatafusion/functions/src/core/try_cast_to_type.rsdatafusion/sqllogictest/test_files/cast_to_type.sltdocs/source/user-guide/sql/scalar_functions.md
| // Use an actual cast to get the correct type | ||
| Expr::Cast(datafusion_expr::Cast { | ||
| expr: Box::new(arg), | ||
| field: target_type.into_nullable_field_ref(), | ||
| }) | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how DataFusion handles Cast expression nullability and whether
# into_nullable_field_ref vs computed nullability causes issues
# Search for how Cast's field nullability affects schema derivation
rg -n -A5 'impl.*Cast' --type rust
echo "---"
# Check if there are any tests that verify runtime nullability matches declared schema
rg -n 'into_nullable_field_ref' --type rust -C3Repository: martin-augment/datafusion
Length of output: 36845
🏁 Script executed:
cat -n datafusion/functions/src/core/cast_to_type.rsRepository: martin-augment/datafusion
Length of output: 6289
🏁 Script executed:
rg -n "impl.*Expr.*Cast" datafusion/expr/src/expr.rs -A 20 | head -100Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
rg -n "pub struct Cast" datafusion/expr/src/expr.rs -A 10Repository: martin-augment/datafusion
Length of output: 393
🏁 Script executed:
rg -n "fn.*data_type.*Cast\|Cast.*data_type" datafusion/expr/src/expr.rs -B 2 -A 8 | head -80Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
rg -n "Cast =>" datafusion/expr/src/expr.rs -A 5 | head -60Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
rg -n "expr.data_type\|field.data_type\|field.is_nullable" datafusion/expr/src/expr.rs -A 2 | head -100Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
ast-grep --pattern 'impl Expr {
$$$
pub fn data_type($_) -> $_ {
$$$
}
$$$
}'Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
rg -n "fn data_type" datafusion/expr/src/expr.rs | head -20Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
rg -n "pub fn data_type\|pub fn field" datafusion/expr/src/ -A 3 | grep -A 3 "data_type\|field" | head -80Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
rg -n "Expr::Cast" datafusion/expr/src/expr.rs -B 2 -A 5 | head -100Repository: martin-augment/datafusion
Length of output: 3983
🏁 Script executed:
rg -n "get_data_type\|get_field\|output_ordering" datafusion/expr/src/ -A 3 | grep -B 2 -A 5 "Cast" | head -80Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
rg -n "DynScalarUDF\|get_data_type" datafusion/expr/src/expr.rs -B 2 -A 8 | head -150Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
rg -n "Cast.*field\|field.*Cast" datafusion/core/src/ --type rust -A 3 | head -100Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
rg -n "fn data_type\|fn field" datafusion/physical-expr/src/expressions/cast.rs -B 2 -A 10 | head -150Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
cat -n datafusion/physical-expr/src/expressions/cast.rs | head -250Repository: martin-augment/datafusion
Length of output: 10920
🏁 Script executed:
cat -n datafusion/physical-expr/src/expressions/cast.rs | sed -n '241,260p'Repository: martin-augment/datafusion
Length of output: 1133
Fix nullability inconsistency between type inference and simplification.
return_field_from_args (lines 107-117) correctly computes nullability based on the source field (source_field.is_nullable()) and whether the target type is Null. However, simplify() always uses into_nullable_field_ref() which unconditionally creates a nullable field. This causes the simplified Expr::Cast to always be nullable, even when a non-nullable input and non-null target type should produce non-nullable output.
The CastExpr::nullable() method (in physical-expr) computes nullability as child_nullable || target_nullable, so the target field's nullability directly affects the result. To maintain consistency, simplify() should compute the field's nullability the same way as return_field_from_args:
let nullable = source_field.is_nullable() || target_type == DataType::Null;
let field = Field::new("", target_type, nullable).into();This ensures the declared schema and the simplified Cast expression produce the same nullability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@datafusion/functions/src/core/cast_to_type.rs` around lines 140 - 145,
simplify() currently builds the simplified Expr::Cast using
into_nullable_field_ref(), forcing the cast to always be nullable and causing a
mismatch with return_field_from_args which computes nullability from
source_field.is_nullable() and whether target_type == DataType::Null; update
simplify() to compute nullable the same way (nullable =
source_field.is_nullable() || target_type == DataType::Null) and construct the
target field via Field::new(..., target_type, nullable).into() so the simplified
Expr::Cast and return_field_from_args (and CastExpr::nullable()) produce
consistent nullability.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! There was an issue in the return_field_from_args() method that the PR author improved in the last commit in the PR. But it seems the author didn't improve the same in the simplify() function. Prevents a wrong behavior and wrong result in the information schema for a view that uses this user defined function
|
augment review |
value:incorrect-but-reasonable; category:bug; feedback: The Claude AI reviewer is not correct! The |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! There is a lot of duplicated code that could be extracted to a helper struct or a trait to prevent regressions by fixing an issue only at one place and missing it at another. |
value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The |
🤖 Augment PR SummarySummary: This PR adds two new scalar UDFs, Changes:
Technical Notes: The “reference” argument’s value is ignored; only its inferred type is used to determine the cast target type. The functions are marked immutable and rely on planner/optimizer simplification rather than runtime evaluation. 🤖 Was this summary useful? React with 👍 or 👎 |
| arg | ||
| } else { | ||
| // Use an actual cast to get the correct type | ||
| Expr::Cast(datafusion_expr::Cast { |
There was a problem hiding this comment.
datafusion/functions/src/core/cast_to_type.rs:141: return_field_from_args special-cases DataType::Null to force nullable=true, but simplify rewrites to Expr::Cast, whose field nullability is derived from the input expression. If the reference arg is a bare NULL (type Null) and the source is non-nullable, this rewrite can yield a non-nullable Null output schema (and diverge from the UDF’s declared schema).
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| let arg = args.pop().unwrap(); | ||
|
|
||
| let source_type = info.get_data_type(&arg)?; | ||
| let new_expr = if source_type == target_type { |
There was a problem hiding this comment.
datafusion/functions/src/core/try_cast_to_type.rs:121: When source_type == target_type, simplify returns arg directly, but return_field_from_args always marks the result as nullable. This can make the simplified expression’s nullability differ from the original UDF’s declared nullability, which DataFusion’s ScalarUDFImpl::simplify docs warn can cause schema verification issues.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| let arg = args.pop().unwrap(); | ||
|
|
||
| let source_type = info.get_data_type(&arg)?; | ||
| let new_expr = if source_type == target_type { |
There was a problem hiding this comment.
In datafusion/functions/src/core/try_cast_to_type.rs:121, simplify returns the original arg when source_type == target_type, which means the optimized expression is no longer an Expr::TryCast and can lose the “always nullable / can return NULL on failure” behavior implied by return_field_from_args (and Expr::TryCast nullability).
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
21322: To review by AI