Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 152 additions & 0 deletions datafusion/functions/src/core/cast_to_type.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

//! [`CastToTypeFunc`]: Implementation of the `cast_to_type`

use arrow::datatypes::{DataType, Field, FieldRef};
use datafusion_common::{
Result, datatype::DataTypeExt, internal_err, utils::take_function_args,
};
use datafusion_expr::simplify::{ExprSimplifyResult, SimplifyContext};
use datafusion_expr::{
Coercion, ColumnarValue, Documentation, Expr, ReturnFieldArgs, ScalarFunctionArgs,
ScalarUDFImpl, Signature, TypeSignatureClass, Volatility,
};
use datafusion_macros::user_doc;

/// Casts the first argument to the data type of the second argument.
///
/// Only the type of the second argument is used; its value is ignored.
/// This is useful in macros or generic SQL where you need to preserve
/// or match types dynamically.
///
/// For example:
/// ```sql
/// select cast_to_type('42', NULL::INTEGER);
/// ```
#[user_doc(
doc_section(label = "Other Functions"),
description = "Casts the first argument to the data type of the second argument. Only the type of the second argument is used; its value is ignored.",
syntax_example = "cast_to_type(expression, reference)",
sql_example = r#"```sql
> select cast_to_type('42', NULL::INTEGER) as a;
+----+
| a |
+----+
| 42 |
+----+

> select cast_to_type(1 + 2, NULL::DOUBLE) as b;
+-----+
| b |
+-----+
| 3.0 |
+-----+
```"#,
argument(
name = "expression",
description = "Expression to cast. The expression can be a constant, column, or function, and any combination of operators."
),
argument(
name = "reference",
description = "Reference expression whose data type determines the target cast type. The value is ignored."
)
)]
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct CastToTypeFunc {
signature: Signature,
}

impl Default for CastToTypeFunc {
fn default() -> Self {
Self::new()
}
}

impl CastToTypeFunc {
pub fn new() -> Self {
Self {
signature: Signature::coercible(
vec![
Coercion::new_exact(TypeSignatureClass::Any),
Coercion::new_exact(TypeSignatureClass::Any),
],
Volatility::Immutable,
),
}
}
}

impl ScalarUDFImpl for CastToTypeFunc {
fn name(&self) -> &str {
"cast_to_type"
}

fn signature(&self) -> &Signature {
&self.signature
}

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be called instead")
}

fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
let [source_field, reference_field] =
take_function_args(self.name(), args.arg_fields)?;
let target_type = reference_field.data_type().clone();
// Nullability is inherited only from the first argument (the value
// being cast). The second argument is used solely for its type, so
// its own nullability is irrelevant. The one exception is when the
// target type is Null – that type is inherently nullable.
let nullable = source_field.is_nullable() || target_type == DataType::Null;
Ok(Field::new(self.name(), target_type, nullable).into())
}

fn invoke_with_args(&self, _args: ScalarFunctionArgs) -> Result<ColumnarValue> {
internal_err!("cast_to_type should have been simplified to cast")
}

fn simplify(
&self,
mut args: Vec<Expr>,
info: &SimplifyContext,
) -> Result<ExprSimplifyResult> {
let [_, type_arg] = take_function_args(self.name(), &args)?;
let target_type = info.get_data_type(type_arg)?;

// remove second (reference) argument
args.pop().unwrap();
let arg = args.pop().unwrap();

let source_type = info.get_data_type(&arg)?;
let new_expr = if source_type == target_type {
// the argument's data type is already the correct type
arg
} else {
// Use an actual cast to get the correct type
Expr::Cast(datafusion_expr::Cast {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/functions/src/core/cast_to_type.rs:141: return_field_from_args special-cases DataType::Null to force nullable=true, but simplify rewrites to Expr::Cast, whose field nullability is derived from the input expression. If the reference arg is a bare NULL (type Null) and the source is non-nullable, this rewrite can yield a non-nullable Null output schema (and diverge from the UDF’s declared schema).

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

expr: Box::new(arg),
field: target_type.into_nullable_field_ref(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The simplify implementation uses into_nullable_field_ref(), which always creates a nullable field for the Cast expression. This is inconsistent with the logic in return_field_from_args (lines 111-116), where nullability is inherited from the source argument (unless the target type is Null). This discrepancy can lead to incorrect schema inference or suboptimal query plans after simplification. The simplify method should use the same nullability logic as return_field_from_args by leveraging info.nullable(&arg).

Suggested change
field: target_type.into_nullable_field_ref(),
field: Field::new("value", target_type.clone(), info.nullable(&arg)? || target_type == DataType::Null).into(),

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! There was an issue in the return_field_from_args() method that the PR author improved in the last commit in the PR. But it seems the author didn't improve the same in the simplify() function. Prevents a wrong behavior and wrong result in the information schema for a view that uses this user defined function

})
};
Comment on lines +140 to +145
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how DataFusion handles Cast expression nullability and whether
# into_nullable_field_ref vs computed nullability causes issues

# Search for how Cast's field nullability affects schema derivation
rg -n -A5 'impl.*Cast' --type rust
echo "---"
# Check if there are any tests that verify runtime nullability matches declared schema
rg -n 'into_nullable_field_ref' --type rust -C3

Repository: martin-augment/datafusion

Length of output: 36845


🏁 Script executed:

cat -n datafusion/functions/src/core/cast_to_type.rs

Repository: martin-augment/datafusion

Length of output: 6289


🏁 Script executed:

rg -n "impl.*Expr.*Cast" datafusion/expr/src/expr.rs -A 20 | head -100

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "pub struct Cast" datafusion/expr/src/expr.rs -A 10

Repository: martin-augment/datafusion

Length of output: 393


🏁 Script executed:

rg -n "fn.*data_type.*Cast\|Cast.*data_type" datafusion/expr/src/expr.rs -B 2 -A 8 | head -80

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "Cast =>" datafusion/expr/src/expr.rs -A 5 | head -60

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "expr.data_type\|field.data_type\|field.is_nullable" datafusion/expr/src/expr.rs -A 2 | head -100

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

ast-grep --pattern 'impl Expr {
  $$$
  pub fn data_type($_) -> $_ {
    $$$
  }
  $$$
}'

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "fn data_type" datafusion/expr/src/expr.rs | head -20

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "pub fn data_type\|pub fn field" datafusion/expr/src/ -A 3 | grep -A 3 "data_type\|field" | head -80

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "Expr::Cast" datafusion/expr/src/expr.rs -B 2 -A 5 | head -100

Repository: martin-augment/datafusion

Length of output: 3983


🏁 Script executed:

rg -n "get_data_type\|get_field\|output_ordering" datafusion/expr/src/ -A 3 | grep -B 2 -A 5 "Cast" | head -80

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "DynScalarUDF\|get_data_type" datafusion/expr/src/expr.rs -B 2 -A 8 | head -150

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "Cast.*field\|field.*Cast" datafusion/core/src/ --type rust -A 3 | head -100

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "fn data_type\|fn field" datafusion/physical-expr/src/expressions/cast.rs -B 2 -A 10 | head -150

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

cat -n datafusion/physical-expr/src/expressions/cast.rs | head -250

Repository: martin-augment/datafusion

Length of output: 10920


🏁 Script executed:

cat -n datafusion/physical-expr/src/expressions/cast.rs | sed -n '241,260p'

Repository: martin-augment/datafusion

Length of output: 1133


Fix nullability inconsistency between type inference and simplification.

return_field_from_args (lines 107-117) correctly computes nullability based on the source field (source_field.is_nullable()) and whether the target type is Null. However, simplify() always uses into_nullable_field_ref() which unconditionally creates a nullable field. This causes the simplified Expr::Cast to always be nullable, even when a non-nullable input and non-null target type should produce non-nullable output.

The CastExpr::nullable() method (in physical-expr) computes nullability as child_nullable || target_nullable, so the target field's nullability directly affects the result. To maintain consistency, simplify() should compute the field's nullability the same way as return_field_from_args:

let nullable = source_field.is_nullable() || target_type == DataType::Null;
let field = Field::new("", target_type, nullable).into();

This ensures the declared schema and the simplified Cast expression produce the same nullability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/core/cast_to_type.rs` around lines 140 - 145,
simplify() currently builds the simplified Expr::Cast using
into_nullable_field_ref(), forcing the cast to always be nullable and causing a
mismatch with return_field_from_args which computes nullability from
source_field.is_nullable() and whether target_type == DataType::Null; update
simplify() to compute nullable the same way (nullable =
source_field.is_nullable() || target_type == DataType::Null) and construct the
target field via Field::new(..., target_type, nullable).into() so the simplified
Expr::Cast and return_field_from_args (and CastExpr::nullable()) produce
consistent nullability.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! There was an issue in the return_field_from_args() method that the PR author improved in the last commit in the PR. But it seems the author didn't improve the same in the simplify() function. Prevents a wrong behavior and wrong result in the information schema for a view that uses this user defined function

Ok(ExprSimplifyResult::Simplified(new_expr))
}

fn documentation(&self) -> Option<&Documentation> {
self.doc()
}
}
14 changes: 14 additions & 0 deletions datafusion/functions/src/core/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ pub mod arrow_cast;
pub mod arrow_metadata;
pub mod arrow_try_cast;
pub mod arrowtypeof;
pub mod cast_to_type;
pub mod coalesce;
pub mod expr_ext;
pub mod getfield;
Expand All @@ -37,13 +38,16 @@ pub mod nvl2;
pub mod overlay;
pub mod planner;
pub mod r#struct;
pub mod try_cast_to_type;
pub mod union_extract;
pub mod union_tag;
pub mod version;

// create UDFs
make_udf_function!(arrow_cast::ArrowCastFunc, arrow_cast);
make_udf_function!(arrow_try_cast::ArrowTryCastFunc, arrow_try_cast);
make_udf_function!(cast_to_type::CastToTypeFunc, cast_to_type);
make_udf_function!(try_cast_to_type::TryCastToTypeFunc, try_cast_to_type);
make_udf_function!(nullif::NullIfFunc, nullif);
make_udf_function!(nvl::NVLFunc, nvl);
make_udf_function!(nvl2::NVL2Func, nvl2);
Expand Down Expand Up @@ -75,6 +79,14 @@ pub mod expr_fn {
arrow_try_cast,
"Casts a value to a specific Arrow data type, returning NULL if the cast fails",
arg1 arg2
),(
cast_to_type,
"Casts the first argument to the data type of the second argument",
arg1 arg2
),(
try_cast_to_type,
"Casts the first argument to the data type of the second argument, returning NULL on failure",
arg1 arg2
),(
nvl,
"Returns value2 if value1 is NULL; otherwise it returns value1",
Expand Down Expand Up @@ -147,6 +159,8 @@ pub fn functions() -> Vec<Arc<ScalarUDF>> {
nullif(),
arrow_cast(),
arrow_try_cast(),
cast_to_type(),
try_cast_to_type(),
arrow_metadata(),
nvl(),
nvl2(),
Expand Down
135 changes: 135 additions & 0 deletions datafusion/functions/src/core/try_cast_to_type.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

//! [`TryCastToTypeFunc`]: Implementation of the `try_cast_to_type`

use arrow::datatypes::{DataType, Field, FieldRef};
use datafusion_common::{
Result, datatype::DataTypeExt, internal_err, utils::take_function_args,
};
use datafusion_expr::simplify::{ExprSimplifyResult, SimplifyContext};
use datafusion_expr::{
Coercion, ColumnarValue, Documentation, Expr, ReturnFieldArgs, ScalarFunctionArgs,
ScalarUDFImpl, Signature, TypeSignatureClass, Volatility,
};
use datafusion_macros::user_doc;

/// Like [`cast_to_type`](super::cast_to_type::CastToTypeFunc) but returns NULL
/// on cast failure instead of erroring.
///
/// This is implemented by simplifying `try_cast_to_type(expr, ref)` into
/// `Expr::TryCast` during optimization.
#[user_doc(
doc_section(label = "Other Functions"),
description = "Casts the first argument to the data type of the second argument, returning NULL if the cast fails. Only the type of the second argument is used; its value is ignored.",
syntax_example = "try_cast_to_type(expression, reference)",
sql_example = r#"```sql
> select try_cast_to_type('123', NULL::INTEGER) as a,
try_cast_to_type('not_a_number', NULL::INTEGER) as b;

+-----+------+
| a | b |
+-----+------+
| 123 | NULL |
+-----+------+
```"#,
argument(
name = "expression",
description = "Expression to cast. The expression can be a constant, column, or function, and any combination of operators."
),
argument(
name = "reference",
description = "Reference expression whose data type determines the target cast type. The value is ignored."
)
)]
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct TryCastToTypeFunc {
signature: Signature,
}

impl Default for TryCastToTypeFunc {
fn default() -> Self {
Self::new()
}
}

impl TryCastToTypeFunc {
pub fn new() -> Self {
Self {
signature: Signature::coercible(
vec![
Coercion::new_exact(TypeSignatureClass::Any),
Coercion::new_exact(TypeSignatureClass::Any),
],
Volatility::Immutable,
),
}
}
}

impl ScalarUDFImpl for TryCastToTypeFunc {
fn name(&self) -> &str {
"try_cast_to_type"
}

fn signature(&self) -> &Signature {
&self.signature
}

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be called instead")
}

fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
// TryCast can always return NULL (on cast failure), so always nullable
let [_, reference_field] = take_function_args(self.name(), args.arg_fields)?;
let target_type = reference_field.data_type().clone();
Ok(Field::new(self.name(), target_type, true).into())
}

fn invoke_with_args(&self, _args: ScalarFunctionArgs) -> Result<ColumnarValue> {
internal_err!("try_cast_to_type should have been simplified to try_cast")
}

fn simplify(
&self,
mut args: Vec<Expr>,
info: &SimplifyContext,
) -> Result<ExprSimplifyResult> {
let [_, type_arg] = take_function_args(self.name(), &args)?;
let target_type = info.get_data_type(type_arg)?;

// remove second (reference) argument
args.pop().unwrap();
let arg = args.pop().unwrap();

let source_type = info.get_data_type(&arg)?;
let new_expr = if source_type == target_type {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/functions/src/core/try_cast_to_type.rs:121: When source_type == target_type, simplify returns arg directly, but return_field_from_args always marks the result as nullable. This can make the simplified expression’s nullability differ from the original UDF’s declared nullability, which DataFusion’s ScalarUDFImpl::simplify docs warn can cause schema verification issues.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In datafusion/functions/src/core/try_cast_to_type.rs:121, simplify returns the original arg when source_type == target_type, which means the optimized expression is no longer an Expr::TryCast and can lose the “always nullable / can return NULL on failure” behavior implied by return_field_from_args (and Expr::TryCast nullability).

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

arg
} else {
Expr::TryCast(datafusion_expr::TryCast {
expr: Box::new(arg),
field: target_type.into_nullable_field_ref(),
})
};
Ok(ExprSimplifyResult::Simplified(new_expr))
}

fn documentation(&self) -> Option<&Documentation> {
self.doc()
}
}
Loading
Loading