Skip to content

3559: feat: Cast numeric (non int) to timestamp#44

Open
martin-augment wants to merge 10 commits intomainfrom
pr-3559-2026-03-06-12-45-07
Open

3559: feat: Cast numeric (non int) to timestamp#44
martin-augment wants to merge 10 commits intomainfrom
pr-3559-2026-03-06-12-45-07

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

3559: To review by AI

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Comet's data type casting capabilities by introducing native support for converting floating-point numbers (Float32, Float64), decimal values, and boolean types into timestamps. This feature ensures closer alignment with Spark's casting semantics, particularly for non-integer numeric types, and improves the overall compatibility and robustness of Comet's expression evaluation engine.

Highlights

  • Expanded Timestamp Casting: Added support for casting Float32, Float64, Decimal128, and Boolean types to Timestamp in the native Rust expression engine.
  • New Rust Conversion Functions: Implemented new Rust functions: cast_float_to_timestamp, cast_decimal_to_timestamp, and cast_boolean_to_timestamp to handle these conversions.
  • Scala Compatibility Updates: Updated the Scala CometCast expression to recognize and support these new casting rules, ensuring compatibility with Spark's casting behavior.
  • New Benchmark Added: Introduced a new benchmark to measure the performance of casting non-integer numeric types to timestamps.
  • Enhanced Test Coverage: Enabled and enhanced Scala test cases in CometCastSuite to validate the correctness of these new timestamp casts across various numeric and boolean inputs, including a new DataFrame comparison utility.
Changelog
  • native/spark-expr/Cargo.toml
    • Added a new benchmark target for casting non-integer numeric types to timestamp.
  • native/spark-expr/benches/cast_non_int_numeric_timestamp.rs
    • Introduced a new benchmark file to evaluate the performance of casting Float32, Float64, Boolean, and Decimal128 to Timestamp.
  • native/spark-expr/src/conversion_funcs/boolean.rs
    • Implemented the cast_boolean_to_timestamp function.
    • Added unit tests for boolean to timestamp casting.
  • native/spark-expr/src/conversion_funcs/cast.rs
    • Integrated the new cast_float_to_timestamp, cast_boolean_to_timestamp, and cast_decimal_to_timestamp functions into the main cast_array dispatch logic.
  • native/spark-expr/src/conversion_funcs/numeric.rs
    • Added cast_decimal_to_timestamp and cast_float_to_timestamp functions.
    • Included a macro for float-to-timestamp conversion logic with overflow handling.
    • Added comprehensive unit tests for float and decimal to timestamp casting.
  • spark/src/main/scala/org/apache/comet/expressions/CometCast.scala
    • Updated the canCastFromBoolean, canCastFromFloat, canCastFromDouble, and canCastFromDecimal methods to correctly identify TimestampType as a compatible target for casting.
  • spark/src/test/scala/org/apache/comet/CometCastSuite.scala
    • Un-ignored and updated test cases for casting Boolean, Float, Double, and Decimal types to Timestamp.
    • Introduced useDataFrameDiff for more robust comparison in some test scenarios.
  • spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala
    • Added a new helper function assertDataFrameEqualsWithExceptions to compare Spark and Comet DataFrames, especially useful for cases where collect() might fail due to extreme values.
Activity
  • The pull request was created by martin-augment.
  • The author has requested AI review.
  • No human review comments or activity have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

builder.append_null();
} else {
let micros = if bool_array.value(i) { 1 } else { 0 };
builder.append_value(micros);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boolean-to-timestamp maps true to 1µs instead of 1s

High Severity

cast_boolean_to_timestamp maps true to 1 (1 microsecond) instead of MICROS_PER_SECOND (1,000,000 = 1 second). In Spark, boolean-to-timestamp works by converting true1 (second) → 1_000_000 microseconds. The existing cast_int_to_timestamp_impl macro confirms integers are treated as seconds and multiplied by MICROS_PER_SECOND. The result is that true produces 1970-01-01 00:00:00.000001 instead of the correct 1970-01-01 00:00:01. The unit test also encodes this wrong expectation.

Additional Locations (1)

Fix in Cursor Fix in Web

val df2 = spark.sql(s"select a, try_cast(a as ${toType.sql}) from t order by a")
if (useDataFrameDiff) {
assertDataFrameEqualsWithExceptions(df2, assertCometNative = !hasIncompatibleType)
} else {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANSI try_cast test runs outside required config scope

Medium Severity

The if (testTry) block for try_cast in the ANSI section was moved outside the withSQLConf block that sets ANSI_ENABLED -> "true" and getExprAllowIncompatConfigKey -> "true". Without the incompat config key, the Comet native path may not be exercised, reducing test coverage for the ANSI try_cast path.

Fix in Cursor Fix in Web

@claude
Copy link
Copy Markdown

claude bot commented Mar 6, 2026

PR Review: Cast numeric (non-int) to Timestamp

Good work implementing these casting conversions! The feature is well-structured and the benchmarks are a nice addition. Here are my findings:

BUGS / CORRECTNESS

  1. Test name vs data mismatch in Decimal test (CometCastSuite.scala)
    The new test is named 'cast DecimalType(38,10) to TimestampType' but calls generateDecimalsPrecision38Scale18(), which produces Decimal(38,18) data. Either the test name or the generator call is wrong. If scale-18 data was intended, rename the test; if scale-10 was intended, a different generator is needed.

  2. Dead code in overflow ANSI error branch of cast_float_to_timestamp_impl! (numeric.rs)
    The overflow ANSI error branch contains a micros.is_nan() check that is unreachable. The macro only reaches the overflow path when val is already finite and non-NaN (the first guard val.is_nan() || val.is_infinite() handles those). Multiplying a finite f64 by a constant cannot produce NaN. This dead branch could mislead future maintainers.

  3. Typo in comment
    In cast_decimal_to_timestamp: 'in all leval modes' should be 'in all eval modes'.

POTENTIAL ISSUES

  1. canCastFromDecimal enables Timestamp for all eval modes without ANSI handling in Rust
    cast_decimal_to_timestamp always silently truncates (ts.as_i128() as i64) regardless of eval_mode. If Spark raises an error in ANSI mode for certain Decimal->Timestamp values, Comet would silently produce a different result. Please verify Spark's Decimal->Timestamp behavior in ANSI mode matches silent truncation, or add eval_mode handling.

  2. cast_boolean_to_timestamp ignores eval_mode at the Rust level
    The Scala side correctly restricts Boolean->Timestamp to LEGACY mode, but the Rust dispatch (Boolean, Timestamp(_, tz)) => cast_boolean_to_timestamp(&array, tz) does not pass eval_mode. If the Scala guard ever changes, this would silently succeed in ANSI mode. Consider adding a defensive eval_mode parameter or a comment explaining the invariant.

  3. Overflow boundary subtlety for float->i64 cast
    The check micros.floor() <= i64::MAX as f64 is subtle: i64::MAX (9223372036854775807) rounds up to 9223372036854775808.0 as f64, so values in [i64::MAX, i64::MAX+1) pass the check. Rust saturates on float-to-int cast since 1.45 so the result is safe, but this should be verified against Spark's behavior at this boundary and documented with a comment.

  4. scale_factor overflow potential in cast_decimal_to_timestamp
    10_i128.pow(scale as u32) - a negative scale would be cast to a huge u32, causing a panicking exponentiation in debug mode. A debug_assert!(scale >= 0 && scale <= 38) would make the invariant explicit.

CODE QUALITY / STYLE

  1. Boolean included in 'non-int numeric' benchmark
    cast_non_int_numeric_timestamp.rs benchmarks Boolean->Timestamp alongside Float and Decimal. Boolean is not a numeric type. Consider renaming the file/group to cast_non_int_to_timestamp or moving the boolean benchmark separately.

  2. assertDataFrameEqualsWithExceptions evaluates df up to 6 times
    The new test helper in CometTestBase.scala materializes the DataFrame multiple times (2 tries + up to 4 more for exceptAll comparisons). This is inefficient and could be flaky for non-deterministic plans. Consider materializing to a temp view before comparisons.

  3. Missing compatibleTimezones loop for Boolean->Timestamp
    Float->Timestamp and Double->Timestamp tests are wrapped in compatibleTimezones.foreach for thoroughness. The Boolean->Timestamp test is not. Wrapping it would be consistent with the other timestamp cast tests.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for casting non-integer numeric types (Float32, Float64, Decimal, Boolean) to Timestamp, along with corresponding benchmarks and tests. The implementation correctly handles special float values (NaN, Infinity) and potential overflows, aligning with Spark's behavior. The test suite has been enhanced to accommodate tests that cannot use collect() due to extreme values. I've provided a few suggestions to improve code clarity and fix a minor issue in an error message.

Comment on lines +45 to +62
pub(crate) fn cast_boolean_to_timestamp(
array_ref: &ArrayRef,
target_tz: &Option<Arc<str>>,
) -> SparkResult<ArrayRef> {
let bool_array = array_ref.as_boolean();
let mut builder = TimestampMicrosecondBuilder::with_capacity(bool_array.len());

for i in 0..bool_array.len() {
if bool_array.is_null(i) {
builder.append_null();
} else {
let micros = if bool_array.value(i) { 1 } else { 0 };
builder.append_value(micros);
}
}

Ok(Arc::new(builder.finish().with_timezone_opt(target_tz.clone())) as ArrayRef)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of cast_boolean_to_timestamp can be simplified by using iterators and collect instead of a manual loop with a builder. This makes the code more concise and idiomatic.

pub(crate) fn cast_boolean_to_timestamp(
    array_ref: &ArrayRef,
    target_tz: &Option<Arc<str>>,
) -> SparkResult<ArrayRef> {
    let bool_array = array_ref.as_boolean();
    let result: arrow::array::TimestampMicrosecondArray = bool_array
        .iter()
        .map(|v| v.map(|b| if b { 1 } else { 0 }))
        .collect();

    Ok(Arc::new(result.with_timezone_opt(target_tz.clone())) as ArrayRef)
}

Comment on lines +78 to +126
macro_rules! cast_float_to_timestamp_impl {
($array:expr, $builder:expr, $primitive_type:ty, $eval_mode:expr) => {{
let arr = $array.as_primitive::<$primitive_type>();
for i in 0..arr.len() {
if arr.is_null(i) {
$builder.append_null();
} else {
let val = arr.value(i) as f64;
// Path 1: NaN/Infinity check - error says TIMESTAMP
if val.is_nan() || val.is_infinite() {
if $eval_mode == EvalMode::Ansi {
return Err(SparkError::CastInvalidValue {
value: val.to_string(),
from_type: "DOUBLE".to_string(),
to_type: "TIMESTAMP".to_string(),
});
}
$builder.append_null();
} else {
// Path 2: Multiply then check overflow - error says BIGINT
let micros = val * MICROS_PER_SECOND as f64;
if micros.floor() <= i64::MAX as f64 && micros.ceil() >= i64::MIN as f64 {
$builder.append_value(micros as i64);
} else {
if $eval_mode == EvalMode::Ansi {
let value_str = if micros.is_infinite() {
if micros.is_sign_positive() {
"Infinity".to_string()
} else {
"-Infinity".to_string()
}
} else if micros.is_nan() {
"NaN".to_string()
} else {
format!("{:e}", micros).to_uppercase() + "D"
};
return Err(SparkError::CastOverFlow {
value: value_str,
from_type: "DOUBLE".to_string(),
to_type: "BIGINT".to_string(),
});
}
$builder.append_null();
}
}
}
}
}};
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The from_type in the CastInvalidValue error is hardcoded to "DOUBLE". This is incorrect when casting from Float32, as Spark would report "FLOAT".

To fix this, the macro should accept the from_type as a parameter. Additionally, using the original float value (f32 or f64) in the error message is more accurate than the value after casting to f64.

macro_rules! cast_float_to_timestamp_impl {
    ($array:expr, $builder:expr, $primitive_type:ty, $eval_mode:expr, $from_type_str:expr) => {{
        let arr = $array.as_primitive::<$primitive_type>();
        for i in 0..arr.len() {
            if arr.is_null(i) {
                $builder.append_null();
            } else {
                let val_native = arr.value(i);
                let val = val_native as f64;
                // Path 1: NaN/Infinity check - error says TIMESTAMP
                if val.is_nan() || val.is_infinite() {
                    if $eval_mode == EvalMode::Ansi {
                        return Err(SparkError::CastInvalidValue {
                            value: val_native.to_string(),
                            from_type: $from_type_str.to_string(),
                            to_type: "TIMESTAMP".to_string(),
                        });
                    }
                    $builder.append_null();
                } else {
                    // Path 2: Multiply then check overflow - error says BIGINT
                    let micros = val * MICROS_PER_SECOND as f64;
                    if micros.floor() <= i64::MAX as f64 && micros.ceil() >= i64::MIN as f64 {
                        $builder.append_value(micros as i64);
                    } else {
                        if $eval_mode == EvalMode::Ansi {
                            let value_str = if micros.is_infinite() {
                                if micros.is_sign_positive() {
                                    "Infinity".to_string()
                                } else {
                                    "-Infinity".to_string()
                                }
                            } else if micros.is_nan() {
                                "NaN".to_string()
                            } else {
                                format!("{:e}", micros).to_uppercase() + "D"
                            };
                            return Err(SparkError::CastOverFlow {
                                value: value_str,
                                from_type: "DOUBLE".to_string(),
                                to_type: "BIGINT".to_string(),
                            });
                        }
                        $builder.append_null();
                    }
                }
            }
        }
    }};
}

Comment on lines +992 to +1015
pub(crate) fn cast_float_to_timestamp(
array_ref: &ArrayRef,
target_tz: &Option<Arc<str>>,
eval_mode: EvalMode,
) -> SparkResult<ArrayRef> {
let mut builder = TimestampMicrosecondBuilder::with_capacity(array_ref.len());

match array_ref.data_type() {
DataType::Float32 => {
cast_float_to_timestamp_impl!(array_ref, builder, Float32Type, eval_mode)
}
DataType::Float64 => {
cast_float_to_timestamp_impl!(array_ref, builder, Float64Type, eval_mode)
}
dt => {
return Err(SparkError::Internal(format!(
"Unsupported type for cast_float_to_timestamp: {:?}",
dt
)))
}
}

Ok(Arc::new(builder.finish().with_timezone_opt(target_tz.clone())) as ArrayRef)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Following the change to cast_float_to_timestamp_impl!, please update the calls to pass the correct from_type string ("FLOAT" or "DOUBLE") based on the input data type.

pub(crate) fn cast_float_to_timestamp(
    array_ref: &ArrayRef,
    target_tz: &Option<Arc<str>>,
    eval_mode: EvalMode,
) -> SparkResult<ArrayRef> {
    let mut builder = TimestampMicrosecondBuilder::with_capacity(array_ref.len());

    match array_ref.data_type() {
        DataType::Float32 => {
            cast_float_to_timestamp_impl!(array_ref, builder, Float32Type, eval_mode, "FLOAT")
        }
        DataType::Float64 => {
            cast_float_to_timestamp_impl!(array_ref, builder, Float64Type, eval_mode, "DOUBLE")
        }
        dt => {
            return Err(SparkError::Internal(format!(
                "Unsupported type for cast_float_to_timestamp: {:?}",
                dt
            )))
        }
    }

    Ok(Arc::new(builder.finish().with_timezone_opt(target_tz.clone())) as ArrayRef)
}

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 6, 2026

🤖 Augment PR Summary

Summary: This PR extends Comet’s Spark-compatible CAST implementation to support casting non-integer numeric inputs to TIMESTAMP.

Changes:

  • Added native (Rust) cast implementations for Float32/Float64 → Timestamp, Decimal128 → Timestamp, and Boolean → Timestamp.
  • Wired the new casts into the central cast dispatcher so they’re used instead of DataFusion’s generic cast where Spark semantics differ.
  • Updated Spark-side support detection (CometCast) to allow these casts (with Boolean→Timestamp gated to LEGACY mode).
  • Enabled/added Spark integration tests for Float/Double/Decimal/Boolean → Timestamp casts and expanded decimal coverage.
  • Introduced a new DataFrame comparison helper that uses exceptAll to avoid collect() in cases where extreme timestamp values can break materialization.
  • Added a Criterion benchmark for these cast paths to measure performance across input types.

Technical Notes: Float/double casting handles NaN/Infinity and ANSI-mode overflow behavior, and all timestamp outputs are produced in microseconds with optional timezone metadata preserved.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

scale: i8,
) -> SparkResult<ArrayRef> {
let arr = array_ref.as_primitive::<Decimal128Type>();
let scale_factor = 10_i128.pow(scale as u32);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale can be negative for Spark decimals (e.g. DECIMAL(10,-4)); 10_i128.pow(scale as u32) will behave incorrectly (huge exponent) and can panic/overflow. This cast should explicitly handle negative scales to avoid crashing on valid Spark inputs.

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

to_type: "BIGINT".to_string(),
});
}
$builder.append_null();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LEGACY/TRY modes this path appends NULL when micros is out of the i64 range, but Spark’s legacy doubleToTimestamp uses (d * MICROS_PER_SECOND).toLong (i.e., clamps to Long.{MIN,MAX} rather than returning NULL). If the goal is strict Spark compatibility, this overflow behavior likely diverges for extreme numeric inputs.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

if (hasIncompatibleType) {
checkSparkAnswer(df)
if (useDataFrameDiff) {
assertDataFrameEqualsWithExceptions(df, assertCometNative = !hasIncompatibleType)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When useDataFrameDiff=true, the (sparkErr, cometErr) result from assertDataFrameEqualsWithExceptions is ignored in the non-ANSI / try_cast paths, so the test can silently pass even if Spark and/or Comet throws. It seems safer to assert both sides succeeded (e.g., (None, None)) for these branches where exceptions are not expected.

Severity: medium

Other Locations
  • spark/src/test/scala/org/apache/comet/CometCastSuite.scala:1508
  • spark/src/test/scala/org/apache/comet/CometCastSuite.scala:1580

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants