3559: feat: Cast numeric (non int) to timestamp by martin-augment · Pull Request #44 · martin-augment/datafusion-comet

martin-augment · 2026-03-06T12:45:10Z

3559: To review by AI

gemini-code-assist · 2026-03-06T12:45:33Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Comet's data type casting capabilities by introducing native support for converting floating-point numbers (Float32, Float64), decimal values, and boolean types into timestamps. This feature ensures closer alignment with Spark's casting semantics, particularly for non-integer numeric types, and improves the overall compatibility and robustness of Comet's expression evaluation engine.

Highlights

Expanded Timestamp Casting: Added support for casting Float32, Float64, Decimal128, and Boolean types to Timestamp in the native Rust expression engine.
New Rust Conversion Functions: Implemented new Rust functions: cast_float_to_timestamp, cast_decimal_to_timestamp, and cast_boolean_to_timestamp to handle these conversions.
Scala Compatibility Updates: Updated the Scala CometCast expression to recognize and support these new casting rules, ensuring compatibility with Spark's casting behavior.
New Benchmark Added: Introduced a new benchmark to measure the performance of casting non-integer numeric types to timestamps.
Enhanced Test Coverage: Enabled and enhanced Scala test cases in CometCastSuite to validate the correctness of these new timestamp casts across various numeric and boolean inputs, including a new DataFrame comparison utility.

Changelog

native/spark-expr/Cargo.toml
- Added a new benchmark target for casting non-integer numeric types to timestamp.
native/spark-expr/benches/cast_non_int_numeric_timestamp.rs
- Introduced a new benchmark file to evaluate the performance of casting Float32, Float64, Boolean, and Decimal128 to Timestamp.
native/spark-expr/src/conversion_funcs/boolean.rs
- Implemented the cast_boolean_to_timestamp function.
- Added unit tests for boolean to timestamp casting.
native/spark-expr/src/conversion_funcs/cast.rs
- Integrated the new cast_float_to_timestamp, cast_boolean_to_timestamp, and cast_decimal_to_timestamp functions into the main cast_array dispatch logic.
native/spark-expr/src/conversion_funcs/numeric.rs
- Added cast_decimal_to_timestamp and cast_float_to_timestamp functions.
- Included a macro for float-to-timestamp conversion logic with overflow handling.
- Added comprehensive unit tests for float and decimal to timestamp casting.
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala
- Updated the canCastFromBoolean, canCastFromFloat, canCastFromDouble, and canCastFromDecimal methods to correctly identify TimestampType as a compatible target for casting.
spark/src/test/scala/org/apache/comet/CometCastSuite.scala
- Un-ignored and updated test cases for casting Boolean, Float, Double, and Decimal types to Timestamp.
- Introduced useDataFrameDiff for more robust comparison in some test scenarios.
spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala
- Added a new helper function assertDataFrameEqualsWithExceptions to compare Spark and Comet DataFrames, especially useful for cases where collect() might fail due to extreme values.

Activity

The pull request was created by martin-augment.
The author has requested AI review.
No human review comments or activity have been recorded yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-06T12:48:10Z

native/spark-expr/src/conversion_funcs/boolean.rs

+            builder.append_null();
+        } else {
+            let micros = if bool_array.value(i) { 1 } else { 0 };
+            builder.append_value(micros);


Boolean-to-timestamp maps true to 1µs instead of 1s

High Severity

cast_boolean_to_timestamp maps true to 1 (1 microsecond) instead of MICROS_PER_SECOND (1,000,000 = 1 second). In Spark, boolean-to-timestamp works by converting true → 1 (second) → 1_000_000 microseconds. The existing cast_int_to_timestamp_impl macro confirms integers are treated as seconds and multiplied by MICROS_PER_SECOND. The result is that true produces 1970-01-01 00:00:00.000001 instead of the correct 1970-01-01 00:00:01. The unit test also encodes this wrong expectation.

Additional Locations (1)

native/spark-expr/src/conversion_funcs/boolean.rs#L230-L231

cursor · 2026-03-06T12:48:10Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

+          val df2 = spark.sql(s"select a, try_cast(a as ${toType.sql}) from t order by a")
+          if (useDataFrameDiff) {
+            assertDataFrameEqualsWithExceptions(df2, assertCometNative = !hasIncompatibleType)
+          } else {


ANSI try_cast test runs outside required config scope

Medium Severity

The if (testTry) block for try_cast in the ANSI section was moved outside the withSQLConf block that sets ANSI_ENABLED -> "true" and getExprAllowIncompatConfigKey -> "true". Without the incompat config key, the Comet native path may not be exercised, reducing test coverage for the ANSI try_cast path.

claude · 2026-03-06T12:50:11Z

PR Review: Cast numeric (non-int) to Timestamp

Good work implementing these casting conversions! The feature is well-structured and the benchmarks are a nice addition. Here are my findings:

BUGS / CORRECTNESS

Test name vs data mismatch in Decimal test (CometCastSuite.scala)
The new test is named 'cast DecimalType(38,10) to TimestampType' but calls generateDecimalsPrecision38Scale18(), which produces Decimal(38,18) data. Either the test name or the generator call is wrong. If scale-18 data was intended, rename the test; if scale-10 was intended, a different generator is needed.
Dead code in overflow ANSI error branch of cast_float_to_timestamp_impl! (numeric.rs)
The overflow ANSI error branch contains a micros.is_nan() check that is unreachable. The macro only reaches the overflow path when val is already finite and non-NaN (the first guard val.is_nan() || val.is_infinite() handles those). Multiplying a finite f64 by a constant cannot produce NaN. This dead branch could mislead future maintainers.
Typo in comment
In cast_decimal_to_timestamp: 'in all leval modes' should be 'in all eval modes'.

POTENTIAL ISSUES

canCastFromDecimal enables Timestamp for all eval modes without ANSI handling in Rust
cast_decimal_to_timestamp always silently truncates (ts.as_i128() as i64) regardless of eval_mode. If Spark raises an error in ANSI mode for certain Decimal->Timestamp values, Comet would silently produce a different result. Please verify Spark's Decimal->Timestamp behavior in ANSI mode matches silent truncation, or add eval_mode handling.
cast_boolean_to_timestamp ignores eval_mode at the Rust level
The Scala side correctly restricts Boolean->Timestamp to LEGACY mode, but the Rust dispatch (Boolean, Timestamp(_, tz)) => cast_boolean_to_timestamp(&array, tz) does not pass eval_mode. If the Scala guard ever changes, this would silently succeed in ANSI mode. Consider adding a defensive eval_mode parameter or a comment explaining the invariant.
Overflow boundary subtlety for float->i64 cast
The check micros.floor() <= i64::MAX as f64 is subtle: i64::MAX (9223372036854775807) rounds up to 9223372036854775808.0 as f64, so values in [i64::MAX, i64::MAX+1) pass the check. Rust saturates on float-to-int cast since 1.45 so the result is safe, but this should be verified against Spark's behavior at this boundary and documented with a comment.
scale_factor overflow potential in cast_decimal_to_timestamp
10_i128.pow(scale as u32) - a negative scale would be cast to a huge u32, causing a panicking exponentiation in debug mode. A debug_assert!(scale >= 0 && scale <= 38) would make the invariant explicit.

CODE QUALITY / STYLE

Boolean included in 'non-int numeric' benchmark
cast_non_int_numeric_timestamp.rs benchmarks Boolean->Timestamp alongside Float and Decimal. Boolean is not a numeric type. Consider renaming the file/group to cast_non_int_to_timestamp or moving the boolean benchmark separately.
assertDataFrameEqualsWithExceptions evaluates df up to 6 times
The new test helper in CometTestBase.scala materializes the DataFrame multiple times (2 tries + up to 4 more for exceptAll comparisons). This is inefficient and could be flaky for non-deterministic plans. Consider materializing to a temp view before comparisons.
Missing compatibleTimezones loop for Boolean->Timestamp
Float->Timestamp and Double->Timestamp tests are wrapped in compatibleTimezones.foreach for thoroughness. The Boolean->Timestamp test is not. Wrapping it would be consistent with the other timestamp cast tests.

gemini-code-assist

Code Review

This pull request adds support for casting non-integer numeric types (Float32, Float64, Decimal, Boolean) to Timestamp, along with corresponding benchmarks and tests. The implementation correctly handles special float values (NaN, Infinity) and potential overflows, aligning with Spark's behavior. The test suite has been enhanced to accommodate tests that cannot use collect() due to extreme values. I've provided a few suggestions to improve code clarity and fix a minor issue in an error message.

gemini-code-assist · 2026-03-06T12:50:25Z

native/spark-expr/src/conversion_funcs/boolean.rs

+pub(crate) fn cast_boolean_to_timestamp(
+    array_ref: &ArrayRef,
+    target_tz: &Option<Arc<str>>,
+) -> SparkResult<ArrayRef> {
+    let bool_array = array_ref.as_boolean();
+    let mut builder = TimestampMicrosecondBuilder::with_capacity(bool_array.len());
+
+    for i in 0..bool_array.len() {
+        if bool_array.is_null(i) {
+            builder.append_null();
+        } else {
+            let micros = if bool_array.value(i) { 1 } else { 0 };
+            builder.append_value(micros);
+        }
+    }
+
+    Ok(Arc::new(builder.finish().with_timezone_opt(target_tz.clone())) as ArrayRef)
+}


The implementation of cast_boolean_to_timestamp can be simplified by using iterators and collect instead of a manual loop with a builder. This makes the code more concise and idiomatic.

pub(crate) fn cast_boolean_to_timestamp( array_ref: &ArrayRef, target_tz: &Option<Arc<str>>, ) -> SparkResult<ArrayRef> { let bool_array = array_ref.as_boolean(); let result: arrow::array::TimestampMicrosecondArray = bool_array .iter() .map(|v| v.map(|b| if b { 1 } else { 0 })) .collect(); Ok(Arc::new(result.with_timezone_opt(target_tz.clone())) as ArrayRef) }

gemini-code-assist · 2026-03-06T12:50:25Z

native/spark-expr/src/conversion_funcs/numeric.rs

+macro_rules! cast_float_to_timestamp_impl {
+    ($array:expr, $builder:expr, $primitive_type:ty, $eval_mode:expr) => {{
+        let arr = $array.as_primitive::<$primitive_type>();
+        for i in 0..arr.len() {
+            if arr.is_null(i) {
+                $builder.append_null();
+            } else {
+                let val = arr.value(i) as f64;
+                // Path 1: NaN/Infinity check - error says TIMESTAMP
+                if val.is_nan() || val.is_infinite() {
+                    if $eval_mode == EvalMode::Ansi {
+                        return Err(SparkError::CastInvalidValue {
+                            value: val.to_string(),
+                            from_type: "DOUBLE".to_string(),
+                            to_type: "TIMESTAMP".to_string(),
+                        });
+                    }
+                    $builder.append_null();
+                } else {
+                    // Path 2: Multiply then check overflow - error says BIGINT
+                    let micros = val * MICROS_PER_SECOND as f64;
+                    if micros.floor() <= i64::MAX as f64 && micros.ceil() >= i64::MIN as f64 {
+                        $builder.append_value(micros as i64);
+                    } else {
+                        if $eval_mode == EvalMode::Ansi {
+                            let value_str = if micros.is_infinite() {
+                                if micros.is_sign_positive() {
+                                    "Infinity".to_string()
+                                } else {
+                                    "-Infinity".to_string()
+                                }
+                            } else if micros.is_nan() {
+                                "NaN".to_string()
+                            } else {
+                                format!("{:e}", micros).to_uppercase() + "D"
+                            };
+                            return Err(SparkError::CastOverFlow {
+                                value: value_str,
+                                from_type: "DOUBLE".to_string(),
+                                to_type: "BIGINT".to_string(),
+                            });
+                        }
+                        $builder.append_null();
+                    }
+                }
+            }
+        }
+    }};
+}


The from_type in the CastInvalidValue error is hardcoded to "DOUBLE". This is incorrect when casting from Float32, as Spark would report "FLOAT".

To fix this, the macro should accept the from_type as a parameter. Additionally, using the original float value (f32 or f64) in the error message is more accurate than the value after casting to f64.

macro_rules! cast_float_to_timestamp_impl { ($array:expr, $builder:expr, $primitive_type:ty, $eval_mode:expr, $from_type_str:expr) => {{ let arr = $array.as_primitive::<$primitive_type>(); for i in 0..arr.len() { if arr.is_null(i) { $builder.append_null(); } else { let val_native = arr.value(i); let val = val_native as f64; // Path 1: NaN/Infinity check - error says TIMESTAMP if val.is_nan() || val.is_infinite() { if $eval_mode == EvalMode::Ansi { return Err(SparkError::CastInvalidValue { value: val_native.to_string(), from_type: $from_type_str.to_string(), to_type: "TIMESTAMP".to_string(), }); } $builder.append_null(); } else { // Path 2: Multiply then check overflow - error says BIGINT let micros = val * MICROS_PER_SECOND as f64; if micros.floor() <= i64::MAX as f64 && micros.ceil() >= i64::MIN as f64 { $builder.append_value(micros as i64); } else { if $eval_mode == EvalMode::Ansi { let value_str = if micros.is_infinite() { if micros.is_sign_positive() { "Infinity".to_string() } else { "-Infinity".to_string() } } else if micros.is_nan() { "NaN".to_string() } else { format!("{:e}", micros).to_uppercase() + "D" }; return Err(SparkError::CastOverFlow { value: value_str, from_type: "DOUBLE".to_string(), to_type: "BIGINT".to_string(), }); } $builder.append_null(); } } } } }}; }

gemini-code-assist · 2026-03-06T12:50:25Z

native/spark-expr/src/conversion_funcs/numeric.rs

+pub(crate) fn cast_float_to_timestamp(
+    array_ref: &ArrayRef,
+    target_tz: &Option<Arc<str>>,
+    eval_mode: EvalMode,
+) -> SparkResult<ArrayRef> {
+    let mut builder = TimestampMicrosecondBuilder::with_capacity(array_ref.len());
+
+    match array_ref.data_type() {
+        DataType::Float32 => {
+            cast_float_to_timestamp_impl!(array_ref, builder, Float32Type, eval_mode)
+        }
+        DataType::Float64 => {
+            cast_float_to_timestamp_impl!(array_ref, builder, Float64Type, eval_mode)
+        }
+        dt => {
+            return Err(SparkError::Internal(format!(
+                "Unsupported type for cast_float_to_timestamp: {:?}",
+                dt
+            )))
+        }
+    }
+
+    Ok(Arc::new(builder.finish().with_timezone_opt(target_tz.clone())) as ArrayRef)
+}


Following the change to cast_float_to_timestamp_impl!, please update the calls to pass the correct from_type string ("FLOAT" or "DOUBLE") based on the input data type.

pub(crate) fn cast_float_to_timestamp( array_ref: &ArrayRef, target_tz: &Option<Arc<str>>, eval_mode: EvalMode, ) -> SparkResult<ArrayRef> { let mut builder = TimestampMicrosecondBuilder::with_capacity(array_ref.len()); match array_ref.data_type() { DataType::Float32 => { cast_float_to_timestamp_impl!(array_ref, builder, Float32Type, eval_mode, "FLOAT") } DataType::Float64 => { cast_float_to_timestamp_impl!(array_ref, builder, Float64Type, eval_mode, "DOUBLE") } dt => { return Err(SparkError::Internal(format!( "Unsupported type for cast_float_to_timestamp: {:?}", dt ))) } } Ok(Arc::new(builder.finish().with_timezone_opt(target_tz.clone())) as ArrayRef) }

augmentcode · 2026-03-06T12:52:56Z

🤖 Augment PR Summary

Summary: This PR extends Comet’s Spark-compatible CAST implementation to support casting non-integer numeric inputs to TIMESTAMP.

Changes:

Added native (Rust) cast implementations for Float32/Float64 → Timestamp, Decimal128 → Timestamp, and Boolean → Timestamp.
Wired the new casts into the central cast dispatcher so they’re used instead of DataFusion’s generic cast where Spark semantics differ.
Updated Spark-side support detection (CometCast) to allow these casts (with Boolean→Timestamp gated to LEGACY mode).
Enabled/added Spark integration tests for Float/Double/Decimal/Boolean → Timestamp casts and expanded decimal coverage.
Introduced a new DataFrame comparison helper that uses exceptAll to avoid collect() in cases where extreme timestamp values can break materialization.
Added a Criterion benchmark for these cast paths to measure performance across input types.

Technical Notes: Float/double casting handles NaN/Infinity and ANSI-mode overflow behavior, and all timestamp outputs are produced in microseconds with optional timezone metadata preserved.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-03-06T12:52:56Z

native/spark-expr/src/conversion_funcs/numeric.rs

+    scale: i8,
+) -> SparkResult<ArrayRef> {
+    let arr = array_ref.as_primitive::<Decimal128Type>();
+    let scale_factor = 10_i128.pow(scale as u32);


scale can be negative for Spark decimals (e.g. DECIMAL(10,-4)); 10_i128.pow(scale as u32) will behave incorrectly (huge exponent) and can panic/overflow. This cast should explicitly handle negative scales to avoid crashing on valid Spark inputs.

Severity: high

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-03-06T12:52:56Z

native/spark-expr/src/conversion_funcs/numeric.rs

+                                to_type: "BIGINT".to_string(),
+                            });
+                        }
+                        $builder.append_null();


In LEGACY/TRY modes this path appends NULL when micros is out of the i64 range, but Spark’s legacy doubleToTimestamp uses (d * MICROS_PER_SECOND).toLong (i.e., clamps to Long.{MIN,MAX} rather than returning NULL). If the goal is strict Spark compatibility, this overflow behavior likely diverges for extreme numeric inputs.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-03-06T12:52:57Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

-        if (hasIncompatibleType) {
-          checkSparkAnswer(df)
+        if (useDataFrameDiff) {
+          assertDataFrameEqualsWithExceptions(df, assertCometNative = !hasIncompatibleType)


When useDataFrameDiff=true, the (sparkErr, cometErr) result from assertDataFrameEqualsWithExceptions is ignored in the non-ANSI / try_cast paths, so the test can silently pass even if Spark and/or Comet throws. It seems safer to assert both sides succeeded (e.g., (None, None)) for these branches where exceptions are not expected.

Severity: medium

Other Locations

spark/src/test/scala/org/apache/comet/CometCastSuite.scala:1508

spark/src/test/scala/org/apache/comet/CometCastSuite.scala:1580

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

coderfender and others added 10 commits February 26, 2026 15:57

float_to_timestamp

b9b0ad4

non_numeric_to_timestamp

b41a43a

fix_bool_to_timestamp_support

55cae7f

fix_bool_to_timestamp_support

e64e194

fix_ansi_support_when_non_using_collect

3f3dd19

TEST_SPARK_EXPRESSION_REFACTOR

ac46828

Merge branch 'main' into cast_float_to_timestamp

9bb2ea4

rebase_main

e593089

fix_review_comments

61bbada

Merge branch 'main' into cast_float_to_timestamp

44783c5

cursor bot reviewed Mar 6, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

augmentcode bot reviewed Mar 6, 2026

View reviewed changes

Conversation

martin-augment commented Mar 6, 2026

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 6, 2026

Choose a reason for hiding this comment

Boolean-to-timestamp maps true to 1µs instead of 1s

Uh oh!

cursor bot Mar 6, 2026

Choose a reason for hiding this comment

ANSI try_cast test runs outside required config scope

Uh oh!

claude bot commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Mar 6, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants