Skip to content

21212: feat: spark compatible float to timestamp cast with ANSI support#297

Open
martin-augment wants to merge 2 commits intomainfrom
pr-21212-2026-03-29-20-23-02
Open

21212: feat: spark compatible float to timestamp cast with ANSI support#297
martin-augment wants to merge 2 commits intomainfrom
pr-21212-2026-03-29-20-23-02

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21212: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

Walkthrough

This change extends the Spark CAST function to support casting float types (Float32 and Float64) to timestamp values. The implementation adds float-to-microseconds conversion with special handling for IEEE floating-point values (NaN and Infinity) and overflow conditions, where these cases produce NULL in non-ANSI mode and raise errors in ANSI mode. The enable_ansi_mode configuration is threaded through the casting path to control this behavior. Scalar and array inputs are supported. Comprehensive test coverage is added through a new SQLLogicTest file and extended unit tests, validating correct conversions for various input values and both ANSI and non-ANSI execution modes.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21212-2026-03-29-20-23-02

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Mar 29, 2026

PR Review: Spark-compatible float to timestamp cast with ANSI support

Overall this is a well-structured addition that correctly implements float→timestamp semantics with ANSI mode support. The test coverage is broad. A few issues need attention before merging.


🔴 Will break CI: clippy::collapsible_else_if

In float_secs_to_micros (cast.rs ~line 60), the else { if ... } pattern will fail the cargo clippy -- -D warnings check required by CLAUDE.md:

// Current — triggers collapsible_else_if
} else {
    if enable_ansi_mode {
        return exec_err!("Overflow casting {} to TIMESTAMP", val);
    }
    Ok(None)
}

Should be collapsed:

} else if enable_ansi_mode {
    exec_err!("Overflow casting {} to TIMESTAMP", val)
} else {
    Ok(None)
}

🔴 Subtle overflow check bug (ANSI mode)

The bounds check in float_secs_to_micros:

if micros >= i64::MIN as f64 && micros <= i64::MAX as f64 {
    Ok(Some(micros as i64))

i64::MAX (9223372036854775807) is not exactly representable as f64 — it rounds up to 9223372036854775808.0. So values in the range (i64::MAX, 9223372036854775808.0] will pass the check, and micros as i64 will saturate to i64::MAX. In non-ANSI mode saturation is acceptable, but in ANSI mode this silently swallows overflows that should be errors. Use a strict-less-than for the upper bound:

if micros >= i64::MIN as f64 && micros < i64::MAX as f64 {

(i64::MIN = -9223372036854775808 is exactly representable, so >= is fine for the lower bound.)


🟡 Error message doesn't distinguish -Infinity

In float_secs_to_micros, both Infinity and -Infinity produce:

Cannot cast Infinity to TIMESTAMP

The test test_cast_float_infinity_ansi_mode only covers positive infinity. Consider:

if val.is_nan() { "NaN" } else if val.is_sign_positive() { "Infinity" } else { "-Infinity" }

And add a test for negative infinity in ANSI mode.


🟡 Doc comment on SparkCast not updated

The struct's doc comment still reads:

/// # Currently supported conversions
/// - Int8/Int16/Int32/Int64 -> Timestamp (target_type = 'timestamp')

Float32/Float64 should be added there, and the overflow behaviour section should mention float semantics (NaN/Infinity/saturation vs. ANSI errors) rather than only integer saturation.


🟡 Potentially brittle test assertion

In test_cast_float64_array_to_timestamp:

Some(1704067200.123456),
// ...
assert_eq!(ts_array.value(3), 1_704_067_200_123_456);

1704067200.123456 has 16 significant digits, which is at the limit of f64 precision (~15–17 digits). The product 1704067200.123456 * 1_000_000.0 may not equal exactly 1704067200123456. Verify this is stable on all platforms, or use a value with fewer significant digits that is provably exact in f64.


✅ Positives

  • The ANSI vs. non-ANSI mode split (NULL vs. error) matches Spark's specification.
  • cast_float_to_timestamp correctly uses the T::Native: Into<f64> bound so Float32→f64 widening is handled generically.
  • ScalarValue path handles Float32(None) / Float64(None) as None correctly.
  • The sqllogictest file covers scalar, array, timezone, and ANSI-mode paths comprehensively.
  • Missing Float32 NaN/Infinity tests in the unit tests (the SLT only tests Float64 for NaN/Inf) — worth adding a test_cast_float32_nan_non_ansi_mode unit test for completeness.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements casting from float types (Float32 and Float64) to timestamps in the Spark-compatible cast function, including handling for fractional seconds, NaN, infinity, and overflow with support for ANSI mode. It also adds corresponding unit and logic tests. Review feedback suggests improving error message clarity by distinguishing between positive and negative infinity and refactoring array processing to use idiomatic Rust iterators.

if enable_ansi_mode {
return exec_err!(
"Cannot cast {} to TIMESTAMP",
if val.is_nan() { "NaN" } else { "Infinity" }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message for infinity does not distinguish between positive and negative infinity. This could be confusing for users. Consider providing a more specific message to improve clarity.

It would also be beneficial to add a test case for casting f64::NEG_INFINITY in ANSI mode to ensure it's handled correctly and provides the improved error message.

Suggested change
if val.is_nan() { "NaN" } else { "Infinity" }
if val.is_nan() { "NaN" } else if val.is_sign_positive() { "Infinity" } else { "-Infinity" }

Comment on lines +209 to +219
for i in 0..arr.len() {
if arr.is_null(i) {
builder.append_null();
} else {
let val: f64 = arr.value(i).into();
match float_secs_to_micros(val, enable_ansi_mode)? {
Some(micros) => builder.append_value(micros),
None => builder.append_null(),
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loop can be written more idiomatically using an iterator over the array. This can make the code more concise and easier to read by leveraging standard Rust patterns.

    for val in arr.iter() {
        let micros = if let Some(val) = val {
            float_secs_to_micros(val.into(), enable_ansi_mode)?
        } else {
            None
        };
        builder.append_option(micros);
    }

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 29, 2026

🤖 Augment PR Summary

Summary: Adds Spark-compatible casting from floating-point seconds to microsecond timestamps, with behavior gated by ANSI mode.

Changes:

  • Extends spark_cast input signature to accept Float32 and Float64 in addition to signed integers
  • Introduces float-seconds → microseconds conversion with explicit handling for NaN/Infinity and overflow
  • Threads execution.enable_ansi_mode into timestamp casting to switch between NULL-on-invalid vs error-on-invalid behavior
  • Adds array casting implementation for float inputs producing Timestamp(Microsecond) arrays
  • Updates scalar casting path to support ScalarValue::Float32/Float64
  • Adds Rust unit tests covering float arrays/scalars, negative values, NaN/Infinity, and overflow in ANSI vs non-ANSI
  • Adds a new SQLLogicTest file validating SQL-level behavior, timezone effects, and ANSI-mode error cases

Technical Notes: Float inputs are interpreted as seconds since Unix epoch (fractional seconds supported) and converted to microsecond timestamps; special values (NaN/Infinity) are NULL in non-ANSI mode and errors in ANSI mode.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 4 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

if enable_ansi_mode {
return exec_err!(
"Cannot cast {} to TIMESTAMP",
if val.is_nan() { "NaN" } else { "Infinity" }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/spark/src/function/conversion/cast.rs:52: In ANSI mode, -Infinity will be reported as Infinity here because the error text doesn’t preserve the sign, which can make failures harder to diagnose.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

return Ok(None);
}
let micros = val * MICROS_PER_SECOND as f64;
if micros >= i64::MIN as f64 && micros <= i64::MAX as f64 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/spark/src/function/conversion/cast.rs:58: The bounds check uses i64::MAX as f64 (rounded to 2^63), so a value that rounds to exactly 2^63 microseconds can pass this check and then micros as i64 will saturate to i64::MAX, bypassing the intended overflow error/NULL behavior (including in ANSI mode).

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

if enable_ansi_mode {
return exec_err!("Overflow casting {} to TIMESTAMP", val);
}
Ok(None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/spark/src/function/conversion/cast.rs:64: Spark’s non-ANSI float/double→timestamp cast appears to saturate on overflow (via (d * MICROS_PER_SECOND).toLong), so returning NULL on overflow here may diverge from Spark compatibility for very large inputs.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.


// Spark only supports signed integers, so we explicitly list them
let signed_int_signatures = [
// Supported input types: signed integers and floats
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/spark/src/function/conversion/cast.rs:112: The spark_cast doc comment still lists only integer→timestamp and describes saturating overflow, but this PR adds float→timestamp with distinct NaN/Infinity/overflow behavior, so the documentation is now inconsistent with the implementation.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

}

fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
let enable_ansi_mode = args.config_options.execution.enable_ansi_mode;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float-to-timestamp can produce NULLs despite non-nullable return field

Medium Severity

return_field_from_args computes nullable solely from whether input fields are nullable. With the new float-to-timestamp conversion, float_secs_to_micros returns None (producing NULLs) for NaN, Infinity, and overflow values in non-ANSI mode — even when the input float column is declared non-nullable. This creates a mismatch: the output field claims to be non-nullable, but the actual data may contain NULLs. Downstream optimizations that skip null checks based on this metadata could produce incorrect results.

Additional Locations (1)
Fix in Cursor Fix in Web

return exec_err!(
"Cannot cast {} to TIMESTAMP",
if val.is_nan() { "NaN" } else { "Infinity" }
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message misidentifies negative infinity as positive

Low Severity

When val is f64::NEG_INFINITY, val.is_nan() is false, so the error message falls through to the else branch producing "Cannot cast Infinity to TIMESTAMP". The actual value is -Infinity, so the error message provides incorrect diagnostic information. The ternary doesn't distinguish positive from negative infinity.

Fix in Cursor Fix in Web

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datafusion/spark/src/function/conversion/cast.rs (1)

345-356: ⚠️ Potential issue | 🟡 Minor

Test helper panics if timezone is None.

The function signature accepts Option<&str> but line 354 calls timezone.unwrap() unconditionally. This will panic if None is passed. Consider either:

  • Changing the signature to timezone: &str since None is not supported, or
  • Handling None gracefully with .map(Arc::from)

This is test-only code and all current callers pass Some(...), so impact is low.

Proposed fix (option 1 - change signature)
 fn make_args_with_timezone(
     input: ColumnarValue,
     target_type: &str,
-    timezone: Option<&str>,
+    timezone: &str,
 ) -> ScalarFunctionArgs {
     let return_field = Arc::new(Field::new(
         "result",
         DataType::Timestamp(
             TimeUnit::Microsecond,
-            Some(Arc::from(timezone.unwrap())),
+            Some(Arc::from(timezone)),
         ),
         true,
     ));
     let mut config = ConfigOptions::default();
-    if let Some(tz) = timezone {
-        config.execution.time_zone = Some(tz.to_string());
-    }
+    config.execution.time_zone = Some(timezone.to_string());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/spark/src/function/conversion/cast.rs` around lines 345 - 356, The
helper make_args_with_timezone currently unwraps timezone unconditionally
causing a panic when timezone is None; either change the signature to require
timezone: &str and update all callers to pass a str, or keep timezone:
Option<&str> and construct the Timestamp field with timezone.map(Arc::from)
(i.e. pass Some(Arc::from(...)) when present and None otherwise) so the
Arc::from(timezone.unwrap()) call in the creation of return_field (and any
places using timezone) is replaced with a safe Option<Arc<str>> conversion.
🧹 Nitpick comments (1)
datafusion/spark/src/function/conversion/cast.rs (1)

57-59: Minor precision edge case in overflow boundary check.

The comparison micros <= i64::MAX as f64 has a subtle precision issue: i64::MAX as f64 rounds up to 9223372036854775808.0 (one more than i64::MAX) due to f64's limited precision. Values equal to this threshold could pass the check but would saturate when cast to i64.

This is benign in practice—Rust 1.45+ defines saturating behavior for such casts, and this edge case represents timestamps billions of years in the future. The current behavior is also consistent with the saturating approach used for integer overflow in secs_to_micros.

No action required unless stricter boundary enforcement is desired.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/spark/src/function/conversion/cast.rs` around lines 57 - 59, The
check against the i64 upper bound can mispass due to f64 rounding of i64::MAX;
update the condition in the cast logic that uses micros (and mirror the same
change in secs_to_micros if present) to use a strict less-than or subtract one
unit before comparison — e.g. replace `micros <= i64::MAX as f64` with `micros <
i64::MAX as f64` or `micros <= (i64::MAX as f64) - 1.0` so values that round up
to 9223372036854775808.0 don't erroneously pass and then saturate when cast to
i64. Ensure you reference the `micros` variable and `MICROS_PER_SECOND` in the
same block when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@datafusion/spark/src/function/conversion/cast.rs`:
- Around line 345-356: The helper make_args_with_timezone currently unwraps
timezone unconditionally causing a panic when timezone is None; either change
the signature to require timezone: &str and update all callers to pass a str, or
keep timezone: Option<&str> and construct the Timestamp field with
timezone.map(Arc::from) (i.e. pass Some(Arc::from(...)) when present and None
otherwise) so the Arc::from(timezone.unwrap()) call in the creation of
return_field (and any places using timezone) is replaced with a safe
Option<Arc<str>> conversion.

---

Nitpick comments:
In `@datafusion/spark/src/function/conversion/cast.rs`:
- Around line 57-59: The check against the i64 upper bound can mispass due to
f64 rounding of i64::MAX; update the condition in the cast logic that uses
micros (and mirror the same change in secs_to_micros if present) to use a strict
less-than or subtract one unit before comparison — e.g. replace `micros <=
i64::MAX as f64` with `micros < i64::MAX as f64` or `micros <= (i64::MAX as f64)
- 1.0` so values that round up to 9223372036854775808.0 don't erroneously pass
and then saturate when cast to i64. Ensure you reference the `micros` variable
and `MICROS_PER_SECOND` in the same block when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 20bef460-ebba-43fb-81ed-bfdd7983a901

📥 Commits

Reviewing files that changed from the base of the PR and between 49d9c89 and a75281c.

📒 Files selected for processing (2)
  • datafusion/spark/src/function/conversion/cast.rs
  • datafusion/sqllogictest/test_files/spark/conversion/cast_float_to_timestamp.slt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants