21212: feat: spark compatible float to timestamp cast with ANSI support#297
21212: feat: spark compatible float to timestamp cast with ANSI support#297martin-augment wants to merge 2 commits intomainfrom
Conversation
WalkthroughThis change extends the Spark CAST function to support casting float types ( ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Review: Spark-compatible float to timestamp cast with ANSI supportOverall this is a well-structured addition that correctly implements float→timestamp semantics with ANSI mode support. The test coverage is broad. A few issues need attention before merging. 🔴 Will break CI:
|
There was a problem hiding this comment.
Code Review
This pull request implements casting from float types (Float32 and Float64) to timestamps in the Spark-compatible cast function, including handling for fractional seconds, NaN, infinity, and overflow with support for ANSI mode. It also adds corresponding unit and logic tests. Review feedback suggests improving error message clarity by distinguishing between positive and negative infinity and refactoring array processing to use idiomatic Rust iterators.
| if enable_ansi_mode { | ||
| return exec_err!( | ||
| "Cannot cast {} to TIMESTAMP", | ||
| if val.is_nan() { "NaN" } else { "Infinity" } |
There was a problem hiding this comment.
The error message for infinity does not distinguish between positive and negative infinity. This could be confusing for users. Consider providing a more specific message to improve clarity.
It would also be beneficial to add a test case for casting f64::NEG_INFINITY in ANSI mode to ensure it's handled correctly and provides the improved error message.
| if val.is_nan() { "NaN" } else { "Infinity" } | |
| if val.is_nan() { "NaN" } else if val.is_sign_positive() { "Infinity" } else { "-Infinity" } |
| for i in 0..arr.len() { | ||
| if arr.is_null(i) { | ||
| builder.append_null(); | ||
| } else { | ||
| let val: f64 = arr.value(i).into(); | ||
| match float_secs_to_micros(val, enable_ansi_mode)? { | ||
| Some(micros) => builder.append_value(micros), | ||
| None => builder.append_null(), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The loop can be written more idiomatically using an iterator over the array. This can make the code more concise and easier to read by leveraging standard Rust patterns.
for val in arr.iter() {
let micros = if let Some(val) = val {
float_secs_to_micros(val.into(), enable_ansi_mode)?
} else {
None
};
builder.append_option(micros);
}
🤖 Augment PR SummarySummary: Adds Spark-compatible casting from floating-point seconds to microsecond timestamps, with behavior gated by ANSI mode. Changes:
Technical Notes: Float inputs are interpreted as seconds since Unix epoch (fractional seconds supported) and converted to microsecond timestamps; special values (NaN/Infinity) are NULL in non-ANSI mode and errors in ANSI mode. 🤖 Was this summary useful? React with 👍 or 👎 |
| if enable_ansi_mode { | ||
| return exec_err!( | ||
| "Cannot cast {} to TIMESTAMP", | ||
| if val.is_nan() { "NaN" } else { "Infinity" } |
There was a problem hiding this comment.
| return Ok(None); | ||
| } | ||
| let micros = val * MICROS_PER_SECOND as f64; | ||
| if micros >= i64::MIN as f64 && micros <= i64::MAX as f64 { |
There was a problem hiding this comment.
datafusion/spark/src/function/conversion/cast.rs:58: The bounds check uses i64::MAX as f64 (rounded to 2^63), so a value that rounds to exactly 2^63 microseconds can pass this check and then micros as i64 will saturate to i64::MAX, bypassing the intended overflow error/NULL behavior (including in ANSI mode).
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| if enable_ansi_mode { | ||
| return exec_err!("Overflow casting {} to TIMESTAMP", val); | ||
| } | ||
| Ok(None) |
There was a problem hiding this comment.
datafusion/spark/src/function/conversion/cast.rs:64: Spark’s non-ANSI float/double→timestamp cast appears to saturate on overflow (via (d * MICROS_PER_SECOND).toLong), so returning NULL on overflow here may diverge from Spark compatibility for very large inputs.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
|
||
| // Spark only supports signed integers, so we explicitly list them | ||
| let signed_int_signatures = [ | ||
| // Supported input types: signed integers and floats |
There was a problem hiding this comment.
datafusion/spark/src/function/conversion/cast.rs:112: The spark_cast doc comment still lists only integer→timestamp and describes saturating overflow, but this PR adds float→timestamp with distinct NaN/Infinity/overflow behavior, so the documentation is now inconsistent with the implementation.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| } | ||
|
|
||
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { | ||
| let enable_ansi_mode = args.config_options.execution.enable_ansi_mode; |
There was a problem hiding this comment.
Float-to-timestamp can produce NULLs despite non-nullable return field
Medium Severity
return_field_from_args computes nullable solely from whether input fields are nullable. With the new float-to-timestamp conversion, float_secs_to_micros returns None (producing NULLs) for NaN, Infinity, and overflow values in non-ANSI mode — even when the input float column is declared non-nullable. This creates a mismatch: the output field claims to be non-nullable, but the actual data may contain NULLs. Downstream optimizations that skip null checks based on this metadata could produce incorrect results.
Additional Locations (1)
| return exec_err!( | ||
| "Cannot cast {} to TIMESTAMP", | ||
| if val.is_nan() { "NaN" } else { "Infinity" } | ||
| ); |
There was a problem hiding this comment.
Error message misidentifies negative infinity as positive
Low Severity
When val is f64::NEG_INFINITY, val.is_nan() is false, so the error message falls through to the else branch producing "Cannot cast Infinity to TIMESTAMP". The actual value is -Infinity, so the error message provides incorrect diagnostic information. The ternary doesn't distinguish positive from negative infinity.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
datafusion/spark/src/function/conversion/cast.rs (1)
345-356:⚠️ Potential issue | 🟡 MinorTest helper panics if
timezoneisNone.The function signature accepts
Option<&str>but line 354 callstimezone.unwrap()unconditionally. This will panic ifNoneis passed. Consider either:
- Changing the signature to
timezone: &strsinceNoneis not supported, or- Handling
Nonegracefully with.map(Arc::from)This is test-only code and all current callers pass
Some(...), so impact is low.Proposed fix (option 1 - change signature)
fn make_args_with_timezone( input: ColumnarValue, target_type: &str, - timezone: Option<&str>, + timezone: &str, ) -> ScalarFunctionArgs { let return_field = Arc::new(Field::new( "result", DataType::Timestamp( TimeUnit::Microsecond, - Some(Arc::from(timezone.unwrap())), + Some(Arc::from(timezone)), ), true, )); let mut config = ConfigOptions::default(); - if let Some(tz) = timezone { - config.execution.time_zone = Some(tz.to_string()); - } + config.execution.time_zone = Some(timezone.to_string());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/spark/src/function/conversion/cast.rs` around lines 345 - 356, The helper make_args_with_timezone currently unwraps timezone unconditionally causing a panic when timezone is None; either change the signature to require timezone: &str and update all callers to pass a str, or keep timezone: Option<&str> and construct the Timestamp field with timezone.map(Arc::from) (i.e. pass Some(Arc::from(...)) when present and None otherwise) so the Arc::from(timezone.unwrap()) call in the creation of return_field (and any places using timezone) is replaced with a safe Option<Arc<str>> conversion.
🧹 Nitpick comments (1)
datafusion/spark/src/function/conversion/cast.rs (1)
57-59: Minor precision edge case in overflow boundary check.The comparison
micros <= i64::MAX as f64has a subtle precision issue:i64::MAX as f64rounds up to9223372036854775808.0(one more thani64::MAX) due to f64's limited precision. Values equal to this threshold could pass the check but would saturate when cast to i64.This is benign in practice—Rust 1.45+ defines saturating behavior for such casts, and this edge case represents timestamps billions of years in the future. The current behavior is also consistent with the saturating approach used for integer overflow in
secs_to_micros.No action required unless stricter boundary enforcement is desired.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/spark/src/function/conversion/cast.rs` around lines 57 - 59, The check against the i64 upper bound can mispass due to f64 rounding of i64::MAX; update the condition in the cast logic that uses micros (and mirror the same change in secs_to_micros if present) to use a strict less-than or subtract one unit before comparison — e.g. replace `micros <= i64::MAX as f64` with `micros < i64::MAX as f64` or `micros <= (i64::MAX as f64) - 1.0` so values that round up to 9223372036854775808.0 don't erroneously pass and then saturate when cast to i64. Ensure you reference the `micros` variable and `MICROS_PER_SECOND` in the same block when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@datafusion/spark/src/function/conversion/cast.rs`:
- Around line 345-356: The helper make_args_with_timezone currently unwraps
timezone unconditionally causing a panic when timezone is None; either change
the signature to require timezone: &str and update all callers to pass a str, or
keep timezone: Option<&str> and construct the Timestamp field with
timezone.map(Arc::from) (i.e. pass Some(Arc::from(...)) when present and None
otherwise) so the Arc::from(timezone.unwrap()) call in the creation of
return_field (and any places using timezone) is replaced with a safe
Option<Arc<str>> conversion.
---
Nitpick comments:
In `@datafusion/spark/src/function/conversion/cast.rs`:
- Around line 57-59: The check against the i64 upper bound can mispass due to
f64 rounding of i64::MAX; update the condition in the cast logic that uses
micros (and mirror the same change in secs_to_micros if present) to use a strict
less-than or subtract one unit before comparison — e.g. replace `micros <=
i64::MAX as f64` with `micros < i64::MAX as f64` or `micros <= (i64::MAX as f64)
- 1.0` so values that round up to 9223372036854775808.0 don't erroneously pass
and then saturate when cast to i64. Ensure you reference the `micros` variable
and `MICROS_PER_SECOND` in the same block when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 20bef460-ebba-43fb-81ed-bfdd7983a901
📒 Files selected for processing (2)
datafusion/spark/src/function/conversion/cast.rsdatafusion/sqllogictest/test_files/spark/conversion/cast_float_to_timestamp.slt


21212: To review by AI