2835: feat: Support casting string non int numeric types#39
2835: feat: Support casting string non int numeric types#39martin-augment wants to merge 9 commits intomainfrom
Conversation
WalkthroughThis change extends Comet's casting capabilities to support string-to-numeric conversions for floating-point and decimal types. The documentation, native Rust implementation, and Scala compatibility logic are updated to mark string-to-float, string-to-double, and string-to-decimal conversions as compatible instead of incompatible. The Rust implementation adds parsing utilities for special float values and decimal formats, including exponent notation support. Test coverage is expanded with multiple ANSI-aware test scenarios covering edge cases and special values. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review: Support casting string to non-int numeric typesSummaryThis PR adds comprehensive support for casting strings to Float, Double, and Decimal types, removing several compatibility restrictions. The implementation adds ~430 lines of Rust code and extensive test coverage. Positive Aspects1. Excellent Test Coverage
2. Complete Feature Implementation
3. Documentation Updates
Issues and Concerns1. CRITICAL: Potential Division-by-Zero Logic ErrorLocation: let quotient_opt = mantissa.checked_div(divisor);
// Check if divisor is 0
if quotient_opt.is_none() {
return Ok(None);
}Issue: The comment says "Check if divisor is 0" but Recommendation: Either remove this check entirely or clarify the comment to say "Check for overflow in division". 2. Bug: Incorrect Rounding BehaviorLocation: // Round half up: if abs(remainder) >= divisor/2, round away from zero
let half_divisor = divisor / 2;
let rounded = if remainder.abs() >= half_divisor {
if mantissa >= 0 {
quotient + 1
} else {
quotient - 1
}
} else {
quotient
};Issue: The rounding logic is incorrect for negative numbers. "Half-up" rounding means rounding towards positive infinity when exactly at 0.5, not "away from zero". Example:
Recommendation: The correct implementation should be: let rounded = if remainder.abs() > half_divisor {
// Clear case: round away from zero
if mantissa >= 0 { quotient + 1 } else { quotient - 1 }
} else if remainder.abs() == half_divisor {
// Exactly at half: round towards positive infinity (half-up)
if mantissa >= 0 { quotient + 1 } else { quotient }
} else {
quotient
};Or verify that Spark uses "round half away from zero" instead of "round half up". If so, update the comment. 3. Code DuplicationLocation: These two functions have nearly identical logic (96% identical). The only difference is:
Recommendation: Consider extracting common logic or using a trait-based approach to reduce duplication and improve maintainability. 4. Missing Edge Case TestsWhile test coverage is good, consider adding tests for:
5. Performance ConsiderationLocation: let s_lower = s.to_lowercase();Issue: This creates a new allocation for every string, even when the string doesn't contain special values. Recommendation: Consider checking for special values using case-insensitive comparison before allocation: // Fast path: if no special chars, try parse directly
if !s.contains(|c: char| matches!(c, 'i' | 'I' | 'n' | 'N' | 'd' | 'D' | 'f' | 'F')) {
return s.parse::<F>().ok();
}
// Slow path: handle special cases
let s_lower = s.to_lowercase();
// ... rest of logic6. Minor: Inconsistent Error HandlingLocation: The function uses 7. Documentation: Comment TypoLocation: // Exponent part must have optional sign followed by atleast a digitTypo: "atleast" should be "at least" (two words) 8. Missing Unit Tests for Helper FunctionsThe helper functions Security Considerations✅ No obvious security issues detected:
Performance Considerations
Code QualityStrengths:
Areas for improvement:
Recommendations SummaryMust Fix (Critical):
Should Fix (Important):
Nice to Have:
VerdictThis is a solid implementation with excellent test coverage. However, there is a critical bug in the rounding logic that needs to be fixed before merging. Once the rounding issue is addressed and verified against Spark's behavior, this PR will significantly improve Comet's compatibility with Spark. Recommendation: Request Changes - Fix the rounding bug and verify against Spark's exact behavior, then approve. |
| eval_mode: EvalMode, | ||
| ) -> SparkResult<ArrayRef> { | ||
| match to_type { | ||
| DataType::Float16 | DataType::Float32 => { |
There was a problem hiding this comment.
For to_type == Float16, this arm builds a Float32 array via Float32Type; that returns an array with the wrong Arrow type for a Float16 cast and could lead to downstream type mismatches. Consider ensuring the returned ArrayRef matches Float16 semantics (also applies to other locations in the PR).
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct! Apache DataFusion supports the Float16 type but Apache Spark does not! And DataFusion Comet should behave as Spark does, i.e. it should map to the Spark's FloatType (32-bit floating point).
| cast_string_to_timestamp(&array, to_type, eval_mode, &cast_options.timezone) | ||
| } | ||
| (Utf8, Date32) => cast_string_to_date(&array, to_type, eval_mode), | ||
| (Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode), |
There was a problem hiding this comment.
Only Utf8 is handled for string→float casts; LargeUtf8 is not included here, which makes behavior inconsistent with decimal handling (Utf8 | LargeUtf8). Consider supporting LargeUtf8 here as well.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! LargeUtf8 and Utf8View types are missed here. They could be added for a broader support of the TEXT types. Prevents bug reports when users try to cast LargeUtf8/Utf8View to Decimal values.
| | DataType::Utf8 // note that there can be formatting differences | ||
| ), | ||
| DataType::Utf8 if allow_incompat => matches!( | ||
| DataType::Utf8 if allow_incompat => { |
There was a problem hiding this comment.
With allow_incompat = true, Utf8 no longer reports compatibility for casts to Float32/Float64, while the default branch does—enabling the flag unexpectedly reduces compatibility. This looks inverted relative to intent and may incorrectly block valid casts when the flag is on.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! It seems some functionality is lost with this PR/improvement. It would be nice to restore the support for casting a String to Float32/64. Prevents regression.
|
|
||
| /// Parse a decimal string into mantissa and scale | ||
| /// e.g., "123.45" -> (12345, 2), "-0.001" -> (-1, 3) | ||
| fn parse_decimal_str(s: &str) -> Result<(i128, i32), String> { |
There was a problem hiding this comment.
parse_decimal_str returns an i128 mantissa, so the string→Decimal256 path is effectively limited to values that fit in i128 magnitude. This may reject valid Decimal256 inputs with larger precision even though Decimal256 supports wider ranges.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The support for i128 comes from is_validate_decimal_precision(). It does not support (yet) i256. Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
spark/src/test/scala/org/apache/comet/CometCastSuite.scala (1)
681-715: Potential duplicate test coverage.The tests at lines 681-685 (
"cast StringType to FloatType special values") and lines 711-715 ("ANSI support - cast StringType to FloatType special values") appear to test the same scenario with the same inputs (specialValues.toDF("a"),DataTypes.FloatType, and iterating overtestAnsi = true/false).Consider consolidating these into a single test or clarifying if there's an intentional difference.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/source/user-guide/latest/compatibility.md(1 hunks)native/spark-expr/src/conversion_funcs/cast.rs(6 hunks)spark/src/main/scala/org/apache/comet/expressions/CometCast.scala(1 hunks)spark/src/test/scala/org/apache/comet/CometCastSuite.scala(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-11T20:44:05.014Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:05.014Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.
Applied to files:
native/spark-expr/src/conversion_funcs/cast.rs
📚 Learning: 2025-11-04T14:26:48.750Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().
Applied to files:
native/spark-expr/src/conversion_funcs/cast.rs
🧬 Code graph analysis (2)
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)
spark/src/main/scala/org/apache/comet/serde/SupportLevel.scala (1)
Compatible(31-31)
spark/src/test/scala/org/apache/comet/CometCastSuite.scala (10)
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala (1)
test(48-55)spark/src/test/scala/org/apache/comet/exec/CometExecSuite.scala (1)
test(59-68)spark/src/test/scala/org/apache/comet/exec/CometWindowExecSuite.scala (1)
test(39-49)spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala (1)
test(37-44)spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala (1)
test(56-68)spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala (1)
test(38-48)spark/src/test/scala/org/apache/spark/sql/comet/ParquetDatetimeRebaseSuite.scala (2)
test(142-147)test(151-161)spark/src/test/scala/org/apache/comet/exec/CometExec3_4PlusSuite.scala (1)
test(59-66)native/spark-expr/src/conversion_funcs/cast.rs (2)
v(2699-2699)values(2609-2609)spark/src/test/scala/org/apache/comet/DataGenerator.scala (2)
generateStrings(51-53)generateStrings(56-58)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
🔇 Additional comments (10)
docs/source/user-guide/latest/compatibility.md (1)
162-164: LGTM!The documentation correctly updates the compatibility table to reflect the new string-to-float, string-to-double, and string-to-decimal cast support. The changes are consistent with the implementation updates in the Scala and Rust code.
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)
187-190: LGTM!The compatibility logic correctly marks string-to-float, string-to-double, and string-to-decimal casts as
Compatible(), enabling native Comet execution for these cast operations. This aligns with the new Rust implementation that handles these conversions.spark/src/test/scala/org/apache/comet/CometCastSuite.scala (2)
112-136: LGTM!The
specialValuessequence provides comprehensive coverage of edge cases for string-to-numeric casting, including:
- Float/double suffixes (
1.5f,2.0d)- Infinity variants (
inf,Inf,INF,+inf,-inf,+Infinity,-Infinity)- NaN variants (
NaN,nan,NAN)- Scientific notation (
1.23e4,1.23E4,-1.23e-4)- Whitespace handling (
" 123.456789 ")- Edge cases (
"","xyz",null)This is a well-designed test input set.
734-791: LGTM!Good coverage for string-to-decimal casting with:
- Various precision/scale combinations:
DecimalType(22,2),DecimalType(2,2),DecimalType(38,10)- Basic values including edge cases like
".5","5.","+123.45","inf", empty string- Scientific notation support
- ANSI mode testing for both true/false
The test cases align well with Spark's decimal parsing behavior.
native/spark-expr/src/conversion_funcs/cast.rs (6)
219-221: LGTM!The
can_cast_from_stringfunction correctly enables casting toFloat32,Float64,Decimal128, andDecimal256types, which aligns with the new casting capabilities being added.
1341-1413: LGTM!The
parse_decimal_strfunction correctly handles:
- Scientific notation with positive/negative exponents
- Optional sign prefix
- Decimal point handling (including ".5" format)
- Overflow protection via
checked_mulandchecked_add- Proper scale calculation:
fractional_scale - exponentThe implementation aligns with Spark's decimal parsing semantics.
1259-1339: LGTM!The
parse_string_to_decimalfunction correctly implements Spark's decimal parsing behavior:
- Returns
Nonefor special float values (inf, nan) which aren't valid decimals- Uses format validation before parsing
- Implements HALF_UP rounding (round half away from zero)
- Handles overflow protection for extreme scale adjustments
- Validates result fits target precision
The rounding logic at lines 1308-1318 correctly rounds away from zero for both positive and negative values when the remainder is >= half the divisor.
1415-1491: LGTM!The float parsing implementation correctly handles:
- Null values checked before accessing (per retrieved learnings)
- Special float strings:
inf,-inf,nan(case-insensitive)- Spark's
D/Fsuffix convention (e.g.,"1.5f","2.0D")- Scientific notation via Rust's standard
parse- ANSI mode error propagation with
invalid_valueerrorThe implementation aligns with Spark's string-to-float casting behavior.
1179-1257: LGTM!The
is_valid_decimal_formatfunction provides robust validation:
- Handles optional leading sign (
+/-)- Rejects invalid sign combinations (
++,+-)- Validates decimal point (only one allowed, not after exponent)
- Validates exponent format (optional sign, at least one digit required)
- Returns
falsefor empty strings or strings without digitsThe byte-based parsing is efficient and correctly handles all documented edge cases.
1620-1626: Remove unreachable DataFusion fallback logic forUtf8→Float32/Float64conversions.The
cast_arrayfunction explicitly handlesUtf8→Float32/Float64conversions viacast_string_to_floatat line 969, so the fallback match arm at line 1041 that callsis_datafusion_spark_compatiblewill never be reached for these types. The code at lines 1620–1626 that allowsFloat32/Float64forUtf8withoutallow_incompatis therefore unreachable and should either be removed or the custom handler should be removed for consistency.
| fn cast_string_to_decimal256_impl( | ||
| array: &ArrayRef, | ||
| eval_mode: EvalMode, | ||
| precision: u8, | ||
| scale: i8, | ||
| ) -> SparkResult<ArrayRef> { | ||
| let string_array = array | ||
| .as_any() | ||
| .downcast_ref::<StringArray>() | ||
| .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?; | ||
|
|
||
| let mut decimal_builder = PrimitiveBuilder::<Decimal256Type>::with_capacity(string_array.len()); | ||
|
|
||
| for i in 0..string_array.len() { | ||
| if string_array.is_null(i) { | ||
| decimal_builder.append_null(); | ||
| } else { | ||
| let str_value = string_array.value(i).trim(); | ||
| match parse_string_to_decimal(str_value, precision, scale) { | ||
| Ok(Some(decimal_value)) => { | ||
| // Convert i128 to i256 | ||
| let i256_value = i256::from_i128(decimal_value); | ||
| decimal_builder.append_value(i256_value); | ||
| } | ||
| Ok(None) => { | ||
| if eval_mode == EvalMode::Ansi { | ||
| return Err(invalid_value( | ||
| str_value, | ||
| "STRING", | ||
| &format!("DECIMAL({},{})", precision, scale), | ||
| )); | ||
| } | ||
| decimal_builder.append_null(); | ||
| } | ||
| Err(e) => { | ||
| if eval_mode == EvalMode::Ansi { | ||
| return Err(e); | ||
| } | ||
| decimal_builder.append_null(); | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| Ok(Arc::new( | ||
| decimal_builder | ||
| .with_precision_and_scale(precision, scale)? | ||
| .finish(), | ||
| )) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Spark Decimal256 maximum precision digits Apache
💡 Result:
- Apache Spark DecimalType: maximum precision = 38 digits. [1][2]
- "Decimal256" (Apache Arrow / Decimal256 types) supports precision up to 76 digits. [3]
- Spark does not currently support Decimal256 natively; there is an open ASF JIRA requesting Decimal256 / higher-precision decimal support (SPARK-50293). [4]
Sources:
- Spark DecimalType docs (MAX_PRECISION = 38).
- PySpark / Spark SQL docs (DecimalType precision up to 38).
- Apache Arrow pyarrow.decimal256 docs (precision must be between 1 and 76).
- ASF JIRA SPARK-50293 (request to support DECIMAL256 / higher precision).
🏁 Script executed:
# Search for precision validation and limits in the codebase
rg -B2 -A2 "precision" native/spark-expr/src/conversion_funcs/cast.rs | head -50Repository: martin-augment/datafusion-comet
Length of output: 2126
🏁 Script executed:
# Check if there are other decimal-related functions or comments about Decimal256
rg "Decimal256|precision.*76|precision.*38" native/spark-expr/src/ -B2 -A2Repository: martin-augment/datafusion-comet
Length of output: 13445
🏁 Script executed:
# Look at the parse_string_to_decimal function to understand its constraints
rg -B5 -A15 "fn parse_string_to_decimal" native/spark-expr/src/Repository: martin-augment/datafusion-comet
Length of output: 1673
🏁 Script executed:
# Search for any constraints or validation on Decimal256 precision
rg "Decimal256|MAX_PRECISION|precision.*38" native/spark-expr/src/ -B3 -A3 | head -80Repository: martin-augment/datafusion-comet
Length of output: 6383
🏁 Script executed:
# Check if there are TODOs or FIXMEs related to Decimal256 or precision
rg "TODO|FIXME|XXX" native/spark-expr/src/conversion_funcs/cast.rsRepository: martin-augment/datafusion-comet
Length of output: 144
🏁 Script executed:
# Look for any documentation or comments about supported decimal ranges
rg -A10 "fn cast_string_to_decimal\(" native/spark-expr/src/conversion_funcs/cast.rs | head -30Repository: martin-augment/datafusion-comet
Length of output: 388
The cast_string_to_decimal256_impl function has a precision limitation due to using i128 intermediate values.
The function relies on parse_string_to_decimal, which returns i128, then converts to i256 via i256::from_i128. This approach limits decimal values to i128's range, supporting only precision up to 38 digits. Decimal256 should support up to 76 digits of precision.
If higher precisions are needed in the future (or are already being passed in some code paths), the parsing logic must be extended to handle i256 directly, rather than converting from a limited i128 intermediate.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The support for i128 comes from is_validate_decimal_precision(). It does not support (yet) i256. Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.
| cast_string_to_timestamp(&array, to_type, eval_mode, &cast_options.timezone) | ||
| } | ||
| (Utf8, Date32) => cast_string_to_date(&array, to_type, eval_mode), | ||
| (Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode), |
There was a problem hiding this comment.
Bug: Missing LargeUtf8 handling for float cast
The match pattern at line 969 only handles Utf8 for string-to-float casts, but can_cast_from_string (line 215-221) claims both Utf8 and LargeUtf8 are supported for Float32 and Float64. When a LargeUtf8 array is passed for float conversion, it will not match the pattern, fall through to the catch-all case, and return an error since is_datafusion_spark_compatible doesn't handle LargeUtf8 either.
Additional Locations (1)
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! A support is missing for both LargeUtf8 and Utf8View types. They could be easily supported by using GenericStringArray arrow data type in cast_string_to_float(). Prevents introducing incomplete casting improvement.
| } | ||
| (Utf8 | LargeUtf8, Decimal256(precision, scale)) => { | ||
| cast_string_to_decimal(&array, to_type, precision, scale, eval_mode) | ||
| } |
There was a problem hiding this comment.
Bug: LargeUtf8 decimal cast fails due to wrong downcast
The match patterns at lines 970-975 claim to handle Utf8 | LargeUtf8 for decimal casts, but the implementation functions cast_string_to_decimal128_impl and cast_string_to_decimal256_impl only downcast to StringArray (which represents Utf8). When a LargeUtf8 array is passed, the downcast will fail and return an error because LargeUtf8 uses LargeStringArray, not StringArray.
Additional Locations (2)
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! To be able to support LargeUtf8 type it needs to cast to LargeStringArray or GenericStringArray. Prevents a runtime error when the caller passes an array of LargeUtf8 instance.
| DataType::Utf8 if allow_incompat => matches!( | ||
| DataType::Utf8 if allow_incompat => { | ||
| matches!(to_type, DataType::Binary | DataType::Decimal128(_, _)) | ||
| } |
There was a problem hiding this comment.
Bug: Decimal128/256 inconsistent with cast_supported for incompat mode
The is_datafusion_spark_compatible function requires allow_incompat to be true for Utf8 to Decimal128 delegation, but can_cast_from_string unconditionally returns true for Decimal128 casts. While the custom implementation at lines 970-972 now handles this case (making this code path less relevant), the documentation in compatibility.md was updated to show these casts as compatible, creating an inconsistency in the fallback logic. If the custom implementation were to fail or be removed, the fallback would unexpectedly require allow_incompat=true despite the cast being documented as compatible.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! It seems some functionality is lost with this PR/improvement. It would be nice to restore the support for casting a String to Float32/64. Prevents regression.
value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! |
value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! Trying to execute "SELECT CAST('123.45D' AS DECIMAL(5,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '123.45D' of the type "STRING" cannot be cast to "DECIMAL(5,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead. SQLSTATE: 22018". So, DataFusion Comet behaves the same way. |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! The same could be achieved by using String::eq_ignore_ascii_case() without allocating new temporary string. |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! The error message is very generic and if the error is handled/logged somewhere up in the hierarchy it will be hard for the developer to understand which function returned it and what was the context, i.e. what input was not accepted. |
value:good-to-have; category:documentation; feedback:The Claude AI reviewer is correct! The comment sounds wrong - "must have" + "optional". Also "atleast" are two words - "at" and "least". A better comment could be: "The exponent part should have an optional sign followed by at least one digit" |
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! The tests are very similar and could be combined into one test. Prevents code duplication and higher maintenance cost. |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
2835: To review by AI