2835: feat: Support string non int numeric types#32
2835: feat: Support string non int numeric types#32martin-augment wants to merge 2 commits intomainfrom
Conversation
WalkthroughThis pull request extends casting support from strings to numeric types in the Comet query engine. The documentation was updated to reflect string-to-float, string-to-double, and string-to-decimal casts moving from incompatible to compatible. Native Rust code implements parsing logic for converting strings to decimal and floating-point types, including special value handling. The Scala compatibility layer marks these casts as compatible rather than incompatible. Test coverage was expanded with special value test cases for float, double, and decimal casting operations across ANSI and non-ANSI modes. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| cast_string_to_timestamp(&array, to_type, eval_mode, &cast_options.timezone) | ||
| } | ||
| (Utf8, Date32) => cast_string_to_date(&array, to_type, eval_mode), | ||
| (Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode), |
There was a problem hiding this comment.
This match arm includes Float16, but cast_string_to_float only handles Float32/Float64; attempting STRING→FLOAT16 will hit the catch-all and return an internal error.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! cast_string_to_float() (defined at line 1301) currently supports only Fload32 and Float64, but not Float16. Trying to cast a Float16 will return an internal error.
| ), | ||
| DataType::Utf8 if allow_incompat => matches!( | ||
| DataType::Utf8 if allow_incompat => { | ||
| matches!(to_type, DataType::Binary | DataType::Decimal128(_, _)) |
There was a problem hiding this comment.
In is_datafusion_spark_compatible, the Utf8 if allow_incompat branch excludes Float32/Float64 and Decimal256, so enabling allow_incompat actually removes float compatibility and omits decimal256—likely unintended given the new support.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The Augment AI reviewer is correct! Both the old and the new version of the code do not support Float32/64 and Decimal256 in any of the match arms, with or without allow_incompat. Prevents error results at runtime when any of these types are used.
| testAnsi = false) | ||
| test("cast StringType to DoubleType") { | ||
| Seq(true, false).foreach { v => | ||
| castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v) |
There was a problem hiding this comment.
In the "cast StringType to DoubleType" test, the cast uses DataTypes.FloatType instead of DataTypes.DoubleType, so it doesn't exercise double casting.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! It looks like this is a copy/paste error. The test should use DoubleType as the test name suggests. Prevents regressions related to the support for DoubleType because it is not tested at the moment.
There was a problem hiding this comment.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/source/user-guide/latest/compatibility.md(1 hunks)native/spark-expr/src/conversion_funcs/cast.rs(7 hunks)spark/src/main/scala/org/apache/comet/expressions/CometCast.scala(1 hunks)spark/src/test/scala/org/apache/comet/CometCastSuite.scala(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-04T14:26:48.750Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().
Applied to files:
native/spark-expr/src/conversion_funcs/cast.rs
📚 Learning: 2025-11-11T20:44:05.014Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:05.014Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.
Applied to files:
native/spark-expr/src/conversion_funcs/cast.rs
🧬 Code graph analysis (1)
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)
spark/src/main/scala/org/apache/comet/serde/SupportLevel.scala (1)
Compatible(31-31)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
🔇 Additional comments (8)
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)
187-190: LGTM - String to float/double and decimal casts are now marked as compatible.The change from
IncompatibletoCompatible()aligns with the native Rust implementation that now provides Spark-compatible string parsing for these numeric types.docs/source/user-guide/latest/compatibility.md (1)
162-164: Documentation update correctly reflects the new compatible casts.The addition of string → float, string → double, and string → decimal to the Compatible Casts table is consistent with the implementation changes.
native/spark-expr/src/conversion_funcs/cast.rs (4)
214-216: LGTM - Extended can_cast_from_string to include new types.The function now correctly reports Float32, Float64, Decimal128, and Decimal256 as valid target types for string casts.
1358-1408: LGTM - FloatParse trait implementations handle special values correctly.The implementations properly handle infinity variants, NaN, and suffix stripping for Spark compatibility.
1537-1543: Clarify the conditional logic for Utf8 compatibility with Decimal128.The
can_cast_from_stringfunction (line 215) returnstruefor Decimal128 unconditionally, whileis_datafusion_spark_compatible(lines 1537-1543) requiresallow_incompatto be true for Utf8→Decimal128 casts. This difference is inconsistent with how other compatibility-dependent casts (like Date32/Date64) are handled, wherecan_cast_from_stringitself checksoptions.allow_incompat(lines 217-221). Although Utf8→Decimal128 casts have explicit custom handling inspark_cast(lines 965-970) that bypasses the compatibility check, the divergent conditional logic between the two functions should be reconciled or documented to prevent confusion during maintenance.
1194-1197: Inconsistent suffix handling between decimal and float parsing is correct and intentional.The code correctly implements Spark's behavior: float parsing (
FloatParse) explicitly strips 'd'/'f' suffixes before parsing (matching Java'sDouble.parseDouble()), while decimal parsing does NOT strip suffixes and lets them fail naturally (matching Spark'sJavaBigDecimal(string)which rejects suffixes per Java'sBigDecimalspec). This inconsistency mirrors the underlying Java APIs and is the correct behavior. No changes needed.spark/src/test/scala/org/apache/comet/CometCastSuite.scala (2)
655-679: Good test coverage for special string values.The
specialValuessequence provides comprehensive coverage for edge cases including suffixed values, infinity/NaN variants, scientific notation, whitespace, and invalid inputs.
700-733: LGTM - Decimal casting tests cover basic and high precision scenarios.The tests appropriately cover:
- Basic decimal values with positive/negative numbers
- Precision edge cases like "0.001" and high precision values
- Invalid inputs (inf, empty string, non-numeric)
- The
testAnsi = falseparameter is appropriate since ANSI mode has different error handling
| let string_array = array | ||
| .as_any() | ||
| .downcast_ref::<StringArray>() | ||
| .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?; |
There was a problem hiding this comment.
LargeUtf8 input will cause a runtime error for decimal casts.
The match pattern at lines 965-970 accepts both Utf8 and LargeUtf8, but downcast_ref::<StringArray>() only works for Utf8. When the input is LargeUtf8, this will return None and cause an error.
- let string_array = array
- .as_any()
- .downcast_ref::<StringArray>()
- .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;
+ // Handle both Utf8 (StringArray) and LargeUtf8 (LargeStringArray)
+ fn get_string_value<O: OffsetSizeTrait>(array: &ArrayRef, i: usize) -> Option<&str> {
+ array.as_any()
+ .downcast_ref::<GenericStringArray<O>>()
+ .map(|arr| if arr.is_null(i) { None } else { Some(arr.value(i)) })
+ .flatten()
+ }Consider using GenericStringArray<O> with a generic offset type, similar to the pattern used in cast_string_to_int.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
native/spark-expr/src/conversion_funcs/cast.rs around lines 1080-1083: the code
unconditionally downcasts to StringArray which fails for LargeUtf8 inputs;
update the logic to handle both Utf8 and LargeUtf8 by using GenericStringArray
with a generic offset type (or by matching the DataType and downcasting to the
appropriate StringArray/LargeStringArray variant as done in cast_string_to_int).
Replace the single downcast_ref::<StringArray>() with a branch that constructs a
GenericStringArray<O> (or performs the correct downcast for each offset type),
then use that generic array for the decimal cast logic so LargeUtf8 inputs are
handled without runtime errors.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! LargeUtf8 cannot be casted to StringArray, so the new code will need to use its parent trait - GenericStringArray. Prevents internal error at runtime when LargeUtf8 is passed.
| match parse_string_to_decimal(str_value, precision, scale) { | ||
| Ok(Some(decimal_value)) => { | ||
| // Convert i128 to i256 | ||
| let i256_value = i256::from_i128(decimal_value); | ||
| decimal_builder.append_value(i256_value); |
There was a problem hiding this comment.
Decimal256 precision is limited to i128 range.
The implementation parses the string to i128 first, then converts to i256. This limits Decimal256 values to the i128 range (~38 decimal digits), which may not cover the full Decimal256 precision (up to 76 decimal digits).
For full Decimal256 support, consider parsing directly to i256 or using a big integer library. If this limitation is intentional, add a comment documenting it.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! This is caused by a limitation of parse_string_to_decimal() function. Currently it supports parsing up to the biggest integer type in the standard library - i128. i256 is a custom type provided by the Arrow-rs library and support for it would be nice to have.
| /// Parse a decimal string into (mantissa, scale) | ||
| /// e.g., "123.45" -> (12345, 2), "-0.001" -> (-1, 3) | ||
| fn parse_decimal_str(s: &str) -> Result<(i128, i32), String> { | ||
| let s = s.trim(); | ||
| if s.is_empty() { | ||
| return Err("Empty string".to_string()); | ||
| } | ||
|
|
||
| let negative = s.starts_with('-'); | ||
| let s = if negative || s.starts_with('+') { | ||
| &s[1..] | ||
| } else { | ||
| s | ||
| }; | ||
|
|
||
| // Split by decimal point | ||
| let parts: Vec<&str> = s.split('.').collect(); | ||
|
|
||
| if parts.len() > 2 { | ||
| return Err("Multiple decimal points".to_string()); | ||
| } | ||
|
|
||
| let integral_part = parts[0]; | ||
| let fractional_part = if parts.len() == 2 { parts[1] } else { "" }; | ||
|
|
||
| // Parse integral part | ||
| let integral_value: i128 = if integral_part.is_empty() { | ||
| 0 | ||
| } else { | ||
| integral_part | ||
| .parse() | ||
| .map_err(|_| "Invalid integral part".to_string())? | ||
| }; | ||
|
|
||
| // Parse fractional part | ||
| let scale = fractional_part.len() as i32; | ||
| let fractional_value: i128 = if fractional_part.is_empty() { | ||
| 0 | ||
| } else { | ||
| fractional_part | ||
| .parse() | ||
| .map_err(|_| "Invalid fractional part".to_string())? | ||
| }; | ||
|
|
||
| // Combine: value = integral * 10^scale + fractional | ||
| let mantissa = integral_value | ||
| .checked_mul(10_i128.pow(scale as u32)) | ||
| .and_then(|v| v.checked_add(fractional_value)) | ||
| .ok_or("Overflow in mantissa calculation")?; | ||
|
|
||
| let final_mantissa = if negative { -mantissa } else { mantissa }; | ||
|
|
||
| Ok((final_mantissa, scale)) | ||
| } |
There was a problem hiding this comment.
Missing scientific notation support in decimal string parsing.
The parse_decimal_str function does not handle scientific notation (e.g., "1.23e4", "1E-5"). The float parsing (FloatParse) handles such values, but they will fail when casting strings like "1.23e4" to decimal.
Spark's BigDecimal constructor supports scientific notation. Consider adding support or documenting this as a known limitation:
+ // Check for scientific notation - not yet supported
+ if s.contains('e') || s.contains('E') {
+ return Err("Scientific notation not supported".to_string());
+ }
+
// Split by decimal point
let parts: Vec<&str> = s.split('.').collect();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// Parse a decimal string into (mantissa, scale) | |
| /// e.g., "123.45" -> (12345, 2), "-0.001" -> (-1, 3) | |
| fn parse_decimal_str(s: &str) -> Result<(i128, i32), String> { | |
| let s = s.trim(); | |
| if s.is_empty() { | |
| return Err("Empty string".to_string()); | |
| } | |
| let negative = s.starts_with('-'); | |
| let s = if negative || s.starts_with('+') { | |
| &s[1..] | |
| } else { | |
| s | |
| }; | |
| // Split by decimal point | |
| let parts: Vec<&str> = s.split('.').collect(); | |
| if parts.len() > 2 { | |
| return Err("Multiple decimal points".to_string()); | |
| } | |
| let integral_part = parts[0]; | |
| let fractional_part = if parts.len() == 2 { parts[1] } else { "" }; | |
| // Parse integral part | |
| let integral_value: i128 = if integral_part.is_empty() { | |
| 0 | |
| } else { | |
| integral_part | |
| .parse() | |
| .map_err(|_| "Invalid integral part".to_string())? | |
| }; | |
| // Parse fractional part | |
| let scale = fractional_part.len() as i32; | |
| let fractional_value: i128 = if fractional_part.is_empty() { | |
| 0 | |
| } else { | |
| fractional_part | |
| .parse() | |
| .map_err(|_| "Invalid fractional part".to_string())? | |
| }; | |
| // Combine: value = integral * 10^scale + fractional | |
| let mantissa = integral_value | |
| .checked_mul(10_i128.pow(scale as u32)) | |
| .and_then(|v| v.checked_add(fractional_value)) | |
| .ok_or("Overflow in mantissa calculation")?; | |
| let final_mantissa = if negative { -mantissa } else { mantissa }; | |
| Ok((final_mantissa, scale)) | |
| } | |
| /// Parse a decimal string into (mantissa, scale) | |
| /// e.g., "123.45" -> (12345, 2), "-0.001" -> (-1, 3) | |
| fn parse_decimal_str(s: &str) -> Result<(i128, i32), String> { | |
| let s = s.trim(); | |
| if s.is_empty() { | |
| return Err("Empty string".to_string()); | |
| } | |
| let negative = s.starts_with('-'); | |
| let s = if negative || s.starts_with('+') { | |
| &s[1..] | |
| } else { | |
| s | |
| }; | |
| // Check for scientific notation - not yet supported | |
| if s.contains('e') || s.contains('E') { | |
| return Err("Scientific notation not supported".to_string()); | |
| } | |
| // Split by decimal point | |
| let parts: Vec<&str> = s.split('.').collect(); | |
| if parts.len() > 2 { | |
| return Err("Multiple decimal points".to_string()); | |
| } | |
| let integral_part = parts[0]; | |
| let fractional_part = if parts.len() == 2 { parts[1] } else { "" }; | |
| // Parse integral part | |
| let integral_value: i128 = if integral_part.is_empty() { | |
| 0 | |
| } else { | |
| integral_part | |
| .parse() | |
| .map_err(|_| "Invalid integral part".to_string())? | |
| }; | |
| // Parse fractional part | |
| let scale = fractional_part.len() as i32; | |
| let fractional_value: i128 = if fractional_part.is_empty() { | |
| 0 | |
| } else { | |
| fractional_part | |
| .parse() | |
| .map_err(|_| "Invalid fractional part".to_string())? | |
| }; | |
| // Combine: value = integral * 10^scale + fractional | |
| let mantissa = integral_value | |
| .checked_mul(10_i128.pow(scale as u32)) | |
| .and_then(|v| v.checked_add(fractional_value)) | |
| .ok_or("Overflow in mantissa calculation")?; | |
| let final_mantissa = if negative { -mantissa } else { mantissa }; | |
| Ok((final_mantissa, scale)) | |
| } |
🤖 Prompt for AI Agents
In native/spark-expr/src/conversion_funcs/cast.rs around lines 1246 to 1299,
parse_decimal_str currently only handles fixed-point strings and rejects
scientific notation (e.g. "1.23e4", "1E-5"); update the parser to detect an
optional exponent part (split on 'e' or 'E'), validate and parse the exponent as
a signed integer, then parse the mantissa portion with the existing logic to
produce (mantissa, scale) and adjust the scale by subtracting the exponent
(scale = scale - exponent) so the numeric value remains mantissa * 10^-scale;
ensure you apply checked arithmetic when shifting the mantissa (multiplying or
dividing by powers of ten) and surface clear errors for invalid exponent syntax
or overflow; keep handling of leading sign as before and preserve tests for
corner cases (multiple dots, empty parts).
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! parse_decimal_str() does not support scientific notation. This should either be improved or documented as not supported.
| test("cast StringType to DoubleType") { | ||
| Seq(true, false).foreach { v => | ||
| castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v) | ||
| } | ||
| } |
There was a problem hiding this comment.
Bug: Test casts to FloatType instead of DoubleType.
The test is named "cast StringType to DoubleType" but line 690 casts to FloatType instead of DoubleType. This appears to be a copy-paste error.
test("cast StringType to DoubleType") {
Seq(true, false).foreach { v =>
- castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v)
+ castTest(specialValues.toDF("a"), DataTypes.DoubleType, testAnsi = v)
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| test("cast StringType to DoubleType") { | |
| Seq(true, false).foreach { v => | |
| castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v) | |
| } | |
| } | |
| test("cast StringType to DoubleType") { | |
| Seq(true, false).foreach { v => | |
| castTest(specialValues.toDF("a"), DataTypes.DoubleType, testAnsi = v) | |
| } | |
| } |
🤖 Prompt for AI Agents
In spark/src/test/scala/org/apache/comet/CometCastSuite.scala around lines 688
to 692, the test named "cast StringType to DoubleType" mistakenly calls castTest
with DataTypes.FloatType; change the type argument to DataTypes.DoubleType so
the test actually casts to DoubleType, leaving the rest of the test logic
unchanged.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! It looks like this is a copy/paste error. The test should use DoubleType as the test name suggests. Prevents regressions related to the support for DoubleType because it is not tested at the moment.
PR Review: Support string to non-int numeric types castingSummaryThis PR implements string-to-float, string-to-double, and string-to-decimal casting operations that were previously marked as incompatible. The implementation properly handles special values (inf, NaN), suffixes ('d', 'f'), and ANSI mode. Overall, this is a solid improvement that removes significant compatibility gaps. Code Quality & Best Practices ✅Strengths:
Minor Issues:
Potential Bugs 🐛
Performance Considerations ⚡Good:
Minor optimization opportunity:
Security Concerns 🔒No major security issues identified. The implementation properly handles:
Test Coverage ✅Strong test coverage:
Missing test cases:
Documentation ✅Documentation changes look good:
Action ItemsMust Fix (Blocking):
Should Fix (Non-blocking but important): Nice to Have: RecommendationREQUEST CHANGES - The missing function will cause compilation failure. Please fix the blocking issues and I'd be happy to approve! The overall approach is solid and this is a valuable feature addition. Great work on tackling these compatibility gaps! 🚀 |
| testAnsi = false) | ||
| test("cast StringType to DoubleType") { | ||
| Seq(true, false).foreach { v => | ||
| castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v) |
There was a problem hiding this comment.
Bug: Test for DoubleType incorrectly uses FloatType
The test named "cast StringType to DoubleType" actually tests DataTypes.FloatType instead of DataTypes.DoubleType. This appears to be a copy-paste error from the FloatType test immediately above it. As a result, DoubleType casting is not being tested at all, and the FloatType test is effectively run twice.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! It looks like this is a copy/paste error. The test should use DoubleType as the test name suggests. Prevents regressions related to the support for DoubleType because it is not tested at the moment.
| } | ||
| (Utf8 | LargeUtf8, Decimal256(precision, scale)) => { | ||
| cast_string_to_decimal(&array, to_type, precision, scale, eval_mode) | ||
| } |
There was a problem hiding this comment.
Bug: LargeUtf8 decimal conversion will fail at runtime
The match patterns include LargeUtf8 for decimal conversions, but the implementation functions cast_string_to_decimal128_impl and cast_string_to_decimal256_impl only downcast to StringArray. When a LargeUtf8 array (which is LargeStringArray) is passed, the downcast returns None and the function returns an internal error "Expected string array". This differs from the pattern used elsewhere (e.g., boolean conversion) which has separate arms for Utf8 and LargeUtf8.
Additional Locations (2)
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! LargeUtf8 cannot be casted to StringArray, so the new code will need to use its parent trait - GenericStringArray. Prevents internal error at runtime when LargeUtf8 is passed.
value:useful; category:bug; feedback:The Claude AI reviewer is correct! It looks like this is a copy/paste error. The test should use DoubleType as the test name suggests. Prevents regressions related to the support for DoubleType because it is not tested at the moment. |
value:annoying; category:bug; feedback:The Claude AI reviewer is correct! is_validate_decimal_precision() is imported at line 35 (use arrow::datatypes::is_validate_decimal_precision). |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
2835: To review by AI