2835: feat: Support casting string non int numeric types#37
2835: feat: Support casting string non int numeric types#37martin-augment wants to merge 7 commits intomainfrom
Conversation
WalkthroughThis pull request extends cast compatibility for string-to-numeric conversions. The changes update documentation to reflect string-to-float, string-to-double, and string-to-decimal casts as compatible. The Rust implementation adds parsing logic for converting strings to Decimal128, Decimal256, and floating-point types, with support for special values (inf, nan) and scientific notation. Corresponding Scala code relaxes cast compatibility checks for these conversions. Test coverage is expanded with new test cases covering special value handling, high-precision decimals, and scientific notation across ANSI and non-ANSI modes. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review: Support casting string to non-int numeric typesSummaryThis PR implements full support for casting strings to float, double, and decimal types in Apache DataFusion Comet, achieving Spark compatibility. The implementation removes previous incompatibility restrictions and adds comprehensive handling for special values, scientific notation, and edge cases. ✅ Strengths1. Comprehensive Feature Implementation
2. Robust Edge Case Handling
3. Excellent Test CoverageThe test suite is comprehensive and well-structured:
4. Good Rust Practices
|
| eval_mode: EvalMode, | ||
| ) -> SparkResult<ArrayRef> { | ||
| match to_type { | ||
| DataType::Float16 | DataType::Float32 => { |
There was a problem hiding this comment.
For DataType::Float16, this path builds a Float32 array (cast_string_to_float_impl::<Float32Type>), which won’t match the requested Float16; consider emitting a Float16 array (or casting the Float32 result) to avoid a type mismatch at runtime.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct! Apache DataFusion supports the Float16 type but Apache Spark does not! And DataFusion Comet should behave as Spark does, i.e. it should map to the Spark's FloatType (32-bit floating point).
| ) -> SparkResult<ArrayRef> { | ||
| let string_array = array | ||
| .as_any() | ||
| .downcast_ref::<StringArray>() |
There was a problem hiding this comment.
This downcast to StringArray will fail for LargeUtf8 inputs, yet the match arm includes LargeUtf8 for decimal casts; consider supporting LargeStringArray (e.g., via a generic string accessor) so both Utf8 and LargeUtf8 are handled (also applies to the Decimal256 impl).
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! Trying to cast a LargeUtf8 array to StringArray will fail at runtime. It should use GenericStringArray. Prevents application crashes when using LargeUtf8 values.
| DataType::Utf8 if allow_incompat => { | ||
| matches!(to_type, DataType::Binary | DataType::Decimal128(_, _)) | ||
| } | ||
| DataType::Utf8 => matches!( |
There was a problem hiding this comment.
is_datafusion_spark_compatible doesn’t treat Utf8 -> Decimal as compatible (and omits Decimal256), but casting from string to decimal is implemented and docs mark it compatible; consider including Decimal128 and Decimal256 in the non-allow_incompat branch to reflect actual support.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! Casting a Utf8 to Decimal128/256 is implemented in this Pull Request, so it should be handled as "Supported". Prevents delegating the cast operation to Spark when it could be done in the native code by Comet.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
native/spark-expr/src/conversion_funcs/cast.rs (1)
969-975: Consider addingLargeUtf8support for float casting.The decimal paths at lines 970-975 support both
Utf8andLargeUtf8, but the float path at line 969 only supportsUtf8. For consistency, consider addingLargeUtf8support for float casting as well.-(Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode), +(Utf8 | LargeUtf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/source/user-guide/latest/compatibility.md(1 hunks)native/spark-expr/src/conversion_funcs/cast.rs(6 hunks)spark/src/main/scala/org/apache/comet/expressions/CometCast.scala(1 hunks)spark/src/test/scala/org/apache/comet/CometCastSuite.scala(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-04T14:26:48.750Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().
Applied to files:
native/spark-expr/src/conversion_funcs/cast.rs
📚 Learning: 2025-11-11T20:44:05.014Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:05.014Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.
Applied to files:
native/spark-expr/src/conversion_funcs/cast.rs
🧬 Code graph analysis (2)
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)
spark/src/main/scala/org/apache/comet/serde/SupportLevel.scala (1)
Compatible(31-31)
spark/src/test/scala/org/apache/comet/CometCastSuite.scala (2)
native/spark-expr/src/conversion_funcs/cast.rs (2)
v(2609-2609)values(2519-2519)spark/src/test/scala/org/apache/comet/DataGenerator.scala (2)
generateStrings(51-53)generateStrings(56-58)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
🔇 Additional comments (12)
docs/source/user-guide/latest/compatibility.md (1)
162-164: Documentation correctly reflects the compatibility changes.The additions of
string -> float,string -> double, andstring -> decimalto the Compatible Casts table align with the implementation changes in the Rust and Scala files.spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)
187-190: String-to-float/double/decimal casts correctly marked as Compatible.The changes to return
Compatible()forFloatType,DoubleType, andDecimalTypeincanCastFromStringalign with the native Rust implementation that now supports these casts.native/spark-expr/src/conversion_funcs/cast.rs (7)
219-221: Cast support correctly extended.The additions for
Float32,Float64,Decimal128, andDecimal256incan_cast_from_stringare consistent with the new implementation functions added below.
1058-1077: Clean dispatcher function.The dispatcher correctly routes to the appropriate implementation based on the target decimal type.
1250-1323: Decimal string parsing logic is well-structured.The parsing correctly handles:
- Scientific notation with positive/negative exponents
- Signed numbers
- Decimal points
- Overflow protection with checked arithmetic
The scale calculation at line 1321 (
fractional_scale - exponent) correctly handles scientific notation.
1361-1370: ANSI mode error uses original untrimmed value.At line 1366, the error reports
arr.value(i)(untrimmed) while line 1361 usesarr.value(i).trim()for parsing. This is intentional and correct—it shows the user's original input in the error message.
1393-1401: Note: D/F suffix handling differs between float and decimal parsing.Float parsing strips D/F suffixes (lines 1394-1398), while decimal parsing does not (per comment at line 1200). This appears intentional to match Spark's behavior where
"1.5f"is valid for float but not for decimal.
1530-1536: Changes correctly reflect the custom implementation flow.The
is_datafusion_spark_compatiblefunction is only reached for casts without custom implementations. Sincecast_string_to_floatandcast_string_to_decimalare now handled earlier incast_array, this function's changes for Float32/Float64 handle edge cases that might bypass the custom path.
1207-1228: The rounding logic correctly implements Spark's HALF_UP semantics; the original concern is based on a misunderstanding.Spark's
HALF_UProunds away from zero on ties (matching Java'sRoundingMode.HALF_UP). The code at lines 1216–1227 correctly implements this: whenremainder.abs() >= half_divisor, it rounds away from zero by adding 1 to the quotient for positive numbers and subtracting 1 for negative numbers, resulting in values like −1.5 → −2, which is the expected behavior.The original review incorrectly stated that HALF_UP rounds "towards positive infinity" and conflated it with a different rounding mode. In Spark/Java HALF_UP, −1.5 rounds to −2 (away from zero), not −1.
Likely an incorrect or invalid review comment.
spark/src/test/scala/org/apache/comet/CometCastSuite.scala (3)
655-679: Comprehensive set of special values for testing.The
specialValuessequence covers a good range of edge cases including:
- Type suffixes (f, F, d, D)
- Infinity variants (inf, Inf, INF, +inf, -inf, +Infinity, -Infinity)
- NaN variants (nan, NaN, NAN)
- Scientific notation
- Whitespace handling
- Zero variants (0.0, -0.0)
- Invalid inputs (empty string, "xyz", null)
681-785: Good test coverage for the new casting functionality.The new tests comprehensively cover:
- Both ANSI and non-ANSI modes via
Seq(true, false).foreach- Float and Double types
- Various decimal precisions (10,2), (38,10), (23,8)
- Scientific notation with valid and invalid formats
- Edge cases like empty strings, invalid characters, and null values
693-714: Existing tests improved with ANSI mode coverage.The modifications to existing float/double/decimal cast tests now iterate over both ANSI enabled and disabled modes, improving test coverage.
| fn cast_string_to_decimal128_impl( | ||
| array: &ArrayRef, | ||
| eval_mode: EvalMode, | ||
| precision: u8, | ||
| scale: i8, | ||
| ) -> SparkResult<ArrayRef> { | ||
| let string_array = array | ||
| .as_any() | ||
| .downcast_ref::<StringArray>() | ||
| .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?; | ||
|
|
||
| let mut decimal_builder = Decimal128Builder::with_capacity(string_array.len()); | ||
|
|
||
| for i in 0..string_array.len() { | ||
| if string_array.is_null(i) { | ||
| decimal_builder.append_null(); | ||
| } else { | ||
| let str_value = string_array.value(i).trim(); | ||
| match parse_string_to_decimal(str_value, precision, scale) { | ||
| Ok(Some(decimal_value)) => { | ||
| decimal_builder.append_value(decimal_value); | ||
| } | ||
| Ok(None) => { | ||
| if eval_mode == EvalMode::Ansi { | ||
| return Err(invalid_value( | ||
| str_value, | ||
| "STRING", | ||
| &format!("DECIMAL({},{})", precision, scale), | ||
| )); | ||
| } | ||
| decimal_builder.append_null(); | ||
| } | ||
| Err(e) => { | ||
| if eval_mode == EvalMode::Ansi { | ||
| return Err(e); | ||
| } | ||
| decimal_builder.append_null(); | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| Ok(Arc::new( | ||
| decimal_builder | ||
| .with_precision_and_scale(precision, scale)? | ||
| .finish(), | ||
| )) | ||
| } |
There was a problem hiding this comment.
LargeUtf8 dispatch exists but implementation only handles StringArray.
The cast_array dispatch at lines 970-975 routes both Utf8 and LargeUtf8 to cast_string_to_decimal, but cast_string_to_decimal128_impl only downcasts to StringArray (Utf8). This will cause a runtime error for LargeUtf8 inputs.
Consider using as_generic_string_array or add separate handling for LargeUtf8:
+use datafusion::common::cast::as_generic_string_array;
+
fn cast_string_to_decimal128_impl(
array: &ArrayRef,
eval_mode: EvalMode,
precision: u8,
scale: i8,
) -> SparkResult<ArrayRef> {
- let string_array = array
- .as_any()
- .downcast_ref::<StringArray>()
- .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;
+ // Handle both Utf8 and LargeUtf8
+ if let Ok(string_array) = as_generic_string_array::<i32>(array) {
+ return cast_string_to_decimal128_impl_generic(string_array, eval_mode, precision, scale);
+ }
+ if let Ok(string_array) = as_generic_string_array::<i64>(array) {
+ return cast_string_to_decimal128_impl_generic(string_array, eval_mode, precision, scale);
+ }
+ Err(SparkError::Internal("Expected string array".to_string()))
+}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In native/spark-expr/src/conversion_funcs/cast.rs around lines 1079 to 1126, the
function cast_string_to_decimal128_impl only downcasts the input to StringArray
(Utf8) but the dispatch also routes LargeUtf8 inputs, which will panic at
runtime; update the function to accept both Utf8 and LargeUtf8 by using a
generic string accessor (e.g., as_generic_string_array) or explicitly handle
both types (downcast_ref::<StringArray>() and
downcast_ref::<LargeStringArray>()), iterate over values via the generic API (or
branch accordingly), preserve null handling and trimming logic, and then feed
parsed values into the Decimal128Builder as before so LargeUtf8 inputs no longer
cause a runtime error.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.
| fn cast_string_to_decimal256_impl( | ||
| array: &ArrayRef, | ||
| eval_mode: EvalMode, | ||
| precision: u8, | ||
| scale: i8, | ||
| ) -> SparkResult<ArrayRef> { | ||
| let string_array = array | ||
| .as_any() | ||
| .downcast_ref::<StringArray>() | ||
| .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?; | ||
|
|
||
| let mut decimal_builder = PrimitiveBuilder::<Decimal256Type>::with_capacity(string_array.len()); | ||
|
|
||
| for i in 0..string_array.len() { | ||
| if string_array.is_null(i) { | ||
| decimal_builder.append_null(); | ||
| } else { | ||
| let str_value = string_array.value(i).trim(); | ||
| match parse_string_to_decimal(str_value, precision, scale) { | ||
| Ok(Some(decimal_value)) => { | ||
| // Convert i128 to i256 | ||
| let i256_value = i256::from_i128(decimal_value); | ||
| decimal_builder.append_value(i256_value); | ||
| } | ||
| Ok(None) => { | ||
| if eval_mode == EvalMode::Ansi { | ||
| return Err(invalid_value( | ||
| str_value, | ||
| "STRING", | ||
| &format!("DECIMAL({},{})", precision, scale), | ||
| )); | ||
| } | ||
| decimal_builder.append_null(); | ||
| } | ||
| Err(e) => { | ||
| if eval_mode == EvalMode::Ansi { | ||
| return Err(e); | ||
| } | ||
| decimal_builder.append_null(); | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| Ok(Arc::new( | ||
| decimal_builder | ||
| .with_precision_and_scale(precision, scale)? | ||
| .finish(), | ||
| )) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /root/repo 2>/dev/null || cd . && rg -n "fn parse_string_to_decimal" -A 20 --type rs 2>/dev/null || echo "Repository not found in expected locations"Repository: martin-augment/datafusion-comet
Length of output: 117
🏁 Script executed:
rg -n "cast_string_to_decimal256" -B 5 -A 50 --type rs 2>/dev/null || echo "File not accessible"Repository: martin-augment/datafusion-comet
Length of output: 94
🏁 Script executed:
rg "Decimal256Type\|from_i128" -B 3 -A 3 --type rs | head -100Repository: martin-augment/datafusion-comet
Length of output: 101
🏁 Script executed:
find . -path "*/native/spark-expr/src/conversion_funcs/cast.rs" -o -path "*/cast.rs" 2>/dev/null | head -5Repository: martin-augment/datafusion-comet
Length of output: 123
🏁 Script executed:
rg "parse_string_to_decimal" --type rs -A 30 | grep -E "fn parse_string_to_decimal|-> |i128|i256" | head -20Repository: martin-augment/datafusion-comet
Length of output: 101
🏁 Script executed:
#!/bin/bash
# Search for the parse_string_to_decimal function signature to confirm return type
echo "=== Searching for parse_string_to_decimal function ==="
rg -n "parse_string_to_decimal" --type rs -A 15 | head -80
echo ""
echo "=== Searching for Decimal256 type usage ==="
rg "Decimal256Type" --type rs -B 2 -A 2 | head -50
echo ""
echo "=== Checking for from_i128 usage ==="
rg "from_i128|to_i128" --type rs -B 1 -A 1 | head -40Repository: martin-augment/datafusion-comet
Length of output: 310
🏁 Script executed:
rg "fn parse_string_to_decimal|SparkResult<Option<i128>>" --type rs -B 2 -A 15Repository: martin-augment/datafusion-comet
Length of output: 101
🏁 Script executed:
# Check if there's a parse_string_to_decimal256 or similar variant
rg "parse.*decimal.*256|parse.*decimal256" --type rs -iRepository: martin-augment/datafusion-comet
Length of output: 101
🏁 Script executed:
# Look for how Decimal128 and Decimal256 are handled differently
rg -n "cast_string_to_decimal128|cast_string_to_decimal256" --type rs -A 5 | head -100Repository: martin-augment/datafusion-comet
Length of output: 101
🏁 Script executed:
cat -n ./native/spark-expr/src/conversion_funcs/cast.rs | sed -n '1128,1177p'Repository: martin-augment/datafusion-comet
Length of output: 2120
🏁 Script executed:
# Search for parse_string_to_decimal function
grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rsRepository: martin-augment/datafusion-comet
Length of output: 172
🏁 Script executed:
# Get the parse_string_to_decimal function implementation
grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 30Repository: martin-augment/datafusion-comet
Length of output: 1351
🏁 Script executed:
# Check for Decimal256 related functions
grep -n "Decimal256\|decimal256" ./native/spark-expr/src/conversion_funcs/cast.rs | head -20Repository: martin-augment/datafusion-comet
Length of output: 686
🏁 Script executed:
# Search for cast_string_to_decimal128 to compare implementations
grep -n "fn cast_string_to_decimal128" ./native/spark-expr/src/conversion_funcs/cast.rs -A 50Repository: martin-augment/datafusion-comet
Length of output: 1956
🏁 Script executed:
# Look for i256::from_i128 usage and context
grep -n "from_i128\|i256\|i128" ./native/spark-expr/src/conversion_funcs/cast.rs | head -30Repository: martin-augment/datafusion-comet
Length of output: 1320
🏁 Script executed:
grep -n "fn parse_decimal_str" ./native/spark-expr/src/conversion_funcs/cast.rs -A 60Repository: martin-augment/datafusion-comet
Length of output: 2101
🏁 Script executed:
# Check for tests related to Decimal256 casting
find . -name "*.rs" -type f -path "*test*" | xargs grep -l "Decimal256\|cast.*decimal" 2>/dev/null | head -5Repository: martin-augment/datafusion-comet
Length of output: 57
🏁 Script executed:
# Look for any test files in the repo
find . -name "*test*.rs" -o -name "*_test.rs" | head -10Repository: martin-augment/datafusion-comet
Length of output: 57
🏁 Script executed:
# Search for test cases mentioning Decimal256 or cast
grep -r "Decimal256.*cast\|cast.*Decimal256" . --include="*.rs" 2>/dev/nullRepository: martin-augment/datafusion-comet
Length of output: 478
🏁 Script executed:
# Check if there are Scala test files
find . -name "*CastSuite*" -o -name "*cast*test*.scala" 2>/dev/null | head -5Repository: martin-augment/datafusion-comet
Length of output: 135
🏁 Script executed:
grep -n "Decimal256\|precision.*7[0-9]\|precision.*[6-9][0-9]" ./spark/src/test/scala/org/apache/comet/CometCastSuite.scala | head -30Repository: martin-augment/datafusion-comet
Length of output: 57
🏁 Script executed:
# Look for cast to decimal256 tests
grep -n "Decimal256\|DECIMAL256" ./spark/src/test/scala/org/apache/comet/CometCastSuite.scala -i -B 2 -A 5 | head -80Repository: martin-augment/datafusion-comet
Length of output: 57
🏁 Script executed:
# Check the entire parse_decimal_str function to see overflow handling
grep -n "fn parse_decimal_str" ./native/spark-expr/src/conversion_funcs/cast.rs -A 90 | head -120Repository: martin-augment/datafusion-comet
Length of output: 3335
🏁 Script executed:
# Look for precision validation or range checks in parse_string_to_decimal
grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 100 | grep -E "precision|range|overflow|checked|max|min" -B 2 -A 2Repository: martin-augment/datafusion-comet
Length of output: 836
🏁 Script executed:
grep -n "fn is_validate_decimal_precision\|is_validate_decimal_precision" ./native/spark-expr/src/conversion_funcs/cast.rs -B 2 -A 20Repository: martin-augment/datafusion-comet
Length of output: 3716
🏁 Script executed:
# Get full context around the precision check
grep -n "is_validate_decimal_precision" ./native/spark-expr/src/conversion_funcs/cast.rs -B 5 -A 5Repository: martin-augment/datafusion-comet
Length of output: 1920
🏁 Script executed:
# Search for where this function is defined
grep -r "fn is_validate_decimal_precision" . --include="*.rs"Repository: martin-augment/datafusion-comet
Length of output: 57
🏁 Script executed:
# Let's get the full parse_string_to_decimal function
grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 80 | head -100Repository: martin-augment/datafusion-comet
Length of output: 3334
🏁 Script executed:
grep -r "is_validate_decimal_precision" . --include="*.rs" -A 5 | head -30Repository: martin-augment/datafusion-comet
Length of output: 2240
🏁 Script executed:
# Let's check the Decimal256Type definition
grep -n "Decimal256Type\|decimal_type" ./native/spark-expr/src/conversion_funcs/cast.rs | head -20Repository: martin-augment/datafusion-comet
Length of output: 279
🏁 Script executed:
grep -n "LargeUtf8" ./native/spark-expr/src/conversion_funcs/cast.rs -B 3 -A 3Repository: martin-augment/datafusion-comet
Length of output: 2674
🏁 Script executed:
# Check cast patterns for both Utf8 and LargeUtf8 to Decimal256
grep -n "Utf8.*Decimal256\|LargeUtf8.*Decimal256" ./native/spark-expr/src/conversion_funcs/cast.rs -B 2 -A 5Repository: martin-augment/datafusion-comet
Length of output: 422
🏁 Script executed:
grep -n "fn cast_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 25Repository: martin-augment/datafusion-comet
Length of output: 3009
🏁 Script executed:
# Look at the full dispatch logic
grep -n "cast_string_to_decimal\|Decimal128.*_impl\|Decimal256.*_impl" ./native/spark-expr/src/conversion_funcs/cast.rs | head -30Repository: martin-augment/datafusion-comet
Length of output: 596
🏁 Script executed:
# Check if there's special handling for LargeUtf8
grep -n "cast_string_to_decimal\|LargeUtf8" ./native/spark-expr/src/conversion_funcs/cast.rs | head -40Repository: martin-augment/datafusion-comet
Length of output: 1272
🏁 Script executed:
grep -n "StringArray\|LargeStringArray" ./native/spark-expr/src/conversion_funcs/cast.rs | head -20Repository: martin-augment/datafusion-comet
Length of output: 1262
🏁 Script executed:
# Check how cast_string_to_int handles LargeUtf8
grep -n "cast_string_to_int" ./native/spark-expr/src/conversion_funcs/cast.rs -A 20 | head -50Repository: martin-augment/datafusion-comet
Length of output: 2354
Two critical issues with Decimal256 string casting.
-
Precision bottleneck:
parse_string_to_decimalreturnsi128, limiting values to ~38 significant digits. However,Decimal256supports up to 76 digits. Strings representing high-precision values (39-76 digits) will fail to parse. -
LargeUtf8incompatibility: Bothcast_string_to_decimal128_impl(line 1087) andcast_string_to_decimal256_impl(line 1136) downcast toStringArrayonly. However,LargeUtf8usesLargeStringArrayand requiresGenericStringArray<i64>. This causes casting fromLargeUtf8toDecimal256to fail with "Expected string array". Other string cast functions likecast_string_to_intcorrectly useGenericStringArray<OffsetSize>to handle both cases.
🤖 Prompt for AI Agents
In native/spark-expr/src/conversion_funcs/cast.rs around lines 1128-1177, the
function both incorrectly downcasts only to StringArray (failing for
LargeUtf8/LargeStringArray) and relies on parse_string_to_decimal which returns
i128 (unable to parse >38 digits needed for Decimal256). Fix by changing the
array downcast to handle GenericStringArray<OffsetSize> (or branch to accept
both StringArray and LargeStringArray) so LargeUtf8 inputs are accepted, and
replace/use a parser that returns a 256-bit value (i256) — either implement a
new parse_string_to_decimal256 that parses arbitrary-precision decimal strings
into i256 respecting precision/scale up to 76 digits, or modify
parse_string_to_decimal to return i256 for the Decimal256 path; then use that
i256 directly when appending to the Decimal256 builder and preserve existing
EvalMode (Ansi/null) handling.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The support for i128 comes from is_validate_decimal_precision(). It does not support (yet) i256. Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.
| } | ||
| (Utf8 | LargeUtf8, Decimal256(precision, scale)) => { | ||
| cast_string_to_decimal(&array, to_type, precision, scale, eval_mode) | ||
| } |
There was a problem hiding this comment.
Bug: LargeUtf8 to Decimal cast fails with runtime error
The pattern match accepts both Utf8 | LargeUtf8 for Decimal128 and Decimal256 conversions, but cast_string_to_decimal128_impl and cast_string_to_decimal256_impl only downcast to StringArray (which is for Utf8). When a LargeUtf8 array is passed, the downcast will return None and the function will return an "Expected string array" error. The implementations need to also handle LargeStringArray or use GenericStringArray<OffsetSize> with a type parameter.
Additional Locations (2)
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.
| ) -> SparkResult<ArrayRef> { | ||
| match to_type { | ||
| DataType::Float16 | DataType::Float32 => { | ||
| cast_string_to_float_impl::<Float32Type>(array, eval_mode, "FLOAT") |
There was a problem hiding this comment.
Bug: Float16 cast returns wrong array type
When casting a string to Float16, the code uses Float32Type which returns a Float32Array instead of a Float16Array. This type mismatch could cause issues downstream when the caller expects a Float16 array but receives a Float32 array. The Float16 case is bundled with Float32 but uses the same implementation that produces Float32Type output.
There was a problem hiding this comment.
value:incorrect-but-reasonable; category:bug; feedback:The Bugbot AI reviewer is not correct! Apache DataFusion supports the Float16 type but Apache Spark does not! And DataFusion Comet should behave as Spark does, i.e. it should map to the Spark's FloatType (32-bit floating point).
|
|
||
| // Parse the string as a decimal number | ||
| // Note: We do NOT strip 'D' or 'F' suffixes - let rust's parsing fail naturally for invalid input | ||
| match parse_decimal_str(s) { |
There was a problem hiding this comment.
Bug: Decimal parsing doesn't strip D/F suffixes like float does
The parse_string_to_float function strips 'D' and 'F' suffixes from input strings (line 1394), but parse_string_to_decimal explicitly does not strip these suffixes. This causes inputs like "123.45D" to return null for decimal casts while returning the correct value for float casts. The test includes "123.45D" as a test case, suggesting Spark supports this format for decimal casts. This inconsistency means decimal casts will fail where float casts succeed for the same input pattern.
There was a problem hiding this comment.
value:annoying; category:bug; feedback:The Bugbot AI reviewer is not correct! Trying to execute "SELECT CAST('123.45D' AS DECIMAL(5,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '123.45D' of the type "STRING" cannot be cast to "DECIMAL(5,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead. SQLSTATE: 22018". So, DataFusion Comet behaves the same way.
| integral_part | ||
| .parse() | ||
| .map_err(|_| "Invalid integral part".to_string())? | ||
| }; |
There was a problem hiding this comment.
Bug: Malformed scientific notation "e5" returns zero instead of null
In parse_decimal_str, when the integral part is empty (as in "e5" or ".e5"), integral_value is set to 0 instead of returning an error. This causes malformed inputs like "e5" to parse as 0 rather than returning null. The test suite includes "e5" as a test value, and if Spark returns null for this invalid input, the Comet implementation will produce incorrect results. The function treats an empty string before 'e' as valid with value 0, which is semantically incorrect for number parsing.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! Apache Spark returns an error/exception when trying to parse "e5" as a decimal, not 0. DataFusion Comet should do the same.
value:useful; category:bug; feedback:The Claude AI reviewer is correct! It would be good to add more test cases for such edge values to validate that both Comet and Spark round to the same values when the scale requires it. Prevents bugs due to different behavior in Comet compared to Spark. |
value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! Trying to execute "SELECT CAST('123.45D' AS DECIMAL(5,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '123.45D' of the type "STRING" cannot be cast to "DECIMAL(5,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use |
value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! Executing "SELECT CAST('1e2147483648' AS DECIMAL(10,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '1e2147483648' of the type "STRING" cannot be cast to "DECIMAL(10,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! The same could be achieved by using String::eq_ignore_ascii_case() without allocating new temporary string. |
value:annoying; category:bug; feedback:The CodeRabbit AI reviewer is not correct! Adding LargeUtf8 as supported type will fail later at https://github.com/martin-augment/datafusion-comet/pull/37/files#diff-c09cf0751fb1466958da9fc412fbd55fa910425f1f3c8f6457d8adefad9a6483R1352 because it expects only a StringArray. To support LargeUtf8 it should cast to GenericStringArray instead. |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
2835: To review by AI