2835: feat: Support casting string non int numeric types by martin-augment · Pull Request #37 · martin-augment/datafusion-comet

martin-augment · 2025-12-09T07:27:33Z

2835: To review by AI

coderabbitai · 2025-12-09T07:27:49Z

Walkthrough

This pull request extends cast compatibility for string-to-numeric conversions. The changes update documentation to reflect string-to-float, string-to-double, and string-to-decimal casts as compatible. The Rust implementation adds parsing logic for converting strings to Decimal128, Decimal256, and floating-point types, with support for special values (inf, nan) and scientific notation. Corresponding Scala code relaxes cast compatibility checks for these conversions. Test coverage is expanded with new test cases covering special value handling, high-precision decimals, and scientific notation across ANSI and non-ANSI modes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-2835-2025-12-09-07-27-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2025-12-09T07:29:57Z

Pull Request Review: Support casting string to non-int numeric types

Summary

This PR implements full support for casting strings to float, double, and decimal types in Apache DataFusion Comet, achieving Spark compatibility. The implementation removes previous incompatibility restrictions and adds comprehensive handling for special values, scientific notation, and edge cases.

✅ Strengths

1. Comprehensive Feature Implementation

Successfully implements string-to-float, string-to-double, and string-to-decimal casts
Handles both Decimal128 and Decimal256 types
Properly supports ANSI and non-ANSI evaluation modes
Removes compatibility restrictions from the documentation (compatibility.md)

2. Robust Edge Case Handling

Special values: inf, +inf, -inf, infinity, nan (case-insensitive)
Scientific notation: 1.23E-5, 1e10, etc.
Format variations: 1.5f, 2.0d, 1.5F, 2.0D suffixes
Leading/trailing whitespace trimming
Empty strings and null values

3. Excellent Test Coverage

The test suite is comprehensive and well-structured:

Special values test: Tests for float/double/decimal suffixes, infinity, NaN, scientific notation, whitespace handling
Scientific notation test: Dedicated tests for various scientific notation formats
Decimal precision tests: Tests for basic values, high precision (38,10), overflow scenarios
Random data tests: Uses data generation for broader coverage
Both ANSI and non-ANSI modes tested

4. Good Rust Practices

Proper error handling with SparkResult and SparkError
Use of checked_mul and checked_add to prevent integer overflow
Clear function separation and naming conventions
Appropriate use of Arrow builders

⚠️ Issues and Concerns

1. Critical Bug: Incorrect Rounding Logic in Decimal Parsing

Location: native/spark-expr/src/conversion_funcs/cast.rs:1216-1226

The rounding logic for decimal scale adjustment has a subtle but critical bug:

let half_divisor = divisor / 2;
let rounded = if remainder.abs() >= half_divisor {
    if mantissa >= 0 {
        quotient + 1
    } else {
        quotient - 1
    }
} else {
    quotient
};

Issue: This implements "round half up" incorrectly for the exact halfway case when divisor is even. When remainder.abs() == half_divisor, the code rounds away from zero, but Spark uses "round half up" which means:

Positive values: round up
Negative values: round toward zero when exactly halfway (not away from zero)

Example: For -1.5 with scale 0:

Current implementation: -2 (rounds away from zero)
Spark behavior: Should be -1 (rounds toward positive infinity)

Severity: High - This affects correctness of decimal conversions

2. Missing D/F Suffix Handling in Decimal Parsing

Location: native/spark-expr/src/conversion_funcs/cast.rs:1200

The comment says:

// Note: We do NOT strip 'D' or 'F' suffixes - let rust's parsing fail naturally for invalid input

However, based on the test cases ("1.5f", "2.0d"), Spark does accept these suffixes for decimal parsing. The current implementation will return None for inputs like "123.45d" because Rust's parse() will fail.

Test Evidence: The test specialValues includes "1.5f", "1.5F", "2.0d", "2.0D" and expects them to work for decimal types.

Recommendation: Strip trailing 'd', 'D', 'f', 'F' before parsing in parse_decimal_str.

3. Potential Integer Overflow in Scientific Notation

Location: native/spark-expr/src/conversion_funcs/cast.rs:1264-1266

let exp: i32 = exponent_part
    .parse()
    .map_err(|_| "Invalid exponent".to_string())?;

If a user provides extremely large exponents (e.g., "1e2147483648"), this could cause issues. While Spark likely handles this, consider:

Adding bounds checking on the exponent value
Testing with extreme exponent values

4. Performance Consideration: Repeated String Allocations

Location: native/spark-expr/src/conversion_funcs/cast.rs:1187

let s_lower = s.to_lowercase();

For every string in the array, this allocates a new lowercased string. For large datasets, this could impact performance.

Optimization: Consider using eq_ignore_ascii_case for ASCII strings or checking specific characters without full allocation:

if s.len() == 3 && s.eq_ignore_ascii_case("inf") 
    || s.len() == 4 && s.eq_ignore_ascii_case("+inf")
    // ...

Severity: Low-Medium - Performance optimization, not a correctness issue

5. Missing Test: Decimal Precision Boundary Cases

While the tests are comprehensive, consider adding explicit tests for:

Maximum precision values (e.g., Decimal128(38, 0))
Values that exactly fit vs. overflow by one
Negative scale decimals (if supported by Spark)
Very large scientific notation that overflows during scaling

6. Code Duplication Between Decimal128 and Decimal256

Location: cast_string_to_decimal128_impl and cast_string_to_decimal256_impl

These two functions have nearly identical logic, differing only in the builder type and the i128 → i256 conversion. Consider:

Using a generic function or macro to reduce duplication
This reduces maintenance burden and potential for bugs

7. Unclear Comment About D/F Suffix Behavior

Location: native/spark-expr/src/conversion_funcs/cast.rs:1200

The comment contradicts the test expectations. Either:

The comment should be updated to reflect actual behavior
The implementation should strip D/F suffixes to match Spark
Tests should be added to verify the expected behavior

🔍 Additional Observations

Documentation Updates

The changes to compatibility.md correctly move string→float/double/decimal from "Incompatible" to "Compatible". This aligns with the implementation.

Scala Changes

The CometCast.scala changes correctly remove the incompatibility markers for these cast operations. Good consistency with the Rust implementation.

Float Parsing

The parse_string_to_float function correctly handles D/F suffixes by stripping them before parsing. This same logic should be applied to decimal parsing.

📋 Recommendations

High Priority (Must Fix)

Fix the rounding logic in parse_string_to_decimal to match Spark's "round half up" behavior exactly
Add D/F suffix stripping in parse_decimal_str to match float parsing behavior
Add tests to verify D/F suffix handling for decimals works correctly

Medium Priority (Should Fix)

Add bounds checking for scientific notation exponents
Add edge case tests for decimal precision boundaries
Optimize string comparison to avoid allocations in hot path

Low Priority (Nice to Have)

Reduce code duplication between Decimal128 and Decimal256 implementations
Update or clarify comments about D/F suffix handling

🎯 Overall Assessment

This is a solid PR that significantly improves Spark compatibility for numeric type casting. The implementation is well-structured, the test coverage is comprehensive, and it removes major compatibility gaps.

However, there are critical correctness issues (rounding logic, D/F suffix handling) that must be addressed before merging. Once these are fixed, this will be a valuable addition to the project.

Recommendation: Request changes to address the high-priority issues, particularly the decimal rounding logic and D/F suffix handling.

Great work on this comprehensive feature implementation! 🚀

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-09T07:30:16Z

native/spark-expr/src/conversion_funcs/cast.rs

+    eval_mode: EvalMode,
+) -> SparkResult<ArrayRef> {
+    match to_type {
+        DataType::Float16 | DataType::Float32 => {


For DataType::Float16, this path builds a Float32 array (cast_string_to_float_impl::<Float32Type>), which won’t match the requested Float16; consider emitting a Float16 array (or casting the Float32 result) to avoid a type mismatch at runtime.

_{🤖 Was this useful? React with 👍 or 👎}

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct! Apache DataFusion supports the Float16 type but Apache Spark does not! And DataFusion Comet should behave as Spark does, i.e. it should map to the Spark's FloatType (32-bit floating point).

augmentcode · 2025-12-09T07:30:16Z

native/spark-expr/src/conversion_funcs/cast.rs

+) -> SparkResult<ArrayRef> {
+    let string_array = array
+        .as_any()
+        .downcast_ref::<StringArray>()


This downcast to StringArray will fail for LargeUtf8 inputs, yet the match arm includes LargeUtf8 for decimal casts; consider supporting LargeStringArray (e.g., via a generic string accessor) so both Utf8 and LargeUtf8 are handled (also applies to the Decimal256 impl).

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct! Trying to cast a LargeUtf8 array to StringArray will fail at runtime. It should use GenericStringArray. Prevents application crashes when using LargeUtf8 values.

augmentcode · 2025-12-09T07:30:16Z

native/spark-expr/src/conversion_funcs/cast.rs

+        DataType::Utf8 if allow_incompat => {
+            matches!(to_type, DataType::Binary | DataType::Decimal128(_, _))
+        }
+        DataType::Utf8 => matches!(


is_datafusion_spark_compatible doesn’t treat Utf8 -> Decimal as compatible (and omits Decimal256), but casting from string to decimal is implemented and docs mark it compatible; consider including Decimal128 and Decimal256 in the non-allow_incompat branch to reflect actual support.

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct! Casting a Utf8 to Decimal128/256 is implemented in this Pull Request, so it should be handled as "Supported". Prevents delegating the cast operation to Spark when it could be done in the native code by Comet.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

native/spark-expr/src/conversion_funcs/cast.rs (1)
969-975: Consider adding LargeUtf8 support for float casting.

The decimal paths at lines 970-975 support both Utf8 and LargeUtf8, but the float path at line 969 only supports Utf8. For consistency, consider adding LargeUtf8 support for float casting as well.
-(Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),
+(Utf8 | LargeUtf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ab8e12 and 227afcc.

📒 Files selected for processing (4)

docs/source/user-guide/latest/compatibility.md (1 hunks)
native/spark-expr/src/conversion_funcs/cast.rs (6 hunks)
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1 hunks)
spark/src/test/scala/org/apache/comet/CometCastSuite.scala (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-11-04T14:26:48.750Z

Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().

Applied to files:

native/spark-expr/src/conversion_funcs/cast.rs

📚 Learning: 2025-11-11T20:44:05.014Z

Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:05.014Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.

Applied to files:

native/spark-expr/src/conversion_funcs/cast.rs

🧬 Code graph analysis (2)

spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)

spark/src/main/scala/org/apache/comet/serde/SupportLevel.scala (1)

Compatible (31-31)

spark/src/test/scala/org/apache/comet/CometCastSuite.scala (2)

native/spark-expr/src/conversion_funcs/cast.rs (2)

v (2609-2609)

values (2519-2519)

spark/src/test/scala/org/apache/comet/DataGenerator.scala (2)

generateStrings (51-53)

generateStrings (56-58)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Cursor Bugbot
GitHub Check: claude-review

🔇 Additional comments (12)

docs/source/user-guide/latest/compatibility.md (1)

162-164: Documentation correctly reflects the compatibility changes.

The additions of string -> float, string -> double, and string -> decimal to the Compatible Casts table align with the implementation changes in the Rust and Scala files.

spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)

187-190: String-to-float/double/decimal casts correctly marked as Compatible.

The changes to return Compatible() for FloatType, DoubleType, and DecimalType in canCastFromString align with the native Rust implementation that now supports these casts.

native/spark-expr/src/conversion_funcs/cast.rs (7)

219-221: Cast support correctly extended.

The additions for Float32, Float64, Decimal128, and Decimal256 in can_cast_from_string are consistent with the new implementation functions added below.

1058-1077: Clean dispatcher function.

The dispatcher correctly routes to the appropriate implementation based on the target decimal type.

1250-1323: Decimal string parsing logic is well-structured.

The parsing correctly handles:

Scientific notation with positive/negative exponents

Signed numbers

Decimal points

Overflow protection with checked arithmetic

The scale calculation at line 1321 (fractional_scale - exponent) correctly handles scientific notation.

1361-1370: ANSI mode error uses original untrimmed value.

At line 1366, the error reports arr.value(i) (untrimmed) while line 1361 uses arr.value(i).trim() for parsing. This is intentional and correct—it shows the user's original input in the error message.

1393-1401: Note: D/F suffix handling differs between float and decimal parsing.

Float parsing strips D/F suffixes (lines 1394-1398), while decimal parsing does not (per comment at line 1200). This appears intentional to match Spark's behavior where "1.5f" is valid for float but not for decimal.

1530-1536: Changes correctly reflect the custom implementation flow.

The is_datafusion_spark_compatible function is only reached for casts without custom implementations. Since cast_string_to_float and cast_string_to_decimal are now handled earlier in cast_array, this function's changes for Float32/Float64 handle edge cases that might bypass the custom path.

1207-1228: The rounding logic correctly implements Spark's HALF_UP semantics; the original concern is based on a misunderstanding.

Spark's HALF_UP rounds away from zero on ties (matching Java's RoundingMode.HALF_UP). The code at lines 1216–1227 correctly implements this: when remainder.abs() >= half_divisor, it rounds away from zero by adding 1 to the quotient for positive numbers and subtracting 1 for negative numbers, resulting in values like −1.5 → −2, which is the expected behavior.

The original review incorrectly stated that HALF_UP rounds "towards positive infinity" and conflated it with a different rounding mode. In Spark/Java HALF_UP, −1.5 rounds to −2 (away from zero), not −1.

Likely an incorrect or invalid review comment.

spark/src/test/scala/org/apache/comet/CometCastSuite.scala (3)

655-679: Comprehensive set of special values for testing.

The specialValues sequence covers a good range of edge cases including:

Type suffixes (f, F, d, D)

Infinity variants (inf, Inf, INF, +inf, -inf, +Infinity, -Infinity)

NaN variants (nan, NaN, NAN)

Scientific notation

Whitespace handling

Zero variants (0.0, -0.0)

Invalid inputs (empty string, "xyz", null)

681-785: Good test coverage for the new casting functionality.

The new tests comprehensively cover:

Both ANSI and non-ANSI modes via Seq(true, false).foreach

Float and Double types

Various decimal precisions (10,2), (38,10), (23,8)

Scientific notation with valid and invalid formats

Edge cases like empty strings, invalid characters, and null values

693-714: Existing tests improved with ANSI mode coverage.

The modifications to existing float/double/decimal cast tests now iterate over both ANSI enabled and disabled modes, improving test coverage.

coderabbitai · 2025-12-09T07:33:00Z

native/spark-expr/src/conversion_funcs/cast.rs

+fn cast_string_to_decimal128_impl(
+    array: &ArrayRef,
+    eval_mode: EvalMode,
+    precision: u8,
+    scale: i8,
+) -> SparkResult<ArrayRef> {
+    let string_array = array
+        .as_any()
+        .downcast_ref::<StringArray>()
+        .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;
+
+    let mut decimal_builder = Decimal128Builder::with_capacity(string_array.len());
+
+    for i in 0..string_array.len() {
+        if string_array.is_null(i) {
+            decimal_builder.append_null();
+        } else {
+            let str_value = string_array.value(i).trim();
+            match parse_string_to_decimal(str_value, precision, scale) {
+                Ok(Some(decimal_value)) => {
+                    decimal_builder.append_value(decimal_value);
+                }
+                Ok(None) => {
+                    if eval_mode == EvalMode::Ansi {
+                        return Err(invalid_value(
+                            str_value,
+                            "STRING",
+                            &format!("DECIMAL({},{})", precision, scale),
+                        ));
+                    }
+                    decimal_builder.append_null();
+                }
+                Err(e) => {
+                    if eval_mode == EvalMode::Ansi {
+                        return Err(e);
+                    }
+                    decimal_builder.append_null();
+                }
+            }
+        }
+    }
+
+    Ok(Arc::new(
+        decimal_builder
+            .with_precision_and_scale(precision, scale)?
+            .finish(),
+    ))
+}


⚠️ Potential issue | 🟠 Major

LargeUtf8 dispatch exists but implementation only handles StringArray.

The cast_array dispatch at lines 970-975 routes both Utf8 and LargeUtf8 to cast_string_to_decimal, but cast_string_to_decimal128_impl only downcasts to StringArray (Utf8). This will cause a runtime error for LargeUtf8 inputs.

Consider using as_generic_string_array or add separate handling for LargeUtf8:

+use datafusion::common::cast::as_generic_string_array; + fn cast_string_to_decimal128_impl( array: &ArrayRef, eval_mode: EvalMode, precision: u8, scale: i8, ) -> SparkResult<ArrayRef> { - let string_array = array - .as_any() - .downcast_ref::<StringArray>() - .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?; + // Handle both Utf8 and LargeUtf8 + if let Ok(string_array) = as_generic_string_array::<i32>(array) { + return cast_string_to_decimal128_impl_generic(string_array, eval_mode, precision, scale); + } + if let Ok(string_array) = as_generic_string_array::<i64>(array) { + return cast_string_to_decimal128_impl_generic(string_array, eval_mode, precision, scale); + } + Err(SparkError::Internal("Expected string array".to_string())) +}

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In native/spark-expr/src/conversion_funcs/cast.rs around lines 1079 to 1126, the function cast_string_to_decimal128_impl only downcasts the input to StringArray (Utf8) but the dispatch also routes LargeUtf8 inputs, which will panic at runtime; update the function to accept both Utf8 and LargeUtf8 by using a generic string accessor (e.g., as_generic_string_array) or explicitly handle both types (downcast_ref::<StringArray>() and downcast_ref::<LargeStringArray>()), iterate over values via the generic API (or branch accordingly), preserve null handling and trimming logic, and then feed parsed values into the Decimal128Builder as before so LargeUtf8 inputs no longer cause a runtime error.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.

coderabbitai · 2025-12-09T07:33:00Z

native/spark-expr/src/conversion_funcs/cast.rs

+fn cast_string_to_decimal256_impl(
+    array: &ArrayRef,
+    eval_mode: EvalMode,
+    precision: u8,
+    scale: i8,
+) -> SparkResult<ArrayRef> {
+    let string_array = array
+        .as_any()
+        .downcast_ref::<StringArray>()
+        .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;
+
+    let mut decimal_builder = PrimitiveBuilder::<Decimal256Type>::with_capacity(string_array.len());
+
+    for i in 0..string_array.len() {
+        if string_array.is_null(i) {
+            decimal_builder.append_null();
+        } else {
+            let str_value = string_array.value(i).trim();
+            match parse_string_to_decimal(str_value, precision, scale) {
+                Ok(Some(decimal_value)) => {
+                    // Convert i128 to i256
+                    let i256_value = i256::from_i128(decimal_value);
+                    decimal_builder.append_value(i256_value);
+                }
+                Ok(None) => {
+                    if eval_mode == EvalMode::Ansi {
+                        return Err(invalid_value(
+                            str_value,
+                            "STRING",
+                            &format!("DECIMAL({},{})", precision, scale),
+                        ));
+                    }
+                    decimal_builder.append_null();
+                }
+                Err(e) => {
+                    if eval_mode == EvalMode::Ansi {
+                        return Err(e);
+                    }
+                    decimal_builder.append_null();
+                }
+            }
+        }
+    }
+
+    Ok(Arc::new(
+        decimal_builder
+            .with_precision_and_scale(precision, scale)?
+            .finish(),
+    ))
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cd /root/repo 2>/dev/null || cd . && rg -n "fn parse_string_to_decimal" -A 20 --type rs 2>/dev/null || echo "Repository not found in expected locations"

Repository: martin-augment/datafusion-comet

Length of output: 117

🏁 Script executed:

rg -n "cast_string_to_decimal256" -B 5 -A 50 --type rs 2>/dev/null || echo "File not accessible"

Repository: martin-augment/datafusion-comet

Length of output: 94

🏁 Script executed:

rg "Decimal256Type\|from_i128" -B 3 -A 3 --type rs | head -100

Repository: martin-augment/datafusion-comet

Length of output: 101

🏁 Script executed:

find . -path "*/native/spark-expr/src/conversion_funcs/cast.rs" -o -path "*/cast.rs" 2>/dev/null | head -5

Repository: martin-augment/datafusion-comet

Length of output: 123

🏁 Script executed:

rg "parse_string_to_decimal" --type rs -A 30 | grep -E "fn parse_string_to_decimal|-> |i128|i256" | head -20

Repository: martin-augment/datafusion-comet

Length of output: 101

🏁 Script executed:

#!/bin/bash # Search for the parse_string_to_decimal function signature to confirm return type echo "=== Searching for parse_string_to_decimal function ===" rg -n "parse_string_to_decimal" --type rs -A 15 | head -80 echo "" echo "=== Searching for Decimal256 type usage ===" rg "Decimal256Type" --type rs -B 2 -A 2 | head -50 echo "" echo "=== Checking for from_i128 usage ===" rg "from_i128|to_i128" --type rs -B 1 -A 1 | head -40

Repository: martin-augment/datafusion-comet

Length of output: 310

🏁 Script executed:

rg "fn parse_string_to_decimal|SparkResult<Option<i128>>" --type rs -B 2 -A 15

Repository: martin-augment/datafusion-comet

Length of output: 101

🏁 Script executed:

# Check if there's a parse_string_to_decimal256 or similar variant rg "parse.*decimal.*256|parse.*decimal256" --type rs -i

Repository: martin-augment/datafusion-comet

Length of output: 101

🏁 Script executed:

# Look for how Decimal128 and Decimal256 are handled differently rg -n "cast_string_to_decimal128|cast_string_to_decimal256" --type rs -A 5 | head -100

Repository: martin-augment/datafusion-comet

Length of output: 101

🏁 Script executed:

cat -n ./native/spark-expr/src/conversion_funcs/cast.rs | sed -n '1128,1177p'

Repository: martin-augment/datafusion-comet

Length of output: 2120

🏁 Script executed:

# Search for parse_string_to_decimal function grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs

Repository: martin-augment/datafusion-comet

Length of output: 172

🏁 Script executed:

# Get the parse_string_to_decimal function implementation grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 30

Repository: martin-augment/datafusion-comet

Length of output: 1351

🏁 Script executed:

# Check for Decimal256 related functions grep -n "Decimal256\|decimal256" ./native/spark-expr/src/conversion_funcs/cast.rs | head -20

Repository: martin-augment/datafusion-comet

Length of output: 686

🏁 Script executed:

# Search for cast_string_to_decimal128 to compare implementations grep -n "fn cast_string_to_decimal128" ./native/spark-expr/src/conversion_funcs/cast.rs -A 50

Repository: martin-augment/datafusion-comet

Length of output: 1956

🏁 Script executed:

# Look for i256::from_i128 usage and context grep -n "from_i128\|i256\|i128" ./native/spark-expr/src/conversion_funcs/cast.rs | head -30

Repository: martin-augment/datafusion-comet

Length of output: 1320

🏁 Script executed:

grep -n "fn parse_decimal_str" ./native/spark-expr/src/conversion_funcs/cast.rs -A 60

Repository: martin-augment/datafusion-comet

Length of output: 2101

🏁 Script executed:

# Check for tests related to Decimal256 casting find . -name "*.rs" -type f -path "*test*" | xargs grep -l "Decimal256\|cast.*decimal" 2>/dev/null | head -5

Repository: martin-augment/datafusion-comet

Length of output: 57

🏁 Script executed:

# Look for any test files in the repo find . -name "*test*.rs" -o -name "*_test.rs" | head -10

Repository: martin-augment/datafusion-comet

Length of output: 57

🏁 Script executed:

# Search for test cases mentioning Decimal256 or cast grep -r "Decimal256.*cast\|cast.*Decimal256" . --include="*.rs" 2>/dev/null

Repository: martin-augment/datafusion-comet

Length of output: 478

🏁 Script executed:

# Check if there are Scala test files find . -name "*CastSuite*" -o -name "*cast*test*.scala" 2>/dev/null | head -5

Repository: martin-augment/datafusion-comet

Length of output: 135

🏁 Script executed:

grep -n "Decimal256\|precision.*7[0-9]\|precision.*[6-9][0-9]" ./spark/src/test/scala/org/apache/comet/CometCastSuite.scala | head -30

Repository: martin-augment/datafusion-comet

Length of output: 57

🏁 Script executed:

# Look for cast to decimal256 tests grep -n "Decimal256\|DECIMAL256" ./spark/src/test/scala/org/apache/comet/CometCastSuite.scala -i -B 2 -A 5 | head -80

Repository: martin-augment/datafusion-comet

Length of output: 57

🏁 Script executed:

# Check the entire parse_decimal_str function to see overflow handling grep -n "fn parse_decimal_str" ./native/spark-expr/src/conversion_funcs/cast.rs -A 90 | head -120

Repository: martin-augment/datafusion-comet

Length of output: 3335

🏁 Script executed:

# Look for precision validation or range checks in parse_string_to_decimal grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 100 | grep -E "precision|range|overflow|checked|max|min" -B 2 -A 2

Repository: martin-augment/datafusion-comet

Length of output: 836

🏁 Script executed:

grep -n "fn is_validate_decimal_precision\|is_validate_decimal_precision" ./native/spark-expr/src/conversion_funcs/cast.rs -B 2 -A 20

Repository: martin-augment/datafusion-comet

Length of output: 3716

🏁 Script executed:

# Get full context around the precision check grep -n "is_validate_decimal_precision" ./native/spark-expr/src/conversion_funcs/cast.rs -B 5 -A 5

Repository: martin-augment/datafusion-comet

Length of output: 1920

🏁 Script executed:

# Search for where this function is defined grep -r "fn is_validate_decimal_precision" . --include="*.rs"

Repository: martin-augment/datafusion-comet

Length of output: 57

🏁 Script executed:

# Let's get the full parse_string_to_decimal function grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 80 | head -100

Repository: martin-augment/datafusion-comet

Length of output: 3334

🏁 Script executed:

grep -r "is_validate_decimal_precision" . --include="*.rs" -A 5 | head -30

Repository: martin-augment/datafusion-comet

Length of output: 2240

🏁 Script executed:

# Let's check the Decimal256Type definition grep -n "Decimal256Type\|decimal_type" ./native/spark-expr/src/conversion_funcs/cast.rs | head -20

Repository: martin-augment/datafusion-comet

Length of output: 279

🏁 Script executed:

grep -n "LargeUtf8" ./native/spark-expr/src/conversion_funcs/cast.rs -B 3 -A 3

Repository: martin-augment/datafusion-comet

Length of output: 2674

🏁 Script executed:

# Check cast patterns for both Utf8 and LargeUtf8 to Decimal256 grep -n "Utf8.*Decimal256\|LargeUtf8.*Decimal256" ./native/spark-expr/src/conversion_funcs/cast.rs -B 2 -A 5

Repository: martin-augment/datafusion-comet

Length of output: 422

🏁 Script executed:

grep -n "fn cast_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 25

Repository: martin-augment/datafusion-comet

Length of output: 3009

🏁 Script executed:

# Look at the full dispatch logic grep -n "cast_string_to_decimal\|Decimal128.*_impl\|Decimal256.*_impl" ./native/spark-expr/src/conversion_funcs/cast.rs | head -30

Repository: martin-augment/datafusion-comet

Length of output: 596

🏁 Script executed:

# Check if there's special handling for LargeUtf8 grep -n "cast_string_to_decimal\|LargeUtf8" ./native/spark-expr/src/conversion_funcs/cast.rs | head -40

Repository: martin-augment/datafusion-comet

Length of output: 1272

🏁 Script executed:

grep -n "StringArray\|LargeStringArray" ./native/spark-expr/src/conversion_funcs/cast.rs | head -20

Repository: martin-augment/datafusion-comet

Length of output: 1262

🏁 Script executed:

# Check how cast_string_to_int handles LargeUtf8 grep -n "cast_string_to_int" ./native/spark-expr/src/conversion_funcs/cast.rs -A 20 | head -50

Repository: martin-augment/datafusion-comet

Length of output: 2354

Two critical issues with Decimal256 string casting.

Precision bottleneck: parse_string_to_decimal returns i128, limiting values to ~38 significant digits. However, Decimal256 supports up to 76 digits. Strings representing high-precision values (39-76 digits) will fail to parse.

LargeUtf8 incompatibility: Both cast_string_to_decimal128_impl (line 1087) and cast_string_to_decimal256_impl (line 1136) downcast to StringArray only. However, LargeUtf8 uses LargeStringArray and requires GenericStringArray<i64>. This causes casting from LargeUtf8 to Decimal256 to fail with "Expected string array". Other string cast functions like cast_string_to_int correctly use GenericStringArray<OffsetSize> to handle both cases.

🤖 Prompt for AI Agents

In native/spark-expr/src/conversion_funcs/cast.rs around lines 1128-1177, the function both incorrectly downcasts only to StringArray (failing for LargeUtf8/LargeStringArray) and relies on parse_string_to_decimal which returns i128 (unable to parse >38 digits needed for Decimal256). Fix by changing the array downcast to handle GenericStringArray<OffsetSize> (or branch to accept both StringArray and LargeStringArray) so LargeUtf8 inputs are accepted, and replace/use a parser that returns a 256-bit value (i256) — either implement a new parse_string_to_decimal256 that parses arbitrary-precision decimal strings into i256 respecting precision/scale up to 76 digits, or modify parse_string_to_decimal to return i256 for the Decimal256 path; then use that i256 directly when appending to the Decimal256 builder and preserve existing EvalMode (Ansi/null) handling.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The support for i128 comes from is_validate_decimal_precision(). It does not support (yet) i256. Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.

cursor · 2025-12-09T07:34:44Z

native/spark-expr/src/conversion_funcs/cast.rs

+        }
+        (Utf8 | LargeUtf8, Decimal256(precision, scale)) => {
+            cast_string_to_decimal(&array, to_type, precision, scale, eval_mode)
+        }


Bug: LargeUtf8 to Decimal cast fails with runtime error

The pattern match accepts both Utf8 | LargeUtf8 for Decimal128 and Decimal256 conversions, but cast_string_to_decimal128_impl and cast_string_to_decimal256_impl only downcast to StringArray (which is for Utf8). When a LargeUtf8 array is passed, the downcast will return None and the function will return an "Expected string array" error. The implementations need to also handle LargeStringArray or use GenericStringArray<OffsetSize> with a type parameter.

Additional Locations (2)

native/spark-expr/src/conversion_funcs/cast.rs#L1133-L1137

native/spark-expr/src/conversion_funcs/cast.rs#L1084-L1088

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.

cursor · 2025-12-09T07:34:44Z

native/spark-expr/src/conversion_funcs/cast.rs

+) -> SparkResult<ArrayRef> {
+    match to_type {
+        DataType::Float16 | DataType::Float32 => {
+            cast_string_to_float_impl::<Float32Type>(array, eval_mode, "FLOAT")


Bug: Float16 cast returns wrong array type

When casting a string to Float16, the code uses Float32Type which returns a Float32Array instead of a Float16Array. This type mismatch could cause issues downstream when the caller expects a Float16 array but receives a Float32 array. The Float16 case is bundled with Float32 but uses the same implementation that produces Float32Type output.

value:incorrect-but-reasonable; category:bug; feedback:The Bugbot AI reviewer is not correct! Apache DataFusion supports the Float16 type but Apache Spark does not! And DataFusion Comet should behave as Spark does, i.e. it should map to the Spark's FloatType (32-bit floating point).

cursor · 2025-12-09T07:34:44Z

native/spark-expr/src/conversion_funcs/cast.rs

+
+    // Parse the string as a decimal number
+    // Note: We do NOT strip 'D' or 'F' suffixes - let rust's parsing fail naturally for invalid input
+    match parse_decimal_str(s) {


Bug: Decimal parsing doesn't strip D/F suffixes like float does

The parse_string_to_float function strips 'D' and 'F' suffixes from input strings (line 1394), but parse_string_to_decimal explicitly does not strip these suffixes. This causes inputs like "123.45D" to return null for decimal casts while returning the correct value for float casts. The test includes "123.45D" as a test case, suggesting Spark supports this format for decimal casts. This inconsistency means decimal casts will fail where float casts succeed for the same input pattern.

value:annoying; category:bug; feedback:The Bugbot AI reviewer is not correct! Trying to execute "SELECT CAST('123.45D' AS DECIMAL(5,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '123.45D' of the type "STRING" cannot be cast to "DECIMAL(5,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead. SQLSTATE: 22018". So, DataFusion Comet behaves the same way.

cursor · 2025-12-09T07:34:44Z

native/spark-expr/src/conversion_funcs/cast.rs

+        integral_part
+            .parse()
+            .map_err(|_| "Invalid integral part".to_string())?
+    };


Bug: Malformed scientific notation "e5" returns zero instead of null

In parse_decimal_str, when the integral part is empty (as in "e5" or ".e5"), integral_value is set to 0 instead of returning an error. This causes malformed inputs like "e5" to parse as 0 rather than returning null. The test suite includes "e5" as a test value, and if Spark returns null for this invalid input, the Comet implementation will produce incorrect results. The function treats an empty string before 'e' as valid with value 0, which is semantically incorrect for number parsing.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! Apache Spark returns an error/exception when trying to parse "e5" as a decimal, not 0. DataFusion Comet should do the same.

martin-augment · 2025-12-09T09:21:13Z

1. Critical Bug: Incorrect Rounding Logic in Decimal Parsing

Location: native/spark-expr/src/conversion_funcs/cast.rs:1216-1226

The rounding logic for decimal scale adjustment has a subtle but critical bug:
let half_divisor = divisor / 2;
let rounded = if remainder.abs() >= half_divisor {
    if mantissa >= 0 {
        quotient + 1
    } else {
        quotient - 1
    }
} else {
    quotient
};
Issue: This implements "round half up" incorrectly for the exact halfway case when divisor is even. When remainder.abs() == half_divisor, the code rounds away from zero, but Spark uses "round half up" which means:

Positive values: round up

Negative values: round toward zero when exactly halfway (not away from zero)

Example: For -1.5 with scale 0:

Current implementation: -2 (rounds away from zero)

Spark behavior: Should be -1 (rounds toward positive infinity)

Severity: High - This affects correctness of decimal conversions

value:useful; category:bug; feedback:The Claude AI reviewer is correct! It would be good to add more test cases for such edge values to validate that both Comet and Spark round to the same values when the scale requires it. Prevents bugs due to different behavior in Comet compared to Spark.

martin-augment · 2025-12-09T09:29:14Z

2. Missing D/F Suffix Handling in Decimal Parsing

Location: native/spark-expr/src/conversion_funcs/cast.rs:1200

The comment says:
// Note: We do NOT strip 'D' or 'F' suffixes - let rust's parsing fail naturally for invalid input
However, based on the test cases ("1.5f", "2.0d"), Spark does accept these suffixes for decimal parsing. The current implementation will return None for inputs like "123.45d" because Rust's parse() will fail.

Test Evidence: The test specialValues includes "1.5f", "1.5F", "2.0d", "2.0D" and expects them to work for decimal types.

Recommendation: Strip trailing 'd', 'D', 'f', 'F' before parsing in parse_decimal_str.

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! Trying to execute "SELECT CAST('123.45D' AS DECIMAL(5,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '123.45D' of the type "STRING" cannot be cast to "DECIMAL(5,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead. SQLSTATE: 22018". So, DataFusion Comet behaves the same way.

martin-augment · 2025-12-09T09:34:58Z

3. Potential Integer Overflow in Scientific Notation

Location: native/spark-expr/src/conversion_funcs/cast.rs:1264-1266
let exp: i32 = exponent_part
    .parse()
    .map_err(|_| "Invalid exponent".to_string())?;
If a user provides extremely large exponents (e.g., "1e2147483648"), this could cause issues. While Spark likely handles this, consider:

Adding bounds checking on the exponent value

Testing with extreme exponent values

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! Executing "SELECT CAST('1e2147483648' AS DECIMAL(10,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '1e2147483648' of the type "STRING" cannot be cast to "DECIMAL(10,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead. SQLSTATE: 22018". So, Comet fails to parse consistently with Spark.

martin-augment · 2025-12-09T09:36:14Z

4. Performance Consideration: Repeated String Allocations

Location: native/spark-expr/src/conversion_funcs/cast.rs:1187
let s_lower = s.to_lowercase();
For every string in the array, this allocates a new lowercased string. For large datasets, this could impact performance.

Optimization: Consider using eq_ignore_ascii_case for ASCII strings or checking specific characters without full allocation:
if s.len() == 3 && s.eq_ignore_ascii_case("inf") 
    || s.len() == 4 && s.eq_ignore_ascii_case("+inf")
    // ...
Severity: Low-Medium - Performance optimization, not a correctness issue

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The same could be achieved by using String::eq_ignore_ascii_case() without allocating new temporary string.

martin-augment · 2025-12-09T09:46:25Z

969-975: Consider adding LargeUtf8 support for float casting.

The decimal paths at lines 970-975 support both Utf8 and LargeUtf8, but the float path at line 969 only supports Utf8. For consistency, consider adding LargeUtf8 support for float casting as well.
-(Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),
+(Utf8 | LargeUtf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),

value:annoying; category:bug; feedback:The CodeRabbit AI reviewer is not correct! Adding LargeUtf8 as supported type will fail later at https://github.com/martin-augment/datafusion-comet/pull/37/files#diff-c09cf0751fb1466958da9fc412fbd55fa910425f1f3c8f6457d8adefad9a6483R1352 because it expects only a StringArray. To support LargeUtf8 it should cast to GenericStringArray instead.

github-actions · 2026-03-03T03:53:53Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

coderfender added 7 commits December 5, 2025 16:08

support_exhasutive_spark_float_decimal_casts

a86a3db

support_exhasutive_spark_float_decimal_casts

fad5503

remove_triggers

5ff5e75

support_non_int

c807081

support_string_to_non_int_casts

79d0ea9

support_string_to_non_int_casts

3bb3579

support_string_to_non_int_casts

227afcc

augmentcode bot reviewed Dec 9, 2025

View reviewed changes

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

cursor bot reviewed Dec 9, 2025

View reviewed changes

github-actions bot added the Stale label Mar 3, 2026

Conversation

martin-augment commented Dec 9, 2025

Uh oh!

coderabbitai bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Dec 9, 2025

Pull Request Review: Support casting string to non-int numeric types

Summary

✅ Strengths

1. Comprehensive Feature Implementation

2. Robust Edge Case Handling

3. Excellent Test Coverage

4. Good Rust Practices

⚠️ Issues and Concerns

1. Critical Bug: Incorrect Rounding Logic in Decimal Parsing

2. Missing D/F Suffix Handling in Decimal Parsing

3. Potential Integer Overflow in Scientific Notation

4. Performance Consideration: Repeated String Allocations

5. Missing Test: Decimal Precision Boundary Cases

6. Code Duplication Between Decimal128 and Decimal256

7. Unclear Comment About D/F Suffix Behavior

🔍 Additional Observations

Documentation Updates

Scala Changes

Float Parsing

📋 Recommendations

High Priority (Must Fix)

Medium Priority (Should Fix)

Low Priority (Nice to Have)

🎯 Overall Assessment

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 9, 2025

Choose a reason for hiding this comment

Bug: LargeUtf8 to Decimal cast fails with runtime error

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 9, 2025

Choose a reason for hiding this comment

Bug: Float16 cast returns wrong array type

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 9, 2025

Choose a reason for hiding this comment

Bug: Decimal parsing doesn't strip D/F suffixes like float does

Uh oh!

coderabbitai bot commented Dec 9, 2025 •

edited

Loading