Skip to content

2835: feat: Support casting string non int numeric types#37

Open
martin-augment wants to merge 7 commits intomainfrom
pr-2835-2025-12-09-07-27-29
Open

2835: feat: Support casting string non int numeric types#37
martin-augment wants to merge 7 commits intomainfrom
pr-2835-2025-12-09-07-27-29

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

2835: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 9, 2025

Walkthrough

This pull request extends cast compatibility for string-to-numeric conversions. The changes update documentation to reflect string-to-float, string-to-double, and string-to-decimal casts as compatible. The Rust implementation adds parsing logic for converting strings to Decimal128, Decimal256, and floating-point types, with support for special values (inf, nan) and scientific notation. Corresponding Scala code relaxes cast compatibility checks for these conversions. Test coverage is expanded with new test cases covering special value handling, high-precision decimals, and scientific notation across ANSI and non-ANSI modes.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-2835-2025-12-09-07-27-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Dec 9, 2025

Pull Request Review: Support casting string to non-int numeric types

Summary

This PR implements full support for casting strings to float, double, and decimal types in Apache DataFusion Comet, achieving Spark compatibility. The implementation removes previous incompatibility restrictions and adds comprehensive handling for special values, scientific notation, and edge cases.


✅ Strengths

1. Comprehensive Feature Implementation

  • Successfully implements string-to-float, string-to-double, and string-to-decimal casts
  • Handles both Decimal128 and Decimal256 types
  • Properly supports ANSI and non-ANSI evaluation modes
  • Removes compatibility restrictions from the documentation (compatibility.md)

2. Robust Edge Case Handling

  • Special values: inf, +inf, -inf, infinity, nan (case-insensitive)
  • Scientific notation: 1.23E-5, 1e10, etc.
  • Format variations: 1.5f, 2.0d, 1.5F, 2.0D suffixes
  • Leading/trailing whitespace trimming
  • Empty strings and null values

3. Excellent Test Coverage

The test suite is comprehensive and well-structured:

  • Special values test: Tests for float/double/decimal suffixes, infinity, NaN, scientific notation, whitespace handling
  • Scientific notation test: Dedicated tests for various scientific notation formats
  • Decimal precision tests: Tests for basic values, high precision (38,10), overflow scenarios
  • Random data tests: Uses data generation for broader coverage
  • Both ANSI and non-ANSI modes tested

4. Good Rust Practices

  • Proper error handling with SparkResult and SparkError
  • Use of checked_mul and checked_add to prevent integer overflow
  • Clear function separation and naming conventions
  • Appropriate use of Arrow builders

⚠️ Issues and Concerns

1. Critical Bug: Incorrect Rounding Logic in Decimal Parsing

Location: native/spark-expr/src/conversion_funcs/cast.rs:1216-1226

The rounding logic for decimal scale adjustment has a subtle but critical bug:

let half_divisor = divisor / 2;
let rounded = if remainder.abs() >= half_divisor {
    if mantissa >= 0 {
        quotient + 1
    } else {
        quotient - 1
    }
} else {
    quotient
};

Issue: This implements "round half up" incorrectly for the exact halfway case when divisor is even. When remainder.abs() == half_divisor, the code rounds away from zero, but Spark uses "round half up" which means:

  • Positive values: round up
  • Negative values: round toward zero when exactly halfway (not away from zero)

Example: For -1.5 with scale 0:

  • Current implementation: -2 (rounds away from zero)
  • Spark behavior: Should be -1 (rounds toward positive infinity)

Severity: High - This affects correctness of decimal conversions

2. Missing D/F Suffix Handling in Decimal Parsing

Location: native/spark-expr/src/conversion_funcs/cast.rs:1200

The comment says:

// Note: We do NOT strip 'D' or 'F' suffixes - let rust's parsing fail naturally for invalid input

However, based on the test cases ("1.5f", "2.0d"), Spark does accept these suffixes for decimal parsing. The current implementation will return None for inputs like "123.45d" because Rust's parse() will fail.

Test Evidence: The test specialValues includes "1.5f", "1.5F", "2.0d", "2.0D" and expects them to work for decimal types.

Recommendation: Strip trailing 'd', 'D', 'f', 'F' before parsing in parse_decimal_str.

3. Potential Integer Overflow in Scientific Notation

Location: native/spark-expr/src/conversion_funcs/cast.rs:1264-1266

let exp: i32 = exponent_part
    .parse()
    .map_err(|_| "Invalid exponent".to_string())?;

If a user provides extremely large exponents (e.g., "1e2147483648"), this could cause issues. While Spark likely handles this, consider:

  • Adding bounds checking on the exponent value
  • Testing with extreme exponent values

4. Performance Consideration: Repeated String Allocations

Location: native/spark-expr/src/conversion_funcs/cast.rs:1187

let s_lower = s.to_lowercase();

For every string in the array, this allocates a new lowercased string. For large datasets, this could impact performance.

Optimization: Consider using eq_ignore_ascii_case for ASCII strings or checking specific characters without full allocation:

if s.len() == 3 && s.eq_ignore_ascii_case("inf") 
    || s.len() == 4 && s.eq_ignore_ascii_case("+inf")
    // ...

Severity: Low-Medium - Performance optimization, not a correctness issue

5. Missing Test: Decimal Precision Boundary Cases

While the tests are comprehensive, consider adding explicit tests for:

  • Maximum precision values (e.g., Decimal128(38, 0))
  • Values that exactly fit vs. overflow by one
  • Negative scale decimals (if supported by Spark)
  • Very large scientific notation that overflows during scaling

6. Code Duplication Between Decimal128 and Decimal256

Location: cast_string_to_decimal128_impl and cast_string_to_decimal256_impl

These two functions have nearly identical logic, differing only in the builder type and the i128 → i256 conversion. Consider:

  • Using a generic function or macro to reduce duplication
  • This reduces maintenance burden and potential for bugs

7. Unclear Comment About D/F Suffix Behavior

Location: native/spark-expr/src/conversion_funcs/cast.rs:1200

The comment contradicts the test expectations. Either:

  • The comment should be updated to reflect actual behavior
  • The implementation should strip D/F suffixes to match Spark
  • Tests should be added to verify the expected behavior

🔍 Additional Observations

Documentation Updates

The changes to compatibility.md correctly move string→float/double/decimal from "Incompatible" to "Compatible". This aligns with the implementation.

Scala Changes

The CometCast.scala changes correctly remove the incompatibility markers for these cast operations. Good consistency with the Rust implementation.

Float Parsing

The parse_string_to_float function correctly handles D/F suffixes by stripping them before parsing. This same logic should be applied to decimal parsing.


📋 Recommendations

High Priority (Must Fix)

  1. Fix the rounding logic in parse_string_to_decimal to match Spark's "round half up" behavior exactly
  2. Add D/F suffix stripping in parse_decimal_str to match float parsing behavior
  3. Add tests to verify D/F suffix handling for decimals works correctly

Medium Priority (Should Fix)

  1. Add bounds checking for scientific notation exponents
  2. Add edge case tests for decimal precision boundaries
  3. Optimize string comparison to avoid allocations in hot path

Low Priority (Nice to Have)

  1. Reduce code duplication between Decimal128 and Decimal256 implementations
  2. Update or clarify comments about D/F suffix handling

🎯 Overall Assessment

This is a solid PR that significantly improves Spark compatibility for numeric type casting. The implementation is well-structured, the test coverage is comprehensive, and it removes major compatibility gaps.

However, there are critical correctness issues (rounding logic, D/F suffix handling) that must be addressed before merging. Once these are fixed, this will be a valuable addition to the project.

Recommendation: Request changes to address the high-priority issues, particularly the decimal rounding logic and D/F suffix handling.


Great work on this comprehensive feature implementation! 🚀

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

eval_mode: EvalMode,
) -> SparkResult<ArrayRef> {
match to_type {
DataType::Float16 | DataType::Float32 => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DataType::Float16, this path builds a Float32 array (cast_string_to_float_impl::<Float32Type>), which won’t match the requested Float16; consider emitting a Float16 array (or casting the Float32 result) to avoid a type mismatch at runtime.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct! Apache DataFusion supports the Float16 type but Apache Spark does not! And DataFusion Comet should behave as Spark does, i.e. it should map to the Spark's FloatType (32-bit floating point).

) -> SparkResult<ArrayRef> {
let string_array = array
.as_any()
.downcast_ref::<StringArray>()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This downcast to StringArray will fail for LargeUtf8 inputs, yet the match arm includes LargeUtf8 for decimal casts; consider supporting LargeStringArray (e.g., via a generic string accessor) so both Utf8 and LargeUtf8 are handled (also applies to the Decimal256 impl).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! Trying to cast a LargeUtf8 array to StringArray will fail at runtime. It should use GenericStringArray. Prevents application crashes when using LargeUtf8 values.

DataType::Utf8 if allow_incompat => {
matches!(to_type, DataType::Binary | DataType::Decimal128(_, _))
}
DataType::Utf8 => matches!(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_datafusion_spark_compatible doesn’t treat Utf8 -> Decimal as compatible (and omits Decimal256), but casting from string to decimal is implemented and docs mark it compatible; consider including Decimal128 and Decimal256 in the non-allow_incompat branch to reflect actual support.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! Casting a Utf8 to Decimal128/256 is implemented in this Pull Request, so it should be handled as "Supported". Prevents delegating the cast operation to Spark when it could be done in the native code by Comet.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
native/spark-expr/src/conversion_funcs/cast.rs (1)

969-975: Consider adding LargeUtf8 support for float casting.

The decimal paths at lines 970-975 support both Utf8 and LargeUtf8, but the float path at line 969 only supports Utf8. For consistency, consider adding LargeUtf8 support for float casting as well.

-(Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),
+(Utf8 | LargeUtf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ab8e12 and 227afcc.

📒 Files selected for processing (4)
  • docs/source/user-guide/latest/compatibility.md (1 hunks)
  • native/spark-expr/src/conversion_funcs/cast.rs (6 hunks)
  • spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1 hunks)
  • spark/src/test/scala/org/apache/comet/CometCastSuite.scala (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-04T14:26:48.750Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().

Applied to files:

  • native/spark-expr/src/conversion_funcs/cast.rs
📚 Learning: 2025-11-11T20:44:05.014Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:05.014Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.

Applied to files:

  • native/spark-expr/src/conversion_funcs/cast.rs
🧬 Code graph analysis (2)
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)
spark/src/main/scala/org/apache/comet/serde/SupportLevel.scala (1)
  • Compatible (31-31)
spark/src/test/scala/org/apache/comet/CometCastSuite.scala (2)
native/spark-expr/src/conversion_funcs/cast.rs (2)
  • v (2609-2609)
  • values (2519-2519)
spark/src/test/scala/org/apache/comet/DataGenerator.scala (2)
  • generateStrings (51-53)
  • generateStrings (56-58)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
🔇 Additional comments (12)
docs/source/user-guide/latest/compatibility.md (1)

162-164: Documentation correctly reflects the compatibility changes.

The additions of string -> float, string -> double, and string -> decimal to the Compatible Casts table align with the implementation changes in the Rust and Scala files.

spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)

187-190: String-to-float/double/decimal casts correctly marked as Compatible.

The changes to return Compatible() for FloatType, DoubleType, and DecimalType in canCastFromString align with the native Rust implementation that now supports these casts.

native/spark-expr/src/conversion_funcs/cast.rs (7)

219-221: Cast support correctly extended.

The additions for Float32, Float64, Decimal128, and Decimal256 in can_cast_from_string are consistent with the new implementation functions added below.


1058-1077: Clean dispatcher function.

The dispatcher correctly routes to the appropriate implementation based on the target decimal type.


1250-1323: Decimal string parsing logic is well-structured.

The parsing correctly handles:

  • Scientific notation with positive/negative exponents
  • Signed numbers
  • Decimal points
  • Overflow protection with checked arithmetic

The scale calculation at line 1321 (fractional_scale - exponent) correctly handles scientific notation.


1361-1370: ANSI mode error uses original untrimmed value.

At line 1366, the error reports arr.value(i) (untrimmed) while line 1361 uses arr.value(i).trim() for parsing. This is intentional and correct—it shows the user's original input in the error message.


1393-1401: Note: D/F suffix handling differs between float and decimal parsing.

Float parsing strips D/F suffixes (lines 1394-1398), while decimal parsing does not (per comment at line 1200). This appears intentional to match Spark's behavior where "1.5f" is valid for float but not for decimal.


1530-1536: Changes correctly reflect the custom implementation flow.

The is_datafusion_spark_compatible function is only reached for casts without custom implementations. Since cast_string_to_float and cast_string_to_decimal are now handled earlier in cast_array, this function's changes for Float32/Float64 handle edge cases that might bypass the custom path.


1207-1228: The rounding logic correctly implements Spark's HALF_UP semantics; the original concern is based on a misunderstanding.

Spark's HALF_UP rounds away from zero on ties (matching Java's RoundingMode.HALF_UP). The code at lines 1216–1227 correctly implements this: when remainder.abs() >= half_divisor, it rounds away from zero by adding 1 to the quotient for positive numbers and subtracting 1 for negative numbers, resulting in values like −1.5 → −2, which is the expected behavior.

The original review incorrectly stated that HALF_UP rounds "towards positive infinity" and conflated it with a different rounding mode. In Spark/Java HALF_UP, −1.5 rounds to −2 (away from zero), not −1.

Likely an incorrect or invalid review comment.

spark/src/test/scala/org/apache/comet/CometCastSuite.scala (3)

655-679: Comprehensive set of special values for testing.

The specialValues sequence covers a good range of edge cases including:

  • Type suffixes (f, F, d, D)
  • Infinity variants (inf, Inf, INF, +inf, -inf, +Infinity, -Infinity)
  • NaN variants (nan, NaN, NAN)
  • Scientific notation
  • Whitespace handling
  • Zero variants (0.0, -0.0)
  • Invalid inputs (empty string, "xyz", null)

681-785: Good test coverage for the new casting functionality.

The new tests comprehensively cover:

  • Both ANSI and non-ANSI modes via Seq(true, false).foreach
  • Float and Double types
  • Various decimal precisions (10,2), (38,10), (23,8)
  • Scientific notation with valid and invalid formats
  • Edge cases like empty strings, invalid characters, and null values

693-714: Existing tests improved with ANSI mode coverage.

The modifications to existing float/double/decimal cast tests now iterate over both ANSI enabled and disabled modes, improving test coverage.

Comment on lines +1079 to +1126
fn cast_string_to_decimal128_impl(
array: &ArrayRef,
eval_mode: EvalMode,
precision: u8,
scale: i8,
) -> SparkResult<ArrayRef> {
let string_array = array
.as_any()
.downcast_ref::<StringArray>()
.ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;

let mut decimal_builder = Decimal128Builder::with_capacity(string_array.len());

for i in 0..string_array.len() {
if string_array.is_null(i) {
decimal_builder.append_null();
} else {
let str_value = string_array.value(i).trim();
match parse_string_to_decimal(str_value, precision, scale) {
Ok(Some(decimal_value)) => {
decimal_builder.append_value(decimal_value);
}
Ok(None) => {
if eval_mode == EvalMode::Ansi {
return Err(invalid_value(
str_value,
"STRING",
&format!("DECIMAL({},{})", precision, scale),
));
}
decimal_builder.append_null();
}
Err(e) => {
if eval_mode == EvalMode::Ansi {
return Err(e);
}
decimal_builder.append_null();
}
}
}
}

Ok(Arc::new(
decimal_builder
.with_precision_and_scale(precision, scale)?
.finish(),
))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

LargeUtf8 dispatch exists but implementation only handles StringArray.

The cast_array dispatch at lines 970-975 routes both Utf8 and LargeUtf8 to cast_string_to_decimal, but cast_string_to_decimal128_impl only downcasts to StringArray (Utf8). This will cause a runtime error for LargeUtf8 inputs.

Consider using as_generic_string_array or add separate handling for LargeUtf8:

+use datafusion::common::cast::as_generic_string_array;
+
 fn cast_string_to_decimal128_impl(
     array: &ArrayRef,
     eval_mode: EvalMode,
     precision: u8,
     scale: i8,
 ) -> SparkResult<ArrayRef> {
-    let string_array = array
-        .as_any()
-        .downcast_ref::<StringArray>()
-        .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;
+    // Handle both Utf8 and LargeUtf8
+    if let Ok(string_array) = as_generic_string_array::<i32>(array) {
+        return cast_string_to_decimal128_impl_generic(string_array, eval_mode, precision, scale);
+    }
+    if let Ok(string_array) = as_generic_string_array::<i64>(array) {
+        return cast_string_to_decimal128_impl_generic(string_array, eval_mode, precision, scale);
+    }
+    Err(SparkError::Internal("Expected string array".to_string()))
+}

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In native/spark-expr/src/conversion_funcs/cast.rs around lines 1079 to 1126, the
function cast_string_to_decimal128_impl only downcasts the input to StringArray
(Utf8) but the dispatch also routes LargeUtf8 inputs, which will panic at
runtime; update the function to accept both Utf8 and LargeUtf8 by using a
generic string accessor (e.g., as_generic_string_array) or explicitly handle
both types (downcast_ref::<StringArray>() and
downcast_ref::<LargeStringArray>()), iterate over values via the generic API (or
branch accordingly), preserve null handling and trimming logic, and then feed
parsed values into the Decimal128Builder as before so LargeUtf8 inputs no longer
cause a runtime error.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.

Comment on lines +1128 to +1177
fn cast_string_to_decimal256_impl(
array: &ArrayRef,
eval_mode: EvalMode,
precision: u8,
scale: i8,
) -> SparkResult<ArrayRef> {
let string_array = array
.as_any()
.downcast_ref::<StringArray>()
.ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;

let mut decimal_builder = PrimitiveBuilder::<Decimal256Type>::with_capacity(string_array.len());

for i in 0..string_array.len() {
if string_array.is_null(i) {
decimal_builder.append_null();
} else {
let str_value = string_array.value(i).trim();
match parse_string_to_decimal(str_value, precision, scale) {
Ok(Some(decimal_value)) => {
// Convert i128 to i256
let i256_value = i256::from_i128(decimal_value);
decimal_builder.append_value(i256_value);
}
Ok(None) => {
if eval_mode == EvalMode::Ansi {
return Err(invalid_value(
str_value,
"STRING",
&format!("DECIMAL({},{})", precision, scale),
));
}
decimal_builder.append_null();
}
Err(e) => {
if eval_mode == EvalMode::Ansi {
return Err(e);
}
decimal_builder.append_null();
}
}
}
}

Ok(Arc::new(
decimal_builder
.with_precision_and_scale(precision, scale)?
.finish(),
))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cd /root/repo 2>/dev/null || cd . && rg -n "fn parse_string_to_decimal" -A 20 --type rs 2>/dev/null || echo "Repository not found in expected locations"

Repository: martin-augment/datafusion-comet

Length of output: 117


🏁 Script executed:

rg -n "cast_string_to_decimal256" -B 5 -A 50 --type rs 2>/dev/null || echo "File not accessible"

Repository: martin-augment/datafusion-comet

Length of output: 94


🏁 Script executed:

rg "Decimal256Type\|from_i128" -B 3 -A 3 --type rs | head -100

Repository: martin-augment/datafusion-comet

Length of output: 101


🏁 Script executed:

find . -path "*/native/spark-expr/src/conversion_funcs/cast.rs" -o -path "*/cast.rs" 2>/dev/null | head -5

Repository: martin-augment/datafusion-comet

Length of output: 123


🏁 Script executed:

rg "parse_string_to_decimal" --type rs -A 30 | grep -E "fn parse_string_to_decimal|-> |i128|i256" | head -20

Repository: martin-augment/datafusion-comet

Length of output: 101


🏁 Script executed:

#!/bin/bash
# Search for the parse_string_to_decimal function signature to confirm return type
echo "=== Searching for parse_string_to_decimal function ==="
rg -n "parse_string_to_decimal" --type rs -A 15 | head -80

echo ""
echo "=== Searching for Decimal256 type usage ==="
rg "Decimal256Type" --type rs -B 2 -A 2 | head -50

echo ""
echo "=== Checking for from_i128 usage ==="
rg "from_i128|to_i128" --type rs -B 1 -A 1 | head -40

Repository: martin-augment/datafusion-comet

Length of output: 310


🏁 Script executed:

rg "fn parse_string_to_decimal|SparkResult<Option<i128>>" --type rs -B 2 -A 15

Repository: martin-augment/datafusion-comet

Length of output: 101


🏁 Script executed:

# Check if there's a parse_string_to_decimal256 or similar variant
rg "parse.*decimal.*256|parse.*decimal256" --type rs -i

Repository: martin-augment/datafusion-comet

Length of output: 101


🏁 Script executed:

# Look for how Decimal128 and Decimal256 are handled differently
rg -n "cast_string_to_decimal128|cast_string_to_decimal256" --type rs -A 5 | head -100

Repository: martin-augment/datafusion-comet

Length of output: 101


🏁 Script executed:

cat -n ./native/spark-expr/src/conversion_funcs/cast.rs | sed -n '1128,1177p'

Repository: martin-augment/datafusion-comet

Length of output: 2120


🏁 Script executed:

# Search for parse_string_to_decimal function
grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs

Repository: martin-augment/datafusion-comet

Length of output: 172


🏁 Script executed:

# Get the parse_string_to_decimal function implementation
grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 30

Repository: martin-augment/datafusion-comet

Length of output: 1351


🏁 Script executed:

# Check for Decimal256 related functions
grep -n "Decimal256\|decimal256" ./native/spark-expr/src/conversion_funcs/cast.rs | head -20

Repository: martin-augment/datafusion-comet

Length of output: 686


🏁 Script executed:

# Search for cast_string_to_decimal128 to compare implementations
grep -n "fn cast_string_to_decimal128" ./native/spark-expr/src/conversion_funcs/cast.rs -A 50

Repository: martin-augment/datafusion-comet

Length of output: 1956


🏁 Script executed:

# Look for i256::from_i128 usage and context
grep -n "from_i128\|i256\|i128" ./native/spark-expr/src/conversion_funcs/cast.rs | head -30

Repository: martin-augment/datafusion-comet

Length of output: 1320


🏁 Script executed:

grep -n "fn parse_decimal_str" ./native/spark-expr/src/conversion_funcs/cast.rs -A 60

Repository: martin-augment/datafusion-comet

Length of output: 2101


🏁 Script executed:

# Check for tests related to Decimal256 casting
find . -name "*.rs" -type f -path "*test*" | xargs grep -l "Decimal256\|cast.*decimal" 2>/dev/null | head -5

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look for any test files in the repo
find . -name "*test*.rs" -o -name "*_test.rs" | head -10

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Search for test cases mentioning Decimal256 or cast
grep -r "Decimal256.*cast\|cast.*Decimal256" . --include="*.rs" 2>/dev/null

Repository: martin-augment/datafusion-comet

Length of output: 478


🏁 Script executed:

# Check if there are Scala test files
find . -name "*CastSuite*" -o -name "*cast*test*.scala" 2>/dev/null | head -5

Repository: martin-augment/datafusion-comet

Length of output: 135


🏁 Script executed:

grep -n "Decimal256\|precision.*7[0-9]\|precision.*[6-9][0-9]" ./spark/src/test/scala/org/apache/comet/CometCastSuite.scala | head -30

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look for cast to decimal256 tests
grep -n "Decimal256\|DECIMAL256" ./spark/src/test/scala/org/apache/comet/CometCastSuite.scala -i -B 2 -A 5 | head -80

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Check the entire parse_decimal_str function to see overflow handling
grep -n "fn parse_decimal_str" ./native/spark-expr/src/conversion_funcs/cast.rs -A 90 | head -120

Repository: martin-augment/datafusion-comet

Length of output: 3335


🏁 Script executed:

# Look for precision validation or range checks in parse_string_to_decimal
grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 100 | grep -E "precision|range|overflow|checked|max|min" -B 2 -A 2

Repository: martin-augment/datafusion-comet

Length of output: 836


🏁 Script executed:

grep -n "fn is_validate_decimal_precision\|is_validate_decimal_precision" ./native/spark-expr/src/conversion_funcs/cast.rs -B 2 -A 20

Repository: martin-augment/datafusion-comet

Length of output: 3716


🏁 Script executed:

# Get full context around the precision check
grep -n "is_validate_decimal_precision" ./native/spark-expr/src/conversion_funcs/cast.rs -B 5 -A 5

Repository: martin-augment/datafusion-comet

Length of output: 1920


🏁 Script executed:

# Search for where this function is defined
grep -r "fn is_validate_decimal_precision" . --include="*.rs"

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Let's get the full parse_string_to_decimal function
grep -n "fn parse_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 80 | head -100

Repository: martin-augment/datafusion-comet

Length of output: 3334


🏁 Script executed:

grep -r "is_validate_decimal_precision" . --include="*.rs" -A 5 | head -30

Repository: martin-augment/datafusion-comet

Length of output: 2240


🏁 Script executed:

# Let's check the Decimal256Type definition
grep -n "Decimal256Type\|decimal_type" ./native/spark-expr/src/conversion_funcs/cast.rs | head -20

Repository: martin-augment/datafusion-comet

Length of output: 279


🏁 Script executed:

grep -n "LargeUtf8" ./native/spark-expr/src/conversion_funcs/cast.rs -B 3 -A 3

Repository: martin-augment/datafusion-comet

Length of output: 2674


🏁 Script executed:

# Check cast patterns for both Utf8 and LargeUtf8 to Decimal256
grep -n "Utf8.*Decimal256\|LargeUtf8.*Decimal256" ./native/spark-expr/src/conversion_funcs/cast.rs -B 2 -A 5

Repository: martin-augment/datafusion-comet

Length of output: 422


🏁 Script executed:

grep -n "fn cast_string_to_decimal" ./native/spark-expr/src/conversion_funcs/cast.rs -A 25

Repository: martin-augment/datafusion-comet

Length of output: 3009


🏁 Script executed:

# Look at the full dispatch logic
grep -n "cast_string_to_decimal\|Decimal128.*_impl\|Decimal256.*_impl" ./native/spark-expr/src/conversion_funcs/cast.rs | head -30

Repository: martin-augment/datafusion-comet

Length of output: 596


🏁 Script executed:

# Check if there's special handling for LargeUtf8
grep -n "cast_string_to_decimal\|LargeUtf8" ./native/spark-expr/src/conversion_funcs/cast.rs | head -40

Repository: martin-augment/datafusion-comet

Length of output: 1272


🏁 Script executed:

grep -n "StringArray\|LargeStringArray" ./native/spark-expr/src/conversion_funcs/cast.rs | head -20

Repository: martin-augment/datafusion-comet

Length of output: 1262


🏁 Script executed:

# Check how cast_string_to_int handles LargeUtf8
grep -n "cast_string_to_int" ./native/spark-expr/src/conversion_funcs/cast.rs -A 20 | head -50

Repository: martin-augment/datafusion-comet

Length of output: 2354


Two critical issues with Decimal256 string casting.

  1. Precision bottleneck: parse_string_to_decimal returns i128, limiting values to ~38 significant digits. However, Decimal256 supports up to 76 digits. Strings representing high-precision values (39-76 digits) will fail to parse.

  2. LargeUtf8 incompatibility: Both cast_string_to_decimal128_impl (line 1087) and cast_string_to_decimal256_impl (line 1136) downcast to StringArray only. However, LargeUtf8 uses LargeStringArray and requires GenericStringArray<i64>. This causes casting from LargeUtf8 to Decimal256 to fail with "Expected string array". Other string cast functions like cast_string_to_int correctly use GenericStringArray<OffsetSize> to handle both cases.

🤖 Prompt for AI Agents
In native/spark-expr/src/conversion_funcs/cast.rs around lines 1128-1177, the
function both incorrectly downcasts only to StringArray (failing for
LargeUtf8/LargeStringArray) and relies on parse_string_to_decimal which returns
i128 (unable to parse >38 digits needed for Decimal256). Fix by changing the
array downcast to handle GenericStringArray<OffsetSize> (or branch to accept
both StringArray and LargeStringArray) so LargeUtf8 inputs are accepted, and
replace/use a parser that returns a 256-bit value (i256) — either implement a
new parse_string_to_decimal256 that parses arbitrary-precision decimal strings
into i256 respecting precision/scale up to 76 digits, or modify
parse_string_to_decimal to return i256 for the Decimal256 path; then use that
i256 directly when appending to the Decimal256 builder and preserve existing
EvalMode (Ansi/null) handling.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The support for i128 comes from is_validate_decimal_precision(). It does not support (yet) i256. Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.

}
(Utf8 | LargeUtf8, Decimal256(precision, scale)) => {
cast_string_to_decimal(&array, to_type, precision, scale, eval_mode)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: LargeUtf8 to Decimal cast fails with runtime error

The pattern match accepts both Utf8 | LargeUtf8 for Decimal128 and Decimal256 conversions, but cast_string_to_decimal128_impl and cast_string_to_decimal256_impl only downcast to StringArray (which is for Utf8). When a LargeUtf8 array is passed, the downcast will return None and the function will return an "Expected string array" error. The implementations need to also handle LargeStringArray or use GenericStringArray<OffsetSize> with a type parameter.

Additional Locations (2)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.

) -> SparkResult<ArrayRef> {
match to_type {
DataType::Float16 | DataType::Float32 => {
cast_string_to_float_impl::<Float32Type>(array, eval_mode, "FLOAT")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Float16 cast returns wrong array type

When casting a string to Float16, the code uses Float32Type which returns a Float32Array instead of a Float16Array. This type mismatch could cause issues downstream when the caller expects a Float16 array but receives a Float32 array. The Float16 case is bundled with Float32 but uses the same implementation that produces Float32Type output.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback:The Bugbot AI reviewer is not correct! Apache DataFusion supports the Float16 type but Apache Spark does not! And DataFusion Comet should behave as Spark does, i.e. it should map to the Spark's FloatType (32-bit floating point).


// Parse the string as a decimal number
// Note: We do NOT strip 'D' or 'F' suffixes - let rust's parsing fail naturally for invalid input
match parse_decimal_str(s) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Decimal parsing doesn't strip D/F suffixes like float does

The parse_string_to_float function strips 'D' and 'F' suffixes from input strings (line 1394), but parse_string_to_decimal explicitly does not strip these suffixes. This causes inputs like "123.45D" to return null for decimal casts while returning the correct value for float casts. The test includes "123.45D" as a test case, suggesting Spark supports this format for decimal casts. This inconsistency means decimal casts will fail where float casts succeed for the same input pattern.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The Bugbot AI reviewer is not correct! Trying to execute "SELECT CAST('123.45D' AS DECIMAL(5,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '123.45D' of the type "STRING" cannot be cast to "DECIMAL(5,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead. SQLSTATE: 22018". So, DataFusion Comet behaves the same way.

integral_part
.parse()
.map_err(|_| "Invalid integral part".to_string())?
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Malformed scientific notation "e5" returns zero instead of null

In parse_decimal_str, when the integral part is empty (as in "e5" or ".e5"), integral_value is set to 0 instead of returning an error. This causes malformed inputs like "e5" to parse as 0 rather than returning null. The test suite includes "e5" as a test value, and if Spark returns null for this invalid input, the Comet implementation will produce incorrect results. The function treats an empty string before 'e' as valid with value 0, which is semantically incorrect for number parsing.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! Apache Spark returns an error/exception when trying to parse "e5" as a decimal, not 0. DataFusion Comet should do the same.

@martin-augment
Copy link
Copy Markdown
Owner Author

1. Critical Bug: Incorrect Rounding Logic in Decimal Parsing

Location: native/spark-expr/src/conversion_funcs/cast.rs:1216-1226

The rounding logic for decimal scale adjustment has a subtle but critical bug:

let half_divisor = divisor / 2;
let rounded = if remainder.abs() >= half_divisor {
    if mantissa >= 0 {
        quotient + 1
    } else {
        quotient - 1
    }
} else {
    quotient
};

Issue: This implements "round half up" incorrectly for the exact halfway case when divisor is even. When remainder.abs() == half_divisor, the code rounds away from zero, but Spark uses "round half up" which means:

  • Positive values: round up
  • Negative values: round toward zero when exactly halfway (not away from zero)

Example: For -1.5 with scale 0:

  • Current implementation: -2 (rounds away from zero)
  • Spark behavior: Should be -1 (rounds toward positive infinity)

Severity: High - This affects correctness of decimal conversions

value:useful; category:bug; feedback:The Claude AI reviewer is correct! It would be good to add more test cases for such edge values to validate that both Comet and Spark round to the same values when the scale requires it. Prevents bugs due to different behavior in Comet compared to Spark.

@martin-augment
Copy link
Copy Markdown
Owner Author

2. Missing D/F Suffix Handling in Decimal Parsing

Location: native/spark-expr/src/conversion_funcs/cast.rs:1200

The comment says:

// Note: We do NOT strip 'D' or 'F' suffixes - let rust's parsing fail naturally for invalid input

However, based on the test cases ("1.5f", "2.0d"), Spark does accept these suffixes for decimal parsing. The current implementation will return None for inputs like "123.45d" because Rust's parse() will fail.

Test Evidence: The test specialValues includes "1.5f", "1.5F", "2.0d", "2.0D" and expects them to work for decimal types.

Recommendation: Strip trailing 'd', 'D', 'f', 'F' before parsing in parse_decimal_str.

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! Trying to execute "SELECT CAST('123.45D' AS DECIMAL(5,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '123.45D' of the type "STRING" cannot be cast to "DECIMAL(5,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead. SQLSTATE: 22018". So, DataFusion Comet behaves the same way.

@martin-augment
Copy link
Copy Markdown
Owner Author

3. Potential Integer Overflow in Scientific Notation

Location: native/spark-expr/src/conversion_funcs/cast.rs:1264-1266

let exp: i32 = exponent_part
    .parse()
    .map_err(|_| "Invalid exponent".to_string())?;

If a user provides extremely large exponents (e.g., "1e2147483648"), this could cause issues. While Spark likely handles this, consider:

  • Adding bounds checking on the exponent value
  • Testing with extreme exponent values

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! Executing "SELECT CAST('1e2147483648' AS DECIMAL(10,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '1e2147483648' of the type "STRING" cannot be cast to "DECIMAL(10,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead. SQLSTATE: 22018". So, Comet fails to parse consistently with Spark.

@martin-augment
Copy link
Copy Markdown
Owner Author

4. Performance Consideration: Repeated String Allocations

Location: native/spark-expr/src/conversion_funcs/cast.rs:1187

let s_lower = s.to_lowercase();

For every string in the array, this allocates a new lowercased string. For large datasets, this could impact performance.

Optimization: Consider using eq_ignore_ascii_case for ASCII strings or checking specific characters without full allocation:

if s.len() == 3 && s.eq_ignore_ascii_case("inf") 
    || s.len() == 4 && s.eq_ignore_ascii_case("+inf")
    // ...

Severity: Low-Medium - Performance optimization, not a correctness issue

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The same could be achieved by using String::eq_ignore_ascii_case() without allocating new temporary string.

@martin-augment
Copy link
Copy Markdown
Owner Author

969-975: Consider adding LargeUtf8 support for float casting.

The decimal paths at lines 970-975 support both Utf8 and LargeUtf8, but the float path at line 969 only supports Utf8. For consistency, consider adding LargeUtf8 support for float casting as well.

-(Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),
+(Utf8 | LargeUtf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),

value:annoying; category:bug; feedback:The CodeRabbit AI reviewer is not correct! Adding LargeUtf8 as supported type will fail later at https://github.com/martin-augment/datafusion-comet/pull/37/files#diff-c09cf0751fb1466958da9fc412fbd55fa910425f1f3c8f6457d8adefad9a6483R1352 because it expects only a StringArray. To support LargeUtf8 it should cast to GenericStringArray instead.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants