Skip to content

2835: feat: Support casting string non int numeric types#39

Open
martin-augment wants to merge 9 commits intomainfrom
pr-2835-2025-12-11-09-16-34
Open

2835: feat: Support casting string non int numeric types#39
martin-augment wants to merge 9 commits intomainfrom
pr-2835-2025-12-11-09-16-34

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

2835: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 11, 2025

Walkthrough

This change extends Comet's casting capabilities to support string-to-numeric conversions for floating-point and decimal types. The documentation, native Rust implementation, and Scala compatibility logic are updated to mark string-to-float, string-to-double, and string-to-decimal conversions as compatible instead of incompatible. The Rust implementation adds parsing utilities for special float values and decimal formats, including exponent notation support. Test coverage is expanded with multiple ANSI-aware test scenarios covering edge cases and special values.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-2835-2025-12-11-09-16-34

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Dec 11, 2025

Pull Request Review: Support casting string to non-int numeric types

Summary

This PR adds comprehensive support for casting strings to Float, Double, and Decimal types, removing several compatibility restrictions. The implementation adds ~430 lines of Rust code and extensive test coverage.

Positive Aspects

1. Excellent Test Coverage

  • Comprehensive test cases covering edge cases (inf, -inf, nan, scientific notation, D/F suffixes)
  • Tests both ANSI and non-ANSI modes
  • Tests various decimal precisions and scales
  • Good use of parameterized testing with Seq(true, false).foreach

2. Complete Feature Implementation

  • Properly handles all special values (inf, -inf, nan, infinity variants)
  • Supports scientific notation (e.g., "1.23E-5")
  • Handles D/F suffixes for float literals (e.g., "1.5f", "2.0d")
  • Implements proper rounding for decimal conversions (half-up rounding)

3. Documentation Updates

  • Updated compatibility.md to reflect new capabilities
  • Removed incompatibility notes from CometCast.scala

Issues and Concerns

1. CRITICAL: Potential Division-by-Zero Logic Error

Location: native/spark-expr/src/conversion_funcs/cast.rs:1300-1304

let quotient_opt = mantissa.checked_div(divisor);
// Check if divisor is 0
if quotient_opt.is_none() {
    return Ok(None);
}

Issue: The comment says "Check if divisor is 0" but checked_div returns None on overflow, not division by zero. Since divisor = 10_i128.pow(abs_scale_adjustment) where abs_scale_adjustment <= 38, the divisor can never be zero. This check is misleading and the comment is incorrect.

Recommendation: Either remove this check entirely or clarify the comment to say "Check for overflow in division".


2. Bug: Incorrect Rounding Behavior

Location: native/spark-expr/src/conversion_funcs/cast.rs:1308-1318

// Round half up: if abs(remainder) >= divisor/2, round away from zero
let half_divisor = divisor / 2;
let rounded = if remainder.abs() >= half_divisor {
    if mantissa >= 0 {
        quotient + 1
    } else {
        quotient - 1
    }
} else {
    quotient
};

Issue: The rounding logic is incorrect for negative numbers. "Half-up" rounding means rounding towards positive infinity when exactly at 0.5, not "away from zero".

Example:

  • For -2.5 with scale 0, the current code rounds to -3 (away from zero)
  • Correct half-up rounding should give -2 (towards positive infinity)

Recommendation: The correct implementation should be:

let rounded = if remainder.abs() > half_divisor {
    // Clear case: round away from zero
    if mantissa >= 0 { quotient + 1 } else { quotient - 1 }
} else if remainder.abs() == half_divisor {
    // Exactly at half: round towards positive infinity (half-up)
    if mantissa >= 0 { quotient + 1 } else { quotient }
} else {
    quotient
};

Or verify that Spark uses "round half away from zero" instead of "round half up". If so, update the comment.


3. Code Duplication

Location: cast_string_to_decimal128_impl and cast_string_to_decimal256_impl

These two functions have nearly identical logic (96% identical). The only difference is:

  • Builder type: Decimal128Builder vs PrimitiveBuilder::<Decimal256Type>
  • Decimal256 has one extra line converting i128 to i256

Recommendation: Consider extracting common logic or using a trait-based approach to reduce duplication and improve maintainability.


4. Missing Edge Case Tests

While test coverage is good, consider adding tests for:

  1. Boundary values for i128 overflow: Test strings that would overflow i128 during parsing (e.g., values > 340282366920938463463374607431768211455)

  2. Rounding edge cases: Test the exact behavior of values like "2.5", "-2.5", "1.5", "-1.5" when casting to Decimal(1,0) to ensure rounding matches Spark exactly

  3. Very large exponents: Test strings like "1e100", "1e-100" to verify correct handling

  4. Malformed scientific notation: While you test "1.23e" and "e5", also test "1.23e+" and "1.23e-"

  5. Multiple signs: Test strings like "++1", "--1", "+-1" (though is_valid_decimal_format should catch these)


5. Performance Consideration

Location: parse_string_to_float:1472

let s_lower = s.to_lowercase();

Issue: This creates a new allocation for every string, even when the string doesn't contain special values.

Recommendation: Consider checking for special values using case-insensitive comparison before allocation:

// Fast path: if no special chars, try parse directly
if !s.contains(|c: char| matches!(c, 'i' | 'I' | 'n' | 'N' | 'd' | 'D' | 'f' | 'F')) {
    return s.parse::<F>().ok();
}
// Slow path: handle special cases
let s_lower = s.to_lowercase();
// ... rest of logic

6. Minor: Inconsistent Error Handling

Location: cast_string_to_decimal256_impl:1139

The function uses PrimitiveBuilder::<Decimal256Type> but the error message at line 1137 says "Expected string array". Consider making error messages more specific about which function failed.


7. Documentation: Comment Typo

Location: native/spark-expr/src/conversion_funcs/cast.rs:1226

// Exponent part must have optional sign followed by atleast a digit

Typo: "atleast" should be "at least" (two words)


8. Missing Unit Tests for Helper Functions

The helper functions is_valid_decimal_format, parse_decimal_str, and parse_string_to_float are not directly unit tested. While they're tested indirectly through integration tests, consider adding direct unit tests with #[cfg(test)] for better isolation and debugging.


Security Considerations

✅ No obvious security issues detected:

  • Proper bounds checking on exponents and scale adjustments
  • Uses checked_mul, checked_div, and checked_add to prevent overflow
  • Input validation via is_valid_decimal_format
  • No unsafe code

Performance Considerations

  • The implementation uses linear string parsing which is appropriate
  • Consider the allocation optimization mentioned in issue 2680: chore: various refactoring changes for iceberg #5
  • The validation function is_valid_decimal_format is called before parsing, which adds a small overhead but improves error messages

Code Quality

Strengths:

  • Clear function separation and good naming
  • Comprehensive comments explaining the logic
  • Proper error handling with EvalMode::Ansi support

Areas for improvement:

  • Reduce code duplication between Decimal128 and Decimal256 implementations
  • Fix the critical rounding bug
  • Add more edge case tests

Recommendations Summary

Must Fix (Critical):

  1. Fix the rounding logic for negative numbers (Issue 2634: fix: Fall back to Spark for trunc / date_trunc functions when format string is unsupported, or is not a literal value #2)
  2. Clarify or remove the misleading division-by-zero comment (Issue Add Claude Code GitHub Workflow #1)

Should Fix (Important):

  1. Add tests for rounding edge cases to verify Spark compatibility
  2. Reduce code duplication between decimal implementations (Issue 2644: chore: Rename COMET_EXPLAIN_VERBOSE_ENABLED to COMET_EXTENDED_EXPLAIN_FORMAT and change default #3)
  3. Fix typo "atleast" → "at least" (Issue 2595: fix: re-enable Comet abs #7)

Nice to Have:

  1. Add performance optimization for to_lowercase() allocation (Issue 2680: chore: various refactoring changes for iceberg #5)
  2. Add missing edge case tests (Issue 2643: Fix: Return null for null entries in CometPlainVector#getBinary #4)
  3. Add unit tests for helper functions (Issue 2689: feat: Add support for abs #8)

Verdict

This is a solid implementation with excellent test coverage. However, there is a critical bug in the rounding logic that needs to be fixed before merging. Once the rounding issue is addressed and verified against Spark's behavior, this PR will significantly improve Comet's compatibility with Spark.

Recommendation: Request Changes - Fix the rounding bug and verify against Spark's exact behavior, then approve.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 4 suggestions posted.

Comment augment review to trigger a new review at any time.

eval_mode: EvalMode,
) -> SparkResult<ArrayRef> {
match to_type {
DataType::Float16 | DataType::Float32 => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For to_type == Float16, this arm builds a Float32 array via Float32Type; that returns an array with the wrong Arrow type for a Float16 cast and could lead to downstream type mismatches. Consider ensuring the returned ArrayRef matches Float16 semantics (also applies to other locations in the PR).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct! Apache DataFusion supports the Float16 type but Apache Spark does not! And DataFusion Comet should behave as Spark does, i.e. it should map to the Spark's FloatType (32-bit floating point).

cast_string_to_timestamp(&array, to_type, eval_mode, &cast_options.timezone)
}
(Utf8, Date32) => cast_string_to_date(&array, to_type, eval_mode),
(Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only Utf8 is handled for string→float casts; LargeUtf8 is not included here, which makes behavior inconsistent with decimal handling (Utf8 | LargeUtf8). Consider supporting LargeUtf8 here as well.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! LargeUtf8 and Utf8View types are missed here. They could be added for a broader support of the TEXT types. Prevents bug reports when users try to cast LargeUtf8/Utf8View to Decimal values.

| DataType::Utf8 // note that there can be formatting differences
),
DataType::Utf8 if allow_incompat => matches!(
DataType::Utf8 if allow_incompat => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With allow_incompat = true, Utf8 no longer reports compatibility for casts to Float32/Float64, while the default branch does—enabling the flag unexpectedly reduces compatibility. This looks inverted relative to intent and may incorrectly block valid casts when the flag is on.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! It seems some functionality is lost with this PR/improvement. It would be nice to restore the support for casting a String to Float32/64. Prevents regression.


/// Parse a decimal string into mantissa and scale
/// e.g., "123.45" -> (12345, 2), "-0.001" -> (-1, 3)
fn parse_decimal_str(s: &str) -> Result<(i128, i32), String> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_decimal_str returns an i128 mantissa, so the string→Decimal256 path is effectively limited to values that fit in i128 magnitude. This may reject valid Decimal256 inputs with larger precision even though Decimal256 supports wider ranges.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The support for i128 comes from is_validate_decimal_precision(). It does not support (yet) i256. Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
spark/src/test/scala/org/apache/comet/CometCastSuite.scala (1)

681-715: Potential duplicate test coverage.

The tests at lines 681-685 ("cast StringType to FloatType special values") and lines 711-715 ("ANSI support - cast StringType to FloatType special values") appear to test the same scenario with the same inputs (specialValues.toDF("a"), DataTypes.FloatType, and iterating over testAnsi = true/false).

Consider consolidating these into a single test or clarifying if there's an intentional difference.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dafa9a6 and 0a60b94.

📒 Files selected for processing (4)
  • docs/source/user-guide/latest/compatibility.md (1 hunks)
  • native/spark-expr/src/conversion_funcs/cast.rs (6 hunks)
  • spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1 hunks)
  • spark/src/test/scala/org/apache/comet/CometCastSuite.scala (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-11T20:44:05.014Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:05.014Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.

Applied to files:

  • native/spark-expr/src/conversion_funcs/cast.rs
📚 Learning: 2025-11-04T14:26:48.750Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().

Applied to files:

  • native/spark-expr/src/conversion_funcs/cast.rs
🧬 Code graph analysis (2)
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)
spark/src/main/scala/org/apache/comet/serde/SupportLevel.scala (1)
  • Compatible (31-31)
spark/src/test/scala/org/apache/comet/CometCastSuite.scala (10)
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala (1)
  • test (48-55)
spark/src/test/scala/org/apache/comet/exec/CometExecSuite.scala (1)
  • test (59-68)
spark/src/test/scala/org/apache/comet/exec/CometWindowExecSuite.scala (1)
  • test (39-49)
spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala (1)
  • test (37-44)
spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala (1)
  • test (56-68)
spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala (1)
  • test (38-48)
spark/src/test/scala/org/apache/spark/sql/comet/ParquetDatetimeRebaseSuite.scala (2)
  • test (142-147)
  • test (151-161)
spark/src/test/scala/org/apache/comet/exec/CometExec3_4PlusSuite.scala (1)
  • test (59-66)
native/spark-expr/src/conversion_funcs/cast.rs (2)
  • v (2699-2699)
  • values (2609-2609)
spark/src/test/scala/org/apache/comet/DataGenerator.scala (2)
  • generateStrings (51-53)
  • generateStrings (56-58)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
🔇 Additional comments (10)
docs/source/user-guide/latest/compatibility.md (1)

162-164: LGTM!

The documentation correctly updates the compatibility table to reflect the new string-to-float, string-to-double, and string-to-decimal cast support. The changes are consistent with the implementation updates in the Scala and Rust code.

spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)

187-190: LGTM!

The compatibility logic correctly marks string-to-float, string-to-double, and string-to-decimal casts as Compatible(), enabling native Comet execution for these cast operations. This aligns with the new Rust implementation that handles these conversions.

spark/src/test/scala/org/apache/comet/CometCastSuite.scala (2)

112-136: LGTM!

The specialValues sequence provides comprehensive coverage of edge cases for string-to-numeric casting, including:

  • Float/double suffixes (1.5f, 2.0d)
  • Infinity variants (inf, Inf, INF, +inf, -inf, +Infinity, -Infinity)
  • NaN variants (NaN, nan, NAN)
  • Scientific notation (1.23e4, 1.23E4, -1.23e-4)
  • Whitespace handling (" 123.456789 ")
  • Edge cases ("", "xyz", null)

This is a well-designed test input set.


734-791: LGTM!

Good coverage for string-to-decimal casting with:

  • Various precision/scale combinations: DecimalType(22,2), DecimalType(2,2), DecimalType(38,10)
  • Basic values including edge cases like ".5", "5.", "+123.45", "inf", empty string
  • Scientific notation support
  • ANSI mode testing for both true/false

The test cases align well with Spark's decimal parsing behavior.

native/spark-expr/src/conversion_funcs/cast.rs (6)

219-221: LGTM!

The can_cast_from_string function correctly enables casting to Float32, Float64, Decimal128, and Decimal256 types, which aligns with the new casting capabilities being added.


1341-1413: LGTM!

The parse_decimal_str function correctly handles:

  • Scientific notation with positive/negative exponents
  • Optional sign prefix
  • Decimal point handling (including ".5" format)
  • Overflow protection via checked_mul and checked_add
  • Proper scale calculation: fractional_scale - exponent

The implementation aligns with Spark's decimal parsing semantics.


1259-1339: LGTM!

The parse_string_to_decimal function correctly implements Spark's decimal parsing behavior:

  • Returns None for special float values (inf, nan) which aren't valid decimals
  • Uses format validation before parsing
  • Implements HALF_UP rounding (round half away from zero)
  • Handles overflow protection for extreme scale adjustments
  • Validates result fits target precision

The rounding logic at lines 1308-1318 correctly rounds away from zero for both positive and negative values when the remainder is >= half the divisor.


1415-1491: LGTM!

The float parsing implementation correctly handles:

  • Null values checked before accessing (per retrieved learnings)
  • Special float strings: inf, -inf, nan (case-insensitive)
  • Spark's D/F suffix convention (e.g., "1.5f", "2.0D")
  • Scientific notation via Rust's standard parse
  • ANSI mode error propagation with invalid_value error

The implementation aligns with Spark's string-to-float casting behavior.


1179-1257: LGTM!

The is_valid_decimal_format function provides robust validation:

  • Handles optional leading sign (+/-)
  • Rejects invalid sign combinations (++, +-)
  • Validates decimal point (only one allowed, not after exponent)
  • Validates exponent format (optional sign, at least one digit required)
  • Returns false for empty strings or strings without digits

The byte-based parsing is efficient and correctly handles all documented edge cases.


1620-1626: Remove unreachable DataFusion fallback logic for Utf8Float32/Float64 conversions.

The cast_array function explicitly handles Utf8Float32/Float64 conversions via cast_string_to_float at line 969, so the fallback match arm at line 1041 that calls is_datafusion_spark_compatible will never be reached for these types. The code at lines 1620–1626 that allows Float32/Float64 for Utf8 without allow_incompat is therefore unreachable and should either be removed or the custom handler should be removed for consistency.

Comment on lines +1128 to +1177
fn cast_string_to_decimal256_impl(
array: &ArrayRef,
eval_mode: EvalMode,
precision: u8,
scale: i8,
) -> SparkResult<ArrayRef> {
let string_array = array
.as_any()
.downcast_ref::<StringArray>()
.ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;

let mut decimal_builder = PrimitiveBuilder::<Decimal256Type>::with_capacity(string_array.len());

for i in 0..string_array.len() {
if string_array.is_null(i) {
decimal_builder.append_null();
} else {
let str_value = string_array.value(i).trim();
match parse_string_to_decimal(str_value, precision, scale) {
Ok(Some(decimal_value)) => {
// Convert i128 to i256
let i256_value = i256::from_i128(decimal_value);
decimal_builder.append_value(i256_value);
}
Ok(None) => {
if eval_mode == EvalMode::Ansi {
return Err(invalid_value(
str_value,
"STRING",
&format!("DECIMAL({},{})", precision, scale),
));
}
decimal_builder.append_null();
}
Err(e) => {
if eval_mode == EvalMode::Ansi {
return Err(e);
}
decimal_builder.append_null();
}
}
}
}

Ok(Arc::new(
decimal_builder
.with_precision_and_scale(precision, scale)?
.finish(),
))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Spark Decimal256 maximum precision digits Apache

💡 Result:

  • Apache Spark DecimalType: maximum precision = 38 digits. [1][2]
  • "Decimal256" (Apache Arrow / Decimal256 types) supports precision up to 76 digits. [3]
  • Spark does not currently support Decimal256 natively; there is an open ASF JIRA requesting Decimal256 / higher-precision decimal support (SPARK-50293). [4]

Sources:

  1. Spark DecimalType docs (MAX_PRECISION = 38).
  2. PySpark / Spark SQL docs (DecimalType precision up to 38).
  3. Apache Arrow pyarrow.decimal256 docs (precision must be between 1 and 76).
  4. ASF JIRA SPARK-50293 (request to support DECIMAL256 / higher precision).

🏁 Script executed:

# Search for precision validation and limits in the codebase
rg -B2 -A2 "precision" native/spark-expr/src/conversion_funcs/cast.rs | head -50

Repository: martin-augment/datafusion-comet

Length of output: 2126


🏁 Script executed:

# Check if there are other decimal-related functions or comments about Decimal256
rg "Decimal256|precision.*76|precision.*38" native/spark-expr/src/ -B2 -A2

Repository: martin-augment/datafusion-comet

Length of output: 13445


🏁 Script executed:

# Look at the parse_string_to_decimal function to understand its constraints
rg -B5 -A15 "fn parse_string_to_decimal" native/spark-expr/src/

Repository: martin-augment/datafusion-comet

Length of output: 1673


🏁 Script executed:

# Search for any constraints or validation on Decimal256 precision
rg "Decimal256|MAX_PRECISION|precision.*38" native/spark-expr/src/ -B3 -A3 | head -80

Repository: martin-augment/datafusion-comet

Length of output: 6383


🏁 Script executed:

# Check if there are TODOs or FIXMEs related to Decimal256 or precision
rg "TODO|FIXME|XXX" native/spark-expr/src/conversion_funcs/cast.rs

Repository: martin-augment/datafusion-comet

Length of output: 144


🏁 Script executed:

# Look for any documentation or comments about supported decimal ranges
rg -A10 "fn cast_string_to_decimal\(" native/spark-expr/src/conversion_funcs/cast.rs | head -30

Repository: martin-augment/datafusion-comet

Length of output: 388


The cast_string_to_decimal256_impl function has a precision limitation due to using i128 intermediate values.

The function relies on parse_string_to_decimal, which returns i128, then converts to i256 via i256::from_i128. This approach limits decimal values to i128's range, supporting only precision up to 38 digits. Decimal256 should support up to 76 digits of precision.

If higher precisions are needed in the future (or are already being passed in some code paths), the parsing logic must be extended to handle i256 directly, rather than converting from a limited i128 intermediate.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The support for i128 comes from is_validate_decimal_precision(). It does not support (yet) i256. Casting to Decimal should support LargeUtf8 values, so it should cast the passed array to GenericStringArray instead of StringArray (used for Utf8 values). Prevents runtime errors due to failed casting.

cast_string_to_timestamp(&array, to_type, eval_mode, &cast_options.timezone)
}
(Utf8, Date32) => cast_string_to_date(&array, to_type, eval_mode),
(Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing LargeUtf8 handling for float cast

The match pattern at line 969 only handles Utf8 for string-to-float casts, but can_cast_from_string (line 215-221) claims both Utf8 and LargeUtf8 are supported for Float32 and Float64. When a LargeUtf8 array is passed for float conversion, it will not match the pattern, fall through to the catch-all case, and return an error since is_datafusion_spark_compatible doesn't handle LargeUtf8 either.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The  Bugbot AI reviewer is correct! A support is missing for both LargeUtf8 and Utf8View types. They could be easily supported by using GenericStringArray arrow data type in cast_string_to_float(). Prevents introducing incomplete casting improvement.

}
(Utf8 | LargeUtf8, Decimal256(precision, scale)) => {
cast_string_to_decimal(&array, to_type, precision, scale, eval_mode)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: LargeUtf8 decimal cast fails due to wrong downcast

The match patterns at lines 970-975 claim to handle Utf8 | LargeUtf8 for decimal casts, but the implementation functions cast_string_to_decimal128_impl and cast_string_to_decimal256_impl only downcast to StringArray (which represents Utf8). When a LargeUtf8 array is passed, the downcast will fail and return an error because LargeUtf8 uses LargeStringArray, not StringArray.

Additional Locations (2)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! To be able to support LargeUtf8 type it needs to cast to LargeStringArray or GenericStringArray. Prevents a runtime error when the caller passes an array of LargeUtf8 instance.

DataType::Utf8 if allow_incompat => matches!(
DataType::Utf8 if allow_incompat => {
matches!(to_type, DataType::Binary | DataType::Decimal128(_, _))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Decimal128/256 inconsistent with cast_supported for incompat mode

The is_datafusion_spark_compatible function requires allow_incompat to be true for Utf8 to Decimal128 delegation, but can_cast_from_string unconditionally returns true for Decimal128 casts. While the custom implementation at lines 970-972 now handles this case (making this code path less relevant), the documentation in compatibility.md was updated to show these casts as compatible, creating an inconsistency in the fallback logic. If the custom implementation were to fail or be removed, the fallback would unexpectedly require allow_incompat=true despite the cast being documented as compatible.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! It seems some functionality is lost with this PR/improvement. It would be nice to restore the support for casting a String to Float32/64. Prevents regression.

@martin-augment
Copy link
Copy Markdown
Owner Author

1. CRITICAL: Potential Division-by-Zero Logic Error

Location: native/spark-expr/src/conversion_funcs/cast.rs:1300-1304

let quotient_opt = mantissa.checked_div(divisor);
// Check if divisor is 0
if quotient_opt.is_none() {
    return Ok(None);
}

Issue: The comment says "Check if divisor is 0" but checked_div returns None on overflow, not division by zero. Since divisor = 10_i128.pow(abs_scale_adjustment) where abs_scale_adjustment <= 38, the divisor can never be zero. This check is misleading and the comment is incorrect.

Recommendation: Either remove this check entirely or clarify the comment to say "Check for overflow in division".

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! checked_div() does return None when the divisor is 0 or when the result is an overflow. https://doc.rust-lang.org/std/primitive.i32.html#method.checked_div.

@martin-augment
Copy link
Copy Markdown
Owner Author

2. Bug: Incorrect Rounding Behavior

Location: native/spark-expr/src/conversion_funcs/cast.rs:1308-1318

// Round half up: if abs(remainder) >= divisor/2, round away from zero
let half_divisor = divisor / 2;
let rounded = if remainder.abs() >= half_divisor {
    if mantissa >= 0 {
        quotient + 1
    } else {
        quotient - 1
    }
} else {
    quotient
};

Issue: The rounding logic is incorrect for negative numbers. "Half-up" rounding means rounding towards positive infinity when exactly at 0.5, not "away from zero".

Example:

  • For -2.5 with scale 0, the current code rounds to -3 (away from zero)
  • Correct half-up rounding should give -2 (towards positive infinity)

Recommendation: The correct implementation should be:

let rounded = if remainder.abs() > half_divisor {
    // Clear case: round away from zero
    if mantissa >= 0 { quotient + 1 } else { quotient - 1 }
} else if remainder.abs() == half_divisor {
    // Exactly at half: round towards positive infinity (half-up)
    if mantissa >= 0 { quotient + 1 } else { quotient }
} else {
    quotient
};

Or verify that Spark uses "round half away from zero" instead of "round half up". If so, update the comment.

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! Trying to execute "SELECT CAST('123.45D' AS DECIMAL(5,2)) AS converted_number;" in Apache Spark 4.0.1 fails with "org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value '123.45D' of the type "STRING" cannot be cast to "DECIMAL(5,2)" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead. SQLSTATE: 22018". So, DataFusion Comet behaves the same way.

@martin-augment
Copy link
Copy Markdown
Owner Author

5. Performance Consideration

Location: parse_string_to_float:1472

let s_lower = s.to_lowercase();

Issue: This creates a new allocation for every string, even when the string doesn't contain special values.

Recommendation: Consider checking for special values using case-insensitive comparison before allocation:

// Fast path: if no special chars, try parse directly
if !s.contains(|c: char| matches!(c, 'i' | 'I' | 'n' | 'N' | 'd' | 'D' | 'f' | 'F')) {
    return s.parse::<F>().ok();
}
// Slow path: handle special cases
let s_lower = s.to_lowercase();
// ... rest of logic

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The same could be achieved by using String::eq_ignore_ascii_case() without allocating new temporary string.

@martin-augment
Copy link
Copy Markdown
Owner Author

6. Minor: Inconsistent Error Handling

Location: cast_string_to_decimal256_impl:1139

The function uses PrimitiveBuilder::<Decimal256Type> but the error message at line 1137 says "Expected string array". Consider making error messages more specific about which function failed.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The error message is very generic and if the error is handled/logged somewhere up in the hierarchy it will be hard for the developer to understand which function returned it and what was the context, i.e. what input was not accepted.

@martin-augment
Copy link
Copy Markdown
Owner Author

7. Documentation: Comment Typo

Location: native/spark-expr/src/conversion_funcs/cast.rs:1226

// Exponent part must have optional sign followed by atleast a digit

Typo: "atleast" should be "at least" (two words)

value:good-to-have; category:documentation; feedback:The Claude AI reviewer is correct! The comment sounds wrong - "must have" + "optional". Also "atleast" are two words - "at" and "least". A better comment could be: "The exponent part should have an optional sign followed by at least one digit"

@martin-augment
Copy link
Copy Markdown
Owner Author

681-715: Potential duplicate test coverage.

The tests at lines 681-685 ("cast StringType to FloatType special values") and lines 711-715 ("ANSI support - cast StringType to FloatType special values") appear to test the same scenario with the same inputs (specialValues.toDF("a"), DataTypes.FloatType, and iterating over testAnsi = true/false).

Consider consolidating these into a single test or clarifying if there's an intentional difference.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! The tests are very similar and could be combined into one test. Prevents code duplication and higher maintenance cost.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants