Skip to content

2835: feat: Support string non int numeric types#32

Open
martin-augment wants to merge 2 commits intomainfrom
pr-2835-2025-12-02-15-05-01
Open

2835: feat: Support string non int numeric types#32
martin-augment wants to merge 2 commits intomainfrom
pr-2835-2025-12-02-15-05-01

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

2835: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 2, 2025

Walkthrough

This pull request extends casting support from strings to numeric types in the Comet query engine. The documentation was updated to reflect string-to-float, string-to-double, and string-to-decimal casts moving from incompatible to compatible. Native Rust code implements parsing logic for converting strings to decimal and floating-point types, including special value handling. The Scala compatibility layer marks these casts as compatible rather than incompatible. Test coverage was expanded with special value test cases for float, double, and decimal casting operations across ANSI and non-ANSI modes.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-2835-2025-12-02-15-05-01

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

cast_string_to_timestamp(&array, to_type, eval_mode, &cast_options.timezone)
}
(Utf8, Date32) => cast_string_to_date(&array, to_type, eval_mode),
(Utf8, Float16 | Float32 | Float64) => cast_string_to_float(&array, to_type, eval_mode),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This match arm includes Float16, but cast_string_to_float only handles Float32/Float64; attempting STRINGFLOAT16 will hit the catch-all and return an internal error.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! cast_string_to_float() (defined at line 1301) currently supports only Fload32 and Float64, but not Float16. Trying to cast a Float16 will return an internal error.

),
DataType::Utf8 if allow_incompat => matches!(
DataType::Utf8 if allow_incompat => {
matches!(to_type, DataType::Binary | DataType::Decimal128(_, _))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In is_datafusion_spark_compatible, the Utf8 if allow_incompat branch excludes Float32/Float64 and Decimal256, so enabling allow_incompat actually removes float compatibility and omits decimal256—likely unintended given the new support.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The Augment AI reviewer is correct! Both the old and the new version of the code do not support Float32/64 and Decimal256 in any of the match arms, with or without allow_incompat. Prevents error results at runtime when any of these types are used.

testAnsi = false)
test("cast StringType to DoubleType") {
Seq(true, false).foreach { v =>
castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "cast StringType to DoubleType" test, the cast uses DataTypes.FloatType instead of DataTypes.DoubleType, so it doesn't exercise double casting.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! It looks like this is a copy/paste error. The test should use DoubleType as the test name suggests. Prevents regressions related to the support for DoubleType because it is not tested at the moment.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca112c4 and c565a6d.

📒 Files selected for processing (4)
  • docs/source/user-guide/latest/compatibility.md (1 hunks)
  • native/spark-expr/src/conversion_funcs/cast.rs (7 hunks)
  • spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1 hunks)
  • spark/src/test/scala/org/apache/comet/CometCastSuite.scala (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-04T14:26:48.750Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().

Applied to files:

  • native/spark-expr/src/conversion_funcs/cast.rs
📚 Learning: 2025-11-11T20:44:05.014Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:05.014Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.

Applied to files:

  • native/spark-expr/src/conversion_funcs/cast.rs
🧬 Code graph analysis (1)
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)
spark/src/main/scala/org/apache/comet/serde/SupportLevel.scala (1)
  • Compatible (31-31)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
🔇 Additional comments (8)
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala (1)

187-190: LGTM - String to float/double and decimal casts are now marked as compatible.

The change from Incompatible to Compatible() aligns with the native Rust implementation that now provides Spark-compatible string parsing for these numeric types.

docs/source/user-guide/latest/compatibility.md (1)

162-164: Documentation update correctly reflects the new compatible casts.

The addition of string → float, string → double, and string → decimal to the Compatible Casts table is consistent with the implementation changes.

native/spark-expr/src/conversion_funcs/cast.rs (4)

214-216: LGTM - Extended can_cast_from_string to include new types.

The function now correctly reports Float32, Float64, Decimal128, and Decimal256 as valid target types for string casts.


1358-1408: LGTM - FloatParse trait implementations handle special values correctly.

The implementations properly handle infinity variants, NaN, and suffix stripping for Spark compatibility.


1537-1543: Clarify the conditional logic for Utf8 compatibility with Decimal128.

The can_cast_from_string function (line 215) returns true for Decimal128 unconditionally, while is_datafusion_spark_compatible (lines 1537-1543) requires allow_incompat to be true for Utf8→Decimal128 casts. This difference is inconsistent with how other compatibility-dependent casts (like Date32/Date64) are handled, where can_cast_from_string itself checks options.allow_incompat (lines 217-221). Although Utf8→Decimal128 casts have explicit custom handling in spark_cast (lines 965-970) that bypasses the compatibility check, the divergent conditional logic between the two functions should be reconciled or documented to prevent confusion during maintenance.


1194-1197: Inconsistent suffix handling between decimal and float parsing is correct and intentional.

The code correctly implements Spark's behavior: float parsing (FloatParse) explicitly strips 'd'/'f' suffixes before parsing (matching Java's Double.parseDouble()), while decimal parsing does NOT strip suffixes and lets them fail naturally (matching Spark's JavaBigDecimal(string) which rejects suffixes per Java's BigDecimal spec). This inconsistency mirrors the underlying Java APIs and is the correct behavior. No changes needed.

spark/src/test/scala/org/apache/comet/CometCastSuite.scala (2)

655-679: Good test coverage for special string values.

The specialValues sequence provides comprehensive coverage for edge cases including suffixed values, infinity/NaN variants, scientific notation, whitespace, and invalid inputs.


700-733: LGTM - Decimal casting tests cover basic and high precision scenarios.

The tests appropriately cover:

  • Basic decimal values with positive/negative numbers
  • Precision edge cases like "0.001" and high precision values
  • Invalid inputs (inf, empty string, non-numeric)
  • The testAnsi = false parameter is appropriate since ANSI mode has different error handling

Comment on lines +1080 to +1083
let string_array = array
.as_any()
.downcast_ref::<StringArray>()
.ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

LargeUtf8 input will cause a runtime error for decimal casts.

The match pattern at lines 965-970 accepts both Utf8 and LargeUtf8, but downcast_ref::<StringArray>() only works for Utf8. When the input is LargeUtf8, this will return None and cause an error.

-    let string_array = array
-        .as_any()
-        .downcast_ref::<StringArray>()
-        .ok_or_else(|| SparkError::Internal("Expected string array".to_string()))?;
+    // Handle both Utf8 (StringArray) and LargeUtf8 (LargeStringArray)
+    fn get_string_value<O: OffsetSizeTrait>(array: &ArrayRef, i: usize) -> Option<&str> {
+        array.as_any()
+            .downcast_ref::<GenericStringArray<O>>()
+            .map(|arr| if arr.is_null(i) { None } else { Some(arr.value(i)) })
+            .flatten()
+    }

Consider using GenericStringArray<O> with a generic offset type, similar to the pattern used in cast_string_to_int.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
native/spark-expr/src/conversion_funcs/cast.rs around lines 1080-1083: the code
unconditionally downcasts to StringArray which fails for LargeUtf8 inputs;
update the logic to handle both Utf8 and LargeUtf8 by using GenericStringArray
with a generic offset type (or by matching the DataType and downcasting to the
appropriate StringArray/LargeStringArray variant as done in cast_string_to_int).
Replace the single downcast_ref::<StringArray>() with a branch that constructs a
GenericStringArray<O> (or performs the correct downcast for each offset type),
then use that generic array for the decimal cast logic so LargeUtf8 inputs are
handled without runtime errors.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! LargeUtf8 cannot be casted to StringArray, so the new code will need to use its parent trait - GenericStringArray. Prevents internal error at runtime when LargeUtf8 is passed.

Comment on lines +1141 to +1145
match parse_string_to_decimal(str_value, precision, scale) {
Ok(Some(decimal_value)) => {
// Convert i128 to i256
let i256_value = i256::from_i128(decimal_value);
decimal_builder.append_value(i256_value);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Decimal256 precision is limited to i128 range.

The implementation parses the string to i128 first, then converts to i256. This limits Decimal256 values to the i128 range (~38 decimal digits), which may not cover the full Decimal256 precision (up to 76 decimal digits).

For full Decimal256 support, consider parsing directly to i256 or using a big integer library. If this limitation is intentional, add a comment documenting it.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! This is caused by a limitation of parse_string_to_decimal() function. Currently it supports parsing up to the biggest integer type in the standard library - i128. i256 is a custom type provided by the Arrow-rs library and support for it would be nice to have.

Comment on lines +1246 to +1299
/// Parse a decimal string into (mantissa, scale)
/// e.g., "123.45" -> (12345, 2), "-0.001" -> (-1, 3)
fn parse_decimal_str(s: &str) -> Result<(i128, i32), String> {
let s = s.trim();
if s.is_empty() {
return Err("Empty string".to_string());
}

let negative = s.starts_with('-');
let s = if negative || s.starts_with('+') {
&s[1..]
} else {
s
};

// Split by decimal point
let parts: Vec<&str> = s.split('.').collect();

if parts.len() > 2 {
return Err("Multiple decimal points".to_string());
}

let integral_part = parts[0];
let fractional_part = if parts.len() == 2 { parts[1] } else { "" };

// Parse integral part
let integral_value: i128 = if integral_part.is_empty() {
0
} else {
integral_part
.parse()
.map_err(|_| "Invalid integral part".to_string())?
};

// Parse fractional part
let scale = fractional_part.len() as i32;
let fractional_value: i128 = if fractional_part.is_empty() {
0
} else {
fractional_part
.parse()
.map_err(|_| "Invalid fractional part".to_string())?
};

// Combine: value = integral * 10^scale + fractional
let mantissa = integral_value
.checked_mul(10_i128.pow(scale as u32))
.and_then(|v| v.checked_add(fractional_value))
.ok_or("Overflow in mantissa calculation")?;

let final_mantissa = if negative { -mantissa } else { mantissa };

Ok((final_mantissa, scale))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing scientific notation support in decimal string parsing.

The parse_decimal_str function does not handle scientific notation (e.g., "1.23e4", "1E-5"). The float parsing (FloatParse) handles such values, but they will fail when casting strings like "1.23e4" to decimal.

Spark's BigDecimal constructor supports scientific notation. Consider adding support or documenting this as a known limitation:

+    // Check for scientific notation - not yet supported
+    if s.contains('e') || s.contains('E') {
+        return Err("Scientific notation not supported".to_string());
+    }
+
     // Split by decimal point
     let parts: Vec<&str> = s.split('.').collect();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Parse a decimal string into (mantissa, scale)
/// e.g., "123.45" -> (12345, 2), "-0.001" -> (-1, 3)
fn parse_decimal_str(s: &str) -> Result<(i128, i32), String> {
let s = s.trim();
if s.is_empty() {
return Err("Empty string".to_string());
}
let negative = s.starts_with('-');
let s = if negative || s.starts_with('+') {
&s[1..]
} else {
s
};
// Split by decimal point
let parts: Vec<&str> = s.split('.').collect();
if parts.len() > 2 {
return Err("Multiple decimal points".to_string());
}
let integral_part = parts[0];
let fractional_part = if parts.len() == 2 { parts[1] } else { "" };
// Parse integral part
let integral_value: i128 = if integral_part.is_empty() {
0
} else {
integral_part
.parse()
.map_err(|_| "Invalid integral part".to_string())?
};
// Parse fractional part
let scale = fractional_part.len() as i32;
let fractional_value: i128 = if fractional_part.is_empty() {
0
} else {
fractional_part
.parse()
.map_err(|_| "Invalid fractional part".to_string())?
};
// Combine: value = integral * 10^scale + fractional
let mantissa = integral_value
.checked_mul(10_i128.pow(scale as u32))
.and_then(|v| v.checked_add(fractional_value))
.ok_or("Overflow in mantissa calculation")?;
let final_mantissa = if negative { -mantissa } else { mantissa };
Ok((final_mantissa, scale))
}
/// Parse a decimal string into (mantissa, scale)
/// e.g., "123.45" -> (12345, 2), "-0.001" -> (-1, 3)
fn parse_decimal_str(s: &str) -> Result<(i128, i32), String> {
let s = s.trim();
if s.is_empty() {
return Err("Empty string".to_string());
}
let negative = s.starts_with('-');
let s = if negative || s.starts_with('+') {
&s[1..]
} else {
s
};
// Check for scientific notation - not yet supported
if s.contains('e') || s.contains('E') {
return Err("Scientific notation not supported".to_string());
}
// Split by decimal point
let parts: Vec<&str> = s.split('.').collect();
if parts.len() > 2 {
return Err("Multiple decimal points".to_string());
}
let integral_part = parts[0];
let fractional_part = if parts.len() == 2 { parts[1] } else { "" };
// Parse integral part
let integral_value: i128 = if integral_part.is_empty() {
0
} else {
integral_part
.parse()
.map_err(|_| "Invalid integral part".to_string())?
};
// Parse fractional part
let scale = fractional_part.len() as i32;
let fractional_value: i128 = if fractional_part.is_empty() {
0
} else {
fractional_part
.parse()
.map_err(|_| "Invalid fractional part".to_string())?
};
// Combine: value = integral * 10^scale + fractional
let mantissa = integral_value
.checked_mul(10_i128.pow(scale as u32))
.and_then(|v| v.checked_add(fractional_value))
.ok_or("Overflow in mantissa calculation")?;
let final_mantissa = if negative { -mantissa } else { mantissa };
Ok((final_mantissa, scale))
}
🤖 Prompt for AI Agents
In native/spark-expr/src/conversion_funcs/cast.rs around lines 1246 to 1299,
parse_decimal_str currently only handles fixed-point strings and rejects
scientific notation (e.g. "1.23e4", "1E-5"); update the parser to detect an
optional exponent part (split on 'e' or 'E'), validate and parse the exponent as
a signed integer, then parse the mantissa portion with the existing logic to
produce (mantissa, scale) and adjust the scale by subtracting the exponent
(scale = scale - exponent) so the numeric value remains mantissa * 10^-scale;
ensure you apply checked arithmetic when shifting the mantissa (multiplying or
dividing by powers of ten) and surface clear errors for invalid exponent syntax
or overflow; keep handling of leading sign as before and preserve tests for
corner cases (multiple dots, empty parts).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! parse_decimal_str() does not support scientific notation. This should either be improved or documented as not supported.

Comment on lines +688 to 692
test("cast StringType to DoubleType") {
Seq(true, false).foreach { v =>
castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Test casts to FloatType instead of DoubleType.

The test is named "cast StringType to DoubleType" but line 690 casts to FloatType instead of DoubleType. This appears to be a copy-paste error.

   test("cast StringType to DoubleType") {
     Seq(true, false).foreach { v =>
-      castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v)
+      castTest(specialValues.toDF("a"), DataTypes.DoubleType, testAnsi = v)
     }
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
test("cast StringType to DoubleType") {
Seq(true, false).foreach { v =>
castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v)
}
}
test("cast StringType to DoubleType") {
Seq(true, false).foreach { v =>
castTest(specialValues.toDF("a"), DataTypes.DoubleType, testAnsi = v)
}
}
🤖 Prompt for AI Agents
In spark/src/test/scala/org/apache/comet/CometCastSuite.scala around lines 688
to 692, the test named "cast StringType to DoubleType" mistakenly calls castTest
with DataTypes.FloatType; change the type argument to DataTypes.DoubleType so
the test actually casts to DoubleType, leaving the rest of the test logic
unchanged.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! It looks like this is a copy/paste error. The test should use DoubleType as the test name suggests. Prevents regressions related to the support for DoubleType because it is not tested at the moment.

@claude
Copy link
Copy Markdown

claude bot commented Dec 2, 2025

PR Review: Support string to non-int numeric types casting

Summary

This PR implements string-to-float, string-to-double, and string-to-decimal casting operations that were previously marked as incompatible. The implementation properly handles special values (inf, NaN), suffixes ('d', 'f'), and ANSI mode. Overall, this is a solid improvement that removes significant compatibility gaps.


Code Quality & Best Practices ✅

Strengths:

  • Well-structured code with clear separation of concerns (decimal vs float handling)
  • Good use of Rust traits (FloatParse) for type-generic implementations
  • Proper error handling with distinction between ANSI and non-ANSI modes
  • Comprehensive documentation via inline comments

Minor Issues:

  1. Bug in test at line 690 🐛

    test("cast StringType to DoubleType") {
      Seq(true, false).foreach { v =>
        castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v)  // Should be DoubleType!
      }
    }

    This test is casting to FloatType instead of DoubleType, so it's not actually testing double conversion.

  2. Missing function: is_validate_decimal_precision at cast.rs:1229

    The code calls is_validate_decimal_precision(value, precision) but this function doesn't appear to be defined. Looking at similar code in the codebase (e.g., make_decimal.rs:63), you should use Arrow's built-in:

    use arrow::datatypes::validate_decimal_precision;
    
    // Then at line 1229:
    if validate_decimal_precision(value, precision).is_ok() {

    This will cause a compilation error. Please verify the build passes.

  3. Decimal parsing doesn't handle 'D'/'F' suffixes

    In parse_string_to_decimal (line 1195-1196), the comment says:

    "We do NOT strip 'D' or 'F' suffixes - let parsing fail naturally"

    However, the test at line 707 includes "123.45D" which suggests these suffixes should be supported. Spark's Java BigDecimal constructor doesn't accept these suffixes, so the string parsing will fail. This is inconsistent with the float implementation (lines 1375-1376, 1401-1402) which explicitly strips these suffixes.

    Recommendation: Strip 'D' and 'F' suffixes in parse_decimal_str similar to the float implementation for consistency with Spark behavior.


Potential Bugs 🐛

  1. Compilation error: Missing is_validate_decimal_precision function (see above)
  2. Test bug: DoubleType test casts to FloatType
  3. Inconsistent suffix handling: Decimal doesn't strip D/F but float does

Performance Considerations ⚡

Good:

  • Uses builders efficiently with pre-allocated capacity
  • Avoids unnecessary string allocations
  • Checked arithmetic prevents panics

Minor optimization opportunity:

  • Lines 1182, 1360, 1386: to_lowercase() allocates a new string. For better performance with hot paths, consider using eq_ignore_ascii_case() for simple comparisons:
    if s.eq_ignore_ascii_case("inf") || s.eq_ignore_ascii_case("+inf") ...
    This is a micro-optimization and not critical.

Security Concerns 🔒

No major security issues identified.

The implementation properly handles:

  • Integer overflow (checked arithmetic)
  • Buffer overflows (Arrow's safe abstractions)
  • Invalid inputs (returns errors or nulls appropriately)

Test Coverage ✅

Strong test coverage:

  • Comprehensive special values (inf, NaN, signed variants)
  • Suffix handling ('d', 'f')
  • Scientific notation
  • Edge cases (empty strings, whitespace, invalid inputs)
  • Both ANSI and non-ANSI modes
  • High precision decimals (38,10)

Missing test cases:

  • Very large numbers that exceed precision bounds
  • Multiple decimal points (e.g., "1.2.3")
  • Leading zeros (e.g., "000123.45")
  • Decimal numbers with scientific notation (if supported)
  • The commented-out ignore test at line 694 should potentially be re-enabled or removed

Documentation ✅

Documentation changes look good:

  • Properly moves cast operations from "incompatible" to "compatible" table
  • Removes outdated compatibility notes

Action Items

Must Fix (Blocking):

  1. Fix missing is_validate_decimal_precision function - use validate_decimal_precision from arrow::datatypes
  2. Fix test bug: Line 690 should cast to DataTypes.DoubleType not FloatType

Should Fix (Non-blocking but important):
3. Consider stripping 'D'/'F' suffixes in decimal parsing for consistency with Spark and float behavior
4. Verify the implementation actually works with test value "123.45D" at line 707

Nice to Have:
5. Add more edge case tests mentioned above
6. Consider micro-optimization with eq_ignore_ascii_case() if profiling shows string comparison as a hotspot


Recommendation

REQUEST CHANGES - The missing function will cause compilation failure. Please fix the blocking issues and I'd be happy to approve! The overall approach is solid and this is a valuable feature addition.

Great work on tackling these compatibility gaps! 🚀

testAnsi = false)
test("cast StringType to DoubleType") {
Seq(true, false).foreach { v =>
castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test for DoubleType incorrectly uses FloatType

The test named "cast StringType to DoubleType" actually tests DataTypes.FloatType instead of DataTypes.DoubleType. This appears to be a copy-paste error from the FloatType test immediately above it. As a result, DoubleType casting is not being tested at all, and the FloatType test is effectively run twice.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! It looks like this is a copy/paste error. The test should use DoubleType as the test name suggests. Prevents regressions related to the support for DoubleType because it is not tested at the moment.

}
(Utf8 | LargeUtf8, Decimal256(precision, scale)) => {
cast_string_to_decimal(&array, to_type, precision, scale, eval_mode)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: LargeUtf8 decimal conversion will fail at runtime

The match patterns include LargeUtf8 for decimal conversions, but the implementation functions cast_string_to_decimal128_impl and cast_string_to_decimal256_impl only downcast to StringArray. When a LargeUtf8 array (which is LargeStringArray) is passed, the downcast returns None and the function returns an internal error "Expected string array". This differs from the pattern used elsewhere (e.g., boolean conversion) which has separate arms for Utf8 and LargeUtf8.

Additional Locations (2)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! LargeUtf8 cannot be casted to StringArray, so the new code will need to use its parent trait - GenericStringArray. Prevents internal error at runtime when LargeUtf8 is passed.

@martin-augment
Copy link
Copy Markdown
Owner Author

  1. Bug in test at line 690 🐛
    test("cast StringType to DoubleType") {
      Seq(true, false).foreach { v =>
        castTest(specialValues.toDF("a"), DataTypes.FloatType, testAnsi = v)  // Should be DoubleType!
      }
    }
    This test is casting to FloatType instead of DoubleType, so it's not actually testing double conversion.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! It looks like this is a copy/paste error. The test should use DoubleType as the test name suggests. Prevents regressions related to the support for DoubleType because it is not tested at the moment.

@martin-augment
Copy link
Copy Markdown
Owner Author

  • Missing function: is_validate_decimal_precision at cast.rs:1229
    The code calls is_validate_decimal_precision(value, precision) but this function doesn't appear to be defined. Looking at similar code in the codebase (e.g., make_decimal.rs:63), you should use Arrow's built-in:
    use arrow::datatypes::validate_decimal_precision;
    
    // Then at line 1229:
    if validate_decimal_precision(value, precision).is_ok() {
    This will cause a compilation error. Please verify the build passes.

value:annoying; category:bug; feedback:The Claude AI reviewer is correct! is_validate_decimal_precision() is imported at line 35 (use arrow::datatypes::is_validate_decimal_precision).
https://github.com/apache/arrow-rs/blob/b93fa52e47a83dff130709a6e80a4be5017dbf09/arrow/src/datatypes/mod.rs#L28 re-exports it from arrow-data::decimal (https://github.com/apache/arrow-rs/blob/b93fa52e47a83dff130709a6e80a4be5017dbf09/arrow-data/src/decimal.rs#L1174)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants