Skip to content

18642: feat: support spark csc#13

Open
martin-augment wants to merge 1 commit intomainfrom
pr-18642-2025-11-12-09-05-38
Open

18642: feat: support spark csc#13
martin-augment wants to merge 1 commit intomainfrom
pr-18642-2025-11-12-09-05-38

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

18642: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 12, 2025

Walkthrough

The pull request adds a new cosecant (csc) trigonometric function to the DataFusion Spark UDF library. It introduces a trigonometry module in the math functions, implements the SparkCsc scalar function that computes 1.0 divided by sin(x), registers the function with type checking and arity validation for one argument, and includes test cases for scalar inputs, array inputs, and edge cases including infinity and null values.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18642-2025-11-12-09-05-38

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 12, 2025

PR Review: feat: support spark csc

Summary

This PR adds support for the Spark csc (cosecant) function. The implementation follows DataFusion patterns and includes a new trigonometry module for math functions.

Code Quality ✅

Strengths:

  • Follows existing code patterns (similar to expm1.rs)
  • Proper Apache license header
  • Appropriate use of error utilities
  • Clean struct implementation with Default trait

Areas for Improvement:

1. Duplicate Argument Validation (Minor)

datafusion/spark/src/function/math/trigonometry.rs:98-105

The spark_csc function validates arguments again, but this is already handled by coerce_types. Consider removing the duplicate check in the main function:

fn spark_csc(args: &[ColumnarValue]) -> Result<ColumnarValue> {
    // Remove this check - already validated by coerce_types
    if args.len() != 1 {
        return Err(invalid_arg_count_exec_err(
            CSC_FUNCTION_NAME,
            (1, 1),
            args.len(),
        ));
    }
    // ... rest of implementation
}

2. Inconsistent Pattern with Codebase

Looking at rint.rs, similar math functions use make_scalar_function from datafusion_functions::utils for cleaner code organization. Consider refactoring to match:

use datafusion_functions::utils::make_scalar_function;

fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
    make_scalar_function(spark_csc_impl, vec![])(&args.args)
}

pub fn spark_csc_impl(args: &[ArrayRef]) -> Result<ArrayRef> {
    // Implementation using ArrayRef directly
}

This would make the code more consistent with rint.rs:71 and other functions.

Potential Bugs 🐛

1. Division by Zero Not Handled (High Priority)

datafusion/spark/src/function/math/trigonometry.rs:108

When sin(x) = 0, the function returns f64::INFINITY, which matches Spark behavior. However, this should be documented and tested more thoroughly:

  • Test case at line 156 covers csc(0) = INFINITY
  • Missing test for when sin(x) approaches zero from negative side (e.g., csc(-π) should be -INFINITY)
  • Consider adding edge case tests for multiples of π

2. NaN Handling Not Tested

The implementation doesn't test what happens with NaN inputs. Spark would return NaN, and your implementation will too, but it should be explicitly tested:

#[test]
fn test_csc_nan() {
    test_trig_float64_invoke!(SparkCsc::new(), Some(f64::NAN), Ok(Some(f64::NAN)));
}

Performance Considerations ⚡

The implementation is efficient:

  • Uses Arrow's unary operation for vectorization ✅
  • No unnecessary allocations ✅
  • Direct computation without intermediate buffers ✅

Minor optimization opportunity: The pattern matching in spark_csc could be simplified by relying on type coercion always converting to Float64, eliminating the need for runtime type checking.

Security Concerns 🔒

No security issues identified. The function:

  • Doesn't perform any unsafe operations ✅
  • Uses bounded arithmetic (division by zero → infinity, not panic) ✅
  • Properly handles null values ✅

Test Coverage 📊

Good coverage overall but missing some edge cases:

Existing Tests ✅

  • Basic scalar value (csc(0))
  • Array processing ([1, 0, -1])
  • Scalar computation (csc(π/2) = 1)
  • SQL integration tests in .slt file

Missing Test Cases ❌

  1. Negative infinity: csc(x) when sin(x) → 0 from below (e.g., x = -π)
  2. NaN handling: csc(NaN) should return NaN
  3. Positive infinity: csc(π) should return INFINITY or handle gracefully
  4. Large values: Test with very large inputs to ensure numerical stability
  5. Mixed null arrays: [Some(1.0), None, Some(2.0)]

Recommended Additional Tests:

#[test]
fn test_csc_edge_cases() {
    test_trig_float64_invoke!(SparkCsc::new(), Some(f64::NAN), Ok(Some(f64::NAN)));
    test_trig_float64_invoke!(SparkCsc::new(), Some(std::f64::consts::PI), Ok(Some(f64::INFINITY)));
    test_trig_float64_invoke!(SparkCsc::new(), Some(-std::f64::consts::PI), Ok(Some(f64::NEG_INFINITY)));
}

#[test]
fn test_csc_null_handling() {
    test_trig_float64_invoke!(SparkCsc::new(), None, Ok(None));
}

SQL Logic Test Quality 📝

The .slt file test is good but could be enhanced:

  • Tests basic functionality ✅
  • Tests null handling ✅
  • Tests special values (π/2) ✅

Consider adding test for csc(π) to verify infinity handling in SQL context.

Documentation 📚

Missing: Link to Spark documentation in the code. The struct has it for other functions. Add:

/// Spark CSC function: https://spark.apache.org/docs/latest/api/sql/index.html#csc
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct SparkCsc {

Overall Assessment

This is a solid implementation that follows DataFusion conventions well. The main issues are:

  1. Must Fix: Add edge case tests (NaN, ±π, nulls)
  2. Should Fix: Remove duplicate argument validation
  3. Consider: Refactor to use make_scalar_function for consistency with rint.rs
  4. Nice to Have: Add documentation link

The code is production-ready with the additional test coverage. Great work on maintaining consistency with the existing codebase structure!

Recommendation: Approve with minor changes requested for test coverage.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc49fc0 and 2e2e988.

📒 Files selected for processing (3)
  • datafusion/spark/src/function/math/mod.rs (4 hunks)
  • datafusion/spark/src/function/math/trigonometry.rs (1 hunks)
  • datafusion/sqllogictest/test_files/spark/math/csc.slt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
datafusion/spark/src/function/math/trigonometry.rs (2)
datafusion/spark/src/function/error_utils.rs (2)
  • invalid_arg_count_exec_err (23-40)
  • unsupported_data_type_exec_err (42-48)
datafusion/expr-common/src/signature.rs (1)
  • user_defined (1091-1097)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)

Comment on lines +80 to +96
if arg_types.len() != 1 {
return Err(invalid_arg_count_exec_err(
CSC_FUNCTION_NAME,
(1, 1),
arg_types.len(),
));
}
if arg_types[0].is_numeric() {
Ok(vec![DataType::Float64])
} else {
Err(unsupported_data_type_exec_err(
CSC_FUNCTION_NAME,
"Numeric Type",
&arg_types[0],
))
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle NULL inputs without throwing — With the current type coercion you reject DataType::Null, so a plain SELECT csc(NULL) surfaces Unsupported Data Type instead of propagating Spark’s expected NULL result. That’s a logic bug users will hit when optional columns bubble through this UDF. Please accept Null during coercion and short-circuit to a Float64(None) result in the evaluator.

@@
-        if arg_types[0].is_numeric() {
+        if matches!(arg_types[0], DataType::Null) {
+            Ok(vec![DataType::Float64])
+        } else if arg_types[0].is_numeric() {
             Ok(vec![DataType::Float64])
         } else {
             Err(unsupported_data_type_exec_err(
                 CSC_FUNCTION_NAME,
                 "Numeric Type",
                 &arg_types[0],
             ))
         }
@@
-        ColumnarValue::Scalar(ScalarValue::Float64(value)) => Ok(ColumnarValue::Scalar(
-            ScalarValue::Float64(value.map(|x| 1.0 / x.sin())),
-        )),
+        ColumnarValue::Scalar(ScalarValue::Float64(value)) => Ok(ColumnarValue::Scalar(
+            ScalarValue::Float64(value.map(|x| 1.0 / x.sin())),
+        )),
+        ColumnarValue::Scalar(ScalarValue::Null) => Ok(ColumnarValue::Scalar(
+            ScalarValue::Float64(None),
+        )),

Also applies to: 107-128

🤖 Prompt for AI Agents
datafusion/spark/src/function/math/trigonometry.rs lines ~80-96 and ~107-128:
the coercion currently rejects DataType::Null and returns an
unsupported-data-type error instead of propagating NULL; update the type
coercion to accept DataType::Null (treat it as a valid input type alongside
numeric types) and in the evaluator short-circuit when the input scalar is null
to return a Float64(None) result; ensure the signature still reports Float64 as
output type but that runtime checks detect null input and return NULL without
attempting numeric conversion or throwing an error.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that Null handling is missing. Also tests for this use case should be added. This would make it behave as Apache Spark

@martin-augment
Copy link
Copy Markdown
Owner Author

Missing Test Cases ❌

  1. Negative infinity: csc(x) when sin(x) → 0 from below (e.g., x = -π)
  2. NaN handling: csc(NaN) should return NaN
  3. Positive infinity: csc(π) should return INFINITY or handle gracefully
  4. Large values: Test with very large inputs to ensure numerical stability
  5. Mixed null arrays: [Some(1.0), None, Some(2.0)]

value:useful; category:bug; feedback:The Claude AI reviewer is correct that more unit tests could be added to cover the edge cases with boundary values like NAN, INFINITY, NEG_INFINITY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants