Skip to content

18662: Refactor Spark crc32/sha1 signatures#20

Open
martin-augment wants to merge 1 commit intomainfrom
pr-18662-2025-11-13-06-58-29
Open

18662: Refactor Spark crc32/sha1 signatures#20
martin-augment wants to merge 1 commit intomainfrom
pr-18662-2025-11-13-06-58-29

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

18662: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 13, 2025

Walkthrough

This change modifies the CRC32 and SHA1 hash functions in DataFusion's Spark implementation. The signature definitions are switched from user_defined to coercible with implicit binary-to-native binary type coercion. The coerce_types method implementations are removed from both functions, relying on the new coercible signatures instead. The functions now accept additional input types including FixedSizeBinary and Null, with explicit null handling. Argument extraction is simplified using the take_function_args helper. Error reporting for unsupported types is improved. Test files are reorganized to consolidate type-casting test cases for both functions.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18662-2025-11-13-06-58-29

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c27d3f3 and b8d1938.

📒 Files selected for processing (4)
  • datafusion/spark/src/function/hash/crc32.rs (4 hunks)
  • datafusion/spark/src/function/hash/sha1.rs (4 hunks)
  • datafusion/sqllogictest/test_files/spark/hash/crc32.slt (1 hunks)
  • datafusion/sqllogictest/test_files/spark/hash/sha1.slt (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T13:37:46.028Z
Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.

Applied to files:

  • datafusion/spark/src/function/hash/sha1.rs
🧬 Code graph analysis (2)
datafusion/spark/src/function/hash/crc32.rs (4)
datafusion/common/src/cast.rs (3)
  • as_binary_view_array (302-304)
  • as_fixed_size_binary_array (317-319)
  • as_large_binary_array (307-309)
datafusion/common/src/utils/mod.rs (1)
  • take_function_args (926-940)
datafusion/expr-common/src/signature.rs (2)
  • coercible (1147-1153)
  • new_implicit (922-934)
datafusion/spark/src/function/hash/sha1.rs (1)
  • new (52-64)
datafusion/spark/src/function/hash/sha1.rs (4)
datafusion/common/src/cast.rs (3)
  • as_binary_view_array (302-304)
  • as_fixed_size_binary_array (317-319)
  • as_large_binary_array (307-309)
datafusion/common/src/utils/mod.rs (1)
  • take_function_args (926-940)
datafusion/expr-common/src/signature.rs (2)
  • coercible (1147-1153)
  • new_implicit (922-934)
datafusion/spark/src/function/hash/crc32.rs (1)
  • new (50-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (9)
datafusion/spark/src/function/hash/crc32.rs (6)

24-34: LGTM! Import statements correctly updated for the refactoring.

All newly imported utilities are used appropriately in the implementation: as_fixed_size_binary_array for FixedSizeBinary support, take_function_args for argument extraction, and signature-related types for the coercible signature construction.


52-59: Excellent refactoring to coercible signature!

The signature change enables implicit string-to-binary coercion, which is more idiomatic than manually implementing coerce_types. The configuration correctly specifies Binary as the desired type with implicit coercion from string types, aligning with Spark's hash function behavior.


100-100: Good use of the take_function_args helper.

This simplifies argument extraction and provides clear, consistent error messages when the wrong number of arguments is provided.


103-103: LGTM! Correct NULL handling.

Returns a null Int64 array of appropriate length, which correctly implements SQL NULL propagation semantics for the hash function.


116-119: LGTM! FixedSizeBinary support correctly implemented.

The implementation correctly handles FixedSizeBinary arrays by downcasting and applying the same hash logic used for other binary types.


120-122: Good improvement to error message specificity.

Including the unsupported data type in the error message significantly improves debuggability.

datafusion/sqllogictest/test_files/spark/hash/crc32.slt (1)

43-82: Excellent test coverage for new type support!

The reorganized test structure with comprehensive type coverage (NULL, Utf8, Binary, FixedSizeBinary, LargeBinary, BinaryView) correctly validates the expanded input type handling and coercion model.

datafusion/sqllogictest/test_files/spark/hash/sha1.slt (1)

34-82: LGTM! Comprehensive test coverage matching crc32 test structure.

The tests correctly validate the 'sha' alias, NULL handling, and expanded type support including FixedSizeBinary. The consistent test structure across both hash functions improves maintainability.

datafusion/spark/src/function/hash/sha1.rs (1)

24-135: Excellent refactoring matching the crc32 implementation!

The changes mirror the crc32 refactoring precisely, ensuring consistency across both hash functions. The signature switch to coercible, use of take_function_args, NULL handling (correctly returning StringArray::new_null), FixedSizeBinary support, and improved error messages are all correctly implemented.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

@claude
Copy link
Copy Markdown

claude bot commented Nov 13, 2025

PR Review: Refactor Spark crc32/sha1 signatures

Summary

This PR refactors the crc32 and sha1 Spark functions to use the new Signature::coercible API with Coercion::new_implicit, replacing the manual coerce_types implementation. Overall, this is a positive modernization that aligns with the codebase patterns.


Code Quality & Best Practices

Strengths:

  • Follows established patterns: The refactoring aligns with how similar functions like md5, sha256, etc. handle type coercion (see datafusion/functions/src/crypto/md5.rs)
  • Reduces boilerplate: Removes ~20 lines of manual type coercion logic per function by leveraging the framework
  • Consistent error handling: Uses take_function_args helper for argument validation
  • Good test coverage: Test files comprehensively cover all supported data types (Utf8, Utf8View, LargeUtf8, Binary, BinaryView, LargeBinary, FixedSizeBinary, Null)

Concerns:

  • Potential signature mismatch: The new signature uses TypeSignatureClass::Binary as the desired_type, while similar crypto functions in datafusion/functions/src/crypto/md5.rs:63-72 use TypeSignatureClass::Native(logical_binary()). The difference is:
    • TypeSignatureClass::Binary includes FixedSizeBinary
    • TypeSignatureClass::Native(logical_binary()) is more specific

Since the implementation explicitly handles FixedSizeBinary (crc32.rs:116-119, sha1.rs:127-130), this might be intentional, but it differs from the standard crypto function pattern.


Potential Issues

  1. Inconsistent type coercion semantics: The old implementation coerced:

    • Utf8/Utf8View to Binary
    • LargeUtf8 to LargeBinary

    The new implementation appears to coerce everything to Binary (based on NativeType::Binary). This could be a breaking change if downstream code expects LargeUtf8 to remain as LargeBinary.

  2. Removed clippy suppression in sha1.rs:98: The #[allow(clippy::unwrap_used)] was removed. While the unwrap() is still safe (writing to a String never fails), this might trigger clippy warnings in CI if the project has unwrap_used as a deny lint.


Performance Considerations

  • No performance regressions expected: The core hash computation logic remains unchanged
  • Type coercion efficiency: Moving coercion to the signature system allows the query planner to optimize type conversions earlier

Security Concerns

  • No security issues identified: Hash functions are cryptographically sound for their use case
  • Input validation: Proper null handling and array bounds checking via take_function_args

Test Coverage

Excellent coverage:

  • Basic functionality (string input)
  • Edge cases (empty string, NULL)
  • All supported data types (Utf8, Utf8View, LargeUtf8, Binary, BinaryView, LargeBinary, FixedSizeBinary)
  • Alias testing (sha for sha1)

Potential gap:

  • Consider adding tests for error cases (if any are expected)
  • Verify behavior with mixed nullability in batches

Recommendations

  1. Verify type coercion behavior: Confirm that coercing LargeUtf8 to Binary (instead of LargeBinary) is intended and does not break Spark compatibility

  2. Consider alignment with crypto functions: If the difference from md5 pattern is intentional, document why TypeSignatureClass::Binary is used instead of Native(logical_binary())

  3. Clippy lint check: Ensure the removal of #[allow(clippy::unwrap_used)] does not cause CI failures

  4. Documentation: Consider adding a code comment explaining why FixedSizeBinary is supported (is this Spark-compatible?)


Verdict

Approve with minor suggestions. This is a solid refactoring that improves code maintainability. The potential type coercion behavior change should be verified, but otherwise the implementation looks correct and well-tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants