Skip to content

18664: Refactor crypto functions code#21

Open
martin-augment wants to merge 3 commits intomainfrom
pr-18664-2025-11-13-13-06-55
Open

18664: Refactor crypto functions code#21
martin-augment wants to merge 3 commits intomainfrom
pr-18664-2025-11-13-13-06-55

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

18664: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 13, 2025

Walkthrough

This pull request refactors the DataFusion cryptographic functions module. Individual SHA algorithm implementations (sha224, sha256, sha384, sha512) are consolidated into a unified SHAFunc structure with per-algorithm constructors. The digest function is redesigned to use static DigestAlgorithm type dispatch instead of dynamic method selection, with helper methods made private. The md5 function is simplified to use exact type coercions and always return UTF-8 View. Internal error handling is updated to use internal_err in certain cases. Additionally, documentation examples and a test case for dictionary-encoded values are updated with actual computed hash results.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18664-2025-11-13-13-06-55

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 13, 2025

Code Review - PR #21: Refactor crypto functions code

Summary

This PR successfully consolidates the SHA hash functions (sha224, sha256, sha384, sha512) from four separate files into a single parameterized implementation. This is a good refactoring that reduces code duplication significantly (612 lines deleted vs 312 added).


✅ Strengths

1. Excellent Code Deduplication

  • Consolidating four nearly identical SHA function implementations into a single SHAFunc struct is a solid architectural improvement
  • The use of factory methods (sha224(), sha256(), sha384(), sha512()) maintains a clean API

2. Improved Type Safety

  • Added PartialEq, Eq, Hash derives to DigestAlgorithm enum (line 84 in basic.rs) - good for pattern matching and HashMap usage
  • Signature changes from new_implicit to new_exact in md5.rs and sha.rs provide stricter type checking

3. Better Error Handling

  • Changed from exec_err! to internal_err! in md5.rs:155 - more appropriate for "impossible" error conditions
  • Changed from internal_err! to not_impl_err! in digest.rs:129 - semantically correct for unimplemented features

4. Code Organization

  • Moving digest() function from basic.rs to digest.rs (lines 118-133) improves separation of concerns
  • Better visibility controls with methods like digest_scalar, digest_utf8_array_impl, and digest_binary_array_impl now marked as fn instead of pub fn

5. Documentation Improvements

  • SQL examples in user docs now show actual hash output instead of placeholders like <binary_hash_result>
  • More accurate descriptions

🔍 Issues & Concerns

1. Critical: Incorrect Return Type in Documentation
The SHA functions now return DataType::Binary (sha.rs:155) but the documentation structs describe them as returning a "binary string" which is misleading. The actual return type is raw binary data, not a hex-encoded string. Consider updating the doc strings:

// Current (sha.rs:36)
description = "Computes the SHA-224 hash of a binary string.",

// Suggested
description = "Computes the SHA-224 binary hash of a string or binary input.",

2. Inconsistency in Return Types

  • MD5 returns DataType::Utf8View (hex-encoded string)
  • SHA functions return DataType::Binary (raw binary)
  • digest() returns DataType::Binary

This inconsistency could be confusing for users. While this might be intentional for backward compatibility, it should be documented clearly or reconsidered for API consistency.

3. Unreachable Code Pattern
In sha.rs:169-172, the match arm uses unreachable!() for non-SHA algorithms:

DigestAlgorithm::Md5
| DigestAlgorithm::Blake2s
| DigestAlgorithm::Blake2b
| DigestAlgorithm::Blake3 => unreachable!(),

While this is technically safe given the constructor constraints, consider using a type-safe approach with a separate enum for SHA-specific algorithms, or add a debug assertion with context:

_ => unreachable!("SHAFunc should only be constructed with SHA algorithms, got {:?}", self.algorithm)

4. Removed Helper Function
The utf8_or_binary_to_binary_type() helper was removed from basic.rs. While it's no longer used after the refactor, verify that no external code depends on it (it was pub).


🎯 Code Quality Observations

1. Macro Usage
The macros digest_to_array! and digest_to_scalar! are well-defined but their definitions are now placed between the function implementations. Consider moving them to the top of the impl block for better readability:

  • digest_to_array! moved from before digest methods to after hex_encode (lines 171-185)
  • digest_to_scalar! moved from top of file to after digest_to_array! (lines 187-196)

2. Improved Code Flow
The refactored digest_process() function (basic.rs:280-329) now directly calls the impl methods without the intermediate wrapper methods - cleaner and more direct.

3. Signature Simplification
Removing the complex NativeType parameter from coercion signatures simplifies the type system usage. Good change.


⚡ Performance Considerations

No Performance Concerns Identified

  • The refactoring is primarily structural and shouldn't impact performance
  • All hash computations use the same underlying implementations
  • Memory allocation patterns remain unchanged

🧪 Test Coverage

Adequate Coverage Observed
From expr.slt tests:

  • Tests for all SHA variants with various inputs (empty string, NULL, normal strings)
  • Tests comparing sha256() with digest(..., 'sha256')
  • Tests with different string types (Utf8View)
  • Blake2 and Blake3 coverage via digest function

Suggestion: Consider adding a test specifically for the new consolidated implementation to verify all four SHA functions work correctly from the same code path.


🔒 Security Considerations

No Security Issues

  • Hash implementations use standard, well-vetted crates (sha2, md5, blake2, blake3)
  • No unsafe code introduced
  • Input validation handled appropriately

📝 Minor Nitpicks

  1. Import Cleanup: basic.rs:20-21 could be consolidated:

    // Current
    use arrow::array::{Array, ArrayRef, BinaryArray, BinaryArrayType};
    use arrow::array::{AsArray, StringViewArray};
    
    // Suggested
    use arrow::array::{Array, ArrayRef, AsArray, BinaryArray, BinaryArrayType, StringViewArray};
  2. Documentation Comment: digest.rs:120 mentions "sha1" in the comment but it's not in the supported list:

    // Line 120: "Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512."
    // Should remove "sha1" or add it to supported algorithms

✨ Recommendations

  1. Fix the sha1 documentation inconsistency in digest.rs:120
  2. Clarify return type documentation for SHA functions (binary vs string)
  3. Consider adding a debug assertion to the unreachable!() case for better error messages
  4. Document the MD5/SHA return type difference in user-facing docs if this is intentional
  5. Add a test that exercises all four SHA variants in a single test case to verify the consolidated implementation

📊 Overall Assessment

Rating: ✅ Approved with Minor Suggestions

This is a well-executed refactoring that significantly reduces code duplication while maintaining functionality. The code quality improvements (better error types, stricter signatures, improved organization) outweigh the minor issues identified. The concerns raised are mostly about documentation clarity and defensive programming practices rather than functional bugs.

Great work on consolidating this code! 🎉

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.


/// Digest computes a binary hash of the given data, accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`].
/// Second argument is the algorithm to use.
/// Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs list 'sha1' as a supported algorithm, but DigestAlgorithm doesn't include SHA-1 and no implementation is provided here—consider removing 'sha1' from this list (also applies to other docs if present).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:documentation; feedback:The Augment AI reviewer is correct! There is no support for SHA1 and since it is a weak algorithm it should not be advertised in the docs. Prevents confusion in the reader.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
datafusion/functions/src/crypto/sha.rs (1)

163-174: Consider using a type-safe approach to eliminate unreachable!().

While the unreachable!() is safe due to the private constructor, it introduces a runtime panic point that could be eliminated with stronger type safety. Consider defining a SHA-specific algorithm enum or using the type system to ensure only SHA algorithms can be stored in SHAFunc.

Example approach:

#[derive(Debug, PartialEq, Eq, Hash)]
enum SHAAlgorithm {
    Sha224,
    Sha256,
    Sha384,
    Sha512,
}

impl SHAAlgorithm {
    fn to_digest_algorithm(&self) -> DigestAlgorithm {
        match self {
            SHAAlgorithm::Sha224 => DigestAlgorithm::Sha224,
            SHAAlgorithm::Sha256 => DigestAlgorithm::Sha256,
            SHAAlgorithm::Sha384 => DigestAlgorithm::Sha384,
            SHAAlgorithm::Sha512 => DigestAlgorithm::Sha512,
        }
    }
}

This would eliminate the need for the unreachable!() branch entirely.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47a826f and 1bf193e.

📒 Files selected for processing (11)
  • datafusion/functions/src/crypto/basic.rs (8 hunks)
  • datafusion/functions/src/crypto/digest.rs (6 hunks)
  • datafusion/functions/src/crypto/md5.rs (4 hunks)
  • datafusion/functions/src/crypto/mod.rs (1 hunks)
  • datafusion/functions/src/crypto/sha.rs (1 hunks)
  • datafusion/functions/src/crypto/sha224.rs (0 hunks)
  • datafusion/functions/src/crypto/sha256.rs (0 hunks)
  • datafusion/functions/src/crypto/sha384.rs (0 hunks)
  • datafusion/functions/src/crypto/sha512.rs (0 hunks)
  • datafusion/sqllogictest/test_files/expr.slt (1 hunks)
  • docs/source/user-guide/sql/scalar_functions.md (6 hunks)
💤 Files with no reviewable changes (4)
  • datafusion/functions/src/crypto/sha384.rs
  • datafusion/functions/src/crypto/sha512.rs
  • datafusion/functions/src/crypto/sha256.rs
  • datafusion/functions/src/crypto/sha224.rs
🧰 Additional context used
🧬 Code graph analysis (5)
datafusion/functions/src/crypto/mod.rs (3)
datafusion/functions/src/crypto/digest.rs (1)
  • digest (121-133)
datafusion/functions/src/crypto/basic.rs (1)
  • md5 (138-157)
datafusion/functions/src/crypto/sha.rs (4)
  • sha224 (106-108)
  • sha256 (110-112)
  • sha384 (114-116)
  • sha512 (118-120)
datafusion/functions/src/crypto/sha.rs (3)
datafusion/functions/src/crypto/basic.rs (2)
  • digest_process (280-329)
  • Self (109-121)
datafusion/common/src/utils/mod.rs (1)
  • take_function_args (926-940)
datafusion/expr-common/src/signature.rs (2)
  • one_of (1182-1188)
  • new_exact (914-916)
datafusion/functions/src/crypto/md5.rs (3)
datafusion/functions/src/crypto/digest.rs (2)
  • name (97-99)
  • return_type (105-107)
datafusion/functions/src/crypto/sha.rs (2)
  • name (146-148)
  • return_type (154-156)
datafusion/expr-common/src/signature.rs (1)
  • new_exact (914-916)
datafusion/functions/src/crypto/basic.rs (2)
datafusion/common/src/utils/mod.rs (3)
  • take_function_args (926-940)
  • a (540-540)
  • a (545-545)
datafusion/functions/src/crypto/digest.rs (1)
  • digest (121-133)
datafusion/functions/src/crypto/digest.rs (2)
datafusion/functions/src/crypto/basic.rs (1)
  • digest_process (280-329)
datafusion/common/src/utils/mod.rs (1)
  • take_function_args (926-940)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (11)
datafusion/functions/src/crypto/sha.rs (4)

34-96: LGTM: Clear documentation with concrete examples.

The documentation for each SHA variant includes practical SQL examples with actual hash outputs, which will help users understand the expected behavior.


98-120: LGTM: Clean struct design with well-encapsulated constructors.

The public constructors ensure type safety by only allowing valid SHA algorithm variants to be created.


122-138: LGTM: Proper signature configuration.

The use of exact coercions with logical_string() and logical_binary() provides strict type checking, ensuring only string or binary inputs are accepted. The Immutable volatility is correct for deterministic hash functions.


158-161: LGTM: Concise delegation to digest processing.

The implementation correctly extracts the single required argument and delegates to the centralized digest_process function.

datafusion/functions/src/crypto/basic.rs (7)

84-84: LGTM: Essential derives for the refactored architecture.

Adding PartialEq, Eq, Hash derives enables DigestAlgorithm to be used effectively in the new SHAFunc structure and supports comparison and hashing operations.


155-155: LGTM: Appropriate error type for unreachable code path.

Using internal_err! is correct here since digest_process with DigestAlgorithm::Md5 always returns either ColumnarValue::Array or ColumnarValue::Scalar(ScalarValue::Binary). Reaching this branch would indicate a programming error.


200-200: LGTM: Improved encapsulation through private helper methods.

Making digest_scalar, digest_utf8_array_impl, and digest_binary_array_impl private improves encapsulation by forcing all digest operations through the centralized digest_process function.

Also applies to: 217-217, 248-248


285-312: LGTM: Explicit type handling with clear error messages.

The refactored array processing explicitly handles all supported string and binary types with appropriate casting methods. The error message includes both the unsupported type and the digest algorithm, which aids debugging.


313-328: LGTM: Clean scalar value handling.

The scalar processing branch cleanly handles all string and binary variants using pattern matching with | operators, maintaining consistency with the array processing logic.


171-185: LGTM: Effective code deduplication through macros.

The digest_to_array and digest_to_scalar macros successfully eliminate repetitive digest finalization code across multiple algorithm variants, while allowing Blake3's unique API to be handled specially. This improves maintainability.

Also applies to: 200-246


187-196: The #[allow(deprecated)] annotation appears unnecessary.

As of sha2 0.10.9, finalize().as_slice() is the documented way to convert the digest output to a &[u8] slice and is not deprecated. The #[allow(deprecated)] at line 192 may be overly cautious or left over from an earlier version concern, but no actual deprecated API is being invoked in the current dependency versions.

If the suppress annotation was added defensively, it can be removed. If there's a specific compiler or version combination producing a false positive warning, that should be investigated separately—but the code itself is correct per the library's current documentation.

Comment on lines 60 to 68
signature: Signature::one_of(
vec![
TypeSignature::Coercible(vec![Coercion::new_implicit(
TypeSignatureClass::Native(logical_binary()),
vec![TypeSignatureClass::Native(logical_string())],
NativeType::String,
TypeSignature::Coercible(vec![Coercion::new_exact(
TypeSignatureClass::Native(logical_string()),
)]),
TypeSignature::Coercible(vec![Coercion::new_implicit(
TypeSignature::Coercible(vec![Coercion::new_exact(
TypeSignatureClass::Native(logical_binary()),
vec![TypeSignatureClass::Native(logical_binary())],
NativeType::Binary,
)]),
],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Restore unary signature for md5.

The new signature lists two coercions per variant, so the planner now expects two arguments for md5, breaking every existing md5(expr) call. Drop the extra coercion so each variant only covers a single parameter.

-                vec![
-                    TypeSignature::Coercible(vec![Coercion::new_exact(
-                        TypeSignatureClass::Native(logical_string()),
-                    )]),
-                    TypeSignature::Coercible(vec![Coercion::new_exact(
-                        TypeSignatureClass::Native(logical_binary()),
-                    )]),
-                ],
+                vec![
+                    TypeSignature::Coercible(vec![Coercion::new_exact(
+                        TypeSignatureClass::Native(logical_string()),
+                    )]),
+                    TypeSignature::Coercible(vec![Coercion::new_exact(
+                        TypeSignatureClass::Native(logical_binary()),
+                    )]),
+                ],

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In datafusion/functions/src/crypto/md5.rs around lines 60 to 68 the function
signature currently lists two coercions per variant (each vec contains two
Coercion entries) which makes the planner expect two arguments; change each
variant to contain a single Coercible entry with only one
Coercion::new_exact(TypeSignatureClass::Native(...)) so the signature represents
a unary function (md5(expr)) for both string and binary types.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The CodeRabbit AI reviewer is not correct! The signature type is Signature::one_of() (https://github.com/apache/datafusion/pull/18664/files#diff-bb3ee484c275734422691e252119fb89eacfc549f75c2a90f2c6b56472e9289dR60). https://github.com/apache/datafusion/pull/18664/files#diff-a43ffc888af37d9724f2aae6f6990449179aedfb0257a36c1a39724e6613fc96R4794 shows that using one parameter still works

@martin-augment
Copy link
Copy Markdown
Owner Author

2. Documentation Comment: digest.rs:120 mentions "sha1" in the comment but it's not in the supported list:
rust // Line 120: "Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512." // Should remove "sha1" or add it to supported algorithms

value:useful; category:documentation; feedback:The Claude AI reviewer is correct! There is no support for SHA1 and since it is a weak algorithm it should not be advertised in the docs. Prevents confusion in the reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants