Conversation
WalkthroughThis pull request refactors the DataFusion cryptographic functions module. Individual SHA algorithm implementations (sha224, sha256, sha384, sha512) are consolidated into a unified ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review - PR #21: Refactor crypto functions codeSummaryThis PR successfully consolidates the SHA hash functions (sha224, sha256, sha384, sha512) from four separate files into a single parameterized implementation. This is a good refactoring that reduces code duplication significantly (612 lines deleted vs 312 added). ✅ Strengths1. Excellent Code Deduplication
2. Improved Type Safety
3. Better Error Handling
4. Code Organization
5. Documentation Improvements
🔍 Issues & Concerns1. Critical: Incorrect Return Type in Documentation // Current (sha.rs:36)
description = "Computes the SHA-224 hash of a binary string.",
// Suggested
description = "Computes the SHA-224 binary hash of a string or binary input.",2. Inconsistency in Return Types
This inconsistency could be confusing for users. While this might be intentional for backward compatibility, it should be documented clearly or reconsidered for API consistency. 3. Unreachable Code Pattern DigestAlgorithm::Md5
| DigestAlgorithm::Blake2s
| DigestAlgorithm::Blake2b
| DigestAlgorithm::Blake3 => unreachable!(),While this is technically safe given the constructor constraints, consider using a type-safe approach with a separate enum for SHA-specific algorithms, or add a debug assertion with context: _ => unreachable!("SHAFunc should only be constructed with SHA algorithms, got {:?}", self.algorithm)4. Removed Helper Function 🎯 Code Quality Observations1. Macro Usage
2. Improved Code Flow 3. Signature Simplification ⚡ Performance ConsiderationsNo Performance Concerns Identified
🧪 Test CoverageAdequate Coverage Observed
Suggestion: Consider adding a test specifically for the new consolidated implementation to verify all four SHA functions work correctly from the same code path. 🔒 Security ConsiderationsNo Security Issues
📝 Minor Nitpicks
✨ Recommendations
📊 Overall AssessmentRating: ✅ Approved with Minor Suggestions This is a well-executed refactoring that significantly reduces code duplication while maintaining functionality. The code quality improvements (better error types, stricter signatures, improved organization) outweigh the minor issues identified. The concerns raised are mostly about documentation clarity and defensive programming practices rather than functional bugs. Great work on consolidating this code! 🎉 |
|
|
||
| /// Digest computes a binary hash of the given data, accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`]. | ||
| /// Second argument is the algorithm to use. | ||
| /// Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512. |
There was a problem hiding this comment.
The docs list 'sha1' as a supported algorithm, but DigestAlgorithm doesn't include SHA-1 and no implementation is provided here—consider removing 'sha1' from this list (also applies to other docs if present).
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:documentation; feedback:The Augment AI reviewer is correct! There is no support for SHA1 and since it is a weak algorithm it should not be advertised in the docs. Prevents confusion in the reader.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
datafusion/functions/src/crypto/sha.rs (1)
163-174: Consider using a type-safe approach to eliminate unreachable!().While the
unreachable!()is safe due to the private constructor, it introduces a runtime panic point that could be eliminated with stronger type safety. Consider defining a SHA-specific algorithm enum or using the type system to ensure only SHA algorithms can be stored inSHAFunc.Example approach:
#[derive(Debug, PartialEq, Eq, Hash)] enum SHAAlgorithm { Sha224, Sha256, Sha384, Sha512, } impl SHAAlgorithm { fn to_digest_algorithm(&self) -> DigestAlgorithm { match self { SHAAlgorithm::Sha224 => DigestAlgorithm::Sha224, SHAAlgorithm::Sha256 => DigestAlgorithm::Sha256, SHAAlgorithm::Sha384 => DigestAlgorithm::Sha384, SHAAlgorithm::Sha512 => DigestAlgorithm::Sha512, } } }This would eliminate the need for the
unreachable!()branch entirely.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
datafusion/functions/src/crypto/basic.rs(8 hunks)datafusion/functions/src/crypto/digest.rs(6 hunks)datafusion/functions/src/crypto/md5.rs(4 hunks)datafusion/functions/src/crypto/mod.rs(1 hunks)datafusion/functions/src/crypto/sha.rs(1 hunks)datafusion/functions/src/crypto/sha224.rs(0 hunks)datafusion/functions/src/crypto/sha256.rs(0 hunks)datafusion/functions/src/crypto/sha384.rs(0 hunks)datafusion/functions/src/crypto/sha512.rs(0 hunks)datafusion/sqllogictest/test_files/expr.slt(1 hunks)docs/source/user-guide/sql/scalar_functions.md(6 hunks)
💤 Files with no reviewable changes (4)
- datafusion/functions/src/crypto/sha384.rs
- datafusion/functions/src/crypto/sha512.rs
- datafusion/functions/src/crypto/sha256.rs
- datafusion/functions/src/crypto/sha224.rs
🧰 Additional context used
🧬 Code graph analysis (5)
datafusion/functions/src/crypto/mod.rs (3)
datafusion/functions/src/crypto/digest.rs (1)
digest(121-133)datafusion/functions/src/crypto/basic.rs (1)
md5(138-157)datafusion/functions/src/crypto/sha.rs (4)
sha224(106-108)sha256(110-112)sha384(114-116)sha512(118-120)
datafusion/functions/src/crypto/sha.rs (3)
datafusion/functions/src/crypto/basic.rs (2)
digest_process(280-329)Self(109-121)datafusion/common/src/utils/mod.rs (1)
take_function_args(926-940)datafusion/expr-common/src/signature.rs (2)
one_of(1182-1188)new_exact(914-916)
datafusion/functions/src/crypto/md5.rs (3)
datafusion/functions/src/crypto/digest.rs (2)
name(97-99)return_type(105-107)datafusion/functions/src/crypto/sha.rs (2)
name(146-148)return_type(154-156)datafusion/expr-common/src/signature.rs (1)
new_exact(914-916)
datafusion/functions/src/crypto/basic.rs (2)
datafusion/common/src/utils/mod.rs (3)
take_function_args(926-940)a(540-540)a(545-545)datafusion/functions/src/crypto/digest.rs (1)
digest(121-133)
datafusion/functions/src/crypto/digest.rs (2)
datafusion/functions/src/crypto/basic.rs (1)
digest_process(280-329)datafusion/common/src/utils/mod.rs (1)
take_function_args(926-940)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
- GitHub Check: Analyze (rust)
🔇 Additional comments (11)
datafusion/functions/src/crypto/sha.rs (4)
34-96: LGTM: Clear documentation with concrete examples.The documentation for each SHA variant includes practical SQL examples with actual hash outputs, which will help users understand the expected behavior.
98-120: LGTM: Clean struct design with well-encapsulated constructors.The public constructors ensure type safety by only allowing valid SHA algorithm variants to be created.
122-138: LGTM: Proper signature configuration.The use of exact coercions with
logical_string()andlogical_binary()provides strict type checking, ensuring only string or binary inputs are accepted. TheImmutablevolatility is correct for deterministic hash functions.
158-161: LGTM: Concise delegation to digest processing.The implementation correctly extracts the single required argument and delegates to the centralized
digest_processfunction.datafusion/functions/src/crypto/basic.rs (7)
84-84: LGTM: Essential derives for the refactored architecture.Adding
PartialEq, Eq, Hashderives enablesDigestAlgorithmto be used effectively in the newSHAFuncstructure and supports comparison and hashing operations.
155-155: LGTM: Appropriate error type for unreachable code path.Using
internal_err!is correct here sincedigest_processwithDigestAlgorithm::Md5always returns eitherColumnarValue::ArrayorColumnarValue::Scalar(ScalarValue::Binary). Reaching this branch would indicate a programming error.
200-200: LGTM: Improved encapsulation through private helper methods.Making
digest_scalar,digest_utf8_array_impl, anddigest_binary_array_implprivate improves encapsulation by forcing all digest operations through the centralizeddigest_processfunction.Also applies to: 217-217, 248-248
285-312: LGTM: Explicit type handling with clear error messages.The refactored array processing explicitly handles all supported string and binary types with appropriate casting methods. The error message includes both the unsupported type and the digest algorithm, which aids debugging.
313-328: LGTM: Clean scalar value handling.The scalar processing branch cleanly handles all string and binary variants using pattern matching with
|operators, maintaining consistency with the array processing logic.
171-185: LGTM: Effective code deduplication through macros.The
digest_to_arrayanddigest_to_scalarmacros successfully eliminate repetitive digest finalization code across multiple algorithm variants, while allowing Blake3's unique API to be handled specially. This improves maintainability.Also applies to: 200-246
187-196: The #[allow(deprecated)] annotation appears unnecessary.As of sha2 0.10.9, finalize().as_slice() is the documented way to convert the digest output to a &[u8] slice and is not deprecated. The
#[allow(deprecated)]at line 192 may be overly cautious or left over from an earlier version concern, but no actual deprecated API is being invoked in the current dependency versions.If the suppress annotation was added defensively, it can be removed. If there's a specific compiler or version combination producing a false positive warning, that should be investigated separately—but the code itself is correct per the library's current documentation.
| signature: Signature::one_of( | ||
| vec![ | ||
| TypeSignature::Coercible(vec![Coercion::new_implicit( | ||
| TypeSignatureClass::Native(logical_binary()), | ||
| vec![TypeSignatureClass::Native(logical_string())], | ||
| NativeType::String, | ||
| TypeSignature::Coercible(vec![Coercion::new_exact( | ||
| TypeSignatureClass::Native(logical_string()), | ||
| )]), | ||
| TypeSignature::Coercible(vec![Coercion::new_implicit( | ||
| TypeSignature::Coercible(vec![Coercion::new_exact( | ||
| TypeSignatureClass::Native(logical_binary()), | ||
| vec![TypeSignatureClass::Native(logical_binary())], | ||
| NativeType::Binary, | ||
| )]), | ||
| ], |
There was a problem hiding this comment.
Restore unary signature for md5.
The new signature lists two coercions per variant, so the planner now expects two arguments for md5, breaking every existing md5(expr) call. Drop the extra coercion so each variant only covers a single parameter.
- vec![
- TypeSignature::Coercible(vec![Coercion::new_exact(
- TypeSignatureClass::Native(logical_string()),
- )]),
- TypeSignature::Coercible(vec![Coercion::new_exact(
- TypeSignatureClass::Native(logical_binary()),
- )]),
- ],
+ vec![
+ TypeSignature::Coercible(vec![Coercion::new_exact(
+ TypeSignatureClass::Native(logical_string()),
+ )]),
+ TypeSignature::Coercible(vec![Coercion::new_exact(
+ TypeSignatureClass::Native(logical_binary()),
+ )]),
+ ],Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In datafusion/functions/src/crypto/md5.rs around lines 60 to 68 the function
signature currently lists two coercions per variant (each vec contains two
Coercion entries) which makes the planner expect two arguments; change each
variant to contain a single Coercible entry with only one
Coercion::new_exact(TypeSignatureClass::Native(...)) so the signature represents
a unary function (md5(expr)) for both string and binary types.
There was a problem hiding this comment.
value:annoying; category:bug; feedback:The CodeRabbit AI reviewer is not correct! The signature type is Signature::one_of() (https://github.com/apache/datafusion/pull/18664/files#diff-bb3ee484c275734422691e252119fb89eacfc549f75c2a90f2c6b56472e9289dR60). https://github.com/apache/datafusion/pull/18664/files#diff-a43ffc888af37d9724f2aae6f6990449179aedfb0257a36c1a39724e6613fc96R4794 shows that using one parameter still works
value:useful; category:documentation; feedback:The Claude AI reviewer is correct! There is no support for SHA1 and since it is a weak algorithm it should not be advertised in the docs. Prevents confusion in the reader. |
18664: To review by AI