18448: Refactor create_hashes to accept array references#8
18448: Refactor create_hashes to accept array references#8martin-augment wants to merge 1 commit intomainfrom
Conversation
Change create_hashes and related functions to work with &dyn Array references instead of requiring ArrayRef (Arc-wrapped arrays). This avoids unnecessary Arc::clone() calls and enables calls that only have an &dyn Array to use the hashing utilities. Changes: - Add create_hashes_from_arrays(&[&dyn Array]) function - Refactor hash_dictionary, hash_list_array, hash_fixed_list_array to use references instead of cloning - Extract hash_single_array() helper for common logic
WalkthroughIntroduces new hash utility functions Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Review: Refactor create_hashes to accept array referencesSummaryThis PR introduces a performance optimization by adding Code Quality and Best Practices ✅Strengths:
Minor observations:
Potential Bugs or Issues ✅No critical bugs identified. The changes are straightforward and maintain the same logic flow:
Edge cases handled:
Performance Considerations ⚡Performance Improvements:
No performance regressions:
Benchmark recommendation: Security Concerns ✅No security issues identified:
Test Coverage ✅Good test coverage:
Test quality: Suggestions for additional testing:
Example test to consider adding: #[test]
fn test_create_hashes_equivalence() {
let array = Arc::new(Int32Array::from(vec![1, 2, 3, 4]));
let random_state = RandomState::with_seeds(0, 0, 0, 0);
let mut hashes1 = vec![0; array.len()];
create_hashes(&[Arc::clone(&array)], &random_state, &mut hashes1).unwrap();
let mut hashes2 = vec![0; array.len()];
create_hashes_from_arrays(&[array.as_ref()], &random_state, &mut hashes2).unwrap();
assert_eq!(hashes1, hashes2);
}SummaryThis is a high-quality refactoring that achieves its goal of reducing unnecessary Arc operations. The code changes are minimal, focused, and maintain full backward compatibility. Recommendation: ✅ Approve with minor suggestions The PR is ready to merge as-is, but consider:
Great work on this optimization! 🎉 |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
datafusion/common/src/scalar/mod.rs (1)
883-885: Slice-based hashing avoids extra allocation; considerexpectoverunwrapGood change: passing
&[arr.as_ref()]avoids creating a temporaryVec<ArrayRef>while preserving semantics for the first column seed. Consider replacingunwrap()withexpect(...)for clearer panic context.- let hashes = - create_hashes_from_arrays(&[arr.as_ref()], &random_state, hashes_buffer).unwrap(); + let hashes = create_hashes_from_arrays(&[arr.as_ref()], &random_state, hashes_buffer) + .expect("hash_nested_array: failed to create row hashes");datafusion/common/src/hash_utils.rs (1)
926-940: Consider enhancing test coverage.The test verifies basic functionality, but could be strengthened by:
- Comparing hash values from
create_hashes_from_arrayswithcreate_hashesto ensure parity- Verifying that multi-column hashing correctly combines hashes (e.g., checking that
hashes[i]differs when hashing one vs. two columns)- Testing with null values to ensure proper handling
Example enhancement:
#[test] fn test_create_hashes_from_arrays() { let int_array = Arc::new(Int32Array::from(vec![1, 2, 3, 4])); let float_array = Arc::new(Float64Array::from(vec![1.0, 2.0, 3.0, 4.0])); let random_state = RandomState::with_seeds(0, 0, 0, 0); // Test with new API let mut hashes_new = vec![0; int_array.len()]; create_hashes_from_arrays( &[int_array.as_ref(), float_array.as_ref()], &random_state, &mut hashes_new, ) .unwrap(); // Compare with old API for compatibility let mut hashes_old = vec![0; int_array.len()]; create_hashes( &[Arc::clone(&int_array), Arc::clone(&float_array)], &random_state, &mut hashes_old, ) .unwrap(); assert_eq!(hashes_new.len(), 4); assert_eq!(hashes_new, hashes_old, "New and old APIs should produce identical hashes"); // Verify multi-column hashing differs from single-column let mut hashes_single = vec![0; int_array.len()]; create_hashes_from_arrays( &[int_array.as_ref()], &random_state, &mut hashes_single, ) .unwrap(); assert_ne!(hashes_new, hashes_single, "Multi-column hashes should differ from single-column"); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
datafusion/common/src/hash_utils.rs(5 hunks)datafusion/common/src/scalar/mod.rs(2 hunks)datafusion/physical-expr-common/src/binary_map.rs(2 hunks)datafusion/physical-expr-common/src/binary_view_map.rs(2 hunks)datafusion/physical-plan/src/joins/hash_join/exec.rs(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
datafusion/physical-expr-common/src/binary_view_map.rs (1)
datafusion/common/src/hash_utils.rs (1)
create_hashes_from_arrays(444-455)
datafusion/physical-expr-common/src/binary_map.rs (1)
datafusion/common/src/hash_utils.rs (1)
create_hashes_from_arrays(444-455)
datafusion/common/src/hash_utils.rs (1)
datafusion/common/src/cast.rs (8)
as_boolean_array(152-154)as_string_view_array(142-144)as_binary_view_array(302-304)as_struct_array(57-59)as_list_array(157-159)as_large_list_array(183-185)as_map_array(195-197)as_fixed_size_list_array(312-314)
datafusion/physical-plan/src/joins/hash_join/exec.rs (1)
datafusion/common/src/hash_utils.rs (1)
create_hashes_from_arrays(444-455)
datafusion/common/src/scalar/mod.rs (1)
datafusion/common/src/hash_utils.rs (1)
create_hashes_from_arrays(444-455)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: codex
- GitHub Check: claude-review
🔇 Additional comments (13)
datafusion/physical-plan/src/joins/hash_join/exec.rs (3)
1518-1518: LGTM: Import updated to new hashing utility.The import correctly switches to
create_hashes_from_arrayswhich accepts array references instead of Arc-wrapped arrays.
3457-3461: LGTM: Test code correctly adapted to new array reference API.The changes properly use
.as_ref()to convertArrayRef(Arc) to&dyn Array, which matches the newcreate_hashes_from_arrayssignature. This eliminates unnecessary Arc cloning.Also applies to: 3490-3494
3528-3532: LGTM: Consistent usage of array references in tests.The test code consistently applies the same pattern of using
.as_ref()to pass array references to the hashing utility.Also applies to: 3555-3559
datafusion/physical-expr-common/src/binary_view_map.rs (2)
26-26: LGTM: Import updated to new array reference hashing utility.The import correctly switches to
create_hashes_from_arrays.
246-246: LGTM: Improved hashing performance by using array references.The change from
Arc::clone(values)tovalues.as_ref()eliminates an unnecessary Arc allocation and reference count increment. This is a performance improvement with no change in behavior sincecreate_hashes_from_arraysaccepts array references.datafusion/physical-expr-common/src/binary_map.rs (2)
30-30: LGTM: Import updated to new array reference hashing utility.The import correctly switches to
create_hashes_from_arrays, consistent with the refactoring across the codebase.
352-352: LGTM: Performance improvement through array references.The change from
Arc::clone(values)tovalues.as_ref()eliminates unnecessary Arc allocation, consistent with the same refactoring inbinary_view_map.rs. This improves performance while maintaining correct behavior.datafusion/common/src/scalar/mod.rs (1)
54-54: Import updated to new hashing APIBoth
create_hashesandcreate_hashes_from_arrayscoexist inhash_utils.rswith different signatures. The switch inscalar/mod.rstocreate_hashes_from_arraysis correct—it passes&[&dyn Array]directly without Arc wrapping. The 20+ existing uses of the originalcreate_hashesfunction remain unaffected and continue to work. No migration issues.datafusion/common/src/hash_utils.rs (5)
215-217: LGTM: Efficient refactor to avoid Arc clones.The change from
create_hashes(&[dict_values], ...)tocreate_hashes_from_arrays(&[dict_values.as_ref()], ...)correctly uses the new API and avoids unnecessary Arc clones.
307-311: LGTM: Consistent refactor pattern.The refactor correctly applies the same pattern as in
hash_dictionary, usingas_ref()to work with the newcreate_hashes_from_arraysAPI.
338-342: LGTM: Consistent application of the refactor.The changes follow the established pattern and correctly use the new API.
365-421: Excellent refactor: Clean extraction of hashing logic.The new
hash_single_arrayfunction effectively consolidates the per-array hashing logic, making it reusable across bothcreate_hashesandcreate_hashes_from_arrays. The comprehensive type dispatch and properrehashhandling maintain correctness while improving code organization.
444-474: LGTM: Well-designed API with backward compatibility.The new
create_hashes_from_arraysfunction provides a more efficient API for callers who already have&dyn Arrayreferences, whilecreate_hashesmaintains backward compatibility by delegating to the new internal implementation. The documentation clearly explains the relationship between the two functions.
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that more unit tests could be added to make sure that the changed lines do not introduce any regressions. |
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct that if it is not possible to return a Result::Err then the next best thing is to use |
|
Findings
|
value:annoying; category:bug; feedback:The Codex AI reviewer is hallucinating again. The mentioned file is not part of this Pull Request. |
18448: To review by AI
Note
Introduce
create_hashes_from_arraysfor&dyn Array, extracthash_single_array, and update hashing call sites to use the new API.datafusion/common):create_hashes_from_arrays(&[&dyn Array], ...)and reuse via new internalhash_single_array.create_hashes(&[ArrayRef], ...)to delegate tohash_single_array.Arc::clonewith directvalues()access in list/fixed list/dictionary hashing.test_create_hashes_from_arrays.create_hashes_from_arraysinhash_nested_array.create_hashes_from_arraysinbinary_map.rsandbinary_view_map.rs.create_hashes_from_arraysfor key hashing.Written by Cursor Bugbot for commit 57f98ad. This will update automatically on new commits. Configure here.