Skip to content

18448: Refactor create_hashes to accept array references#8

Open
martin-augment wants to merge 1 commit intomainfrom
pr-18448-2025-11-06-14-11-49
Open

18448: Refactor create_hashes to accept array references#8
martin-augment wants to merge 1 commit intomainfrom
pr-18448-2025-11-06-14-11-49

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

@martin-augment martin-augment commented Nov 6, 2025

18448: To review by AI


Note

Introduce create_hashes_from_arrays for &dyn Array, extract hash_single_array, and update hashing call sites to use the new API.

  • Common (datafusion/common):
    • Add create_hashes_from_arrays(&[&dyn Array], ...) and reuse via new internal hash_single_array.
    • Refactor create_hashes(&[ArrayRef], ...) to delegate to hash_single_array.
    • Replace redundant Arc::clone with direct values() access in list/fixed list/dictionary hashing.
    • Add test test_create_hashes_from_arrays.
  • Scalar:
    • Use create_hashes_from_arrays in hash_nested_array.
  • Physical Expr (binary maps):
    • Switch hashing to create_hashes_from_arrays in binary_map.rs and binary_view_map.rs.
  • Physical Plan (hash join tests):
    • Update tests to use create_hashes_from_arrays for key hashing.

Written by Cursor Bugbot for commit 57f98ad. This will update automatically on new commits. Configure here.

Change create_hashes and related functions to work with &dyn Array
references instead of requiring ArrayRef (Arc-wrapped arrays).
This avoids unnecessary Arc::clone() calls and enables calls that
only have an &dyn Array to use the hashing utilities.

Changes:
- Add create_hashes_from_arrays(&[&dyn Array]) function
- Refactor hash_dictionary, hash_list_array, hash_fixed_list_array
  to use references instead of cloning
- Extract hash_single_array() helper for common logic
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 6, 2025

Walkthrough

Introduces new hash utility functions create_hashes_from_arrays and hash_single_array and refactors hash computation across multiple files to use non-Arc array references instead of Arc-wrapped array patterns, reducing unnecessary allocations and simplifying the hashing API.

Changes

Cohort / File(s) Summary
Hash utilities refactoring
datafusion/common/src/hash_utils.rs
Adds public function create_hashes_from_arrays to process multiple arrays and internal helper hash_single_array for per-array hashing logic. Refactors existing hashing paths to use non-Arc references. Adds test test_create_hashes_from_arrays.
Hash utility consumers
datafusion/common/src/scalar/mod.rs,
datafusion/physical-expr-common/src/binary_map.rs,
datafusion/physical-expr-common/src/binary_view_map.rs
Updates imports from create_hashes to create_hashes_from_arrays and refactors call sites to pass array references using .as_ref() instead of Arc-wrapped arrays.
Hash join test code
datafusion/physical-plan/src/joins/hash_join/exec.rs
Updates test code to replace create_hashes usage with create_hashes_from_arrays and adjusts call sites to pass array references.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18448-2025-11-06-14-11-49

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 6, 2025

PR Review: Refactor create_hashes to accept array references

Summary

This PR introduces a performance optimization by adding create_hashes_from_arrays that accepts &dyn Array references instead of requiring ArrayRef (Arc-wrapped arrays). This is a well-executed refactoring that eliminates unnecessary Arc::clone() calls.


Code Quality and Best Practices ✅

Strengths:

  • Excellent API design: The new create_hashes_from_arrays function provides a more flexible API while maintaining backward compatibility with the existing create_hashes function
  • Code reuse: Smart extraction of hash_single_array helper eliminates code duplication between the two public APIs
  • Clear documentation: Doc comments properly explain the relationship between create_hashes and create_hashes_from_arrays
  • Consistent patterns: The refactoring maintains the existing code style and patterns throughout

Minor observations:

  • The #[cfg(not(feature = "force_hash_collisions"))] attributes are correctly placed on both versions of hash_single_array
  • Import cleanup (removing unused std::sync::Arc) is appropriate

Potential Bugs or Issues ✅

No critical bugs identified. The changes are straightforward and maintain the same logic flow:

  1. Dictionary hashing (hash_utils.rs:215-217): Correctly eliminates redundant Arc::clone and uses create_hashes_from_arrays with .as_ref()
  2. List array hashing (hash_utils.rs:311): Same pattern applied correctly
  3. Fixed list array hashing (hash_utils.rs:342): Consistent with other changes
  4. All call sites updated: binary_map.rs, binary_view_map.rs, scalar/mod.rs, and hash_join tests all properly migrated

Edge cases handled:

  • Null handling preserved in all hash functions
  • Dictionary value indexing unchanged
  • List offset logic intact

Performance Considerations ⚡

Performance Improvements:

  • Eliminates unnecessary Arc cloning: The previous code had Arc::clone(array.values()) followed by immediate usage in hash_utils.rs:215, 307, 338. This is now array.values() with direct .as_ref() usage - this removes atomic reference count increments/decrements
  • Zero-cost abstraction: The new API accepts &dyn Array which is what ArrayRef dereferences to anyway, so there's no runtime overhead
  • Reduced memory pressure: Fewer Arc operations means less work for the allocator

No performance regressions:

  • The create_hashes function now delegates to hash_single_array, which is the same code that was previously inlined - compiler should optimize this identically
  • All hash computations use the same algorithms as before

Benchmark recommendation:
While this is clearly an improvement, it would be valuable to add a microbenchmark showing the Arc overhead reduction, especially for dictionary and list arrays where this was most prevalent.


Security Concerns ✅

No security issues identified:

  • No changes to actual hashing algorithms or random state generation
  • No new unsafe code introduced
  • No changes to null handling that could cause panics
  • The .unwrap() calls in hash_utils.rs:385 are acceptable (downcasting to the correct type after pattern matching)

Test Coverage ✅

Good test coverage:

  • ✅ New test test_create_hashes_from_arrays added (hash_utils.rs:927-940) that verifies the new API works with multiple array types
  • ✅ Existing tests continue to use create_hashes, ensuring backward compatibility is maintained
  • ✅ Integration points updated in hash_join tests (exec.rs:3457, 3490, 3528, 3555)

Test quality:
The new test is simple but effective - it tests the multi-column hashing behavior which is the primary use case.

Suggestions for additional testing:

  1. Consider adding a test that verifies create_hashes_from_arrays produces identical results to create_hashes for the same input
  2. Test with dictionary arrays specifically since that's where the most significant change occurred
  3. Consider a test with nested arrays (list/struct) to verify the recursive hashing still works

Example test to consider adding:

#[test]
fn test_create_hashes_equivalence() {
    let array = Arc::new(Int32Array::from(vec![1, 2, 3, 4]));
    let random_state = RandomState::with_seeds(0, 0, 0, 0);
    
    let mut hashes1 = vec![0; array.len()];
    create_hashes(&[Arc::clone(&array)], &random_state, &mut hashes1).unwrap();
    
    let mut hashes2 = vec![0; array.len()];
    create_hashes_from_arrays(&[array.as_ref()], &random_state, &mut hashes2).unwrap();
    
    assert_eq!(hashes1, hashes2);
}

Summary

This is a high-quality refactoring that achieves its goal of reducing unnecessary Arc operations. The code changes are minimal, focused, and maintain full backward compatibility.

Recommendation: ✅ Approve with minor suggestions

The PR is ready to merge as-is, but consider:

  1. Adding an equivalence test between the two APIs
  2. Adding benchmark results showing the Arc overhead reduction (especially for dictionary arrays)

Great work on this optimization! 🎉

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
datafusion/common/src/scalar/mod.rs (1)

883-885: Slice-based hashing avoids extra allocation; consider expect over unwrap

Good change: passing &[arr.as_ref()] avoids creating a temporary Vec<ArrayRef> while preserving semantics for the first column seed. Consider replacing unwrap() with expect(...) for clearer panic context.

-    let hashes =
-        create_hashes_from_arrays(&[arr.as_ref()], &random_state, hashes_buffer).unwrap();
+    let hashes = create_hashes_from_arrays(&[arr.as_ref()], &random_state, hashes_buffer)
+        .expect("hash_nested_array: failed to create row hashes");
datafusion/common/src/hash_utils.rs (1)

926-940: Consider enhancing test coverage.

The test verifies basic functionality, but could be strengthened by:

  1. Comparing hash values from create_hashes_from_arrays with create_hashes to ensure parity
  2. Verifying that multi-column hashing correctly combines hashes (e.g., checking that hashes[i] differs when hashing one vs. two columns)
  3. Testing with null values to ensure proper handling

Example enhancement:

#[test]
fn test_create_hashes_from_arrays() {
    let int_array = Arc::new(Int32Array::from(vec![1, 2, 3, 4]));
    let float_array = Arc::new(Float64Array::from(vec![1.0, 2.0, 3.0, 4.0]));

    let random_state = RandomState::with_seeds(0, 0, 0, 0);
    
    // Test with new API
    let mut hashes_new = vec![0; int_array.len()];
    create_hashes_from_arrays(
        &[int_array.as_ref(), float_array.as_ref()],
        &random_state,
        &mut hashes_new,
    )
    .unwrap();
    
    // Compare with old API for compatibility
    let mut hashes_old = vec![0; int_array.len()];
    create_hashes(
        &[Arc::clone(&int_array), Arc::clone(&float_array)],
        &random_state,
        &mut hashes_old,
    )
    .unwrap();
    
    assert_eq!(hashes_new.len(), 4);
    assert_eq!(hashes_new, hashes_old, "New and old APIs should produce identical hashes");
    
    // Verify multi-column hashing differs from single-column
    let mut hashes_single = vec![0; int_array.len()];
    create_hashes_from_arrays(
        &[int_array.as_ref()],
        &random_state,
        &mut hashes_single,
    )
    .unwrap();
    assert_ne!(hashes_new, hashes_single, "Multi-column hashes should differ from single-column");
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2059dcc and 57f98ad.

📒 Files selected for processing (5)
  • datafusion/common/src/hash_utils.rs (5 hunks)
  • datafusion/common/src/scalar/mod.rs (2 hunks)
  • datafusion/physical-expr-common/src/binary_map.rs (2 hunks)
  • datafusion/physical-expr-common/src/binary_view_map.rs (2 hunks)
  • datafusion/physical-plan/src/joins/hash_join/exec.rs (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
datafusion/physical-expr-common/src/binary_view_map.rs (1)
datafusion/common/src/hash_utils.rs (1)
  • create_hashes_from_arrays (444-455)
datafusion/physical-expr-common/src/binary_map.rs (1)
datafusion/common/src/hash_utils.rs (1)
  • create_hashes_from_arrays (444-455)
datafusion/common/src/hash_utils.rs (1)
datafusion/common/src/cast.rs (8)
  • as_boolean_array (152-154)
  • as_string_view_array (142-144)
  • as_binary_view_array (302-304)
  • as_struct_array (57-59)
  • as_list_array (157-159)
  • as_large_list_array (183-185)
  • as_map_array (195-197)
  • as_fixed_size_list_array (312-314)
datafusion/physical-plan/src/joins/hash_join/exec.rs (1)
datafusion/common/src/hash_utils.rs (1)
  • create_hashes_from_arrays (444-455)
datafusion/common/src/scalar/mod.rs (1)
datafusion/common/src/hash_utils.rs (1)
  • create_hashes_from_arrays (444-455)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: codex
  • GitHub Check: claude-review
🔇 Additional comments (13)
datafusion/physical-plan/src/joins/hash_join/exec.rs (3)

1518-1518: LGTM: Import updated to new hashing utility.

The import correctly switches to create_hashes_from_arrays which accepts array references instead of Arc-wrapped arrays.


3457-3461: LGTM: Test code correctly adapted to new array reference API.

The changes properly use .as_ref() to convert ArrayRef (Arc) to &dyn Array, which matches the new create_hashes_from_arrays signature. This eliminates unnecessary Arc cloning.

Also applies to: 3490-3494


3528-3532: LGTM: Consistent usage of array references in tests.

The test code consistently applies the same pattern of using .as_ref() to pass array references to the hashing utility.

Also applies to: 3555-3559

datafusion/physical-expr-common/src/binary_view_map.rs (2)

26-26: LGTM: Import updated to new array reference hashing utility.

The import correctly switches to create_hashes_from_arrays.


246-246: LGTM: Improved hashing performance by using array references.

The change from Arc::clone(values) to values.as_ref() eliminates an unnecessary Arc allocation and reference count increment. This is a performance improvement with no change in behavior since create_hashes_from_arrays accepts array references.

datafusion/physical-expr-common/src/binary_map.rs (2)

30-30: LGTM: Import updated to new array reference hashing utility.

The import correctly switches to create_hashes_from_arrays, consistent with the refactoring across the codebase.


352-352: LGTM: Performance improvement through array references.

The change from Arc::clone(values) to values.as_ref() eliminates unnecessary Arc allocation, consistent with the same refactoring in binary_view_map.rs. This improves performance while maintaining correct behavior.

datafusion/common/src/scalar/mod.rs (1)

54-54: Import updated to new hashing API

Both create_hashes and create_hashes_from_arrays coexist in hash_utils.rs with different signatures. The switch in scalar/mod.rs to create_hashes_from_arrays is correct—it passes &[&dyn Array] directly without Arc wrapping. The 20+ existing uses of the original create_hashes function remain unaffected and continue to work. No migration issues.

datafusion/common/src/hash_utils.rs (5)

215-217: LGTM: Efficient refactor to avoid Arc clones.

The change from create_hashes(&[dict_values], ...) to create_hashes_from_arrays(&[dict_values.as_ref()], ...) correctly uses the new API and avoids unnecessary Arc clones.


307-311: LGTM: Consistent refactor pattern.

The refactor correctly applies the same pattern as in hash_dictionary, using as_ref() to work with the new create_hashes_from_arrays API.


338-342: LGTM: Consistent application of the refactor.

The changes follow the established pattern and correctly use the new API.


365-421: Excellent refactor: Clean extraction of hashing logic.

The new hash_single_array function effectively consolidates the per-array hashing logic, making it reusable across both create_hashes and create_hashes_from_arrays. The comprehensive type dispatch and proper rehash handling maintain correctness while improving code organization.


444-474: LGTM: Well-designed API with backward compatibility.

The new create_hashes_from_arrays function provides a more efficient API for callers who already have &dyn Array references, while create_hashes maintains backward compatibility by delegating to the new internal implementation. The documentation clearly explains the relationship between the two functions.

@martin-augment
Copy link
Copy Markdown
Owner Author

Suggestions for additional testing:

  1. Consider adding a test that verifies create_hashes_from_arrays produces identical results to create_hashes for the same input
  2. Test with dictionary arrays specifically since that's where the most significant change occurred
  3. Consider a test with nested arrays (list/struct) to verify the recursive hashing still works

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that more unit tests could be added to make sure that the changed lines do not introduce any regressions.

@martin-augment
Copy link
Copy Markdown
Owner Author

883-885: Slice-based hashing avoids extra allocation; consider expect over unwrap

Good change: passing &[arr.as_ref()] avoids creating a temporary Vec<ArrayRef> while preserving semantics for the first column seed. Consider replacing unwrap() with expect(...) for clearer panic context.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct that if it is not possible to return a Result::Err then the next best thing is to use .expect("...") instead of .unwrap(). This would give the eventual panic some context and it would make it easier to debug the problem.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 6, 2025

Findings

  • datafusion/functions/src/datetime/now.rs:55 — NowFunc::new() now hardcodes timezone: Some(Arc::from("+00")). The offset format is missing the minutes component (+HH:MM), so we return an invalid timezone string. The old implementation delegated to ConfigOptions::default() and produced a valid value; this change regresses both correctness and compatibility for anyone still calling the (deprecated) ctor. Please restore the previous behaviour (or at least use "+00:00" and continue to honour the default config).

@martin-augment
Copy link
Copy Markdown
Owner Author

  • datafusion/functions/src/datetime/now.rs:55 — NowFunc::new() now hardcodes timezone: Some(Arc::from("+00")). The offset format is missing the minutes component (+HH:MM), so we return an invalid timezone string. The old implementation delegated to ConfigOptions::default() and produced a valid value; this change regresses both correctness and compatibility for anyone still calling the (deprecated) ctor. Please restore the previous behaviour (or at least use "+00:00" and continue to honour the default config).

value:annoying; category:bug; feedback:The Codex AI reviewer is hallucinating again. The mentioned file is not part of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants