Skip to content

Conversation

@leekeiabstraction
Copy link
Contributor

@leekeiabstraction leekeiabstraction commented Dec 27, 2025

Purpose

Linked issue: close #112

Add HashBucketAssigner as part of larger rust client put KV feature

Brief change log

  • Added HashBucketAssigner
  • Added DataLakeFormat enum
  • Added ported over Flink's flavor of murmur hash implementation
  • Ported over murmur3 hash implementation to avoid pulling in crates as current publicly available rust murmur3 implementations are small personal projects

Tests

  • Added UTs (UT testing against similar values on will be added on Java side)
  • Murmur3 UT includes testing against documented input / seed / output of murmur3
  • Manually verified by comparing against output from Java's FlussBucketingFunction, PaimonBucketingFunction and IcebergBucketingFunction

@leekeiabstraction
Copy link
Contributor Author

@Kelvinyu1117 @luoyuxia , would appreciate review here.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds HashBucketAssigner as part of the larger Rust client put KV feature. The implementation includes a custom MurmurHash port from Apache Flink to avoid external dependencies, a new DataLakeFormat enum for format selection, and bucketing functions that support Fluss, Paimon, Lance, and Iceberg formats.

Key Changes:

  • Added HashBucketAssigner to route records to buckets based on hash-based bucketing strategies
  • Ported MurmurHash3 implementations (standard and Flink variants) for consistent hashing with Java implementations
  • Introduced DataLakeFormat enum and BucketingFunction trait with format-specific implementations

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
crates/fluss/src/util/murmur_hash.rs Implements MurmurHash3 algorithms (standard and Flink variants) with comprehensive test coverage
crates/fluss/src/util/mod.rs Exposes the murmur_hash module
crates/fluss/src/metadata/data_lake_format.rs Defines DataLakeFormat enum for supported data lake formats
crates/fluss/src/metadata/mod.rs Adds and exports the data_lake_format module
crates/fluss/src/lib.rs Adds the bucketing module to the library
crates/fluss/src/client/write/bucket_assigner.rs Implements HashBucketAssigner for hash-based bucket assignment
crates/fluss/src/bucketing/mod.rs Defines BucketingFunction trait and format-specific implementations with tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leekeiabstraction Thanks for the pr. Left some comments. PTAL

Copy link
Contributor Author

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leekeiabstraction Thanks for update.. Only one comment.


// Finalization mix - force all bits of a hash block to avalanche
#[inline(always)]
fn fmix(mut h1: i32, length: usize) -> i32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope it won’t upset you.....
The tricky part is that standard murmur_hash use usize...

But the Fluss uses int32, although it's not standard, we have to follow it.
https://github.com/apache/fluss/blob/f7062eae19be72e5c8dd70373d43cf1f8d283d07/fluss-common/src/main/java/org/apache/fluss/utils/MurmurHashUtils.java#L147

So, iceberg use standard murmur_hash, fluss doesn't.. we will need to keep two..
Maybe make standard murmur_hash a mod, but fluss murmur_hash another mod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries!

Hmm, I'm not sure if I fully understand what you mean by another mod? Also, I can add an assert to check for size smaller than i32::Max for fluss/flink variant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me. I'm also fine to add an assert to check for size smaller than i32::Max for fluss/flink variant.
At least it can avoid the difference of hash between rust client and java client

@leekeiabstraction
Copy link
Contributor Author

Also added test on Java side: apache/fluss#2274

@luoyuxia luoyuxia merged commit 9bc2e9c into apache:main Dec 29, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introduce HashBucketAssigner

2 participants