-
Notifications
You must be signed in to change notification settings - Fork 2
Add faster (~6-10x) hashing algorithm #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces the xxhash algorithm offering a much faster alternative (~6-10x) to MD5 for file hashing. Key changes include:
- Adding a new hash option 'binhash-xxh' that uses xxhash in the hashing logic.
- Updating tests in test_manifest.py to include the new 'binhash-xxh' algorithm.
- Updating dependency files (.conda/meta.yaml and .conda/env_dev.yml) to include the xxhash libraries.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| yamanifest/hashing.py | Updated hash function to support xxhash and modified supported_hashes ordering. |
| test/test_manifest.py | Adjusted tests to include the new 'binhash-xxh' hashing strategy. |
| .conda/meta.yaml | Included xxhash dependency. |
| .conda/env_dev.yml | Added python-xxhash to development environment. |
Comments suppressed due to low confidence (2)
yamanifest/hashing.py:35
- [nitpick] Consider adding a brief comment explaining why 'binhash-xxh' is placed at the top of supported_hashes to clarify its priority due to improved performance.
'binhash-xxh','binhash', 'binhash-nomtime', 'md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512'
test/test_manifest.py:85
- [nitpick] Consider adding test assertions that specifically validate the output of the 'binhash-xxh' hash to verify that it behaves as expected in all covered scenarios.
mf1.add(os.path.join('test',filepath),['binhash-xxh','md5','sha1'])
|
This is cool @charles-turner-1. I'll add a positive review, but before merging I'd like @jo-basevi to do a few tests with |
yamanifest/hashing.py
Outdated
| m = hashlib.new('md5') | ||
| def _binhash(path, size, include_mtime, use_xxh=False): | ||
|
|
||
| m = xxhash.xxh64() if use_xxh else hashlib.new('md5') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to use xxh3 which is 50% faster and almost 2x small data velocity according to the benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked like it wasn't available in the Python version library when I looked yesterday afternoon.
I should have looked more carefully - turns out it is there, it's just called xxh3_64 and not xxh3 in the Python library.
I'v just had a quick look in a Jupyter notebook and it seems to only be maybe 20% faster, but I see no reason not to default to xxh3 if we're using the xxh hash
jo-basevi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me!
I've tested that this runs fine with payu as the manifests still defaults to using binhash. When testing changing the "fast-hash" from binhash to binhash-xxh, it seemed faster on the login node though timing was very variable. A number of configurations I tested seemed to not spend very long on calculating binhashes anyway (less than 1 second). Though I did find that ACCESS-OM2 01deg_jra55_iaf_bgc has a number of inputs and takes like ~10 seconds to calculate the input binhashes. Re-running payu setup on the login node didn't seem to make much difference using binhash or binhash-xxh, so time must be lost somewhere else.. Though running on payu setup on PBS jobs, it did cut the fast hash calculation time down by ~50% using binhash-xxh, so it might be worth making it the default for payu.
|
As it stands, I haven't changed the default from Or would that be a Payu specific change? |
Yeah, it will need to be a change to Payu here. |
|
Cool! I'll merge this & I can open the PR in Payu too if that makes life easier for you. |
Thanks! I've opened an issue in the payu repository. I'm just a bit hesitant of pushing it through to payu straight away as we'll need to let people know that the manifests will change, and all the slow |
if-elsechain inhashing.hashsince it's so much faster than the other hashes I'd rather sit the overhead elsewhere.xxhash: https://github.com/Cyan4973/xxHash
xxhash Python Bindings: https://github.com/ifduyue/python-xxhash
More context
The
build-esm-datastorecatalog utility uses yamanifest to track whether files in the catalogs have changed, but is basically way too slow to be user friendly - think on the order of a minute to a couple of minutes to run. I can't remember where, but I've seen a couple of notebooks where people are using that utility, but have commented it out to avoid rerunning it.I've been playing with performance improvements with the hashing on and off for a good while now, and didn't manage to make any headway with either Cython or Maturin - basically everything works out to be about the same speed. Presumably the MD5 hash that binhash uses is a compiled C extension (I haven't checked but it seems likely), which eats up nearly all the runtime & so any attempts to speed hashing by using compiled extensions do next to nothing to improve performance.
See also ACCESS-NRI/access-nri-intake-catalog#355
Anyhow, xxhash is just a much faster hashing algorithm than MD5. It looks relatively well maintained and has no dependencies, which is pretty crucial IMO.