Skip to content

Conversation

@charles-turner-1
Copy link
Collaborator

@charles-turner-1 charles-turner-1 commented Jun 20, 2025

  • Adds the xxhash algorithm - comes in about 6-10x faster than an MD5 hash, over the range of test cases I've run.
  • I've stuck it at the top of the if-else chain in hashing.hash since it's so much faster than the other hashes I'd rather sit the overhead elsewhere.
  • Add relevant tests.
  • Update meta.yaml, pyproject.toml, etc.

xxhash: https://github.com/Cyan4973/xxHash
xxhash Python Bindings: https://github.com/ifduyue/python-xxhash

Screenshot 2025-06-20 at 8 26 04 am

More context

The build-esm-datastore catalog utility uses yamanifest to track whether files in the catalogs have changed, but is basically way too slow to be user friendly - think on the order of a minute to a couple of minutes to run. I can't remember where, but I've seen a couple of notebooks where people are using that utility, but have commented it out to avoid rerunning it.

I've been playing with performance improvements with the hashing on and off for a good while now, and didn't manage to make any headway with either Cython or Maturin - basically everything works out to be about the same speed. Presumably the MD5 hash that binhash uses is a compiled C extension (I haven't checked but it seems likely), which eats up nearly all the runtime & so any attempts to speed hashing by using compiled extensions do next to nothing to improve performance.

See also ACCESS-NRI/access-nri-intake-catalog#355

Anyhow, xxhash is just a much faster hashing algorithm than MD5. It looks relatively well maintained and has no dependencies, which is pretty crucial IMO.

@charles-turner-1 charles-turner-1 requested a review from Copilot June 20, 2025 00:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces the xxhash algorithm offering a much faster alternative (~6-10x) to MD5 for file hashing. Key changes include:

  • Adding a new hash option 'binhash-xxh' that uses xxhash in the hashing logic.
  • Updating tests in test_manifest.py to include the new 'binhash-xxh' algorithm.
  • Updating dependency files (.conda/meta.yaml and .conda/env_dev.yml) to include the xxhash libraries.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
yamanifest/hashing.py Updated hash function to support xxhash and modified supported_hashes ordering.
test/test_manifest.py Adjusted tests to include the new 'binhash-xxh' hashing strategy.
.conda/meta.yaml Included xxhash dependency.
.conda/env_dev.yml Added python-xxhash to development environment.
Comments suppressed due to low confidence (2)

yamanifest/hashing.py:35

  • [nitpick] Consider adding a brief comment explaining why 'binhash-xxh' is placed at the top of supported_hashes to clarify its priority due to improved performance.
    'binhash-xxh','binhash', 'binhash-nomtime', 'md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512'

test/test_manifest.py:85

  • [nitpick] Consider adding test assertions that specifically validate the output of the 'binhash-xxh' hash to verify that it behaves as expected in all covered scenarios.
        mf1.add(os.path.join('test',filepath),['binhash-xxh','md5','sha1'])

@aidanheerdegen
Copy link
Member

aidanheerdegen commented Jun 20, 2025

This is cool @charles-turner-1.

I'll add a positive review, but before merging I'd like @jo-basevi to do a few tests with payu to check that there aren't any unexpected problems, but also to quantify the positive effect this could have on the time taken to check manifests at the start of every model run.

m = hashlib.new('md5')
def _binhash(path, size, include_mtime, use_xxh=False):

m = xxhash.xxh64() if use_xxh else hashlib.new('md5')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use xxh3 which is 50% faster and almost 2x small data velocity according to the benchmarks

https://github.com/Cyan4973/xxHash#benchmarks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked like it wasn't available in the Python version library when I looked yesterday afternoon.

I should have looked more carefully - turns out it is there, it's just called xxh3_64 and not xxh3 in the Python library.

I'v just had a quick look in a Jupyter notebook and it seems to only be maybe 20% faster, but I see no reason not to default to xxh3 if we're using the xxh hash

Copy link
Contributor

@jo-basevi jo-basevi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me!

I've tested that this runs fine with payu as the manifests still defaults to using binhash. When testing changing the "fast-hash" from binhash to binhash-xxh, it seemed faster on the login node though timing was very variable. A number of configurations I tested seemed to not spend very long on calculating binhashes anyway (less than 1 second). Though I did find that ACCESS-OM2 01deg_jra55_iaf_bgc has a number of inputs and takes like ~10 seconds to calculate the input binhashes. Re-running payu setup on the login node didn't seem to make much difference using binhash or binhash-xxh, so time must be lost somewhere else.. Though running on payu setup on PBS jobs, it did cut the fast hash calculation time down by ~50% using binhash-xxh, so it might be worth making it the default for payu.

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Jun 20, 2025

As it stands, I haven't changed the default from binhash to binhash-xxh in this PR - I think we'd want to change this line to do so, right?

Or would that be a Payu specific change?

@jo-basevi
Copy link
Contributor

Or would that be a Payu specific change?

Yeah, it will need to be a change to Payu here.

@charles-turner-1
Copy link
Collaborator Author

Cool! I'll merge this & I can open the PR in Payu too if that makes life easier for you.

@charles-turner-1 charles-turner-1 merged commit 24c590e into master Jun 20, 2025
5 checks passed
@charles-turner-1 charles-turner-1 deleted the xxhash branch June 20, 2025 06:20
@jo-basevi
Copy link
Contributor

Cool! I'll merge this & I can open the PR in Payu too if that makes life easier for you.

Thanks! I've opened an issue in the payu repository. I'm just a bit hesitant of pushing it through to payu straight away as we'll need to let people know that the manifests will change, and all the slow md5 hashes will need to be recalculated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants