Skip to content

Conversation

@millerjoel
Copy link
Collaborator

Description:

  • Improve notebooks 1 & 2 so they should now be able to be run at any stage and repeat downloads won't happen
  • Remove need for station ranges in file system method
  • Improve efficiency of converting to Zarr using dask delayed

Testing:

  • Worth checking that notebooks 1,2 & 3 work from start to finish. Should be able to interrupt 1 & 2 and they should start where they left off.

@millerjoel millerjoel requested a review from tennlee July 28, 2025 22:05
@coveralls
Copy link

coveralls commented Jul 28, 2025

Pull Request Test Coverage Report for Build 17420997194

Details

  • 0 of 11 (0.0%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 61.015%

Changes Missing Coverage Covered Lines Changed/Added Lines %
packages/data/src/pyearthtools/data/transforms/variables.py 0 1 0.0%
packages/utils/src/pyearthtools/utils/data/converter.py 0 2 0.0%
packages/pipeline/src/pyearthtools/pipeline/operations/xarray/normalisation.py 0 3 0.0%
packages/pipeline/src/pyearthtools/pipeline/operations/xarray/join.py 0 5 0.0%
Totals Coverage Status
Change from base Build 17286816387: 0.03%
Covered Lines: 9480
Relevant Lines: 15108

💛 - Coveralls

@tennlee
Copy link
Collaborator

tennlee commented Jul 31, 2025

I have done a lot of behind-the-scenes work on the static analysis on develop. Please rebase your branch, and ideally also make it compliant with 'ruff check'

@millerjoel millerjoel force-pushed the hadisd-simplify-146 branch from 8e6b190 to 45e7093 Compare August 11, 2025 08:52
@millerjoel millerjoel force-pushed the hadisd-simplify-146 branch from 45e7093 to d894f7f Compare August 11, 2025 10:21
@millerjoel
Copy link
Collaborator Author

I've made sure my changes are technically ruff compliant. Some notebooks fail ruff checks due to run magic & defining variable names in a separate notebook. I've told ruff to ignore this, specifically for the notebooks that cause the problem. Before hitting the merge button, I'll bring this up in our meeting tomorrow.

@tennlee
Copy link
Collaborator

tennlee commented Aug 12, 2025

I'm just working through this. The download notebook worked well, looks good to me. The conversion to Zarr issues a lot of warnings. It would be nice if it quieted the warnings after the first one.

@tennlee tennlee closed this Aug 12, 2025
@tennlee tennlee reopened this Aug 12, 2025
@tennlee
Copy link
Collaborator

tennlee commented Aug 12, 2025

The tutorial depends on seaborn, which isn't currently part of the dependencies. I don't think PET wants to influence people's choice of plotting library, there are lots of sensible options and people should feel free to use whatever. I don't think it has many dependencies so I'm happy to add it, but you also might want to just consider if you want to redo the plotting with matplotlib or plotly. So no issue - but please confirm your choice, and add it to the dependencies if it's needed. I'll just manually install it for my testing purposes for now.

@tennlee
Copy link
Collaborator

tennlee commented Aug 12, 2025

I'm getting an error which seems to be related to the warning message I saw when converting to Zarr. The warning message is "ValueError: Invalid Zarr format 3 data_type: {'name': 'fixed_length_utf32', 'configuration': {'length_bytes': 48}}". Did you get that at any stage?

@millerjoel
Copy link
Collaborator Author

I've made some changes to reflect your comments:

  1. Zarr warnings no longer seem to be an issue, so have removed the code that suppresses them

  2. Plotting is now done with matplotlib only, no seaborn.

In the mean time I seem to have broken something whilst looking into why the Drop transform was getting rid of coordinates I didn't want it to. I've reverted back to the way I thought things were, but there is still a problem I need to look into.

@tennlee
Copy link
Collaborator

tennlee commented Sep 1, 2025

Thanks I'll take another look shortly.

@millerjoel
Copy link
Collaborator Author

Right, should all be working again. Thanks again for your help!

I've adjusted variables.Drop so it doesn't drop coordinates that I haven't asked it to. I did start thinking this might be intended behaviour, but I don't really think it's that useful. This meant that I did have to manually drop some coordinates (lat, lon, elev) as part of the pipeline, but I think it's better that I explicitly have to drop them. This was done so that .ToNumpy() would work correctly.

The other numpy problem was that there was a station_id variable that only had a station dimension, no time. I have just dropped it during the preprocessing stage in notebook 2 since it's a bit redundant anyway. This leaves what I think is quite a sensible dataset.

Perhaps give everything one more run to check it works and if it does then I'll merge it.

I'll then have another PR soon that addresses the zarr chunking issues experience on HPCs.

@tennlee
Copy link
Collaborator

tennlee commented Sep 2, 2025

This is also working for me now. I'll merge this increment in, but I can see some possible improvements for another day. Thanks very much for the work!

@tennlee tennlee merged commit 10c9c91 into ACCESS-Community-Hub:develop Sep 3, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants