-
Notifications
You must be signed in to change notification settings - Fork 1
Aggregate data by NUTS_ID with exactextract #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #86 +/- ##
==========================================
+ Coverage 98.61% 98.71% +0.10%
==========================================
Files 9 9
Lines 2159 2336 +177
==========================================
+ Hits 2129 2306 +177
Misses 30 30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for aggregating NetCDF data by NUTS regions using two methods: GeoPandas spatial join (for small datasets) and exactextract (for large datasets), addressing issues #59 and #64.
Key changes:
- Introduces
exactextractas the default aggregation method, withgeopandasas a fallback for smaller datasets - Refactors aggregation logic by extracting common preparation steps into
_prepare_for_aggregation() - Adds comprehensive parameterized tests covering both aggregation methods and various edge cases
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Adds dependencies: xagg, exactextract, rioxarray |
| pyproject.toml | Adds the same three dependencies to project configuration |
| heiplanet_data/preprocess.py | Refactors aggregation: extracts _prepare_for_aggregation(), renames original function to _aggregate_netcdf_nuts_gpd(), adds new _aggregate_netcdf_nuts_ee(), updates aggregate_data_by_nuts() to support both methods |
| heiplanet_data/test/test_preprocess.py | Adds extensive test coverage for both aggregation methods, including tests for data preparation, invalid inputs, custom aggregation dictionaries, large datasets, and variable names with hyphens |
| docs/source/notebooks/tutorial_C_postprocess_data.ipynb | Updates tutorial to demonstrate both aggregation methods with timing comparisons |
| docs/source/notebooks/tutorial_B_preprocess_data.ipynb | Updates JModel data download URL and hash |
| docs/source/notebooks/personal_explore.ipynb | Adds exploration notebook for debugging/testing aggregation methods (1336 lines) |
| docs/source/notebooks/explore_nuts.ipynb | Adds notebook for exploring NUTS hierarchy structure |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|



Overview
This PR addresses issues #59 and #64
We now have two options to aggregate data by NUTS regions:
sjoinfromgeopandasfor datasets of size <= 360 * 720 * 24 * 2 (global data for 2 years, 2 variables, 0.5 degree resolution)exactextract. Note that the optionweighted_mean, which is usually used for population-weighted average temperature is not supported at the moment.1. Functionality
Using geopandas sjoin
ncfile, there might be the cases where the grid points fall outside the NUTS areas, hence resulting NaN values after aggregationsjoindoes not consider weights between grid points intersecting with the NUTS areassjoin_nearest, which also does not account for weighted areas.Using exactextract
exactextractaggregates large dataset faster thangeopandasandxagg, e.g. dataset of global, 24 months, 0.1 degree resolution2. Statistic of NaN values after aggregation
ERA5-Land data for 2016 and 2017,
t2mandtp0.5 degree resolution
sjoinfromgeopandas: 448 unmapped NUTS_IDsxagg: 57 unmapped NUTS_IDs0.1 degree resolution
sjoinfromgeopandas: can't handle dataset of this size (kernel crash)xagg: took more than 20 minutes for aggregatingt2mdata only (stopped before completion)exactextract: 1 unmapped NUTS_IDMT002with bounds[14.18747152, 36.00413545, 14.35044442, 36.08029719]JModel for 2016, 2017, R0, 0.1 deg resolution
exactextract: 964 NUTS_IDs withNaNvalues after aggregationNaNin the original file3. Notes on using mean and sum aggregation
meanandsumonly calculate on non-nan values. However:mean: If all values areNaN, then returnNaNsum: If all values areNaN, then return0.0