Skip to content

Conversation

@kimlee87
Copy link
Collaborator

@kimlee87 kimlee87 commented Dec 19, 2025

Overview

This PR addresses issues #59 and #64

We now have two options to aggregate data by NUTS regions:

  • using sjoin from geopandas for datasets of size <= 360 * 720 * 24 * 2 (global data for 2 years, 2 variables, 0.5 degree resolution)
  • using exactextract. Note that the option weighted_mean, which is usually used for population-weighted average temperature is not supported at the moment.
  • Tutorial C is updated accordingly

1. Functionality

Using geopandas sjoin

  • grid points vs. polygons/multi-poligons
  • Depending on the resolution of the nc file, there might be the cases where the grid points fall outside the NUTS areas, hence resulting NaN values after aggregation
  • at the moment, sjoin does not consider weights between grid points intersecting with the NUTS areas
  • another option is sjoin_nearest, which also does not account for weighted areas.

Using exactextract

  • raster cell vs. polygons/multi-poligons
  • The R version of this library was used by our collaborators
  • It does consider weighted areas
  • exactextract aggregates large dataset faster than geopandas and xagg, e.g. dataset of global, 24 months, 0.1 degree resolution

2. Statistic of NaN values after aggregation

ERA5-Land data for 2016 and 2017, t2m and tp

0.5 degree resolution

  • sjoin from geopandas: 448 unmapped NUTS_IDs
  • xagg: 57 unmapped NUTS_IDs
  • exactextract: 57 unmapped NUTS_IDs

0.1 degree resolution

  • sjoin from geopandas: can't handle dataset of this size (kernel crash)
  • xagg: took more than 20 minutes for aggregating t2m data only (stopped before completion)
  • exactextract: 1 unmapped NUTS_ID

JModel for 2016, 2017, R0, 0.1 deg resolution

  • exactextract: 964 NUTS_IDs with NaN values after aggregation
    • 949 of them are NaN in the original file
    • 15 of them did not have matched raster cells
    • I need to check it again with a different JModel file. If the data is also 0.1 degree resolution, it should yield the same aggregation results as ERA5-Land

3. Notes on using mean and sum aggregation

  • This is related to issue change N/A value to 0 #58
  • Both mean and sum only calculate on non-nan values. However:
    • mean: If all values are NaN, then return NaN
    • sum: If all values are NaN, then return 0.0
      • This is mathematically makes sense for population value, but I'm not sure about other cases, e.g. total precipitation, radiation, soil moisture, etc.

@codecov
Copy link

codecov bot commented Dec 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.71%. Comparing base (b1b716d) to head (08a585a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #86      +/-   ##
==========================================
+ Coverage   98.61%   98.71%   +0.10%     
==========================================
  Files           9        9              
  Lines        2159     2336     +177     
==========================================
+ Hits         2129     2306     +177     
  Misses         30       30              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for aggregating NetCDF data by NUTS regions using two methods: GeoPandas spatial join (for small datasets) and exactextract (for large datasets), addressing issues #59 and #64.

Key changes:

  • Introduces exactextract as the default aggregation method, with geopandas as a fallback for smaller datasets
  • Refactors aggregation logic by extracting common preparation steps into _prepare_for_aggregation()
  • Adds comprehensive parameterized tests covering both aggregation methods and various edge cases

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
requirements.txt Adds dependencies: xagg, exactextract, rioxarray
pyproject.toml Adds the same three dependencies to project configuration
heiplanet_data/preprocess.py Refactors aggregation: extracts _prepare_for_aggregation(), renames original function to _aggregate_netcdf_nuts_gpd(), adds new _aggregate_netcdf_nuts_ee(), updates aggregate_data_by_nuts() to support both methods
heiplanet_data/test/test_preprocess.py Adds extensive test coverage for both aggregation methods, including tests for data preparation, invalid inputs, custom aggregation dictionaries, large datasets, and variable names with hyphens
docs/source/notebooks/tutorial_C_postprocess_data.ipynb Updates tutorial to demonstrate both aggregation methods with timing comparisons
docs/source/notebooks/tutorial_B_preprocess_data.ipynb Updates JModel data download URL and hash
docs/source/notebooks/personal_explore.ipynb Adds exploration notebook for debugging/testing aggregation methods (1336 lines)
docs/source/notebooks/explore_nuts.ipynb Adds notebook for exploring NUTS hierarchy structure

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sonarqubecloud
Copy link

@kimlee87 kimlee87 marked this pull request as ready for review December 30, 2025 15:14
@kimlee87 kimlee87 requested a review from iulusoy December 30, 2025 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants