Aggregate data by NUTS_ID with exactextract #86

kimlee87 · 2025-12-19T18:04:02Z

Overview

This PR addresses issues #59 and #64

We now have two options to aggregate data by NUTS regions:

using sjoin from geopandas for datasets of size <= 360 * 720 * 24 * 2 (global data for 2 years, 2 variables, 0.5 degree resolution)
using exactextract. Note that the option weighted_mean, which is usually used for population-weighted average temperature is not supported at the moment.
Tutorial C is updated accordingly

1. Functionality

Using geopandas sjoin

grid points vs. polygons/multi-poligons
Depending on the resolution of the nc file, there might be the cases where the grid points fall outside the NUTS areas, hence resulting NaN values after aggregation
at the moment, sjoin does not consider weights between grid points intersecting with the NUTS areas
another option is sjoin_nearest, which also does not account for weighted areas.

Using exactextract

raster cell vs. polygons/multi-poligons
The R version of this library was used by our collaborators
It does consider weighted areas
exactextract aggregates large dataset faster than geopandas and xagg, e.g. dataset of global, 24 months, 0.1 degree resolution

2. Statistic of NaN values after aggregation

ERA5-Land data for 2016 and 2017, `t2m` and `tp`

0.5 degree resolution

sjoin from geopandas: 448 unmapped NUTS_IDs
xagg: 57 unmapped NUTS_IDs
exactextract: 57 unmapped NUTS_IDs

0.1 degree resolution

sjoin from geopandas: can't handle dataset of this size (kernel crash)
xagg: took more than 20 minutes for aggregating t2m data only (stopped before completion)
exactextract: 1 unmapped NUTS_ID
- Aggregation time: 8 minutes (run on ssc14)
- Unmmaped NUTS_ID: MT002 with bounds [14.18747152, 36.00413545, 14.35044442, 36.08029719]
- This NUTS region is too small --> using a finer resolution might help (issue handle cases where the NUTS area is smaller than grid scale #59)

JModel for 2016, 2017, R0, 0.1 deg resolution

exactextract: 964 NUTS_IDs with NaN values after aggregation
- 949 of them are NaN in the original file
- 15 of them did not have matched raster cells
- I need to check it again with a different JModel file. If the data is also 0.1 degree resolution, it should yield the same aggregation results as ERA5-Land

3. Notes on using mean and sum aggregation

This is related to issue change N/A value to 0 #58
Both mean and sum only calculate on non-nan values. However:
- mean: If all values are NaN, then return NaN
- sum: If all values are NaN, then return 0.0
  - This is mathematically makes sense for population value, but I'm not sure about other cases, e.g. total precipitation, radiation, soil moisture, etc.

codecov · 2025-12-19T20:01:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.71%. Comparing base (b1b716d) to head (08a585a).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #86      +/-   ##
==========================================
+ Coverage   98.61%   98.71%   +0.10%     
==========================================
  Files           9        9              
  Lines        2159     2336     +177     
==========================================
+ Hits         2129     2306     +177     
  Misses         30       30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR adds support for aggregating NetCDF data by NUTS regions using two methods: GeoPandas spatial join (for small datasets) and exactextract (for large datasets), addressing issues #59 and #64.

Key changes:

Introduces exactextract as the default aggregation method, with geopandas as a fallback for smaller datasets
Refactors aggregation logic by extracting common preparation steps into _prepare_for_aggregation()
Adds comprehensive parameterized tests covering both aggregation methods and various edge cases

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
requirements.txt	Adds dependencies: xagg, exactextract, rioxarray
pyproject.toml	Adds the same three dependencies to project configuration
heiplanet_data/preprocess.py	Refactors aggregation: extracts `_prepare_for_aggregation()`, renames original function to `_aggregate_netcdf_nuts_gpd()`, adds new `_aggregate_netcdf_nuts_ee()`, updates `aggregate_data_by_nuts()` to support both methods
heiplanet_data/test/test_preprocess.py	Adds extensive test coverage for both aggregation methods, including tests for data preparation, invalid inputs, custom aggregation dictionaries, large datasets, and variable names with hyphens
docs/source/notebooks/tutorial_C_postprocess_data.ipynb	Updates tutorial to demonstrate both aggregation methods with timing comparisons
docs/source/notebooks/tutorial_B_preprocess_data.ipynb	Updates JModel data download URL and hash
docs/source/notebooks/personal_explore.ipynb	Adds exploration notebook for debugging/testing aggregation methods (1336 lines)
docs/source/notebooks/explore_nuts.ipynb	Adds notebook for exploring NUTS hierarchy structure

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

heiplanet_data/preprocess.py

heiplanet_data/test/test_preprocess.py

docs/source/notebooks/personal_explore.ipynb

heiplanet_data/preprocess.py

heiplanet_data/test/test_preprocess.py

sonarqubecloud · 2025-12-30T15:13:59Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
1.2% Duplication on New Code

See analysis details on SonarQube Cloud

kimlee87 added 11 commits December 8, 2025 17:04

temp file for exploring NUTS data and sjoin options of geopandas

aa3de84

try different options

146a584

Merge branch 'main' into nuts_agg

5a5e80d

Merge branch 'main' into nuts_agg

1e62633

incorrect results, upload for checking later...

ec29832

try exactextract, still errors

d6fc18c

explore exactextract and compare its results with other libs

69b8254

test exactextract with 0.1 deg resolution

e05fae1

feat: aggregate data by NUTS_ID with exactextract

9c99e48

fix: sonarcloud issue

e5a023c

fix: exactextract needs rioxarray to prepare its data properly

4debea1

kimlee87 added 9 commits December 29, 2025 11:21

fix: sonarcloud issues and change threshold of large dataset

ac5f799

update pandas merge cases

b9e4135

fix: linter issue from sonarcloud

f5d831b

docs: update tutorial C, still error

0c47111

Merge branch 'main' into nuts_agg

e1c83b5

update heiBOX link to download JModel

f2e7485

fix: agg dfs with more than 2 data variables

e7a0b4a

move additional notebooks to under docs folder

f74a590

docs: update tutorial C

c9b4517

kimlee87 requested a review from Copilot December 30, 2025 14:54

Copilot started reviewing on behalf of kimlee87 December 30, 2025 14:54 View session

Merge branch 'main' into nuts_agg

0b46ae3

Copilot AI reviewed Dec 30, 2025

View reviewed changes

fix: based on Copilot comments

08a585a

kimlee87 marked this pull request as ready for review December 30, 2025 15:14

kimlee87 requested a review from iulusoy December 30, 2025 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aggregate data by NUTS_ID with exactextract #86

Aggregate data by NUTS_ID with exactextract #86

Uh oh!

kimlee87 commented Dec 19, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aggregate data by NUTS_ID with exactextract #86

Are you sure you want to change the base?

Aggregate data by NUTS_ID with exactextract #86

Uh oh!

Conversation

kimlee87 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

1. Functionality

Using geopandas sjoin

Using exactextract

2. Statistic of NaN values after aggregation

ERA5-Land data for 2016 and 2017, t2m and tp

0.5 degree resolution

0.1 degree resolution

JModel for 2016, 2017, R0, 0.1 deg resolution

3. Notes on using mean and sum aggregation

Uh oh!

codecov bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Dec 30, 2025

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kimlee87 commented Dec 19, 2025 •

edited

Loading

ERA5-Land data for 2016 and 2017, `t2m` and `tp`

codecov bot commented Dec 19, 2025 •

edited

Loading