wip: refactor evaluation for other geometries #1297

jveitchmichaelis · 2026-02-03T00:05:24Z

Description

This PR cleans up the evaluation code and makes it more agnostic to different scoring tasks. The main idea is that we perform the same steps regardless of the underlying geometry:

Compute predictions for an image
Perform 1:1 assignment of predictions to a supplied ground truth
Filter matches based on some threshold (IoU, distance, etc)
Aggregate high level metrics like precision/recall for the validation dataset

We use Hungarian Matching for 3 (i.e. linear_sum_assignment) which is fast for the scale of inputs that we work with and is agnostic to geometry provided we can define a pairwise matching cost. For boxes and polygons this is IoU, for points I've implemented a similar method that uses either the L1 or L2 norm.

Changes:

I removed some duplication in evaluate.py and used some library methods for dataframe checks and conversion for consistency. The code for evaluate_x is a single function which deals with different geometries as requested (point, box, polygon). The output returns <task>_precision and <task>_recall; other fields are the same regardless of geometry.

evaluate_boxes -> evaluate_geometry. I don't see this low level function being called anywhere outside tests, and it's usually called by the wrapper function anyway.

compute_class_recall doesn't check for mixed label types and will do weird things if you pass in Tree / 0 (for example predictions with numeric ID and ground truth with string). The current test suite misses this as it only looks at the box_precision/recall values but I did some debugging and I think this issue affects main. This PR coerces all labels to be numeric (in __evaluate_wrapper__) and then processes based on the union of labels, as in Feat torchmetrics eval #1071. Do we want this behavior or do we want to only report classes present in ground truth?.
Bug in the (class-wise) precision calculation; the numerator should count matches among predictions with the target label, existing code counts matches among ground truth .
Changes point evaluation to be point vs point, not point vs box. We can add this back, but not sure what we need here.

Draft until confirmed that eval results are unchanged and we include improved test cases for all three geometries.

AI-Assisted Development

Mainly hand-coded, but used Copilot for autocomplete + identifying the cause of a test failure at the end, also for review.

I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
I understand all the code I'm submitting
I have reviewed and validated all AI-generated code

AI tools used (if applicable):
Copilot

codecov · 2026-02-03T02:45:29Z

Codecov Report

❌ Patch coverage is 80.16529% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.33%. Comparing base (3146b96) to head (7dba0bf).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
src/deepforest/IoU.py	69.23%	16 Missing ⚠️
src/deepforest/evaluate.py	88.40%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1297      +/-   ##
==========================================
- Coverage   87.89%   87.33%   -0.56%     
==========================================
  Files          20       21       +1     
  Lines        2776     2835      +59     
==========================================
+ Hits         2440     2476      +36     
- Misses        336      359      +23

Flag	Coverage Δ
unittests	`87.33% <80.16%> (-0.56%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Refactors evaluation to support multiple geometry types (box/polygon/point) via a unified matching + metric-aggregation pipeline, and updates tests/callers accordingly.

Changes:

Replace box-specific evaluation with a generalized evaluate_geometry flow and a geometry-agnostic matching helper.
Update IoU utilities (rename polygon matcher, add point matching) and adjust call sites/tests.
Improve class-wise precision/recall computation and add/adjust tests for label conversion + point evaluation.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`src/deepforest/evaluate.py`	Introduces `evaluate_geometry`, refactors matching and class metric computation, and updates wrapper behavior for label normalization.
`src/deepforest/IoU.py`	Renames polygon IoU matcher and adds Hungarian point matching utilities.
`src/deepforest/main.py`	Updates internal evaluation call to explicitly pass `geometry_type="box"`.
`tests/test_evaluate.py`	Updates tests to call `evaluate_geometry` and adds point + label-conversion coverage.
`tests/test_IoU.py`	Updates test to use the renamed polygon matching function.

Comments suppressed due to low confidence (1)

src/deepforest/IoU.py:75

Renaming compute_IoU to match_polygons removes an established function name from the IoU module. If users call IoU.compute_IoU(...) externally, this is a breaking API change; consider keeping compute_IoU as a thin deprecated alias to match_polygons for backwards compatibility.

def match_polygons(ground_truth: "gpd.GeoDataFrame", submission: "gpd.GeoDataFrame"):
    """Find area of overlap among all sets of ground truth and prediction.

    This function performs matching between a ground truth dataset and a
    submission or prediction dataset, typically the output from a validation or
    test run. In order to compute IoU, we must know which boxes correspond
    between the datasets. This is performed by Hungarian matching, or linear

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/deepforest/evaluate.py

src/deepforest/IoU.py

tests/test_evaluate.py

Co-authored-by: Martí Bosch martibosch@users.noreply.github.com

jveitchmichaelis force-pushed the per-task-eval branch 3 times, most recently from fe2bead to 2803f92 Compare February 3, 2026 02:10

jveitchmichaelis force-pushed the per-task-eval branch 6 times, most recently from b9fc2dd to 0a8df59 Compare February 3, 2026 03:19

jveitchmichaelis requested a review from Copilot February 3, 2026 08:01

Copilot started reviewing on behalf of jveitchmichaelis February 3, 2026 08:01 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

refactor evaluation for other geometries

7dba0bf

Co-authored-by: Martí Bosch martibosch@users.noreply.github.com

jveitchmichaelis force-pushed the per-task-eval branch from 0a8df59 to 7dba0bf Compare February 3, 2026 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: refactor evaluation for other geometries #1297

wip: refactor evaluation for other geometries #1297

jveitchmichaelis commented Feb 3, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wip: refactor evaluation for other geometries #1297

Are you sure you want to change the base?

wip: refactor evaluation for other geometries #1297

Conversation

jveitchmichaelis commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes:

AI-Assisted Development

Uh oh!

codecov bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jveitchmichaelis commented Feb 3, 2026 •

edited

Loading

codecov bot commented Feb 3, 2026 •

edited

Loading