Skip to content

Conversation

@jveitchmichaelis
Copy link
Collaborator

@jveitchmichaelis jveitchmichaelis commented Feb 3, 2026

Description

This PR cleans up the evaluation code and makes it more agnostic to different scoring tasks. The main idea is that we perform the same steps regardless of the underlying geometry:

  1. Compute predictions for an image
  2. Perform 1:1 assignment of predictions to a supplied ground truth
  3. Filter matches based on some threshold (IoU, distance, etc)
  4. Aggregate high level metrics like precision/recall for the validation dataset

We use Hungarian Matching for 3 (i.e. linear_sum_assignment) which is fast for the scale of inputs that we work with and is agnostic to geometry provided we can define a pairwise matching cost. For boxes and polygons this is IoU, for points I've implemented a similar method that uses either the L1 or L2 norm.

Changes:

I removed some duplication in evaluate.py and used some library methods for dataframe checks and conversion for consistency. The code for evaluate_x is a single function which deals with different geometries as requested (point, box, polygon). The output returns <task>_precision and <task>_recall; other fields are the same regardless of geometry.

evaluate_boxes -> evaluate_geometry. I don't see this low level function being called anywhere outside tests, and it's usually called by the wrapper function anyway.

  • compute_class_recall doesn't check for mixed label types and will do weird things if you pass in Tree / 0 (for example predictions with numeric ID and ground truth with string). The current test suite misses this as it only looks at the box_precision/recall values but I did some debugging and I think this issue affects main. This PR coerces all labels to be numeric (in __evaluate_wrapper__) and then processes based on the union of labels, as in Feat torchmetrics eval #1071. Do we want this behavior or do we want to only report classes present in ground truth?.

  • Bug in the (class-wise) precision calculation; the numerator should count matches among predictions with the target label, existing code counts matches among ground truth .

  • Changes point evaluation to be point vs point, not point vs box. We can add this back, but not sure what we need here.

Draft until confirmed that eval results are unchanged and we include improved test cases for all three geometries.

AI-Assisted Development

Mainly hand-coded, but used Copilot for autocomplete + identifying the cause of a test failure at the end, also for review.

  • I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • I understand all the code I'm submitting
  • I have reviewed and validated all AI-generated code

AI tools used (if applicable):
Copilot

@jveitchmichaelis jveitchmichaelis force-pushed the per-task-eval branch 3 times, most recently from fe2bead to 2803f92 Compare February 3, 2026 02:10
@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 80.16529% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.33%. Comparing base (3146b96) to head (7dba0bf).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
src/deepforest/IoU.py 69.23% 16 Missing ⚠️
src/deepforest/evaluate.py 88.40% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1297      +/-   ##
==========================================
- Coverage   87.89%   87.33%   -0.56%     
==========================================
  Files          20       21       +1     
  Lines        2776     2835      +59     
==========================================
+ Hits         2440     2476      +36     
- Misses        336      359      +23     
Flag Coverage Δ
unittests 87.33% <80.16%> (-0.56%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors evaluation to support multiple geometry types (box/polygon/point) via a unified matching + metric-aggregation pipeline, and updates tests/callers accordingly.

Changes:

  • Replace box-specific evaluation with a generalized evaluate_geometry flow and a geometry-agnostic matching helper.
  • Update IoU utilities (rename polygon matcher, add point matching) and adjust call sites/tests.
  • Improve class-wise precision/recall computation and add/adjust tests for label conversion + point evaluation.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/deepforest/evaluate.py Introduces evaluate_geometry, refactors matching and class metric computation, and updates wrapper behavior for label normalization.
src/deepforest/IoU.py Renames polygon IoU matcher and adds Hungarian point matching utilities.
src/deepforest/main.py Updates internal evaluation call to explicitly pass geometry_type="box".
tests/test_evaluate.py Updates tests to call evaluate_geometry and adds point + label-conversion coverage.
tests/test_IoU.py Updates test to use the renamed polygon matching function.
Comments suppressed due to low confidence (1)

src/deepforest/IoU.py:75

  • Renaming compute_IoU to match_polygons removes an established function name from the IoU module. If users call IoU.compute_IoU(...) externally, this is a breaking API change; consider keeping compute_IoU as a thin deprecated alias to match_polygons for backwards compatibility.
def match_polygons(ground_truth: "gpd.GeoDataFrame", submission: "gpd.GeoDataFrame"):
    """Find area of overlap among all sets of ground truth and prediction.

    This function performs matching between a ground truth dataset and a
    submission or prediction dataset, typically the output from a validation or
    test run. In order to compute IoU, we must know which boxes correspond
    between the datasets. This is performed by Hungarian matching, or linear

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Martí Bosch martibosch@users.noreply.github.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant