add torchmetrics wrapper for evaluate_boxes #1256

jveitchmichaelis · 2026-01-02T02:10:28Z

Description

Use torchmetrics to collect results and call evaluate_boxes during validation
All metrics now respect validate_on_epoch, but should be fast enough even with large datasets to run with n=1. Metrics are always reset regardless.
Simpler empty frame tracking during validation_step
Much cleaner logging in main. Non-loggable metrics are dropped in the torchmetric class.
Metric set up is delegated to when the trainer is created. See next point.
If the user changes anything like validation data etc after initialization, we need to also re-init the metrics because precision/recall needs to know about label_dict and the annotation file when it calls evaluate under the hood. There are some guards in the code to check this (e.g. reload metrics in create_trainer and on model load).
There is also a guard to not init metrics unless a validation set is defined, which is required for our internal metric (since it needs to know about the gt dataframe to run). Case when the user sets validation csv/root after init, instead of a config arg. I would suggest we think about decoupling this in evaluate for training (ie the core eval methods should only care about pairs of pred/target and not worry about image paths.
The metric has a task arg which is set to box for now, as I anticipate that we will modify it to support different prediction types.

Related Issue(s)

#901 (step towards this)
#1254
#1245
Supports #1253

AI-Assisted Development

I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
I understand all the code I'm submitting
I have reviewed and validated all AI-generated code

AI tools used (if applicable):
Claude Code to do some bug hunting and experiments with gpu sync that turned out to be not required.

codecov · 2026-01-02T03:08:49Z

Codecov Report

❌ Patch coverage is 93.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.35%. Comparing base (3146b96) to head (a0d0c5b).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
src/deepforest/metrics.py	92.30%	5 Missing ⚠️
src/deepforest/main.py	94.28%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1256      +/-   ##
==========================================
- Coverage   87.89%   87.35%   -0.54%     
==========================================
  Files          20       21       +1     
  Lines        2776     2808      +32     
==========================================
+ Hits         2440     2453      +13     
- Misses        336      355      +19

Flag	Coverage Δ
unittests	`87.35% <93.00%> (-0.54%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

henrykironde · 2026-01-13T17:58:40Z

@jveitchmichaelis can you please update this PR

bw4sz · 2026-02-02T01:04:41Z

Is this ready for review? I'm seeing multi-gpu core dumps currently after 1 epoch, so i'm pretty sure its something about eval. Are you able to test this PR on 2 gpus? Not sure if its related, could be logging images, or logging predictions. Not sure which.

jveitchmichaelis · 2026-02-02T01:07:08Z

I tested this branch with the script you shared, but double check as I haven’t seen a hard crash

Also please check my logic for when these metrics are run, I think making everything respect every_n works for consistency. I'm less sure how lightning works if you call validate on its own without training, does it need n=1 to be forcibly set? I think the canonical solution might be to use trainer.test.

Should be good for review now

bw4sz · 2026-02-02T01:10:36Z

Let me see if a monkey merge it into train_birds just a moment, run that branch, which shouldn't have any changes to src and then see if it helps the core dump.

bw4sz

I am approving these changes with the belief that the multi_gpu errors i'm seeing in bird_training are unrelated. The fact that these tests pass is indication that we have improved the validation setup and it will help us with points and polygons.

jveitchmichaelis force-pushed the evaluation-torchmetric branch 3 times, most recently from 02bc30a to ce69a72 Compare January 2, 2026 02:50

jveitchmichaelis marked this pull request as ready for review January 2, 2026 03:11

jveitchmichaelis force-pushed the evaluation-torchmetric branch 2 times, most recently from 3cefab1 to 4f6d7db Compare January 2, 2026 04:17

jveitchmichaelis requested review from bw4sz and henrykironde January 3, 2026 00:28

jveitchmichaelis force-pushed the evaluation-torchmetric branch 5 times, most recently from bf2ffef to 911173c Compare January 5, 2026 18:16

jveitchmichaelis force-pushed the evaluation-torchmetric branch 3 times, most recently from e048c3b to 8a74338 Compare January 13, 2026 19:08

jveitchmichaelis mentioned this pull request Jan 13, 2026

Simplify evaluation and prediction to mirror training datasets and create tests to assert expected eval behavior #1253

Merged

jveitchmichaelis mentioned this pull request Jan 30, 2026

WAIT: Gather validation predictions from all ranks in DDP for multi-gpu inference. #1290

Open

jveitchmichaelis force-pushed the evaluation-torchmetric branch 4 times, most recently from 7f63f2e to 27aff9c Compare February 1, 2026 23:01

use torchmetrics for evaluation

a0d0c5b

jveitchmichaelis force-pushed the evaluation-torchmetric branch from 27aff9c to a0d0c5b Compare February 2, 2026 02:22

bw4sz approved these changes Feb 2, 2026

View reviewed changes

bw4sz merged commit 7fff5fe into weecology:main Feb 2, 2026
7 checks passed

jveitchmichaelis deleted the evaluation-torchmetric branch February 2, 2026 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add torchmetrics wrapper for evaluate_boxes #1256

add torchmetrics wrapper for evaluate_boxes #1256

Uh oh!

jveitchmichaelis commented Jan 2, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 2, 2026 •

edited

Loading

Uh oh!

henrykironde commented Jan 13, 2026

Uh oh!

bw4sz commented Feb 2, 2026

Uh oh!

jveitchmichaelis commented Feb 2, 2026 •

edited

Loading

Uh oh!

bw4sz commented Feb 2, 2026

Uh oh!

bw4sz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add torchmetrics wrapper for evaluate_boxes #1256

add torchmetrics wrapper for evaluate_boxes #1256

Uh oh!

Conversation

jveitchmichaelis commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

AI-Assisted Development

Uh oh!

codecov bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

henrykironde commented Jan 13, 2026

Uh oh!

bw4sz commented Feb 2, 2026

Uh oh!

jveitchmichaelis commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz commented Feb 2, 2026

Uh oh!

bw4sz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jveitchmichaelis commented Jan 2, 2026 •

edited

Loading

codecov bot commented Jan 2, 2026 •

edited

Loading

jveitchmichaelis commented Feb 2, 2026 •

edited

Loading