Skip to content

Conversation

@jveitchmichaelis
Copy link
Collaborator

@jveitchmichaelis jveitchmichaelis commented Jan 2, 2026

Description

  • Use torchmetrics to collect results and call evaluate_boxes during validation
  • All metrics now respect validate_on_epoch, but should be fast enough even with large datasets to run with n=1. Metrics are always reset regardless.
  • Simpler empty frame tracking during validation_step
  • Much cleaner logging in main. Non-loggable metrics are dropped in the torchmetric class.
  • Metric set up is delegated to when the trainer is created. See next point.
  • If the user changes anything like validation data etc after initialization, we need to also re-init the metrics because precision/recall needs to know about label_dict and the annotation file when it calls evaluate under the hood. There are some guards in the code to check this (e.g. reload metrics in create_trainer and on model load).
  • There is also a guard to not init metrics unless a validation set is defined, which is required for our internal metric (since it needs to know about the gt dataframe to run). Case when the user sets validation csv/root after init, instead of a config arg. I would suggest we think about decoupling this in evaluate for training (ie the core eval methods should only care about pairs of pred/target and not worry about image paths.
  • The metric has a task arg which is set to box for now, as I anticipate that we will modify it to support different prediction types.

Related Issue(s)

#901 (step towards this)
#1254
#1245
Supports #1253

AI-Assisted Development

  • I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • I understand all the code I'm submitting
  • I have reviewed and validated all AI-generated code

AI tools used (if applicable):
Claude Code to do some bug hunting and experiments with gpu sync that turned out to be not required.

@jveitchmichaelis jveitchmichaelis force-pushed the evaluation-torchmetric branch 3 times, most recently from 02bc30a to ce69a72 Compare January 2, 2026 02:50
@codecov
Copy link

codecov bot commented Jan 2, 2026

Codecov Report

❌ Patch coverage is 93.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.35%. Comparing base (3146b96) to head (a0d0c5b).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
src/deepforest/metrics.py 92.30% 5 Missing ⚠️
src/deepforest/main.py 94.28% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1256      +/-   ##
==========================================
- Coverage   87.89%   87.35%   -0.54%     
==========================================
  Files          20       21       +1     
  Lines        2776     2808      +32     
==========================================
+ Hits         2440     2453      +13     
- Misses        336      355      +19     
Flag Coverage Δ
unittests 87.35% <93.00%> (-0.54%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jveitchmichaelis jveitchmichaelis marked this pull request as ready for review January 2, 2026 03:11
@jveitchmichaelis jveitchmichaelis force-pushed the evaluation-torchmetric branch 2 times, most recently from 3cefab1 to 4f6d7db Compare January 2, 2026 04:17
@jveitchmichaelis jveitchmichaelis force-pushed the evaluation-torchmetric branch 5 times, most recently from bf2ffef to 911173c Compare January 5, 2026 18:16
@henrykironde
Copy link
Contributor

@jveitchmichaelis can you please update this PR

@bw4sz
Copy link
Collaborator

bw4sz commented Feb 2, 2026

Is this ready for review? I'm seeing multi-gpu core dumps currently after 1 epoch, so i'm pretty sure its something about eval. Are you able to test this PR on 2 gpus? Not sure if its related, could be logging images, or logging predictions. Not sure which.

@jveitchmichaelis
Copy link
Collaborator Author

jveitchmichaelis commented Feb 2, 2026

I tested this branch with the script you shared, but double check as I haven’t seen a hard crash

Also please check my logic for when these metrics are run, I think making everything respect every_n works for consistency. I'm less sure how lightning works if you call validate on its own without training, does it need n=1 to be forcibly set? I think the canonical solution might be to use trainer.test.

Should be good for review now

@bw4sz
Copy link
Collaborator

bw4sz commented Feb 2, 2026

Let me see if a monkey merge it into train_birds just a moment, run that branch, which shouldn't have any changes to src and then see if it helps the core dump.

Copy link
Collaborator

@bw4sz bw4sz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving these changes with the belief that the multi_gpu errors i'm seeing in bird_training are unrelated. The fact that these tests pass is indication that we have improved the validation setup and it will help us with points and polygons.

@bw4sz bw4sz merged commit 7fff5fe into weecology:main Feb 2, 2026
7 checks passed
@jveitchmichaelis jveitchmichaelis deleted the evaluation-torchmetric branch February 2, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants