-
Notifications
You must be signed in to change notification settings - Fork 239
add torchmetrics wrapper for evaluate_boxes #1256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
02bc30a to
ce69a72
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1256 +/- ##
==========================================
- Coverage 87.89% 87.35% -0.54%
==========================================
Files 20 21 +1
Lines 2776 2808 +32
==========================================
+ Hits 2440 2453 +13
- Misses 336 355 +19
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3cefab1 to
4f6d7db
Compare
bf2ffef to
911173c
Compare
|
@jveitchmichaelis can you please update this PR |
e048c3b to
8a74338
Compare
7f63f2e to
27aff9c
Compare
|
Is this ready for review? I'm seeing multi-gpu core dumps currently after 1 epoch, so i'm pretty sure its something about eval. Are you able to test this PR on 2 gpus? Not sure if its related, could be logging images, or logging predictions. Not sure which. |
|
I tested this branch with the script you shared, but double check as I haven’t seen a hard crash Also please check my logic for when these metrics are run, I think making everything respect Should be good for review now |
|
Let me see if a monkey merge it into train_birds just a moment, run that branch, which shouldn't have any changes to src and then see if it helps the core dump. |
27aff9c to
a0d0c5b
Compare
bw4sz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am approving these changes with the belief that the multi_gpu errors i'm seeing in bird_training are unrelated. The fact that these tests pass is indication that we have improved the validation setup and it will help us with points and polygons.
Description
torchmetricsto collect results and callevaluate_boxesduring validationvalidate_on_epoch, but should be fast enough even with large datasets to run withn=1. Metrics are always reset regardless.validation_stepmain. Non-loggable metrics are dropped in the torchmetric class.evaluateunder the hood. There are some guards in the code to check this (e.g. reload metrics in create_trainer and on model load).taskarg which is set toboxfor now, as I anticipate that we will modify it to support different prediction types.Related Issue(s)
#901 (step towards this)
#1254
#1245
Supports #1253
AI-Assisted Development
AI tools used (if applicable):
Claude Code to do some bug hunting and experiments with gpu sync that turned out to be not required.