-
Notifications
You must be signed in to change notification settings - Fork 238
WAIT: Gather validation predictions from all ranks in DDP for multi-gpu inference. #1290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In on_validation_epoch_end, when world_size > 1, gather predictions from all ranks via torch.distributed.all_gather_object before building the predictions DataFrame and calling __evaluate__. This ensures box_recall, box_precision, etc. are computed on the full validation set (matching 1-GPU behavior) instead of only one rank's subset. See docs/multi_gpu_validation_fix.md for problem description and verification scripts.
|
Evidence that the current strategy does not work. Single GPU: box_recall: 0.4214581251144409 Two GPU same model and data box_recall: 0.2121526151895523 Exactly half of it, since its not collating all the batches. Fixed by this pull request. |
|
I'd like to see if this is also fixed by https://github.com/weecology/DeepForest/pull/1256] so we don't need to manually include sync code. Do you have a standalone example that will test this? A triggering config/training script would be fine. Otherwise I'll run training with 2x GPUs and see if the metrics are borked. I looked to see if we can simulate this within the CPU-only CI and it seems harder than you'd expect. |
|
Here is the multi-gpu example i'm using. yes if you run the same for 1 and 2 gpus, the results show the drop. |
|
Ok I'm running tests on HPG with Neon Eval vs deepforest/tree. |
|
Seems to work OK, some small rounding errors probably due to how torch splits and batches (#1256 needs to be rebased). There's a design choice here. We could have each rank compute its own subset of the metric, and then average. For exact results, perform a single-worker validation at the end. The current method computes on every node simultaneously, which is wasted effort, but it's exact. B200 batch size 8, |
|
I think, but am not 100% sure, that following the general philosophy of the package we want to reduce the mental load for less experienced users. They are going to want to see the exact same results in 1 GPU and mutli-GPU. A more advanced user might be fine with understanding that each GPU has a set and then its average among those sets so there is a small wiggle, but my inclination is to waste the effort of having every GPU perform the computation. |
|
That sounds good to me. I think this is straightforward to optimize in future as each prediction is still an independent contribution to the metric. We would need to move the final recall/precision reduction into the tochmetric. Evaluate_boxes would only go as far as computing the match results (and an optional reduce arg if we want to run it standalone?) This also leads nicely to a unified box/keypoint/polygon eval. |
|
I changed this to WAIT. When I tested this on bird_training, I had to do a merge (see the conflict above and I'm not convinced I made the right choice, trying again) |
In on_validation_epoch_end, when world_size > 1, gather predictions from all ranks via torch.distributed.all_gather_object before building the predictions DataFrame and calling
__evaluate__. This ensures box_recall, box_precision, etc. are computed on the full validation set (matching 1-GPU behavior) instead of only one rank's subset.Summary: With DDP, each rank only had its share of validation batches; we now gather from all ranks so metrics use the full validation set and 2-GPU metrics match 1-GPU.
Change: In on_validation_epoch_end, when world_size > 1, gather self.predictions from all ranks, concatenate, then call evaluate.
AI-Assisted Development
AI tools used (if applicable):
cursor!