WAIT: Gather validation predictions from all ranks in DDP for multi-gpu inference. #1290

bw4sz · 2026-01-29T20:42:32Z

In on_validation_epoch_end, when world_size > 1, gather predictions from all ranks via torch.distributed.all_gather_object before building the predictions DataFrame and calling __evaluate__. This ensures box_recall, box_precision, etc. are computed on the full validation set (matching 1-GPU behavior) instead of only one rank's subset.

Summary: With DDP, each rank only had its share of validation batches; we now gather from all ranks so metrics use the full validation set and 2-GPU metrics match 1-GPU.

Change: In on_validation_epoch_end, when world_size > 1, gather self.predictions from all ranks, concatenate, then call evaluate.

AI-Assisted Development

[x ] I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
[ x] I understand all the code I'm submitting
[ x] I have reviewed and validated all AI-generated code

AI tools used (if applicable):

cursor!

In on_validation_epoch_end, when world_size > 1, gather predictions from all ranks via torch.distributed.all_gather_object before building the predictions DataFrame and calling __evaluate__. This ensures box_recall, box_precision, etc. are computed on the full validation set (matching 1-GPU behavior) instead of only one rank's subset. See docs/multi_gpu_validation_fix.md for problem description and verification scripts.

bw4sz · 2026-01-29T23:50:15Z

Evidence that the current strategy does not work.

Single GPU:

box_recall: 0.4214581251144409
box_precision: 0.5140703916549683

Two GPU

same model and data

box_recall: 0.2121526151895523
box_precision: 0.5238803029060364

Exactly half of it, since its not collating all the batches.

Fixed by this pull request.

jveitchmichaelis · 2026-01-30T18:49:40Z

I'd like to see if this is also fixed by https://github.com/weecology/DeepForest/pull/1256] so we don't need to manually include sync code.

Do you have a standalone example that will test this? A triggering config/training script would be fine. Otherwise I'll run training with 2x GPUs and see if the metrics are borked.

I looked to see if we can simulate this within the CPU-only CI and it seems harder than you'd expect.

bw4sz · 2026-01-30T19:10:20Z

Here is the multi-gpu example i'm using.

"""Compare single-GPU vs multi-GPU validation metrics.

Runs trainer.validate() with 1 GPU and with 2 GPUs on the same test set.
If predictions are not gathered across DDP ranks, box_recall (and related
metrics) will be lower with 2 GPUs than with 1 GPU.

For 2-GPU runs under Slurm, use srun with 2 tasks (one process per GPU);
the script detects SLURM_NTASKS and uses devices=1 so DDP has 2 ranks
without Lightning spawning subprocesses.
"""

import os

import argparse

from deepforest import main as df_main


def run_eval(test_csv, root_dir, num_gpus, iou_threshold=0.4):
    """Run validation with a given number of GPUs and return logged metrics."""
    model = df_main.deepforest()
    model.load_model("weecology/deepforest-bird")
    model.config.score_thresh = 0.2
    model.model.score_thresh = 0.2
    model.label_dict = {"Bird": 0}
    model.numeric_to_label_dict = {0: "Bird"}
    model.config.label_dict = {"Bird": 0}
    model.config.num_classes = 1

    model.config.validation.csv_file = test_csv
    model.config.validation.root_dir = root_dir
    model.config.validation.iou_threshold = iou_threshold
    model.config.validation.val_accuracy_interval = 1
    model.config.workers = 0
    model.create_trainer(devices=num_gpus, strategy="ddp")
    validation_results = model.trainer.validate(model)
    return validation_results[0] if validation_results else {}


def main():
    parser = argparse.ArgumentParser(
        description="Compare single-GPU vs multi-GPU validation metrics."
    )
    parser.add_argument("--test_csv", required=True, help="Path to test CSV")
    parser.add_argument("--root_dir", required=True, help="Root dir for images")
    parser.add_argument("--iou_threshold", type=float, default=0.4)
    parser.add_argument("--num_gpus", type=int, required=True, help="1 or 2")
    args = parser.parse_args()

    results = run_eval(
        test_csv=args.test_csv,
        root_dir=args.root_dir,
        num_gpus=args.num_gpus,
        iou_threshold=args.iou_threshold,
    )

    print(f"num_gpus={args.num_gpus}")
    print(f"box_recall: {results.get('box_recall', 'N/A')}")
    print(f"box_precision: {results.get('box_precision', 'N/A')}")
    print(f"empty_frame_accuracy: {results.get('empty_frame_accuracy', 'N/A')}")


if __name__ == "__main__":
    main()

yes if you run the same for 1 and 2 gpus, the results show the drop.

jveitchmichaelis · 2026-01-30T19:27:08Z

Ok I'm running tests on HPG with Neon Eval vs deepforest/tree.

jveitchmichaelis · 2026-01-31T12:19:14Z

Seems to work OK, some small rounding errors probably due to how torch splits and batches (#1256 needs to be rebased).

There's a design choice here. We could have each rank compute its own subset of the metric, and then average. For exact results, perform a single-worker validation at the end. The current method computes on every node simultaneously, which is wasted effort, but it's exact.

B200 batch size 8, deepforest-tree:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│       box_precision       │    0.5413408279418945     │
│        box_recall         │     0.864090085029602     │
│            iou            │     0.671686053276062     │
│         iou/cl_0          │    0.6716858744621277     │
│            map            │    0.17844010889530182    │
│          map_50           │    0.4625597596168518     │
│          map_75           │    0.09784193336963654    │
│         map_large         │    0.48323848843574524    │
│        map_medium         │     0.259711891412735     │
│       map_per_class       │           -1.0            │
│         map_small         │    0.10286436975002289    │
│           mar_1           │   0.030833937227725983    │
│          mar_10           │    0.1222200021147728     │
│          mar_100          │    0.24777226150035858    │
│     mar_100_per_class     │           -1.0            │
│         mar_large         │    0.5671706199645996     │
│        mar_medium         │    0.3504551649093628     │
│         mar_small         │    0.1639861762523651     │
│    val_bbox_regression    │    0.3008114695549011     │
│    val_classification     │    0.2081366330385208     │
│         val_loss          │    0.5089481472969055     │
└───────────────────────────┴───────────────────────────┘
Validation ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133/133 0:00:11 • 0:00:00 11.91it/s 
num_gpus=2 results:
box_recall: 0.864090085029602
box_precision: 0.5413408279418945
empty_frame_accuracy: N/A


┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│       box_precision       │    0.5413156151771545     │
│        box_recall         │    0.8640548586845398     │
│            iou            │     0.671710193157196     │
│         iou/cl_0          │    0.6717101335525513     │
│            map            │    0.17843422293663025    │
│          map_50           │    0.46256953477859497    │
│          map_75           │    0.09779982268810272    │
│         map_large         │    0.48319825530052185    │
│        map_medium         │    0.25968924164772034    │
│       map_per_class       │           -1.0            │
│         map_small         │    0.10287656635046005    │
│           mar_1           │   0.030823951587080956    │
│          mar_10           │    0.1222274899482727     │
│          mar_100          │    0.24777474999427795    │
│     mar_100_per_class     │           -1.0            │
│         mar_large         │    0.5670626163482666     │
│        mar_medium         │    0.35044270753860474    │
│         mar_small         │    0.16400346159934998    │
│    val_bbox_regression    │    0.3007431924343109     │
│    val_classification     │    0.20711223781108856    │
│         val_loss          │    0.5078554749488831     │
└───────────────────────────┴───────────────────────────┘
Validation ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 265/265 0:00:43 • 0:00:00 4.84it/s 
num_gpus=1 results:
box_recall: 0.8640548586845398
box_precision: 0.5413156151771545
empty_frame_accuracy: N/A

bw4sz · 2026-01-31T15:45:42Z

I think, but am not 100% sure, that following the general philosophy of the package we want to reduce the mental load for less experienced users. They are going to want to see the exact same results in 1 GPU and mutli-GPU. A more advanced user might be fine with understanding that each GPU has a set and then its average among those sets so there is a small wiggle, but my inclination is to waste the effort of having every GPU perform the computation.

jveitchmichaelis · 2026-01-31T23:15:55Z

That sounds good to me.

I think this is straightforward to optimize in future as each prediction is still an independent contribution to the metric. We would need to move the final recall/precision reduction into the tochmetric. Evaluate_boxes would only go as far as computing the match results (and an optional reduce arg if we want to run it standalone?)

This also leads nicely to a unified box/keypoint/polygon eval.

bw4sz · 2026-02-02T14:04:03Z

I changed this to WAIT. When I tested this on bird_training, I had to do a merge (see the conflict above and I'm not convinced I made the right choice, trying again)

bw4sz changed the title ~~wip: gather validation predictions from all ranks in DDP for multi-gpu inference.~~ Gather validation predictions from all ranks in DDP for multi-gpu inference. Feb 2, 2026

bw4sz requested review from jveitchmichaelis and removed request for jveitchmichaelis February 2, 2026 14:01

bw4sz changed the title ~~Gather validation predictions from all ranks in DDP for multi-gpu inference.~~ WAIT: Gather validation predictions from all ranks in DDP for multi-gpu inference. Feb 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAIT: Gather validation predictions from all ranks in DDP for multi-gpu inference. #1290

WAIT: Gather validation predictions from all ranks in DDP for multi-gpu inference. #1290

Uh oh!

bw4sz commented Jan 29, 2026

Uh oh!

bw4sz commented Jan 29, 2026

Uh oh!

jveitchmichaelis commented Jan 30, 2026 •

edited

Loading

Uh oh!

bw4sz commented Jan 30, 2026 •

edited

Loading

Uh oh!

jveitchmichaelis commented Jan 30, 2026

Uh oh!

jveitchmichaelis commented Jan 31, 2026 •

edited

Loading

Uh oh!

bw4sz commented Jan 31, 2026

Uh oh!

jveitchmichaelis commented Jan 31, 2026 •

edited

Loading

Uh oh!

bw4sz commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WAIT: Gather validation predictions from all ranks in DDP for multi-gpu inference. #1290

Are you sure you want to change the base?

WAIT: Gather validation predictions from all ranks in DDP for multi-gpu inference. #1290

Uh oh!

Conversation

bw4sz commented Jan 29, 2026

AI-Assisted Development

Uh oh!

bw4sz commented Jan 29, 2026

Uh oh!

jveitchmichaelis commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis commented Jan 30, 2026

Uh oh!

jveitchmichaelis commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz commented Jan 31, 2026

Uh oh!

jveitchmichaelis commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jveitchmichaelis commented Jan 30, 2026 •

edited

Loading

bw4sz commented Jan 30, 2026 •

edited

Loading

jveitchmichaelis commented Jan 31, 2026 •

edited

Loading

jveitchmichaelis commented Jan 31, 2026 •

edited

Loading