Skip to content

Conversation

@bw4sz
Copy link
Collaborator

@bw4sz bw4sz commented Jan 29, 2026

In on_validation_epoch_end, when world_size > 1, gather predictions from all ranks via torch.distributed.all_gather_object before building the predictions DataFrame and calling __evaluate__. This ensures box_recall, box_precision, etc. are computed on the full validation set (matching 1-GPU behavior) instead of only one rank's subset.

Summary: With DDP, each rank only had its share of validation batches; we now gather from all ranks so metrics use the full validation set and 2-GPU metrics match 1-GPU.

Change: In on_validation_epoch_end, when world_size > 1, gather self.predictions from all ranks, concatenate, then call evaluate.

AI-Assisted Development

  • [x ] I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • [ x] I understand all the code I'm submitting
  • [ x] I have reviewed and validated all AI-generated code

AI tools used (if applicable):

cursor!

In on_validation_epoch_end, when world_size > 1, gather predictions from
all ranks via torch.distributed.all_gather_object before building the
predictions DataFrame and calling __evaluate__. This ensures box_recall,
box_precision, etc. are computed on the full validation set (matching
1-GPU behavior) instead of only one rank's subset.

See docs/multi_gpu_validation_fix.md for problem description and
verification scripts.
@bw4sz
Copy link
Collaborator Author

bw4sz commented Jan 29, 2026

Evidence that the current strategy does not work.

Single GPU:

box_recall: 0.4214581251144409
box_precision: 0.5140703916549683

Two GPU

same model and data

box_recall: 0.2121526151895523
box_precision: 0.5238803029060364

Exactly half of it, since its not collating all the batches.

Fixed by this pull request.

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jan 30, 2026

I'd like to see if this is also fixed by https://github.com/weecology/DeepForest/pull/1256] so we don't need to manually include sync code.

Do you have a standalone example that will test this? A triggering config/training script would be fine. Otherwise I'll run training with 2x GPUs and see if the metrics are borked.

I looked to see if we can simulate this within the CPU-only CI and it seems harder than you'd expect.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Jan 30, 2026

Here is the multi-gpu example i'm using.

"""Compare single-GPU vs multi-GPU validation metrics.

Runs trainer.validate() with 1 GPU and with 2 GPUs on the same test set.
If predictions are not gathered across DDP ranks, box_recall (and related
metrics) will be lower with 2 GPUs than with 1 GPU.

For 2-GPU runs under Slurm, use srun with 2 tasks (one process per GPU);
the script detects SLURM_NTASKS and uses devices=1 so DDP has 2 ranks
without Lightning spawning subprocesses.
"""

import os

import argparse

from deepforest import main as df_main


def run_eval(test_csv, root_dir, num_gpus, iou_threshold=0.4):
    """Run validation with a given number of GPUs and return logged metrics."""
    model = df_main.deepforest()
    model.load_model("weecology/deepforest-bird")
    model.config.score_thresh = 0.2
    model.model.score_thresh = 0.2
    model.label_dict = {"Bird": 0}
    model.numeric_to_label_dict = {0: "Bird"}
    model.config.label_dict = {"Bird": 0}
    model.config.num_classes = 1

    model.config.validation.csv_file = test_csv
    model.config.validation.root_dir = root_dir
    model.config.validation.iou_threshold = iou_threshold
    model.config.validation.val_accuracy_interval = 1
    model.config.workers = 0
    model.create_trainer(devices=num_gpus, strategy="ddp")
    validation_results = model.trainer.validate(model)
    return validation_results[0] if validation_results else {}


def main():
    parser = argparse.ArgumentParser(
        description="Compare single-GPU vs multi-GPU validation metrics."
    )
    parser.add_argument("--test_csv", required=True, help="Path to test CSV")
    parser.add_argument("--root_dir", required=True, help="Root dir for images")
    parser.add_argument("--iou_threshold", type=float, default=0.4)
    parser.add_argument("--num_gpus", type=int, required=True, help="1 or 2")
    args = parser.parse_args()

    results = run_eval(
        test_csv=args.test_csv,
        root_dir=args.root_dir,
        num_gpus=args.num_gpus,
        iou_threshold=args.iou_threshold,
    )

    print(f"num_gpus={args.num_gpus}")
    print(f"box_recall: {results.get('box_recall', 'N/A')}")
    print(f"box_precision: {results.get('box_precision', 'N/A')}")
    print(f"empty_frame_accuracy: {results.get('empty_frame_accuracy', 'N/A')}")


if __name__ == "__main__":
    main()

yes if you run the same for 1 and 2 gpus, the results show the drop.

@jveitchmichaelis
Copy link
Collaborator

Ok I'm running tests on HPG with Neon Eval vs deepforest/tree.

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jan 31, 2026

Seems to work OK, some small rounding errors probably due to how torch splits and batches (#1256 needs to be rebased).

There's a design choice here. We could have each rank compute its own subset of the metric, and then average. For exact results, perform a single-worker validation at the end. The current method computes on every node simultaneously, which is wasted effort, but it's exact.

B200 batch size 8, deepforest-tree:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│       box_precision       │    0.5413408279418945     │
│        box_recall         │     0.864090085029602     │
│            iou            │     0.671686053276062     │
│         iou/cl_0          │    0.6716858744621277     │
│            map            │    0.17844010889530182    │
│          map_50           │    0.4625597596168518     │
│          map_75           │    0.09784193336963654    │
│         map_large         │    0.48323848843574524    │
│        map_medium         │     0.259711891412735     │
│       map_per_class       │           -1.0            │
│         map_small         │    0.10286436975002289    │
│           mar_1           │   0.030833937227725983    │
│          mar_10           │    0.1222200021147728     │
│          mar_100          │    0.24777226150035858    │
│     mar_100_per_class     │           -1.0            │
│         mar_large         │    0.5671706199645996     │
│        mar_medium         │    0.3504551649093628     │
│         mar_small         │    0.1639861762523651     │
│    val_bbox_regression    │    0.3008114695549011     │
│    val_classification     │    0.2081366330385208     │
│         val_loss          │    0.5089481472969055     │
└───────────────────────────┴───────────────────────────┘
Validation ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133/133 0:00:11 • 0:00:00 11.91it/s 
num_gpus=2 results:
box_recall: 0.864090085029602
box_precision: 0.5413408279418945
empty_frame_accuracy: N/A


┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│       box_precision       │    0.5413156151771545     │
│        box_recall         │    0.8640548586845398     │
│            iou            │     0.671710193157196     │
│         iou/cl_0          │    0.6717101335525513     │
│            map            │    0.17843422293663025    │
│          map_50           │    0.46256953477859497    │
│          map_75           │    0.09779982268810272    │
│         map_large         │    0.48319825530052185    │
│        map_medium         │    0.25968924164772034    │
│       map_per_class       │           -1.0            │
│         map_small         │    0.10287656635046005    │
│           mar_1           │   0.030823951587080956    │
│          mar_10           │    0.1222274899482727     │
│          mar_100          │    0.24777474999427795    │
│     mar_100_per_class     │           -1.0            │
│         mar_large         │    0.5670626163482666     │
│        mar_medium         │    0.35044270753860474    │
│         mar_small         │    0.16400346159934998    │
│    val_bbox_regression    │    0.3007431924343109     │
│    val_classification     │    0.20711223781108856    │
│         val_loss          │    0.5078554749488831     │
└───────────────────────────┴───────────────────────────┘
Validation ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 265/265 0:00:43 • 0:00:00 4.84it/s 
num_gpus=1 results:
box_recall: 0.8640548586845398
box_precision: 0.5413156151771545
empty_frame_accuracy: N/A

@bw4sz
Copy link
Collaborator Author

bw4sz commented Jan 31, 2026

I think, but am not 100% sure, that following the general philosophy of the package we want to reduce the mental load for less experienced users. They are going to want to see the exact same results in 1 GPU and mutli-GPU. A more advanced user might be fine with understanding that each GPU has a set and then its average among those sets so there is a small wiggle, but my inclination is to waste the effort of having every GPU perform the computation.

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jan 31, 2026

That sounds good to me.

I think this is straightforward to optimize in future as each prediction is still an independent contribution to the metric. We would need to move the final recall/precision reduction into the tochmetric. Evaluate_boxes would only go as far as computing the match results (and an optional reduce arg if we want to run it standalone?)

This also leads nicely to a unified box/keypoint/polygon eval.

@bw4sz bw4sz changed the title wip: gather validation predictions from all ranks in DDP for multi-gpu inference. Gather validation predictions from all ranks in DDP for multi-gpu inference. Feb 2, 2026
@bw4sz bw4sz requested review from jveitchmichaelis and removed request for jveitchmichaelis February 2, 2026 14:01
@bw4sz bw4sz changed the title Gather validation predictions from all ranks in DDP for multi-gpu inference. WAIT: Gather validation predictions from all ranks in DDP for multi-gpu inference. Feb 2, 2026
@bw4sz
Copy link
Collaborator Author

bw4sz commented Feb 2, 2026

I changed this to WAIT. When I tested this on bird_training, I had to do a merge (see the conflict above and I'm not convinced I made the right choice, trying again)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants