Fix: Defensively close GPU device FDs in dataloader worker processes by hexinw-nvidia · Pull Request #3684 · NVIDIA/Megatron-LM

hexinw-nvidia · 2026-03-04T02:36:59Z

This ensures workers do not keep references into NVIDIA memory space after fork. This helps ensure GPU memory can be reclaimed even if a dataloader worker is delayed or fails to exit.

How to Reproduce / Validate:

Force a long-running dataloader worker
Modify GPTDataset.__getitem__ to insert:

time.sleep(3600)

This simulates a stuck dataloader worker (e.g., blocked in I/O).
Start training
Launch a 1-node Megatron-LM job with:

--num-workers > 0
Verify dataloader workers are alive
On the GPU node:

sudo fuser -v /dev/nvidia*

You should see the dataloader worker processes listed.
With this patch, they should not retain active /dev/nvidia* file
descriptors even though they are running.
Trigger a rank failure
Send SIGTERM to one of the training ranks:

kill -15 <rank_pid>
Observe GPU memory reclaim
Run:

nvidia-smi

The corresponding rank’s GPU memory usage should return to 0
immediately (assuming no other GPU-holding child processes such as async
checkpoint workers are present in this test).
Baseline (without this patch)
Repeat the same steps without this change.

After killing the rank in step 4, you will observe that GPU memory remains non-zero in nvidia-smi, because the dataloader worker still holds /dev/nvidia* references.

This ensures workers do not keep references into NVIDIA memory space after fork. This helps ensure GPU memory can be reclaimed even if a dataloader worker is delayed or fails to exit. How to Reproduce / Validate: 1. Force a long-running dataloader worker Modify GPTDataset.__getitem__ to insert: time.sleep(3600) This simulates a stuck dataloader worker (e.g., blocked in I/O). 2. Start training Launch a 1-node Megatron-LM job with: --num-workers > 0 3. Verify dataloader workers are alive On the GPU node: sudo fuser -v /dev/nvidia* You should see the dataloader worker processes listed. With this patch, they should not retain active /dev/nvidia* file descriptors even though they are running. 4. Trigger a rank failure Send SIGTERM to one of the training ranks: kill -15 <rank_pid> 5. Observe GPU memory reclaim Run: nvidia-smi The corresponding rank’s GPU memory usage should return to 0 immediately (assuming no other GPU-holding child processes such as async checkpoint workers are present in this test). 6. Baseline (without this patch) Repeat the same steps without this change. After killing the rank in step 4, you will observe that GPU memory remains non-zero in nvidia-smi, because the dataloader worker still holds /dev/nvidia* references.

copy-pr-bot · 2026-03-04T02:37:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Phlip79 · 2026-03-04T23:45:04Z

/claude review

claude · 2026-03-04T23:47:36Z

megatron/training/datasets/data_samplers.py

+                    if path.startswith("/dev/nvidia"):
+                        os.close(int(fd))
+                except Exception:
+                    pass


The bare except Exception: pass silently swallows all errors. At minimum, unexpected exceptions (not OSError/FileNotFoundError) deserve a warning log, otherwise bugs here are invisible in production.

Suggested change

pass

except OSError:

pass

CUDA fd paths like /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm should reliably resolve, so only OSError (ENOENT for already-closed fds, EBADF, etc.) is expected here.

Phlip79 · 2026-03-04T23:48:00Z

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

svcnvidia-nemo-ci requested a review from a team March 4, 2026 02:37

Merge branch 'main' into close_nvidia_fds

c8c58ee

claude bot reviewed Mar 4, 2026

View reviewed changes

Phlip79 marked this pull request as draft March 4, 2026 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Defensively close GPU device FDs in dataloader worker processes#3684

Fix: Defensively close GPU device FDs in dataloader worker processes#3684
hexinw-nvidia wants to merge 2 commits intoNVIDIA:mainfrom
hexinw-nvidia:close_nvidia_fds

hexinw-nvidia commented Mar 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

claude bot Mar 4, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hexinw-nvidia commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

claude bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hexinw-nvidia commented Mar 4, 2026 •

edited

Loading