Skip to content

Fix: Defensively close GPU device FDs in dataloader worker processes#3684

Draft
hexinw-nvidia wants to merge 2 commits intoNVIDIA:mainfrom
hexinw-nvidia:close_nvidia_fds
Draft

Fix: Defensively close GPU device FDs in dataloader worker processes#3684
hexinw-nvidia wants to merge 2 commits intoNVIDIA:mainfrom
hexinw-nvidia:close_nvidia_fds

Conversation

@hexinw-nvidia
Copy link
Contributor

@hexinw-nvidia hexinw-nvidia commented Mar 4, 2026

This ensures workers do not keep references into NVIDIA memory space after fork. This helps ensure GPU memory can be reclaimed even if a dataloader worker is delayed or fails to exit.

How to Reproduce / Validate:

  1. Force a long-running dataloader worker
    Modify GPTDataset.__getitem__ to insert:

    time.sleep(3600)

    This simulates a stuck dataloader worker (e.g., blocked in I/O).

  2. Start training
    Launch a 1-node Megatron-LM job with:

    --num-workers > 0

  3. Verify dataloader workers are alive
    On the GPU node:

    sudo fuser -v /dev/nvidia*

    You should see the dataloader worker processes listed.
    With this patch, they should not retain active /dev/nvidia* file
    descriptors even though they are running.

  4. Trigger a rank failure
    Send SIGTERM to one of the training ranks:

    kill -15 <rank_pid>

  5. Observe GPU memory reclaim
    Run:

    nvidia-smi

    The corresponding rank’s GPU memory usage should return to 0
    immediately (assuming no other GPU-holding child processes such as async
    checkpoint workers are present in this test).

  6. Baseline (without this patch)
    Repeat the same steps without this change.

    After killing the rank in step 4, you will observe that GPU memory remains non-zero in nvidia-smi, because the dataloader worker still holds /dev/nvidia* references.

This ensures workers do not keep references into NVIDIA memory space after fork.
This helps ensure GPU memory can be reclaimed even if a dataloader worker is
delayed or fails to exit.

How to Reproduce / Validate:

1. Force a long-running dataloader worker
   Modify GPTDataset.__getitem__ to insert:

     time.sleep(3600)

   This simulates a stuck dataloader worker (e.g., blocked in I/O).

2. Start training
   Launch a 1-node Megatron-LM job with:

     --num-workers > 0

3. Verify dataloader workers are alive
   On the GPU node:

     sudo fuser -v /dev/nvidia*

  You should see the dataloader worker processes listed.
  With this patch, they should not retain active /dev/nvidia* file
  descriptors even though they are running.

4. Trigger a rank failure
   Send SIGTERM to one of the training ranks:

     kill -15 <rank_pid>

5. Observe GPU memory reclaim
   Run:

     nvidia-smi

  The corresponding rank’s GPU memory usage should return to 0
  immediately (assuming no other GPU-holding child processes such as async
  checkpoint workers are present in this test).

6. Baseline (without this patch)
   Repeat the same steps without this change.
   After killing the rank in step 4, you will observe that GPU memory
   remains non-zero in nvidia-smi, because the dataloader worker still
   holds /dev/nvidia* references.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 4, 2026 02:37
@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

/claude review

if path.startswith("/dev/nvidia"):
os.close(int(fd))
except Exception:
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bare except Exception: pass silently swallows all errors. At minimum, unexpected exceptions (not OSError/FileNotFoundError) deserve a warning log, otherwise bugs here are invisible in production.

Suggested change
pass
except OSError:
pass

CUDA fd paths like /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm should reliably resolve, so only OSError (ENOENT for already-closed fds, EBADF, etc.) is expected here.

@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

@Phlip79 Phlip79 marked this pull request as draft March 4, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants