Skip to content

Fixed terminate_mp_processes to cover failure workers.#270

Open
hexinw-nvidia wants to merge 2 commits intoNVIDIA:mainfrom
hexinw-nvidia:terminate_child
Open

Fixed terminate_mp_processes to cover failure workers.#270
hexinw-nvidia wants to merge 2 commits intoNVIDIA:mainfrom
hexinw-nvidia:terminate_child

Conversation

@hexinw-nvidia
Copy link
Contributor

Failure workers have orphan descendant processes left behind from SubprocessContext._close(). temrinate_mp_processes is the right place to catch all the orphan cleanup.

Note if the orphan processes are temprorily or permanently stuck in the kernel I/O wait. The user-level signl based kill would not help.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR broadens terminate_mp_processes in utils.py to clean up all orphaned descendant processes in worker process groups (not just multiprocessing-specific ones), which fixes missed cleanup for failure workers whose descendants are left behind by SubprocessContext._close(). It also fixes two previously flagged inefficiencies: pre-fetched ppid is now consumed from process.info['ppid'] instead of re-reading via process.ppid(), and the unused name and cmdline attrs are removed from the psutil.process_iter call.

Key changes:

  • Removed cmdline pattern matching (from multiprocessing.resource_tracker import main, from multiprocessing.spawn import spawn_main) — now any process with ppid=1 in a target pgid is terminated
  • process.ppid()process.info['ppid'] — correctly uses the pre-fetched cache, avoiding a redundant /proc read per process
  • attrs trimmed from ['pid', 'ppid', 'name', 'cmdline']['pid', 'ppid'] — removes two now-unnecessary pre-fetches
  • Docstring updated to reflect the broadened scope and document the intentional single-pass (no recursive children walk) design decision
  • launcher.py: Updated comments in _shutdown to clarify best-effort cleanup for orphan descendants

Confidence Score: 4/5

  • Safe to merge; the broadened termination scope is intentional and well-guarded by the pgid filter.
  • The logic change is focused and purposeful—broadening cleanup from multiprocessing-specific patterns to any process with ppid=1 in target pgids. This tighter filtering (ppid=1 AND pgid membership) actually improves safety compared to pattern matching. All previously flagged inefficiencies (ppid cache usage, unused attrs) have been fixed. The acknowledged limitation (D-state processes cannot be signaled) is a kernel constraint, not a code defect. No regressions are expected.
  • No files require special attention.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["_shutdown(death_sig, timeout)"] --> B["_pcontext.close(death_sig, timeout)"]
    B --> C{"Any orphan<br/>processes remain?"}
    C -- "Worker died normally<br/>children reparented to PID 1" --> D["terminate_mp_processes<br/>allowed_ppids={1}<br/>allowed_pgids=_children_pgids"]
    C -- "Worker stuck in D-state<br/>children still parented to worker" --> E["⚠ Orphans NOT caught<br/>(ppid ≠ 1 yet)"]
    D --> F["psutil.process_iter<br/>attrs=['pid','ppid']"]
    F --> G["For each process:<br/>check process.info['ppid'] in {1}"]
    G -- "ppid match" --> H["check os.getpgid(pid)<br/>in _children_pgids"]
    G -- "no match" --> F
    H -- "pgid match" --> I["process.terminate()<br/>(SIGTERM)"]
    H -- "no match" --> F
    I --> J["Exception swallowed<br/>(best-effort)"]
    E --> K["Known limitation:<br/>D-state processes<br/>cannot be signaled"]
Loading

Last reviewed commit: 5d7afa6

Failure workers have orphan descendant processes left behind from
SubprocessContext._close(). temrinate_mp_processes is the right
place to catch all the orphan cleanup.

Note if the orphan processes are temprorily or permanently stuck
in the kernel I/O wait. The user-level signl based kill would not
help.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant