fix(proc-cap): record observed PGID for job process group#58
fix(proc-cap): record observed PGID for job process group#58heidi-dang merged 2 commits intomainfrom
Conversation
Summary of ChangesHello @heidi-dang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a specific flakiness in the proc-cap integration by refining how process group IDs are identified and stored for jobs. It introduces a more resilient method for determining the correct PGID, especially in scenarios where child processes might re-parent or change their PGID. The update also includes enhanced debug logging to facilitate future troubleshooting, all while maintaining the existing process enforcement behavior. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Notes: bounded retry (10×1ms), no busy loops; debug gated by HK_DEBUG_PROC_CAP; includes prior inspector procfs hardening already in branch. |
There was a problem hiding this comment.
Code Review
This pull request aims to address a bug in process group ID attribution by introducing a more robust mechanism to observe the actual PGID, including a retry loop. However, it introduces a critical race condition where the parent process may incorrectly record its own PGID, potentially leading to the parent process group being killed when a job is terminated. Furthermore, the use of fprintf in the child process after fork is unsafe and can cause deadlocks. Beyond these security concerns, the review also highlights opportunities to improve maintainability by removing code duplication, replacing magic numbers with named constants, and ensuring consistent debug logging. A temporary file also needs to be removed.
| for (int attempt = 0; attempt < 10; ++attempt) { | ||
| observed_pgid = getpgid(leader_pid); | ||
| if (observed_pgid != -1) | ||
| break; | ||
| pg_get_errno = errno; | ||
| std::this_thread::sleep_for(std::chrono::milliseconds(1)); | ||
| } |
There was a problem hiding this comment.
CRITICAL: Race condition in PGID attribution leading to process group hijacking. The current loop attempting to determine the child's PGID using getpgid(pid) (lines 174-180) is flawed. A child process initially inherits its parent's PGID, and getpgid will return this immediately, causing the loop to terminate before the child can call setpgid(0, 0). This results in the parent's own PGID being stored, which can lead to the parent process group being terminated when the job is cancelled. Additionally, the retry count 10 and sleep duration 1 millisecond are magic numbers; consider defining them as named constants (e.g., kGetPgidMaxAttempts) for improved readability and maintainability.
// Parent: explicitly set the child's PGID to avoid race conditions with getpgid.
// This ensures the child is in its own process group regardless of execution order.
if (setpgid(pid, pid) != 0 && errno != EACCES) {
// EACCES is ignored as it may mean the child has already performed exec()
}
observed_pgid = pid;| fprintf(stderr, | ||
| "SPAWN_DBG child pid=%d ppid=%d pgid_before=%d setpgid_res=%d setpgid_errno=%d " | ||
| "sid=%d\n", | ||
| cpid, getppid(), (int)cpgid_before, setres, set_err, (int)csid); |
There was a problem hiding this comment.
HIGH: Use of non-async-signal-safe function fprintf in child process. The call to fprintf(stderr, ...) (lines 140-143) after fork() but before execl() is dangerous in a multi-threaded application. It can lead to deadlocks if a lock was held by another thread in the parent at the time of fork. Additionally, the setpgid(0, 0) call is duplicated in both if and else branches (lines 132-147), which could be refactored to improve maintainability.
// Avoid fprintf in child after fork as it is not async-signal-safe.
// If logging is required, use the async-signal-safe write() system call.|
|
||
| closedir(proc_dir); | ||
| // Minimal structured debug output for triage (local-only) | ||
| if (matched >= 0) { |
There was a problem hiding this comment.
This debug logging appears to be unconditional, which could generate a lot of output in production. The PR description states that debug prints are gated by HK_DEBUG_PROC_CAP. This block should be wrapped in a conditional check like if (getenv("HK_DEBUG_PROC_CAP")). The current condition if (matched >= 0) is always true, as matched is initialized to 0 and only incremented.
| if (matched >= 0) { | |
| if (getenv("HK_DEBUG_PROC_CAP")) { |
| # temporary diag branch marker | ||
| /.tmpdiag |
| } | ||
|
|
||
| // Instrumentation: log what PGID we think we're inspecting and a cheap probe | ||
| if (getenv("HK_DEBUG_PROC_CAP")) { |
Problem: Release-only proc-cap integration flake where children were counted under a different PGID than the stored value due to assuming pid==pgid.
Fix: Record job.process_group from the observed PGID using getpgid(leader_pid) with a bounded retry (10 attempts � 1ms). If getpgid fails, fall back to pid.
Debug logging: Proc-cap debug prints are gated behind HK_DEBUG_PROC_CAP (off by default).
Behavioral impact: No change to enforcement logic; only correct PGID attribution.
Verification: