Skip to content

Comments

fix(proc-cap): record observed PGID for job process group#58

Merged
heidi-dang merged 2 commits intomainfrom
fix/proc-cap-record-observed-pgid
Feb 19, 2026
Merged

fix(proc-cap): record observed PGID for job process group#58
heidi-dang merged 2 commits intomainfrom
fix/proc-cap-record-observed-pgid

Conversation

@heidi-dang
Copy link
Owner

Problem: Release-only proc-cap integration flake where children were counted under a different PGID than the stored value due to assuming pid==pgid.

Fix: Record job.process_group from the observed PGID using getpgid(leader_pid) with a bounded retry (10 attempts � 1ms). If getpgid fails, fall back to pid.

Debug logging: Proc-cap debug prints are gated behind HK_DEBUG_PROC_CAP (off by default).

Behavioral impact: No change to enforcement logic; only correct PGID attribution.

Verification:

  • Focused integration test 20�x (Release) with debug enabled: all passed; inspector matched children; PROC_LIMIT transitions observed. Log: /tmp/proccap_pgidfix_release_r20_after_strip.log
  • ctest IntegrationTest subset: all passed.

@gemini-code-assist
Copy link

Summary of Changes

Hello @heidi-dang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a specific flakiness in the proc-cap integration by refining how process group IDs are identified and stored for jobs. It introduces a more resilient method for determining the correct PGID, especially in scenarios where child processes might re-parent or change their PGID. The update also includes enhanced debug logging to facilitate future troubleshooting, all while maintaining the existing process enforcement behavior.

Highlights

  • Problem Addressed: Resolved a release-only proc-cap integration flake where child processes were incorrectly attributed to a different Process Group ID (PGID) due to an assumption that pid == pgid.
  • PGID Recording Improvement: Implemented a robust mechanism to record the observed PGID for a job's process group by using getpgid(leader_pid) with up to 10 bounded retries (1ms each), falling back to the process ID (pid) if getpgid fails.
  • Debug Instrumentation: Introduced optional debug logging for proc-cap related operations, activated by setting the HK_DEBUG_PROC_CAP environment variable, to aid in diagnosis.
  • No Enforcement Logic Change: Confirmed that the changes only correct PGID attribution and do not alter the existing process enforcement logic.
Changelog
  • .gitignore.tmpdiag
    • Added a new entry to ignore temporary diagnostic branch markers.
  • src/job/job.cpp
    • Introduced debug logging for setpgid calls within child processes, enabled by g_proc_cap_enabled.
    • Modified the parent process's logic to retrieve the job.process_group by attempting getpgid(leader_pid) multiple times with a small delay, falling back to pid if unsuccessful.
    • Added debug logging to enforce_job_process_cap to display details about the stored PGID, getpgid results, and kill probe results when HK_DEBUG_PROC_CAP is set.
  • src/job/process_inspector_procfs.cpp
    • Updated the parsing of /proc//stat to correctly handle process command names (comm) that contain spaces by finding the closing parenthesis before extracting ppid and pgrp.
    • Added extensive debug logging within count_processes_in_pgid to output details like the scanned PGID, number of processes matched, a list of matched PIDs, and raw stat lines for the leader and the first child process, for diagnostic purposes.
Activity
  • No specific activity (comments, reviews, etc.) was provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@heidi-dang
Copy link
Owner Author

Notes: bounded retry (10×1ms), no busy loops; debug gated by HK_DEBUG_PROC_CAP; includes prior inspector procfs hardening already in branch.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to address a bug in process group ID attribution by introducing a more robust mechanism to observe the actual PGID, including a retry loop. However, it introduces a critical race condition where the parent process may incorrectly record its own PGID, potentially leading to the parent process group being killed when a job is terminated. Furthermore, the use of fprintf in the child process after fork is unsafe and can cause deadlocks. Beyond these security concerns, the review also highlights opportunities to improve maintainability by removing code duplication, replacing magic numbers with named constants, and ensuring consistent debug logging. A temporary file also needs to be removed.

Comment on lines +174 to +180
for (int attempt = 0; attempt < 10; ++attempt) {
observed_pgid = getpgid(leader_pid);
if (observed_pgid != -1)
break;
pg_get_errno = errno;
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

CRITICAL: Race condition in PGID attribution leading to process group hijacking. The current loop attempting to determine the child's PGID using getpgid(pid) (lines 174-180) is flawed. A child process initially inherits its parent's PGID, and getpgid will return this immediately, causing the loop to terminate before the child can call setpgid(0, 0). This results in the parent's own PGID being stored, which can lead to the parent process group being terminated when the job is cancelled. Additionally, the retry count 10 and sleep duration 1 millisecond are magic numbers; consider defining them as named constants (e.g., kGetPgidMaxAttempts) for improved readability and maintainability.

      // Parent: explicitly set the child's PGID to avoid race conditions with getpgid.
      // This ensures the child is in its own process group regardless of execution order.
      if (setpgid(pid, pid) != 0 && errno != EACCES) {
        // EACCES is ignored as it may mean the child has already performed exec()
      }
      observed_pgid = pid;

Comment on lines +140 to +143
fprintf(stderr,
"SPAWN_DBG child pid=%d ppid=%d pgid_before=%d setpgid_res=%d setpgid_errno=%d "
"sid=%d\n",
cpid, getppid(), (int)cpgid_before, setres, set_err, (int)csid);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

HIGH: Use of non-async-signal-safe function fprintf in child process. The call to fprintf(stderr, ...) (lines 140-143) after fork() but before execl() is dangerous in a multi-threaded application. It can lead to deadlocks if a lock was held by another thread in the parent at the time of fork. Additionally, the setpgid(0, 0) call is duplicated in both if and else branches (lines 132-147), which could be refactored to improve maintainability.

        // Avoid fprintf in child after fork as it is not async-signal-safe.
        // If logging is required, use the async-signal-safe write() system call.


closedir(proc_dir);
// Minimal structured debug output for triage (local-only)
if (matched >= 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This debug logging appears to be unconditional, which could generate a lot of output in production. The PR description states that debug prints are gated by HK_DEBUG_PROC_CAP. This block should be wrapped in a conditional check like if (getenv("HK_DEBUG_PROC_CAP")). The current condition if (matched >= 0) is always true, as matched is initialized to 0 and only incremented.

Suggested change
if (matched >= 0) {
if (getenv("HK_DEBUG_PROC_CAP")) {

Comment on lines +1 to +2
# temporary diag branch marker
/.tmpdiag

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file appears to be a temporary artifact from a diagnostic branch. To keep the repository clean, please consider removing it before merging.

}

// Instrumentation: log what PGID we think we're inspecting and a cheap probe
if (getenv("HK_DEBUG_PROC_CAP")) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check uses getenv("HK_DEBUG_PROC_CAP") directly, while other parts of this file use the g_proc_cap_enabled flag. For consistency and to avoid repeated getenv calls, it would be better to use the global flag here as well.

  if (g_proc_cap_enabled) {

@heidi-dang heidi-dang merged commit 5bd0c4a into main Feb 19, 2026
4 checks passed
@heidi-dang heidi-dang deleted the fix/proc-cap-record-observed-pgid branch February 19, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant