feat(governor): group store wired + deterministic eviction tests#56
feat(governor): group store wired + deterministic eviction tests#56heidi-dang merged 2 commits intomainfrom
Conversation
…ministic eviction tests Wave2 P2-2: wire GroupPolicyStore into ProcessGovernor with: - GroupPolicyStore member (cap=256) and pid→group map (cap=8192) - LRU eviction for groups and pids - apply_group_policy and apply_cgroup_policy methods - Event callbacks for GROUP_EVICT/PIDMAP_EVICT/GOV_CGROUP_UNAVAILABLE - Rate-limited cgroup unavailability events - ViolationAction aligned to NONE/WARN/SOFT_KILL/HARD_KILL - Deterministic eviction unit tests using controllable time sequence - Fix memset-on-non-trivial warnings in GroupPolicyStore
Summary of ChangesHello @heidi-dang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Process Governor by introducing a new Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a GroupPolicyStore to manage group-based resource limits and wires it into the ProcessGovernor. However, critical security vulnerabilities were identified, including a logic error in PID-to-group mapping that allows bypassing security restrictions, a potential Denial of Service via unhandled exceptions in CPU affinity parsing, and a data race due to inconsistent synchronization of the group store. Beyond security, there are also correctness issues such as the upsert_group function failing to populate policy details and omitting resource limit fields, as well as problems with statistics reporting and the use of magic numbers for event types.
| const char* GroupPolicyStore::get_group_for_pid(int32_t pid) const { | ||
| for (size_t i = 0; i < kMaxPidGroupMap; ++i) { | ||
| if (pid_map_[i].in_use && static_cast<int32_t>(i) == pid) { | ||
| return pid_map_[i].group_id.data(); | ||
| } | ||
| } | ||
| return nullptr; | ||
| } |
There was a problem hiding this comment.
The GroupPolicyStore has a critical logic error in its PID-to-group mapping, leading to a security bypass. The get_group_for_pid function incorrectly assumes the array index i corresponds directly to the pid, while map_pid_to_group inserts into the first available slot. This means PIDs greater than or equal to 8192, or PIDs stored at indices not matching their value, will not be found. This allows processes with such PIDs to bypass all group-level restrictions enforced by the governor.
| struct PidEntry { | ||
| std::array<char, kMaxGroupIdLen + 1> group_id{}; | ||
| uint64_t last_seen_ns = 0; | ||
| bool in_use = false; | ||
| }; |
There was a problem hiding this comment.
The PidEntry struct is missing a field to store the actual process ID. Since PIDs are sparse and can exceed the fixed size of the pid_map_ array, the mapping logic must store the PID explicitly to perform correct lookups.
| struct PidEntry { | |
| std::array<char, kMaxGroupIdLen + 1> group_id{}; | |
| uint64_t last_seen_ns = 0; | |
| bool in_use = false; | |
| }; | |
| struct PidEntry { | |
| int32_t pid = -1; | |
| std::array<char, kMaxGroupIdLen + 1> group_id{}; | |
| uint64_t last_seen_ns = 0; | |
| bool in_use = false; | |
| }; |
| for (size_t i = 0; i < kMaxPidGroupMap; ++i) { | ||
| if (pid_map_[i].in_use && static_cast<int32_t>(i) == pid) { |
There was a problem hiding this comment.
The logic here incorrectly assumes that the array index i is the process ID. This will fail for any PID greater than kMaxPidGroupMap (8191) and will cause incorrect updates if PIDs are reused or sparse. The loop should check for a matching pid field within the PidEntry.
| for (size_t i = 0; i < kMaxPidGroupMap; ++i) { | |
| if (pid_map_[i].in_use && static_cast<int32_t>(i) == pid) { | |
| for (size_t i = 0; i < kMaxPidGroupMap; ++i) { | |
| if (pid_map_[i].in_use && pid_map_[i].pid == pid) { |
| for (size_t i = 0; i < kMaxGroups; ++i) { | ||
| if (!groups_[i].in_use) { | ||
| groups_[i] = GroupEntry{}; | ||
| std::strncpy(groups_[i].policy.group_id.data(), gid, gid_len); | ||
| groups_[i].policy.group_id[gid_len] = '\0'; | ||
| groups_[i].policy.last_update_ns = get_time(); | ||
| groups_[i].in_use = true; | ||
| group_count_++; | ||
| return true; | ||
| } |
There was a problem hiding this comment.
| if (msg.cpu) { | ||
| if (msg.cpu->max_pct) | ||
| groups_[i].policy.cpu_max_pct = msg.cpu->max_pct; | ||
| } |
There was a problem hiding this comment.
The upsert_group function is missing updates for the newly added cpu_quota_us and cpu_period_us fields in CpuPolicy. These should be synchronized from the msg to the stored policy.
if (msg.cpu) {
if (msg.cpu->max_pct)
groups_[i].policy.cpu_max_pct = msg.cpu->max_pct;
if (msg.cpu->quota_us)
groups_[i].policy.cpu_quota_us = msg.cpu->quota_us;
if (msg.cpu->period_us)
groups_[i].policy.cpu_period_us = msg.cpu->period_us;
}| auto store_stats = group_store_.get_stats(); | ||
| s.group_evictions = store_stats.group_evictions; | ||
| s.pidmap_evictions = store_stats.pidmap_evictions; | ||
| s.cgroup_unavailable_events = store_stats.cgroup_unavailable_count; |
There was a problem hiding this comment.
This line overwrites the cgroup_unavailable_events counter with a value from group_store_. However, GroupPolicyStore never increments its internal cgroup_unavailable_count, while ProcessGovernor correctly increments its own stats_.cgroup_unavailable_events at line 456. This assignment effectively breaks the statistic by always reporting 0.
| event_callback_(4, evict_msg, 0); | ||
| } | ||
| } | ||
| if (new_stats.pidmap_evictions > prev_stats.pidmap_evictions) { | ||
| stats_.pidmap_evictions++; | ||
| if (event_callback_) { | ||
| GovApplyMsg evict_msg; | ||
| evict_msg.pid = pid; | ||
| event_callback_(5, evict_msg, 0); | ||
| } | ||
| } | ||
|
|
||
| const char* group_id = group_store_.get_group_for_pid(pid); | ||
| if (group_id) { | ||
| const GroupPolicy* group_policy = group_store_.get_group(group_id); | ||
| if (group_policy) { | ||
| auto r = apply_cgroup_policy(pid, *group_policy); | ||
| if (!r.success) { | ||
| return r; | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| result.success = true; | ||
| return result; | ||
| } | ||
|
|
||
| ApplyResult ProcessGovernor::apply_cgroup_policy(int32_t pid, const GroupPolicy& group_policy) { | ||
| ApplyResult result; | ||
|
|
||
| if (!cgroup_driver_.is_available() || !cgroup_driver_.is_enabled()) { | ||
| uint64_t now = get_current_time_ns(); | ||
| if (now - last_cgroup_unavailable_ns_ > kCgroupUnavailableRateLimitNs) { | ||
| last_cgroup_unavailable_ns_ = now; | ||
| stats_.cgroup_unavailable_events++; | ||
| if (event_callback_) { | ||
| GovApplyMsg msg; | ||
| msg.pid = pid; | ||
| event_callback_(6, msg, 0); |
…ministic eviction tests Wave2 P2-2: wire GroupPolicyStore into ProcessGovernor with: - GroupPolicyStore member (cap=256) and pid→group map (cap=8192) - LRU eviction for groups and pids - apply_group_policy and apply_cgroup_policy methods - Event callbacks for GROUP_EVICT/PIDMAP_EVICT/GOV_CGROUP_UNAVAILABLE - Rate-limited cgroup unavailability events - ViolationAction aligned to NONE/WARN/SOFT_KILL/HARD_KILL - Deterministic eviction unit tests using controllable time sequence - Fix memset-on-non-trivial warnings in GroupPolicyStore - Add GovEventType enum registry to prevent event ID collisions
Wave2 P2-2: wire GroupPolicyStore into ProcessGovernor, deterministic eviction unit tests, remove memset-on-non-trivial warnings, align ViolationAction to NONE/WARN/SOFT_KILL/HARD_KILL.