-
Notifications
You must be signed in to change notification settings - Fork 31
Feat: Optimize pod cache to reduce memory usage #172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Welcome @mahil-2040! It looks like this is your first PR to volcano-sh/agentcube 🎉 |
Summary of ChangesHello @mahil-2040, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses excessive memory usage in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a great optimization to reduce memory usage by filtering the pod cache. The pull request description is very clear and provides good validation of the improvement.
However, I've found a critical issue in the implementation of the label selector. The current implementation using labels.SelectorFromSet will only match pods with the label sandbox-name and an empty value, which is not the intention and would cause the cache to not work for sandbox pods. I've provided a suggestion to fix this.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #172 +/- ##
==========================================
+ Coverage 35.60% 35.65% +0.04%
==========================================
Files 29 29
Lines 2533 2558 +25
==========================================
+ Hits 902 912 +10
- Misses 1505 1520 +15
Partials 126 126
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a filtered Pod cache configuration to the workload-manager’s controller-runtime manager to reduce memory usage by avoiding caching unrelated cluster Pods.
Changes:
- Configure controller-runtime manager cache filtering for
corev1.Podusing a label selector. - Add a new E2E runner script for Kind-based testing (Go + Python).
- Add new hack scripts for copyright header injection and client-go code generation.
Reviewed changes
Copilot reviewed 1 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
cmd/workload-manager/main.go |
Adds controller-runtime cache ByObject filtering for Pods to reduce cache size. |
test/e2e/run_e2e.sh |
Introduces a full Kind + Helm + port-forward based E2E runner for Go and Python tests. |
hack/update-copyright.sh |
Adds a helper script to prepend boilerplate headers to Go/Python files. |
hack/update-codegen.sh |
Adds a helper script to run Kubernetes code-generator for client-go/. |
458caa6 to
052933c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 5 changed files in this pull request and generated 1 comment.
|
@hzxuzhonghu can you please review this PR? |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
|
@hzxuzhonghu i have implemented all the suggestions, can you check it now? |
pkg/workloadmanager/k8s_client.go
Outdated
| metav1.NamespaceAll, | ||
| func(opts *metav1.ListOptions) { | ||
| // Filter to only watch pods with sandbox-name label | ||
| opts.LabelSelector = SandboxNameLabelKey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requirement, err := labels.NewRequirement(workloadmanager.SandboxNameLabelKey, selection.Exists, nil)
opts.LabelSelector = requirement.String()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the client-go informer filter to use the same explicit approach.
cmd/workload-manager/main.go
Outdated
| Cache: cache.Options{ | ||
| ByObject: map[client.Object]cache.ByObject{ | ||
| &corev1.Pod{}: { | ||
| Label: labelSelector, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why add pod cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! The pod cache is needed because the workload-manager needs to look up pod IPs when routing requests to sandbox sessions.
When a session is created, we need to find the corresponding pod's IP address to forward traffic to it. The GetSandboxPodIP function (in k8s_client.go) uses the pod lister to find pods by their sandbox-name label.
Without caching, we'd have to query the Kubernetes API server every time we need a pod IP, which would be slow and put unnecessary load on the API server. The cache keeps a local copy of pod information so lookups are fast.
The optimization here is that instead of caching ALL pods in the cluster (which wastes memory), we only cache the pods we actually care about - the ones with the sandbox-name label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think GetSandboxPodIP already using podLister.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
donot miss this, i think we can revert this change @mahil-2040
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I am working on this. Give me some time I will revert the changes.
pkg/workloadmanager/k8s_client.go
Outdated
| // WorkloadNameLabelKey labels key for workload name | ||
| WorkloadNameLabelKey = "runtime.agentcube.io/workload-name" | ||
| // SandboxNameLabelKey labels key for sandbox name | ||
| SandboxNameLabelKey = "sandbox-name" // revive:disable-line:var-naming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| SandboxNameLabelKey = "sandbox-name" // revive:disable-line:var-naming | |
| SandboxNameLabelKey = "runtime.agentcube.io/sandbox-name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hzxuzhonghu I’m concerned this change could break backward compatibility. Copilot is suggesting that while "sandbox-name" is already used by existing sandbox pods, changing it to "runtime.agentcube.io/sandbox-name" would cause the filtered cache to miss existing pods, and GetSandboxPodIP may fail during rolling upgrades. Should we keep "sandbox-name" to maintain compatibility, or do you have a migration strategy in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sooner or later we should update, not worry about compatibility now, we have not released yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i should implement this change now, right, ignoring the already existing pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i will implement it and push the changes
|
@hzxuzhonghu Thanks for the suggestion about using "runtime.agentcube.io/sandbox-name"! I want to clarify that the "sandbox-name" label key already existed in the codebase before this PR - I didn't introduce it. This PR only:
I think Changing the label value from "sandbox-name" to "runtime.agentcube.io/sandbox-name" can break backward compatibility with any existing sandbox pods that are already labeled with "sandbox-name". Since this is just a refactoring/optimization PR focused on reducing memory usage, I kept the existing label value to avoid introducing breaking changes. Would you prefer I create a separate follow-up PR to migrate the label key to the namespaced format with proper deprecation handling once this PR is merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
cmd/workload-manager/main.go
Outdated
| os.Exit(1) | ||
| } | ||
| labelSelector := labels.NewSelector().Add(*requirement) | ||
|
|
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blank line after building labelSelector contains trailing whitespace. This will be normalized by gofmt and can cause noisy diffs; please run gofmt / remove the trailing whitespace.
| // Create informer factory for core resources (Pods, etc.) | ||
| informerFactory := informers.NewSharedInformerFactory(clientset, 0) | ||
| // Use filtered factory to only cache pods with sandbox-name label | ||
| requirement, err := labels.NewRequirement(SandboxNameLabelKey, selection.Exists, nil) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create label requirement: %w", err) | ||
| } | ||
| labelSelector := labels.NewSelector().Add(*requirement) | ||
|
|
||
| informerFactory := informers.NewFilteredSharedInformerFactory( | ||
| clientset, | ||
| 0, // resync period | ||
| metav1.NamespaceAll, | ||
| func(opts *metav1.ListOptions) { | ||
| // Filter to only watch pods with sandbox-name label | ||
| opts.LabelSelector = labelSelector.String() | ||
| }, | ||
| ) |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NewK8sClient now constructs a filtered Pod informer cache (label selector exists requirement). There are unit tests for GetSandboxPodIP, but no tests asserting that the informer factory is actually filtering Pods as intended; please add a test that creates a fake clientset + informer factory and verifies only labeled Pods are observed/listed to prevent regressions.
2aa45e8 to
9c8ba87
Compare
|
@hzxuzhonghu I've noticed something interesting: commit 9c8ba87 was passing in CI when I first pushed it, but now when I hard reset to that exact same commit, the E2E test is failing with a 500 Internal Server Error. can you tell me how can i fix it, what can be the issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
86346d6 to
9c8ba87
Compare
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the e2e failure is a flake
From the error image, it seems the image pull failed |
|
Earlier it was giving 500 status code, means the error was different, now it is giving 502 status code on same commit. So what should i do now should I ignore it, because i am a bit confused and not able to figure out what the real problem is? |
|
@YaoZengzeng shed some light on this failure |
Signed-off-by: Mahil <mahilpatel0808@gmail.com>
Signed-off-by: Mahil <mahilpatel0808@gmail.com>
Signed-off-by: Mahil <mahilpatel0808@gmail.com>
…ager Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
56c4f46 to
4fc0d13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
|
@hzxuzhonghu I've reverted the Cache section in main.go and implemented your suggested change to use labels.NewRequirement with selection.Exists for the label selector. and I investigated the failing TestCodeInterpreterWarmPool test and I believe this isn't related to my changes. This test was recently added to upstream main in commit bec1446, and when I rebased my branch it started failing. I tested locally with both my changes and with them reverted, and the test fails the same way in both cases - it can't find the warmpool pods in the default namespace. The Python CodeInterpreter tests all pass, which suggests my changes aren't breaking anything. Let me know if there are any more changes required? |

Closes #46
What does this PR do?
Overview
This document describes how to test the pod cache optimization that reduces memory usage in the workload-manager by implementing selective caching for pods.
Problem Statement
The controller-runtime wraps all cache for all resources we access. This costs much more memory than needed. For example, when reading pods, we only care about pods created for sandboxes (those with the
sandbox-namelabel), but controller-runtime would cache ALL pods in the cluster locally.Solution Implemented
The solution implements a filtered cache using controller-runtime's
cache.OptionswithByObjectconfiguration. This ensures that only pods with thesandbox-namelabel are cached, significantly reducing memory footprint.Changes Made
File:
cmd/workload-manager/main.goAdded filtered cache configuration to the controller manager:
Added required imports:
corev1 "k8s.io/api/core/v1""k8s.io/apimachinery/pkg/labels""sigs.k8s.io/controller-runtime/pkg/cache""sigs.k8s.io/controller-runtime/pkg/client"How It Works
Label-Based Filtering
All sandbox pods are labeled with
sandbox-namein the workload builder:The filtered cache uses this label as a selector, ensuring only relevant pods are cached.
Pod Access Points
The workload-manager accesses pods in the following locations:
pkg/workloadmanager/k8s_client.go-GetSandboxPodIP():podLister.Pods(namespace).Get(podName)for direct lookuppodLister.Pods(namespace).List(labels.SelectorFromSet(map[string]string{"sandbox-name": sandboxName}))for label-based lookuppkg/workloadmanager/garbage_collection.go:Memory Usage Validation
To validate the filtered pod cache optimization, I tested memory usage of
workloadmanagerin a Kind cluster with many non-sandbox pods.Test Setup
Results
Baseline
After optimization
Conclusion
Memory usage increases only slightly with sandbox pods and does not scale
with the total number of non-sandbox pods in the cluster, confirming that
the filtered cache behaves as intended.