Skip to content

Conversation

@mahil-2040
Copy link

Closes #46

What does this PR do?

Overview

This document describes how to test the pod cache optimization that reduces memory usage in the workload-manager by implementing selective caching for pods.

Problem Statement

The controller-runtime wraps all cache for all resources we access. This costs much more memory than needed. For example, when reading pods, we only care about pods created for sandboxes (those with the sandbox-name label), but controller-runtime would cache ALL pods in the cluster locally.

Solution Implemented

The solution implements a filtered cache using controller-runtime's cache.Options with ByObject configuration. This ensures that only pods with the sandbox-name label are cached, significantly reducing memory footprint.

Changes Made

File: cmd/workload-manager/main.go

  1. Added filtered cache configuration to the controller manager:

    Cache: cache.Options{
        ByObject: map[client.Object]cache.ByObject{
            &corev1.Pod{}: {
                Label: labels.SelectorFromSet(labels.Set{
                    "sandbox-name": "", // Only cache pods with this label
                }),
            },
        },
    }
  2. Added required imports:

    • corev1 "k8s.io/api/core/v1"
    • "k8s.io/apimachinery/pkg/labels"
    • "sigs.k8s.io/controller-runtime/pkg/cache"
    • "sigs.k8s.io/controller-runtime/pkg/client"

How It Works

Label-Based Filtering

All sandbox pods are labeled with sandbox-name in the workload builder:

// From pkg/workloadmanager/workload_builder.go
sandbox.Spec.PodTemplate.ObjectMeta.Labels["sandbox-name"] = params.sandboxName

The filtered cache uses this label as a selector, ensuring only relevant pods are cached.

Pod Access Points

The workload-manager accesses pods in the following locations:

  1. pkg/workloadmanager/k8s_client.go - GetSandboxPodIP():

    • Uses podLister.Pods(namespace).Get(podName) for direct lookup
    • Uses podLister.Pods(namespace).List(labels.SelectorFromSet(map[string]string{"sandbox-name": sandboxName})) for label-based lookup
    • Both methods will work correctly with the filtered cache
  2. pkg/workloadmanager/garbage_collection.go:

    • Does NOT directly access pods
    • Works with Sandbox/SandboxClaim resources only

Memory Usage Validation

To validate the filtered pod cache optimization, I tested memory usage of
workloadmanager in a Kind cluster with many non-sandbox pods.

Test Setup

  • ~60 total pods across multiple namespaces (non-sandbox)
  • 5–6 sandbox pods created via CodeInterpreter sessions
  • WorkloadManager deployed with filtered controller-runtime cache
  • Memory measured directly from the Kind node process (metrics-server is unstable in Kind)

Results

Baseline

Screenshot 2026-01-25 191525

After optimization

image
Scenario Sandbox Pods RSS Memory
Baseline 0 ~42 MB
After optimization 5–6 ~44 MB

Conclusion

Memory usage increases only slightly with sandbox pods and does not scale
with the total number of non-sandbox pods in the cluster, confirming that
the filtered cache behaves as intended.

Copilot AI review requested due to automatic review settings January 25, 2026 17:21
@volcano-sh-bot
Copy link
Contributor

Welcome @mahil-2040! It looks like this is your first PR to volcano-sh/agentcube 🎉

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mahil-2040, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses excessive memory usage in the workload-manager by introducing a selective caching mechanism for Kubernetes Pods. Previously, the system would cache all Pods in the cluster, leading to unnecessary memory consumption. The implemented solution configures the controller manager to only cache Pods relevant to sandboxes, identified by a specific label, thereby optimizing resource utilization and ensuring the manager's memory footprint remains stable regardless of the total number of non-sandbox Pods.

Highlights

  • Memory Optimization: Implemented a filtered cache for Kubernetes Pods within the workload-manager to significantly reduce memory consumption.
  • Selective Caching: The cache now only stores Pods that possess the sandbox-name label, avoiding caching of all cluster Pods.
  • Configuration Update: The controller-runtime manager in cmd/workload-manager/main.go was updated with cache.Options to enable ByObject filtering for corev1.Pod.
  • Validation: Memory usage tests in a Kind cluster confirmed that the optimization effectively prevents memory growth with an increasing number of non-sandbox pods.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a great optimization to reduce memory usage by filtering the pod cache. The pull request description is very clear and provides good validation of the improvement.

However, I've found a critical issue in the implementation of the label selector. The current implementation using labels.SelectorFromSet will only match pods with the label sandbox-name and an empty value, which is not the intention and would cause the cache to not work for sandbox pods. I've provided a suggestion to fix this.

@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 7.14286% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 35.65%. Comparing base (845b798) to head (4fc0d13).
⚠️ Report is 30 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloadmanager/k8s_client.go 9.09% 10 Missing ⚠️
pkg/workloadmanager/workload_builder.go 0.00% 3 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #172      +/-   ##
==========================================
+ Coverage   35.60%   35.65%   +0.04%     
==========================================
  Files          29       29              
  Lines        2533     2558      +25     
==========================================
+ Hits          902      912      +10     
- Misses       1505     1520      +15     
  Partials      126      126              
Flag Coverage Δ
unittests 35.65% <7.14%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a filtered Pod cache configuration to the workload-manager’s controller-runtime manager to reduce memory usage by avoiding caching unrelated cluster Pods.

Changes:

  • Configure controller-runtime manager cache filtering for corev1.Pod using a label selector.
  • Add a new E2E runner script for Kind-based testing (Go + Python).
  • Add new hack scripts for copyright header injection and client-go code generation.

Reviewed changes

Copilot reviewed 1 out of 4 changed files in this pull request and generated 2 comments.

File Description
cmd/workload-manager/main.go Adds controller-runtime cache ByObject filtering for Pods to reduce cache size.
test/e2e/run_e2e.sh Introduces a full Kind + Helm + port-forward based E2E runner for Go and Python tests.
hack/update-copyright.sh Adds a helper script to prepend boilerplate headers to Go/Python files.
hack/update-codegen.sh Adds a helper script to run Kubernetes code-generator for client-go/.

@mahil-2040 mahil-2040 force-pushed the feat/optimize-pod-cache branch from 458caa6 to 052933c Compare January 25, 2026 18:04
Copilot AI review requested due to automatic review settings January 25, 2026 18:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 5 changed files in this pull request and generated 1 comment.

@mahil-2040
Copy link
Author

@hzxuzhonghu can you please review this PR?

Copilot AI review requested due to automatic review settings January 27, 2026 05:54
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@mahil-2040
Copy link
Author

@hzxuzhonghu i have implemented all the suggestions, can you check it now?

metav1.NamespaceAll,
func(opts *metav1.ListOptions) {
// Filter to only watch pods with sandbox-name label
opts.LabelSelector = SandboxNameLabelKey
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requirement, err := labels.NewRequirement(workloadmanager.SandboxNameLabelKey, selection.Exists, nil)
opts.LabelSelector = requirement.String()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the client-go informer filter to use the same explicit approach.

Comment on lines 91 to 95
Cache: cache.Options{
ByObject: map[client.Object]cache.ByObject{
&corev1.Pod{}: {
Label: labelSelector,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why add pod cache

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! The pod cache is needed because the workload-manager needs to look up pod IPs when routing requests to sandbox sessions.

When a session is created, we need to find the corresponding pod's IP address to forward traffic to it. The GetSandboxPodIP function (in k8s_client.go) uses the pod lister to find pods by their sandbox-name label.

Without caching, we'd have to query the Kubernetes API server every time we need a pod IP, which would be slow and put unnecessary load on the API server. The cache keeps a local copy of pod information so lookups are fast.

The optimization here is that instead of caching ALL pods in the cluster (which wastes memory), we only cache the pods we actually care about - the ones with the sandbox-name label.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GetSandboxPodIP already using podLister.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

donot miss this, i think we can revert this change @mahil-2040

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am working on this. Give me some time I will revert the changes.

// WorkloadNameLabelKey labels key for workload name
WorkloadNameLabelKey = "runtime.agentcube.io/workload-name"
// SandboxNameLabelKey labels key for sandbox name
SandboxNameLabelKey = "sandbox-name" // revive:disable-line:var-naming
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SandboxNameLabelKey = "sandbox-name" // revive:disable-line:var-naming
SandboxNameLabelKey = "runtime.agentcube.io/sandbox-name"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzxuzhonghu I’m concerned this change could break backward compatibility. Copilot is suggesting that while "sandbox-name" is already used by existing sandbox pods, changing it to "runtime.agentcube.io/sandbox-name" would cause the filtered cache to miss existing pods, and GetSandboxPodIP may fail during rolling upgrades. Should we keep "sandbox-name" to maintain compatibility, or do you have a migration strategy in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sooner or later we should update, not worry about compatibility now, we have not released yet

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i should implement this change now, right, ignoring the already existing pods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i will implement it and push the changes

Copilot AI review requested due to automatic review settings January 28, 2026 05:07
@mahil-2040
Copy link
Author

mahil-2040 commented Jan 28, 2026

@hzxuzhonghu Thanks for the suggestion about using "runtime.agentcube.io/sandbox-name"!

I want to clarify that the "sandbox-name" label key already existed in the codebase before this PR - I didn't introduce it. This PR only:

  • Extracted the hardcoded string into a constant (SandboxNameLabelKey)
  • Used that constant consistently across the codebase
  • Applied it as a filter for the pod cache

I think Changing the label value from "sandbox-name" to "runtime.agentcube.io/sandbox-name" can break backward compatibility with any existing sandbox pods that are already labeled with "sandbox-name". Since this is just a refactoring/optimization PR focused on reducing memory usage, I kept the existing label value to avoid introducing breaking changes.

Would you prefer I create a separate follow-up PR to migrate the label key to the namespaced format with proper deprecation handling once this PR is merged?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

os.Exit(1)
}
labelSelector := labels.NewSelector().Add(*requirement)

Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blank line after building labelSelector contains trailing whitespace. This will be normalized by gofmt and can cause noisy diffs; please run gofmt / remove the trailing whitespace.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines 120 to 137
// Create informer factory for core resources (Pods, etc.)
informerFactory := informers.NewSharedInformerFactory(clientset, 0)
// Use filtered factory to only cache pods with sandbox-name label
requirement, err := labels.NewRequirement(SandboxNameLabelKey, selection.Exists, nil)
if err != nil {
return nil, fmt.Errorf("failed to create label requirement: %w", err)
}
labelSelector := labels.NewSelector().Add(*requirement)

informerFactory := informers.NewFilteredSharedInformerFactory(
clientset,
0, // resync period
metav1.NamespaceAll,
func(opts *metav1.ListOptions) {
// Filter to only watch pods with sandbox-name label
opts.LabelSelector = labelSelector.String()
},
)
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewK8sClient now constructs a filtered Pod informer cache (label selector exists requirement). There are unit tests for GetSandboxPodIP, but no tests asserting that the informer factory is actually filtering Pods as intended; please add a test that creates a fake clientset + informer factory and verifies only labeled Pods are observed/listed to prevent regressions.

Copilot generated this review using guidance from repository custom instructions.
@mahil-2040 mahil-2040 force-pushed the feat/optimize-pod-cache branch from 2aa45e8 to 9c8ba87 Compare January 28, 2026 07:47
@mahil-2040
Copy link
Author

@hzxuzhonghu I've noticed something interesting: commit 9c8ba87 was passing in CI when I first pushed it, but now when I hard reset to that exact same commit, the E2E test is failing with a 500 Internal Server Error.
image

can you tell me how can i fix it, what can be the issue?

Copilot AI review requested due to automatic review settings January 28, 2026 13:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

@mahil-2040 mahil-2040 force-pushed the feat/optimize-pod-cache branch from 86346d6 to 9c8ba87 Compare January 28, 2026 13:57
Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the e2e failure is a flake

@hzxuzhonghu
Copy link
Member

==> Deploying Redis (redis:7-alpine)
namespace/agentcube created
Pulling image 'redis:7-alpine'...
Error response from daemon: received unexpected HTTP status: 502 Bad Gateway
Cleaning up...
Killing any remaining kubectl port-forward processes...

From the error image, it seems the image pull failed

@mahil-2040
Copy link
Author

mahil-2040 commented Jan 29, 2026

Earlier it was giving 500 status code, means the error was different, now it is giving 502 status code on same commit. So what should i do now should I ignore it, because i am a bit confused and not able to figure out what the real problem is?

@hzxuzhonghu
Copy link
Member

@YaoZengzeng shed some light on this failure

=== RUN   TestCodeInterpreterWarmPool
    e2e_test.go:590: Applying e2e_code_interpreter_warmpool.yaml...
    e2e_test.go:626: Waiting for warmpool to be created with 2 pods...
    e2e_test.go:636: Executing code command...
    e2e_test.go:643: 
        	Error Trace:	/home/runner/work/agentcube/agentcube/test/e2e/e2e_test.go:643
        	            				/home/runner/work/agentcube/agentcube/test/e2e/e2e_test.go:597
        	Error:      	Received unexpected error:
        	            	python script failed: 500 Server Error: Internal Server Error for url: http://localhost:8080/v1/code-interpreter
        	Test:       	TestCodeInterpreterWarmPool
    e2e_test.go:603: Cleaning up code interpreter resources...
--- FAIL: TestCodeInterpreterWarmPool (7.19s)
FAIL
FAIL	github.com/volcano-sh/agentcube/test/e2e	94.911s

Signed-off-by: Mahil <mahilpatel0808@gmail.com>
Signed-off-by: Mahil <mahilpatel0808@gmail.com>
Signed-off-by: Mahil <mahilpatel0808@gmail.com>
Signed-off-by: Mahil <mahilpatel0808@gmail.com>
…ager

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@mahil-2040
Copy link
Author

@hzxuzhonghu I've reverted the Cache section in main.go and implemented your suggested change to use labels.NewRequirement with selection.Exists for the label selector.

and I investigated the failing TestCodeInterpreterWarmPool test and I believe this isn't related to my changes. This test was recently added to upstream main in commit bec1446, and when I rebased my branch it started failing. I tested locally with both my changes and with them reverted, and the test fails the same way in both cases - it can't find the warmpool pods in the default namespace. The Python CodeInterpreter tests all pass, which suggests my changes aren't breaking anything.

Let me know if there are any more changes required?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace controller-runtime with raw k8s informer

5 participants