feat: Implement SandboxWarmPool recreate on template updates by shrutiyam-glitch · Pull Request #347 · kubernetes-sigs/agent-sandbox

shrutiyam-glitch · 2026-02-26T00:40:29Z

Fixes #323

This PR implements the rollout logic for SandboxWarmPool when its associated SandboxTemplate is updated.

The implementation adopts a "Recreate" strategy for idle resources while ensuring that active, claimed Sandboxes remain uninterrupted.

Key Changes:

Controller Watches: Updated SandboxWarmPoolReconciler to watch SandboxTemplate resources. It uses an EnqueueRequestsFromMapFunc to identify and reconcile all warmpools referencing a modified template.
Template Spec Hashing: Introduced a sandbox-template-spec-hash label. This fingerprint allows the controller to distinguish between pods running the current template version versus stale versions.
Rotation Logic: The reconciliation loop now filters for pods that are both "stale" (hash mismatch) and "unclaimed." Stale idle pods are deleted, triggering the standard replenishment logic to create fresh pods with the new spec.
Efficiency: The SandboxTemplate is fetched once per reconciliation cycle to avoid N+1 API calls during pod filtering.

Testing Performed:

Verified that updating a SandboxTemplate image triggers the deletion of idle pods in the associated SandboxWarmPool.
Verified that pods already bound to a Sandbox are NOT deleted during a template update.
Verified that changing the templateRef (with new spec) on the SandboxWarmPool itself triggers a full pool rotation.
Verified that changing the templateRef (with old spec) on the SandboxWarmPool itself triggers a full pool rotation.

Steps followed:

Created SandboxTemplate, SandboxWarmpool and SandboxClaim

$ kubectl get sandboxtemplate,sandboxwarmpool,sandboxclaim,pods -n sandbox-test
NAME                                                                 AGE
sandboxtemplate.extensions.agents.x-k8s.io/pct                       23h
sandboxtemplate.extensions.agents.x-k8s.io/python-counter-template   25h

NAME                                                             READY   AGE
sandboxwarmpool.extensions.agents.x-k8s.io/python-sdk-warmpool   2       54s

NAME                                                    AGE
sandboxclaim.extensions.agents.x-k8s.io/sandbox-claim   29s

NAME                                            READY   STATUS    RESTARTS   AGE
pod/python-sdk-warmpool-62wpl                   1/1     Running   0          29s
pod/python-sdk-warmpool-dm6vq                   1/1     Running   0          54s
pod/python-sdk-warmpool-nb2px                   1/1     Running   0          54s

Pod python-sdk-warmpool-nb2px is adopted by the sandbox-claim.

Updated the spec of python-counter-template and applied.

$ kubectl get pods -n sandbox-test
NAME                                            READY   STATUS    RESTARTS   AGE
pod/python-sdk-warmpool-8c5tx                   1/1     Running   0          11s
pod/python-sdk-warmpool-g5wk8                   1/1     Running   0          11s
pod/python-sdk-warmpool-nb2px                   1/1     Running   0          4m25s

The unclaimed warmpool pods are recreated with the new updated template spec.

Updated the spec.sandboxTemplateRef in the SandboxWarmPool manifest to use the sandbox template pct. All the unclaimed pods are by default recreated.

$ kubectl get pod -n sandbox-test
NAME                                        READY   STATUS    RESTARTS   AGE
python-sdk-warmpool-cc5tk                   1/1     Running   0          11s
python-sdk-warmpool-nb2px                   1/1     Running   0          8m36s
python-sdk-warmpool-qsvd9                   1/1     Running   0          11s

netlify · 2026-02-26T00:40:35Z

✅ Deploy Preview for agent-sandbox canceled.

Name	Link
🔨 Latest commit	`8e7c8b1`
🔍 Latest deploy log	https://app.netlify.com/projects/agent-sandbox/deploys/69a8b7ed7156e60008ee080e

k8s-ci-robot · 2026-02-26T00:40:39Z

Hi @shrutiyam-glitch. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dhenkel92

Thank you for working on this feature. It's painful to roll out changes to a warm pool right now 🙂

dhenkel92 · 2026-02-27T10:40:05Z

extensions/controllers/sandboxwarmpool_controller.go

+			// Pod belongs to this warmpool - check if it's stale
+			if tmplErr == nil && r.isPodStale(&pod, currentHash) {
+				log.Info("Deleting stale pod from pool", "pod", pod.Name)
+				if err := r.Delete(ctx, &pod); err != nil {
+					log.Error(err, "Failed to delete stale pod", "pod", pod.Name)
+					allErrors = errors.Join(allErrors, err)
+				}
+				continue
+			}


suggestion: Use a different rollout strategy

The current implementation would mean that all pods get deleted at the same time, so the pool has no available resources until new pods are started. This means that changing the sandbox template might cause unnecessary latency spikes for upstream applications and defeat the purpose of the warm pool.

We should find a better strategy for rotating these pods. Maybe a hardcoded percentage of unavailable pods, or something more opinionated like the rollout mechanism of a deployment.

I see and I agree. Something more robust for updates would be to include an UpdateStrategy struct similar to deployments so that we allow admins to configure maxUnavailable. I'd defer to @janetkuo to know what's the best k8s path here.

Please see the original discussion here #34 (comment):

The primary purpose of a rolling update strategy in standard Kubernetes workloads is to ensure service continuity by gradually replacing old pods with new ones without causing downtime. However, since the pods in a warm pool are, by definition, unclaimed and not serving live traffic, the risk of disruption is not a factor. You can either make it immutable, or support a simpler update mechanism.

When a template spec is updated, the (warm) pods created from the warmpool no longer match what sandboxes need, so they're not useable anyways. We should start with "recreate", and potentially add other rollout strategies if we see other use cases.

... by definition, unclaimed and not serving live traffic, ....

I’m not sure about that claim. The idea of a warm pool is to have resources pre-warmed before they’re needed, so you don’t have to pay the latency of creating/scheduling the pod, downloading the image, and setting up the microVM. If you build applications on top of sandboxes that, for example, require claiming a sandbox per execution, then the warm pool is on the critical path of the application.

In my opinion, if you can’t rely on the pool availability, it defeats the purpose of the warm pool. End-2-end application latency becomes unpredictable, and developers have to start thinking about when and how to deploy updates.

The current behavior: Updating SandboxTemplate won't get old pods to be updated, and only new pods will use updated template spec. When sandbox claims from the pool, it's essentially non-deterministic which pod gets adopted from a mixed-state warm pool (incorrect).

The new behavior we're proposing: Avoid the mixed-state warm pool situation. Updating SandboxTemplate should cause the pool to only contain the pods that match the template spec, sandbox should be able to deterministically only adopt pods that match the template.

Given that old/outdated pods aren't and won't be used, they should be scaled down as soon as possible. Note that claimed pods will disappear from the pool. Warmed pods aren't and shouldn't be in use before they're claimed.

Could we make this PR the "recreate" strategy, and create a separate issue for a "rollingupdate" strategy?

SGTM. Let’s discuss during next Monday’s meeting.

shrutiyam-glitch · 2026-03-03T18:18:09Z

/retest

k8s-ci-robot · 2026-03-03T18:23:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shrutiyam-glitch
Once this PR has been reviewed and has the lgtm label, please assign vicentefb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

extensions/controllers/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

igooch

Overall good work automating the warmpool pod recreation via the new spec hash label.

A few minor suggestions for performance and edge cases in the inline comments.

igooch · 2026-03-04T02:24:44Z

extensions/controllers/sandboxwarmpool_controller.go

+			// Pod belongs to this warmpool - check if it's stale
+			if tmplErr == nil && r.isPodStale(&pod, currentHash) {
+				log.Info("Deleting stale pod from pool", "pod", pod.Name)
+				if err := r.Delete(ctx, &pod); err != nil {
+					log.Error(err, "Failed to delete stale pod", "pod", pod.Name)
+					allErrors = errors.Join(allErrors, err)
+				}
+				continue
+			}


Could we make this PR the "recreate" strategy, and create a separate issue for a "rollingupdate" strategy?

extensions/controllers/sandboxwarmpool_controller.go

igooch · 2026-03-04T03:24:17Z

extensions/controllers/sandboxwarmpool_controller.go

+			// Pod belongs to this warmpool - check if it's stale
+			if tmplErr == nil && r.isPodStale(&pod, currentHash) {
+				log.Info("Deleting stale pod from pool", "pod", pod.Name)
+				if err := r.Delete(ctx, &pod); err != nil {


A SandboxClaim will try to claim stale pods while the deleting phase is in process because the tryAdoptPodFromPool only filters by template name. Recommend updating the SandboxClaim pod selection logic to also verify the spechash.

Missed this case ! Thanks.
Added the check for this in sandboxclaim_controller.

lyj7890 · 2026-03-04T15:35:04Z

This feature would be really helpful for our use case! Looking forward to the merge. 👍

shrutiyam-glitch added 2 commits February 25, 2026 17:58

Enable sandboxwarmpool on template updates

6eeb647

Fetch template once

b1b115b

k8s-ci-robot requested review from janetkuo and justinsb February 26, 2026 00:40

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 26, 2026

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 26, 2026

shrutiyam-glitch changed the title ~~feat: Implement SandboxWarmPool rollout on template updates~~ feat: Implement SandboxWarmPool recreate on template updates Feb 26, 2026

dhenkel92 reviewed Feb 27, 2026

View reviewed changes

janetkuo added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2026

Fix lint

9171db5

igooch reviewed Mar 4, 2026

View reviewed changes

Added additional test case checks

8e7c8b1

Conversation

shrutiyam-glitch commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for agent-sandbox canceled.

Uh oh!

k8s-ci-robot commented Feb 26, 2026

Uh oh!

dhenkel92 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janetkuo Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrutiyam-glitch commented Mar 3, 2026

Uh oh!

k8s-ci-robot commented Mar 3, 2026

Uh oh!

igooch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyj7890 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

shrutiyam-glitch commented Feb 26, 2026 •

edited

Loading

netlify bot commented Feb 26, 2026 •

edited

Loading

janetkuo Mar 3, 2026 •

edited

Loading