Skip to content

feat: Implement SandboxWarmPool recreate on template updates#347

Open
shrutiyam-glitch wants to merge 4 commits intokubernetes-sigs:mainfrom
shrutiyam-glitch:swp-rollout
Open

feat: Implement SandboxWarmPool recreate on template updates#347
shrutiyam-glitch wants to merge 4 commits intokubernetes-sigs:mainfrom
shrutiyam-glitch:swp-rollout

Conversation

@shrutiyam-glitch
Copy link
Contributor

@shrutiyam-glitch shrutiyam-glitch commented Feb 26, 2026

Fixes #323

This PR implements the rollout logic for SandboxWarmPool when its associated SandboxTemplate is updated.

The implementation adopts a "Recreate" strategy for idle resources while ensuring that active, claimed Sandboxes remain uninterrupted.

Key Changes:

  • Controller Watches: Updated SandboxWarmPoolReconciler to watch SandboxTemplate resources. It uses an EnqueueRequestsFromMapFunc to identify and reconcile all warmpools referencing a modified template.

  • Template Spec Hashing: Introduced a sandbox-template-spec-hash label. This fingerprint allows the controller to distinguish between pods running the current template version versus stale versions.

  • Rotation Logic: The reconciliation loop now filters for pods that are both "stale" (hash mismatch) and "unclaimed." Stale idle pods are deleted, triggering the standard replenishment logic to create fresh pods with the new spec.

  • Efficiency: The SandboxTemplate is fetched once per reconciliation cycle to avoid N+1 API calls during pod filtering.

Testing Performed:

  • Verified that updating a SandboxTemplate image triggers the deletion of idle pods in the associated SandboxWarmPool.
  • Verified that pods already bound to a Sandbox are NOT deleted during a template update.
  • Verified that changing the templateRef (with new spec) on the SandboxWarmPool itself triggers a full pool rotation.
  • Verified that changing the templateRef (with old spec) on the SandboxWarmPool itself triggers a full pool rotation.

Steps followed:

  • Created SandboxTemplate, SandboxWarmpool and SandboxClaim
$ kubectl get sandboxtemplate,sandboxwarmpool,sandboxclaim,pods -n sandbox-test
NAME                                                                 AGE
sandboxtemplate.extensions.agents.x-k8s.io/pct                       23h
sandboxtemplate.extensions.agents.x-k8s.io/python-counter-template   25h

NAME                                                             READY   AGE
sandboxwarmpool.extensions.agents.x-k8s.io/python-sdk-warmpool   2       54s

NAME                                                    AGE
sandboxclaim.extensions.agents.x-k8s.io/sandbox-claim   29s

NAME                                            READY   STATUS    RESTARTS   AGE
pod/python-sdk-warmpool-62wpl                   1/1     Running   0          29s
pod/python-sdk-warmpool-dm6vq                   1/1     Running   0          54s
pod/python-sdk-warmpool-nb2px                   1/1     Running   0          54s

Pod python-sdk-warmpool-nb2px is adopted by the sandbox-claim.

  • Updated the spec of python-counter-template and applied.
$ kubectl get pods -n sandbox-test
NAME                                            READY   STATUS    RESTARTS   AGE
pod/python-sdk-warmpool-8c5tx                   1/1     Running   0          11s
pod/python-sdk-warmpool-g5wk8                   1/1     Running   0          11s
pod/python-sdk-warmpool-nb2px                   1/1     Running   0          4m25s

The unclaimed warmpool pods are recreated with the new updated template spec.

  • Updated the spec.sandboxTemplateRef in the SandboxWarmPool manifest to use the sandbox template pct. All the unclaimed pods are by default recreated.
$ kubectl get pod -n sandbox-test
NAME                                        READY   STATUS    RESTARTS   AGE
python-sdk-warmpool-cc5tk                   1/1     Running   0          11s
python-sdk-warmpool-nb2px                   1/1     Running   0          8m36s
python-sdk-warmpool-qsvd9                   1/1     Running   0          11s

@netlify
Copy link

netlify bot commented Feb 26, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 8e7c8b1
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69a8b7ed7156e60008ee080e

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 26, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @shrutiyam-glitch. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 26, 2026
@shrutiyam-glitch shrutiyam-glitch changed the title feat: Implement SandboxWarmPool rollout on template updates feat: Implement SandboxWarmPool recreate on template updates Feb 26, 2026
Copy link

@dhenkel92 dhenkel92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this feature. It's painful to roll out changes to a warm pool right now 🙂

Comment on lines +147 to +155
// Pod belongs to this warmpool - check if it's stale
if tmplErr == nil && r.isPodStale(&pod, currentHash) {
log.Info("Deleting stale pod from pool", "pod", pod.Name)
if err := r.Delete(ctx, &pod); err != nil {
log.Error(err, "Failed to delete stale pod", "pod", pod.Name)
allErrors = errors.Join(allErrors, err)
}
continue
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Use a different rollout strategy

The current implementation would mean that all pods get deleted at the same time, so the pool has no available resources until new pods are started. This means that changing the sandbox template might cause unnecessary latency spikes for upstream applications and defeat the purpose of the warm pool.

We should find a better strategy for rotating these pods. Maybe a hardcoded percentage of unavailable pods, or something more opinionated like the rollout mechanism of a deployment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see and I agree. Something more robust for updates would be to include an UpdateStrategy struct similar to deployments so that we allow admins to configure maxUnavailable. I'd defer to @janetkuo to know what's the best k8s path here.

Copy link
Member

@janetkuo janetkuo Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the original discussion here #34 (comment):

The primary purpose of a rolling update strategy in standard Kubernetes workloads is to ensure service continuity by gradually replacing old pods with new ones without causing downtime. However, since the pods in a warm pool are, by definition, unclaimed and not serving live traffic, the risk of disruption is not a factor. You can either make it immutable, or support a simpler update mechanism.

When a template spec is updated, the (warm) pods created from the warmpool no longer match what sandboxes need, so they're not useable anyways. We should start with "recreate", and potentially add other rollout strategies if we see other use cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... by definition, unclaimed and not serving live traffic, ....

I’m not sure about that claim. The idea of a warm pool is to have resources pre-warmed before they’re needed, so you don’t have to pay the latency of creating/scheduling the pod, downloading the image, and setting up the microVM. If you build applications on top of sandboxes that, for example, require claiming a sandbox per execution, then the warm pool is on the critical path of the application.

In my opinion, if you can’t rely on the pool availability, it defeats the purpose of the warm pool. End-2-end application latency becomes unpredictable, and developers have to start thinking about when and how to deploy updates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior: Updating SandboxTemplate won't get old pods to be updated, and only new pods will use updated template spec. When sandbox claims from the pool, it's essentially non-deterministic which pod gets adopted from a mixed-state warm pool (incorrect).

The new behavior we're proposing: Avoid the mixed-state warm pool situation. Updating SandboxTemplate should cause the pool to only contain the pods that match the template spec, sandbox should be able to deterministically only adopt pods that match the template.

Given that old/outdated pods aren't and won't be used, they should be scaled down as soon as possible. Note that claimed pods will disappear from the pool. Warmed pods aren't and shouldn't be in use before they're claimed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this PR the "recreate" strategy, and create a separate issue for a "rollingupdate" strategy?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Let’s discuss during next Monday’s meeting.

@janetkuo janetkuo added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2026
@shrutiyam-glitch
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shrutiyam-glitch
Once this PR has been reviewed and has the lgtm label, please assign vicentefb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good work automating the warmpool pod recreation via the new spec hash label.

A few minor suggestions for performance and edge cases in the inline comments.

Comment on lines +147 to +155
// Pod belongs to this warmpool - check if it's stale
if tmplErr == nil && r.isPodStale(&pod, currentHash) {
log.Info("Deleting stale pod from pool", "pod", pod.Name)
if err := r.Delete(ctx, &pod); err != nil {
log.Error(err, "Failed to delete stale pod", "pod", pod.Name)
allErrors = errors.Join(allErrors, err)
}
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this PR the "recreate" strategy, and create a separate issue for a "rollingupdate" strategy?

// Pod belongs to this warmpool - check if it's stale
if tmplErr == nil && r.isPodStale(&pod, currentHash) {
log.Info("Deleting stale pod from pool", "pod", pod.Name)
if err := r.Delete(ctx, &pod); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A SandboxClaim will try to claim stale pods while the deleting phase is in process because the tryAdoptPodFromPool only filters by template name. Recommend updating the SandboxClaim pod selection logic to also verify the spechash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this case ! Thanks.
Added the check for this in sandboxclaim_controller.

@lyj7890
Copy link

lyj7890 commented Mar 4, 2026

This feature would be really helpful for our use case! Looking forward to the merge. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] SandboxWarmPool rollout on template updates

7 participants