feat: Implement SandboxWarmPool recreate on template updates#347
feat: Implement SandboxWarmPool recreate on template updates#347shrutiyam-glitch wants to merge 4 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
Hi @shrutiyam-glitch. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
dhenkel92
left a comment
There was a problem hiding this comment.
Thank you for working on this feature. It's painful to roll out changes to a warm pool right now 🙂
| // Pod belongs to this warmpool - check if it's stale | ||
| if tmplErr == nil && r.isPodStale(&pod, currentHash) { | ||
| log.Info("Deleting stale pod from pool", "pod", pod.Name) | ||
| if err := r.Delete(ctx, &pod); err != nil { | ||
| log.Error(err, "Failed to delete stale pod", "pod", pod.Name) | ||
| allErrors = errors.Join(allErrors, err) | ||
| } | ||
| continue | ||
| } |
There was a problem hiding this comment.
suggestion: Use a different rollout strategy
The current implementation would mean that all pods get deleted at the same time, so the pool has no available resources until new pods are started. This means that changing the sandbox template might cause unnecessary latency spikes for upstream applications and defeat the purpose of the warm pool.
We should find a better strategy for rotating these pods. Maybe a hardcoded percentage of unavailable pods, or something more opinionated like the rollout mechanism of a deployment.
There was a problem hiding this comment.
I see and I agree. Something more robust for updates would be to include an UpdateStrategy struct similar to deployments so that we allow admins to configure maxUnavailable. I'd defer to @janetkuo to know what's the best k8s path here.
There was a problem hiding this comment.
Please see the original discussion here #34 (comment):
The primary purpose of a rolling update strategy in standard Kubernetes workloads is to ensure service continuity by gradually replacing old pods with new ones without causing downtime. However, since the pods in a warm pool are, by definition, unclaimed and not serving live traffic, the risk of disruption is not a factor. You can either make it immutable, or support a simpler update mechanism.
When a template spec is updated, the (warm) pods created from the warmpool no longer match what sandboxes need, so they're not useable anyways. We should start with "recreate", and potentially add other rollout strategies if we see other use cases.
There was a problem hiding this comment.
... by definition, unclaimed and not serving live traffic, ....
I’m not sure about that claim. The idea of a warm pool is to have resources pre-warmed before they’re needed, so you don’t have to pay the latency of creating/scheduling the pod, downloading the image, and setting up the microVM. If you build applications on top of sandboxes that, for example, require claiming a sandbox per execution, then the warm pool is on the critical path of the application.
In my opinion, if you can’t rely on the pool availability, it defeats the purpose of the warm pool. End-2-end application latency becomes unpredictable, and developers have to start thinking about when and how to deploy updates.
There was a problem hiding this comment.
The current behavior: Updating SandboxTemplate won't get old pods to be updated, and only new pods will use updated template spec. When sandbox claims from the pool, it's essentially non-deterministic which pod gets adopted from a mixed-state warm pool (incorrect).
The new behavior we're proposing: Avoid the mixed-state warm pool situation. Updating SandboxTemplate should cause the pool to only contain the pods that match the template spec, sandbox should be able to deterministically only adopt pods that match the template.
Given that old/outdated pods aren't and won't be used, they should be scaled down as soon as possible. Note that claimed pods will disappear from the pool. Warmed pods aren't and shouldn't be in use before they're claimed.
There was a problem hiding this comment.
Could we make this PR the "recreate" strategy, and create a separate issue for a "rollingupdate" strategy?
There was a problem hiding this comment.
SGTM. Let’s discuss during next Monday’s meeting.
|
/retest |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shrutiyam-glitch The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
igooch
left a comment
There was a problem hiding this comment.
Overall good work automating the warmpool pod recreation via the new spec hash label.
A few minor suggestions for performance and edge cases in the inline comments.
| // Pod belongs to this warmpool - check if it's stale | ||
| if tmplErr == nil && r.isPodStale(&pod, currentHash) { | ||
| log.Info("Deleting stale pod from pool", "pod", pod.Name) | ||
| if err := r.Delete(ctx, &pod); err != nil { | ||
| log.Error(err, "Failed to delete stale pod", "pod", pod.Name) | ||
| allErrors = errors.Join(allErrors, err) | ||
| } | ||
| continue | ||
| } |
There was a problem hiding this comment.
Could we make this PR the "recreate" strategy, and create a separate issue for a "rollingupdate" strategy?
| // Pod belongs to this warmpool - check if it's stale | ||
| if tmplErr == nil && r.isPodStale(&pod, currentHash) { | ||
| log.Info("Deleting stale pod from pool", "pod", pod.Name) | ||
| if err := r.Delete(ctx, &pod); err != nil { |
There was a problem hiding this comment.
A SandboxClaim will try to claim stale pods while the deleting phase is in process because the tryAdoptPodFromPool only filters by template name. Recommend updating the SandboxClaim pod selection logic to also verify the spechash.
There was a problem hiding this comment.
Missed this case ! Thanks.
Added the check for this in sandboxclaim_controller.
|
This feature would be really helpful for our use case! Looking forward to the merge. 👍 |
Fixes #323
This PR implements the rollout logic for
SandboxWarmPoolwhen its associatedSandboxTemplateis updated.The implementation adopts a "Recreate" strategy for idle resources while ensuring that active, claimed
Sandboxesremain uninterrupted.Key Changes:
Controller Watches: Updated
SandboxWarmPoolReconcilerto watchSandboxTemplateresources. It uses anEnqueueRequestsFromMapFuncto identify and reconcile all warmpools referencing a modified template.Template Spec Hashing: Introduced a
sandbox-template-spec-hashlabel. This fingerprint allows the controller to distinguish between pods running the current template version versus stale versions.Rotation Logic: The reconciliation loop now filters for pods that are both "stale" (hash mismatch) and "unclaimed." Stale idle pods are deleted, triggering the standard replenishment logic to create fresh pods with the new spec.
Efficiency: The
SandboxTemplateis fetched once per reconciliation cycle to avoidN+1API calls during pod filtering.Testing Performed:
SandboxTemplateimage triggers the deletion of idle pods in the associatedSandboxWarmPool.Sandboxare NOT deleted during a template update.templateRef(with new spec) on theSandboxWarmPoolitself triggers a full pool rotation.templateRef(with old spec) on theSandboxWarmPoolitself triggers a full pool rotation.Steps followed:
SandboxTemplate,SandboxWarmpoolandSandboxClaimPod
python-sdk-warmpool-nb2pxis adopted by thesandbox-claim.python-counter-templateand applied.The unclaimed warmpool pods are recreated with the new updated template spec.
spec.sandboxTemplateRefin theSandboxWarmPoolmanifest to use the sandbox templatepct. All the unclaimed pods are by default recreated.