feat: support inplace CPU resize for warm pool sandboxes#228
Conversation
767e305 to
1d13c3b
Compare
b3b25d4 to
87510f7
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #228 +/- ##
==========================================
+ Coverage 67.75% 69.28% +1.52%
==========================================
Files 112 112
Lines 7689 8106 +417
==========================================
+ Hits 5210 5616 +406
- Misses 2185 2193 +8
- Partials 294 297 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
87510f7 to
9ececb9
Compare
9ececb9 to
73c494a
Compare
8f0e5ed to
167d42c
Compare
|
@furykerry @zmberg ready to review, thanks! |
167d42c to
81938f3
Compare
c85baf0 to
6ecb3bf
Compare
6ecb3bf to
026e3ab
Compare
90d431e to
95e0097
Compare
95ba4af to
1e0b47e
Compare
189be43 to
e61dcaa
Compare
d75c962 to
369649a
Compare
3c2d542 to
db5c963
Compare
Signed-off-by: PersistentJZH <zhihao.kan17@gmail.com> Add the ability to resize CPU requests/limits on warm pool sandboxes during claim time without pod recreation, using the Kubernetes pod resize sub-resource API. - Introduce SandboxClaimInPlaceCPUResizeGate (default: true) to control the feature in both SandboxClaim controller and E2B server. - Implement pod inplace resize logic when Claim sandbox. - Fix json potentially unsafe quoting in CodeQL scan. - Update test/kind-conf.yaml to enable InPlacePodVerticalScaling feature. # Conflicts: # pkg/controller/sandboxclaim/core/common_control.go
Signed-off-by: PersistentJZH <zhihao.kan17@gmail.com> - Fail fast on cpu resize. - Shorten cpu resize related extension key names. - Add resize subresource fallback for K8s < 1.33. - Watch pod.Status.Resize field changes in pod event handler for K8s 1.27-1.32. - Sync Sandbox Ready condition with Pod Ready regardless of inplace update outcome. - Block checkSandboxReady during InplaceUpdating state to avoid premature ready signal before resize completes. - Enhance E2E tests: verify actual pod spec resources after resize.
…antics Signed-off-by: PersistentJZH <zhihao.kan17@gmail.com> - add RetryOnConflict for pod resize update and regenerate resize body with latest resourceVersion - set Sandbox InplaceUpdate condition status to False on Failed, keep Succeeded as True - avoid overwriting Ready condition transition metadata when pod ready status is unchanged - move CPU resize feature-gate check from buildClaimOptions to EnsureClaimClaiming precondition - remove unused `pods/resize` permission from sandbox-manager RBAC - clean up/add some comments - add cpu resize and image upgrade failed e2e cases
Signed-off-by: PersistentJZH <zhihao.kan17@gmail.com> - mark inplace update in-progress per successful sub-step - change feature gate name to SandboxInplaceUpdateReasonInplaceUpdating
db5c963 to
dcb00ed
Compare
| return infra.ClaimSandboxOptions{}, fmt.Errorf("resources must specify at least one of requests or limits") | ||
| } | ||
| for _, rl := range []corev1.ResourceList{res.Requests, res.Limits} { | ||
| if cpu, ok := rl[corev1.ResourceCPU]; ok { |
There was a problem hiding this comment.
this check should be generic enough and not limited to cpu resources
| if pod.Spec.Resources != nil { | ||
| processResourceList(requests, pod.Spec.Resources.Requests) | ||
| processResourceList(limits, pod.Spec.Resources.Limits) | ||
| qosLimitResources := getQOSResources(pod.Spec.Resources.Limits) |
There was a problem hiding this comment.
rename qosLimitResources to qosLimitsFound to make the name consistent with the one in else clause
| } | ||
| utils.SetSandboxCondition(newStatus, cond) | ||
|
|
||
| // Update ready condition to in-progress |
There was a problem hiding this comment.
can we not update the sandbox ready condition to false?
| if tmpl, ok := templateContainers[afterPod.Spec.Containers[i].Name]; ok { | ||
| afterPod.Spec.Containers[i].Resources = tmpl.Resources | ||
| } | ||
| } |
There was a problem hiding this comment.
besides qos changes, can we also check whether limit > request here in case the user only specify request, but the request is larger than the limit in the warm pool
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: furykerry The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Ⅰ. Describe what this PR does
Ⅱ. Does this pull request fix one issue?
#68
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
TODO