feat: Implement ProgressDeadlineSeconds for Sandbox Resource#307
feat: Implement ProgressDeadlineSeconds for Sandbox Resource#307igooch wants to merge 5 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: igooch The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @igooch. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
| var deadlineRequeue time.Duration | ||
| // We only check the deadline if the sandbox is not yet in a "Ready" state and hasn't expired. | ||
| // TODO: Only check if the Sandbox is in a "Pending" status PR#121 | ||
| if !expired && !isSandboxReady(sandbox) { |
There was a problem hiding this comment.
Nice addition overall. Small concern: this deadline check runs whenever Ready is false, even after a sandbox was previously Ready. Since elapsed time is from CreationTimestamp, a transient later NotReady could be marked ProgressDeadlineExceeded immediately. Would it make sense to gate this to initial provisioning only (or skip once Ready has ever been true)?
api/v1alpha1/sandbox_types.go
Outdated
| // Lifecycle defines when and how the sandbox should be shut down. | ||
| // +optional | ||
| Lifecycle `json:",inline"` | ||
| Lifecycle *Lifecycle `json:"lifecycle,omitempty"` |
There was a problem hiding this comment.
I might be missing context, but switching from inlined lifecycle fields to spec.lifecycle looks like a breaking API shape change. Existing manifests using spec.shutdownTime/spec.shutdownPolicy may stop working after upgrade. Should we keep backward compatibility (or document migration/versioning) here?
There was a problem hiding this comment.
I'd changed it to match with SandboxClaim although I don't have a particular preference. I changed it back to the original inline implementation.
@vicentefb do you know what the history is between the difference in Lifecycle between SandboxClaim and Sandbox? Is there a reason one is a pointer and the other inline?
There was a problem hiding this comment.
| return ctrl.Result{}, nil | ||
| // This stops reconciliation for resources that have already hit a deadline or expired. | ||
| // TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121 | ||
| if sandboxMarkedExpired(sandbox) || sandboxStalled(sandbox) { |
There was a problem hiding this comment.
Question: once Reason=ProgressDeadlineExceeded we return early forever. Is that intended as a hard terminal state even after spec updates? If yes, a short comment/doc note might help set expectations.
There was a problem hiding this comment.
Because of limited status tracking in the Sandbox resource, the progress deadline is currently calculated from the CreationTimestamp. Even without the early return above this results in a terminal stalled state that persists even if the spec is updated or the underlying pod becomes ready. While using condition transition times could allow the resource to recover, it may lead to unintended timer resets.
There was a problem hiding this comment.
This creates a terminal lock-in where the resource will never be reconciled again, even if the user fixes the underlying pod spec or if a transient infrastructure issue (e.g. node, network) resolves itself.
Given that this takes inspiration from Deployment controller, let's see how it's done there. In Deployment controller, spec.progressDeadlineSeconds is used to handle a stuck Deployment, but Deployment controller continues reconciling the Deployment even after its progressDeadlineSeconds has passed. Ref https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec:
The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.
If progress resumes (e.g., pods become Ready after a transient infra issue), the controller updates the Progressing condition to Status: True with Reason: NewRSAvailableReason.
4ad22fa to
5d7b025
Compare
janetkuo
left a comment
There was a problem hiding this comment.
This change stops a sandbox from making progress even it becomes ready later, which isn't ideal. Please take a look at the review comments.
| type Lifecycle struct { | ||
|
|
||
| // ProgressDeadlineSeconds is the maximum time in seconds for a Sandbox to become ready. | ||
| // Defaults to 600 seconds. |
There was a problem hiding this comment.
Introducing this new default is a breaking change. Any sandbox that takes >600s to become ready will be failed even if they will eventually be ready
There was a problem hiding this comment.
True, the default here is 600 to be consistent with the default in Deployments. Should I remove the default?
There was a problem hiding this comment.
I agress with @janetkuo you should not stop reconcile even after ProgressDeadlineSeconds
| // This keeps the controller code simple. | ||
| return ctrl.Result{}, nil | ||
| // This stops reconciliation for resources that have already hit a deadline or expired. | ||
| // TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121 |
There was a problem hiding this comment.
Please remove/rephrase this line. As I commented in #121, using phase is a legacy approach and now an anti-pattern in Kubernetes.
| return ctrl.Result{}, nil | ||
| // This stops reconciliation for resources that have already hit a deadline or expired. | ||
| // TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121 | ||
| if sandboxMarkedExpired(sandbox) || sandboxStalled(sandbox) { |
There was a problem hiding this comment.
This creates a terminal lock-in where the resource will never be reconciled again, even if the user fixes the underlying pod spec or if a transient infrastructure issue (e.g. node, network) resolves itself.
Given that this takes inspiration from Deployment controller, let's see how it's done there. In Deployment controller, spec.progressDeadlineSeconds is used to handle a stuck Deployment, but Deployment controller continues reconciling the Deployment even after its progressDeadlineSeconds has passed. Ref https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec:
The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.
If progress resumes (e.g., pods become Ready after a transient infra issue), the controller updates the Progressing condition to Status: True with Reason: NewRSAvailableReason.
07dab61 to
68f85b5
Compare
68f85b5 to
6e87e3d
Compare
6e87e3d to
03c5614
Compare
|
@igooch: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| } | ||
|
|
||
| // TODO: This logic will need to be updated when Sandbox pause / resume is implemented. Issue #36. | ||
| elapsed := time.Since(sandbox.CreationTimestamp.Time) |
There was a problem hiding this comment.
This is not consistent with deployment, youe are counting time fron creating. But deployment controller checks time duration from last event.
This change introduces a
ProgressDeadlineSecondsfield to the Sandbox resource'sLifecyclespec, allowing administrators to define a maximum duration for a Sandbox to reach a "Ready" state during provisioning. Similar to Kubernetes Deployments, if this deadline is exceeded, the Sandbox will be marked as "Ready=False" with the reasonProgressDeadlineExceeded.The primary goals of this feature are:
ProgressDeadlineExceededreason and a descriptive message, providing immediate insight into provisioning failures.Working on #271
Notes for reviewer:
ProgressDeadlineExceeded, similar to how they are deleted inhandleSandboxExpiry.ProgressDeadlineSecondsdefaults to600seconds (10 minutes) which is a somewhat arbitrary choice, and could be changed.