Skip to content

feat: Implement ProgressDeadlineSeconds for Sandbox Resource#307

Open
igooch wants to merge 5 commits intokubernetes-sigs:mainfrom
igooch:progressdeadline
Open

feat: Implement ProgressDeadlineSeconds for Sandbox Resource#307
igooch wants to merge 5 commits intokubernetes-sigs:mainfrom
igooch:progressdeadline

Conversation

@igooch
Copy link
Contributor

@igooch igooch commented Feb 10, 2026

This change introduces a ProgressDeadlineSeconds field to the Sandbox resource's Lifecycle spec, allowing administrators to define a maximum duration for a Sandbox to reach a "Ready" state during provisioning. Similar to Kubernetes Deployments, if this deadline is exceeded, the Sandbox will be marked as "Ready=False" with the reason ProgressDeadlineExceeded.

The primary goals of this feature are:

  • Keep the Resource: The Sandbox object itself is retained in the cluster, but its status clearly indicates failure.
  • Conserve Resources & Bandwidth: The controller will stop actively reconciling the stalled Sandbox, preventing unnecessary API calls to the Kubernetes API server and stopping reconciliation attempts for child resources (Pods, Services).
  • Clear User Feedback: The Sandbox's Status Conditions will reflect the ProgressDeadlineExceeded reason and a descriptive message, providing immediate insight into provisioning failures.
  • Improved Observability: This provides a clear point for emitting Sandbox creation "failure" metrics (not yet implemented), enabling better monitoring of Sandbox provisioning reliability.

Working on #271

Notes for reviewer:

  • This does not delete the Sandbox, nor does it delete any underlying Pod and / or Service. Meaning that a Pod may become Ready after the Sandbox has been marked as "Ready=False". An alternative would be to delete any Pod and / or Service after ProgressDeadlineExceeded, similar to how they are deleted in handleSandboxExpiry.
  • The ProgressDeadlineSeconds defaults to 600 seconds (10 minutes) which is a somewhat arbitrary choice, and could be changed.

@netlify
Copy link

netlify bot commented Feb 10, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 03c5614
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/699c9b4289d5ce0008b3a591

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: igooch
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @igooch. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 10, 2026
@vicentefb
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2026
var deadlineRequeue time.Duration
// We only check the deadline if the sandbox is not yet in a "Ready" state and hasn't expired.
// TODO: Only check if the Sandbox is in a "Pending" status PR#121
if !expired && !isSandboxReady(sandbox) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition overall. Small concern: this deadline check runs whenever Ready is false, even after a sandbox was previously Ready. Since elapsed time is from CreationTimestamp, a transient later NotReady could be marked ProgressDeadlineExceeded immediately. Would it make sense to gate this to initial provisioning only (or skip once Ready has ever been true)?

// Lifecycle defines when and how the sandbox should be shut down.
// +optional
Lifecycle `json:",inline"`
Lifecycle *Lifecycle `json:"lifecycle,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing context, but switching from inlined lifecycle fields to spec.lifecycle looks like a breaking API shape change. Existing manifests using spec.shutdownTime/spec.shutdownPolicy may stop working after upgrade. Should we keep backward compatibility (or document migration/versioning) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd changed it to match with SandboxClaim although I don't have a particular preference. I changed it back to the original inline implementation.

Lifecycle *Lifecycle `json:"lifecycle,omitempty"`

@vicentefb do you know what the history is between the difference in Lifecycle between SandboxClaim and Sandbox? Is there a reason one is a pointer and the other inline?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ctrl.Result{}, nil
// This stops reconciliation for resources that have already hit a deadline or expired.
// TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121
if sandboxMarkedExpired(sandbox) || sandboxStalled(sandbox) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: once Reason=ProgressDeadlineExceeded we return early forever. Is that intended as a hard terminal state even after spec updates? If yes, a short comment/doc note might help set expectations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of limited status tracking in the Sandbox resource, the progress deadline is currently calculated from the CreationTimestamp. Even without the early return above this results in a terminal stalled state that persists even if the spec is updated or the underlying pod becomes ready. While using condition transition times could allow the resource to recover, it may lead to unintended timer resets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo @barney-s do you have a preference on the behavior of the Sandbox reconciler after the Progress Deadline Exceeded is hit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a terminal lock-in where the resource will never be reconciled again, even if the user fixes the underlying pod spec or if a transient infrastructure issue (e.g. node, network) resolves itself.

Given that this takes inspiration from Deployment controller, let's see how it's done there. In Deployment controller, spec.progressDeadlineSeconds is used to handle a stuck Deployment, but Deployment controller continues reconciling the Deployment even after its progressDeadlineSeconds has passed. Ref https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec:

The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.

If progress resumes (e.g., pods become Ready after a transient infra issue), the controller updates the Progressing condition to Status: True with Reason: NewRSAvailableReason.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2026
Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change stops a sandbox from making progress even it becomes ready later, which isn't ideal. Please take a look at the review comments.

type Lifecycle struct {

// ProgressDeadlineSeconds is the maximum time in seconds for a Sandbox to become ready.
// Defaults to 600 seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing this new default is a breaking change. Any sandbox that takes >600s to become ready will be failed even if they will eventually be ready

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the default here is 600 to be consistent with the default in Deployments. Should I remove the default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agress with @janetkuo you should not stop reconcile even after ProgressDeadlineSeconds

// This keeps the controller code simple.
return ctrl.Result{}, nil
// This stops reconciliation for resources that have already hit a deadline or expired.
// TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove/rephrase this line. As I commented in #121, using phase is a legacy approach and now an anti-pattern in Kubernetes.

return ctrl.Result{}, nil
// This stops reconciliation for resources that have already hit a deadline or expired.
// TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121
if sandboxMarkedExpired(sandbox) || sandboxStalled(sandbox) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a terminal lock-in where the resource will never be reconciled again, even if the user fixes the underlying pod spec or if a transient infrastructure issue (e.g. node, network) resolves itself.

Given that this takes inspiration from Deployment controller, let's see how it's done there. In Deployment controller, spec.progressDeadlineSeconds is used to handle a stuck Deployment, but Deployment controller continues reconciling the Deployment even after its progressDeadlineSeconds has passed. Ref https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec:

The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.

If progress resumes (e.g., pods become Ready after a transient infra issue), the controller updates the Progressing condition to Status: True with Reason: NewRSAvailableReason.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 20, 2026
@k8s-ci-robot
Copy link
Contributor

@igooch: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmit-agent-sandbox-e2e-test 03c5614 link true /test presubmit-agent-sandbox-e2e-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

}

// TODO: This logic will need to be updated when Sandbox pause / resume is implemented. Issue #36.
elapsed := time.Since(sandbox.CreationTimestamp.Time)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not consistent with deployment, youe are counting time fron creating. But deployment controller checks time duration from last event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants