feat: Implement ProgressDeadlineSeconds for Sandbox Resource by igooch · Pull Request #307 · kubernetes-sigs/agent-sandbox

igooch · 2026-02-10T23:22:45Z

This change introduces a ProgressDeadlineSeconds field to the Sandbox resource's Lifecycle spec, allowing administrators to define a maximum duration for a Sandbox to reach a "Ready" state during provisioning. Similar to Kubernetes Deployments, if this deadline is exceeded, the Sandbox will be marked as "Ready=False" with the reason ProgressDeadlineExceeded.

The primary goals of this feature are:

Keep the Resource: The Sandbox object itself is retained in the cluster, but its status clearly indicates failure.
Conserve Resources & Bandwidth: The controller will stop actively reconciling the stalled Sandbox, preventing unnecessary API calls to the Kubernetes API server and stopping reconciliation attempts for child resources (Pods, Services).
Clear User Feedback: The Sandbox's Status Conditions will reflect the ProgressDeadlineExceeded reason and a descriptive message, providing immediate insight into provisioning failures.
Improved Observability: This provides a clear point for emitting Sandbox creation "failure" metrics (not yet implemented), enabling better monitoring of Sandbox provisioning reliability.

Working on #271

Notes for reviewer:

This does not delete the Sandbox, nor does it delete any underlying Pod and / or Service. Meaning that a Pod may become Ready after the Sandbox has been marked as "Ready=False". An alternative would be to delete any Pod and / or Service after ProgressDeadlineExceeded, similar to how they are deleted in handleSandboxExpiry.
The ProgressDeadlineSeconds defaults to 600 seconds (10 minutes) which is a somewhat arbitrary choice, and could be changed.

netlify · 2026-02-10T23:22:51Z

✅ Deploy Preview for agent-sandbox canceled.

Name	Link
🔨 Latest commit	`03c5614`
🔍 Latest deploy log	https://app.netlify.com/projects/agent-sandbox/deploys/699c9b4289d5ce0008b3a591

k8s-ci-robot · 2026-02-10T23:22:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: igooch
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-02-10T23:22:55Z

Hi @igooch. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vicentefb · 2026-02-10T23:34:43Z

/ok-to-test

acsoto · 2026-02-11T03:27:47Z

controllers/sandbox_controller.go

+	var deadlineRequeue time.Duration
+	// We only check the deadline if the sandbox is not yet in a "Ready" state and hasn't expired.
+	// TODO: Only check if the Sandbox is in a "Pending" status PR#121
+	if !expired && !isSandboxReady(sandbox) {


Nice addition overall. Small concern: this deadline check runs whenever Ready is false, even after a sandbox was previously Ready. Since elapsed time is from CreationTimestamp, a transient later NotReady could be marked ProgressDeadlineExceeded immediately. Would it make sense to gate this to initial provisioning only (or skip once Ready has ever been true)?

acsoto · 2026-02-11T03:27:48Z

api/v1alpha1/sandbox_types.go

 	// Lifecycle defines when and how the sandbox should be shut down.
 	// +optional
-	Lifecycle `json:",inline"`
+	Lifecycle *Lifecycle `json:"lifecycle,omitempty"`


I might be missing context, but switching from inlined lifecycle fields to spec.lifecycle looks like a breaking API shape change. Existing manifests using spec.shutdownTime/spec.shutdownPolicy may stop working after upgrade. Should we keep backward compatibility (or document migration/versioning) here?

I'd changed it to match with SandboxClaim although I don't have a particular preference. I changed it back to the original inline implementation.

agent-sandbox/extensions/api/v1alpha1/sandboxclaim_types.go

Line 76 in b72973c

Lifecycle *Lifecycle `json:"lifecycle,omitempty"`

@vicentefb do you know what the history is between the difference in Lifecycle between SandboxClaim and Sandbox? Is there a reason one is a pointer and the other inline?

https://github.com/kubernetes-sigs/agent-sandbox/pull/222/changes#r2632735089

test/e2e/shutdown_test.go

acsoto · 2026-02-11T03:27:51Z

controllers/sandbox_controller.go

-		return ctrl.Result{}, nil
+	// This stops reconciliation for resources that have already hit a deadline or expired.
+	// TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121
+	if sandboxMarkedExpired(sandbox) || sandboxStalled(sandbox) {


Question: once Reason=ProgressDeadlineExceeded we return early forever. Is that intended as a hard terminal state even after spec updates? If yes, a short comment/doc note might help set expectations.

Because of limited status tracking in the Sandbox resource, the progress deadline is currently calculated from the CreationTimestamp. Even without the early return above this results in a terminal stalled state that persists even if the spec is updated or the underlying pod becomes ready. While using condition transition times could allow the resource to recover, it may lead to unintended timer resets.

@janetkuo @barney-s do you have a preference on the behavior of the Sandbox reconciler after the Progress Deadline Exceeded is hit?

This creates a terminal lock-in where the resource will never be reconciled again, even if the user fixes the underlying pod spec or if a transient infrastructure issue (e.g. node, network) resolves itself.

Given that this takes inspiration from Deployment controller, let's see how it's done there. In Deployment controller, spec.progressDeadlineSeconds is used to handle a stuck Deployment, but Deployment controller continues reconciling the Deployment even after its progressDeadlineSeconds has passed. Ref https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec:

The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.

If progress resumes (e.g., pods become Ready after a transient infra issue), the controller updates the Progressing condition to Status: True with Reason: NewRSAvailableReason.

api/v1alpha1/sandbox_types.go

controllers/sandbox_controller.go

janetkuo

This change stops a sandbox from making progress even it becomes ready later, which isn't ideal. Please take a look at the review comments.

janetkuo · 2026-02-17T23:35:55Z

api/v1alpha1/sandbox_types.go

 type Lifecycle struct {
+
+	// ProgressDeadlineSeconds is the maximum time in seconds for a Sandbox to become ready.
+	// Defaults to 600 seconds.


Introducing this new default is a breaking change. Any sandbox that takes >600s to become ready will be failed even if they will eventually be ready

True, the default here is 600 to be consistent with the default in Deployments. Should I remove the default?

I agress with @janetkuo you should not stop reconcile even after ProgressDeadlineSeconds

janetkuo · 2026-02-17T23:38:19Z

controllers/sandbox_controller.go

-		//       This keeps the controller code simple.
-		return ctrl.Result{}, nil
+	// This stops reconciliation for resources that have already hit a deadline or expired.
+	// TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121


Please remove/rephrase this line. As I commented in #121, using phase is a legacy approach and now an anti-pattern in Kubernetes.

janetkuo · 2026-02-17T23:49:52Z

controllers/sandbox_controller.go

-		return ctrl.Result{}, nil
+	// This stops reconciliation for resources that have already hit a deadline or expired.
+	// TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121
+	if sandboxMarkedExpired(sandbox) || sandboxStalled(sandbox) {


This creates a terminal lock-in where the resource will never be reconciled again, even if the user fixes the underlying pod spec or if a transient infrastructure issue (e.g. node, network) resolves itself.

Given that this takes inspiration from Deployment controller, let's see how it's done there. In Deployment controller, spec.progressDeadlineSeconds is used to handle a stuck Deployment, but Deployment controller continues reconciling the Deployment even after its progressDeadlineSeconds has passed. Ref https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec:

The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.

If progress resumes (e.g., pods become Ready after a transient infra issue), the controller updates the Progressing condition to Status: True with Reason: NewRSAvailableReason.

k8s-ci-robot · 2026-02-23T18:43:38Z

@igooch: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
presubmit-agent-sandbox-e2e-test	`03c5614`	link	true	`/test presubmit-agent-sandbox-e2e-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hzxuzhonghu · 2026-03-06T01:58:45Z

controllers/sandbox_controller.go

+	}
+
+	// TODO: This logic will need to be updated when Sandbox pause / resume is implemented. Issue #36.
+	elapsed := time.Since(sandbox.CreationTimestamp.Time)


This is not consistent with deployment, youe are counting time fron creating. But deployment controller checks time duration from last event.

k8s-ci-robot requested review from janetkuo and soltysh February 10, 2026 23:22

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2026

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 10, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2026

acsoto reviewed Feb 11, 2026

View reviewed changes

test/e2e/shutdown_test.go Outdated Show resolved Hide resolved

acsoto reviewed Feb 11, 2026

View reviewed changes

vicentefb reviewed Feb 11, 2026

View reviewed changes

api/v1alpha1/sandbox_types.go Show resolved Hide resolved

vicentefb reviewed Feb 11, 2026

View reviewed changes

controllers/sandbox_controller.go Outdated Show resolved Hide resolved

igooch force-pushed the progressdeadline branch from 4ad22fa to 5d7b025 Compare February 12, 2026 22:20

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2026

janetkuo requested changes Feb 17, 2026

View reviewed changes

igooch force-pushed the progressdeadline branch from 07dab61 to 68f85b5 Compare February 20, 2026 05:22

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 20, 2026

igooch force-pushed the progressdeadline branch from 68f85b5 to 6e87e3d Compare February 20, 2026 19:36

igooch added 5 commits February 23, 2026 10:23

Adds ProgressDeadlineSeconds to the Sandbox

23ebcae

Generated files

5d420f9

Updated tests

a41d739

Reverts Lifecycle to inline

215f3ae

Adds minimum progress deadline seconds of 0

03c5614

igooch force-pushed the progressdeadline branch from 6e87e3d to 03c5614 Compare February 23, 2026 18:24

hzxuzhonghu reviewed Mar 6, 2026

View reviewed changes

Conversation

igooch commented Feb 10, 2026

Uh oh!

netlify bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for agent-sandbox canceled.

Uh oh!

k8s-ci-robot commented Feb 10, 2026

Uh oh!

k8s-ci-robot commented Feb 10, 2026

Uh oh!

vicentefb commented Feb 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

janetkuo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

netlify bot commented Feb 10, 2026 •

edited

Loading