Skip to content

Conversation

@hzxuzhonghu
Copy link
Member

No description provided.

Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>
Copilot AI review requested due to automatic review settings November 21, 2025 03:25
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hzxuzhonghu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to AgentCube Sandbox management by implementing an idle-aware pause and resume lifecycle. Instead of outright deleting idle sandbox pods, the system will now leverage the containerd API to pause their containers, thereby preserving user state and significantly reducing resource consumption. This change aims to improve developer experience by offering near-instantaneous sandbox resumption. The proposal thoroughly outlines the technical design, including how containers are paused and resumed, the mechanisms for triggering a resume (via annotations or NATS JetStream), and the necessary API and CRD modifications to support this new functionality.

Highlights

  • Idle-aware Sandbox Lifecycle: Introduces a new idle-aware lifecycle management for AgentCube Sandboxes, replacing deletion with pausing via the containerd API to save resources, preserve user state, and shorten time-to-first-byte.
  • Container Pausing and Resuming: Details the mechanism for pausing and resuming sandbox containers using the containerd API, ensuring all workload containers are affected while init and ephemeral containers remain unchanged.
  • Resume Triggers: Implements two methods for resuming paused sandboxes: an annotation-based approach for synchronous requests and a message-queue-based system (specifically NATS JetStream) for asynchronous and cross-cluster signals.
  • API and CRD Changes: Proposes new annotations (sandbox.lifecycle.volcano.sh/state, resume-requested-at, last-paused-at, last-resumed-at) and status conditions (Paused, ResumePending) to reflect the sandbox's new lifecycle states.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design proposal for a sandbox pause/resume feature. The document is well-structured and covers motivation, design details, API changes, and implementation plans in great detail. My review focuses on improving the clarity and consistency of the proposal. I've suggested a change to the control flow diagram to better represent the reconciliation logic and pointed out a minor inconsistency in timestamp formatting.

Comment on lines +275 to +292
flowchart TD
Start["Reconcile Sandbox"] --> CheckIdle{Idle timeout exceeded?}
CheckIdle -- "No" --> Exit["Requeue/Exit"]
CheckIdle -- "Yes" --> VerifyPod["Verify bound pod exists"]
VerifyPod -- "Missing" --> Warn["Log warning & requeue"]
VerifyPod -- "Found" --> PauseContainers["Pause pod containers via containerd"]
PauseContainers --> UpdateState["Update annotations/status: state=Paused"]
UpdateState --> EmitPaused["Emit SandboxPaused event"]
EmitPaused --> Exit
Start["Reconcile Sandbox"] --> WaitResume{Resume signal?}
WaitResume -- "Annotation" --> HandleAnnotation["Process resume annotation"]
WaitResume -- "Message queue" --> HandleMQ["Process MQ resume message"]
HandleAnnotation --> ResumeContainers
HandleMQ --> ResumeContainers
ResumeContainers["Resume containers via containerd"] --> RefreshActivity["Refresh last-activity timestamp"]
RefreshActivity --> UpdateRunning["Set state=Running & clear triggers"]
UpdateRunning --> EmitResumed["Emit SandboxResumed event"]
EmitResumed --> Exit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The control flow diagram is a bit confusing as it shows two separate Start nodes for what should be a single reconciliation loop. A reconciler typically has one entry point and then branches based on the state of the resource. I suggest restructuring the diagram to reflect a single, unified flow, which will improve the clarity of the proposed design. The suggested diagram below first checks for a resume signal and then for idleness, which is a more common pattern in controller logic.

Suggested change
flowchart TD
Start["Reconcile Sandbox"] --> CheckIdle{Idle timeout exceeded?}
CheckIdle -- "No" --> Exit["Requeue/Exit"]
CheckIdle -- "Yes" --> VerifyPod["Verify bound pod exists"]
VerifyPod -- "Missing" --> Warn["Log warning & requeue"]
VerifyPod -- "Found" --> PauseContainers["Pause pod containers via containerd"]
PauseContainers --> UpdateState["Update annotations/status: state=Paused"]
UpdateState --> EmitPaused["Emit SandboxPaused event"]
EmitPaused --> Exit
Start["Reconcile Sandbox"] --> WaitResume{Resume signal?}
WaitResume -- "Annotation" --> HandleAnnotation["Process resume annotation"]
WaitResume -- "Message queue" --> HandleMQ["Process MQ resume message"]
HandleAnnotation --> ResumeContainers
HandleMQ --> ResumeContainers
ResumeContainers["Resume containers via containerd"] --> RefreshActivity["Refresh last-activity timestamp"]
RefreshActivity --> UpdateRunning["Set state=Running & clear triggers"]
UpdateRunning --> EmitResumed["Emit SandboxResumed event"]
EmitResumed --> Exit
flowchart TD
Start["Reconcile Sandbox"] --> WaitResume{Resume signal?}
WaitResume -- "Yes (Annotation or MQ)" --> ResumeContainers["Resume containers via containerd"]
ResumeContainers --> RefreshActivity["Refresh last-activity timestamp"]
RefreshActivity --> UpdateRunning["Set state=Running & clear triggers"]
UpdateRunning --> EmitResumed["Emit SandboxResumed event"]
EmitResumed --> Exit["Requeue/Exit"]
WaitResume -- "No" --> CheckIdle{Idle timeout exceeded?}
CheckIdle -- "No" --> Exit
CheckIdle -- "Yes" --> VerifyPod["Verify bound pod exists"]
VerifyPod -- "Missing" --> Warn["Log warning & requeue"]
VerifyPod -- "Found" --> PauseContainers["Pause pod containers via containerd"]
PauseContainers --> UpdateState["Update annotations/status: state=Paused"]
UpdateState --> EmitPaused["Emit SandboxPaused event"]
EmitPaused --> Exit

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive design proposal for implementing pause/resume functionality for AgentCube sandbox pods. Instead of deleting idle sandbox pods, the proposal advocates for pausing containers via containerd's native pause/resume API, which preserves user state while freeing CPU resources. The design includes both annotation-based and NATS JetStream message queue-based resume triggers, comprehensive observability, and a detailed implementation plan.

Key Changes:

  • Design document for sandbox pod lifecycle management with pause/resume capabilities via containerd
  • Introduction of NATS JetStream as the message queue for asynchronous resume operations
  • New annotation-based API for tracking sandbox state and triggering resume operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


1. Establish a containerd client:
- `client, err := containerd.New("/run/containerd/containerd.sock")`.
- Use a short timeout context (e.g. 5–10 seconds) to avoid blocking reconciles.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout value of "5–10 seconds" for containerd operations may be insufficient for pause/resume operations under heavy node load. When a node is under memory pressure or high I/O load, containerd operations can take longer than 10 seconds. Additionally, if pausing/resuming multiple containers in sequence (as described in line 90 "For each workload container"), the total time could exceed this timeout.

Consider:

  1. Using a per-operation timeout (e.g., 30 seconds per container) rather than a global timeout for all containers
  2. Making the timeout configurable
  3. Documenting the expected time for pause/resume operations under normal and stressed conditions
  4. Implementing proper cancellation handling if a context timeout occurs mid-operation

Without adequate timeouts, the reconciliation loop could block, preventing other sandboxes from being processed.

Suggested change
- Use a short timeout context (e.g. 5–10 seconds) to avoid blocking reconciles.
- For each container operation, use a per-operation timeout context (e.g., 30 seconds per container). The timeout should be configurable to accommodate different node loads and operational environments.
- Document the expected time for pause/resume operations under both normal and stressed conditions, and implement proper cancellation handling if a context timeout occurs mid-operation.

Copilot uses AI. Check for mistakes.
Comment on lines +167 to +170
- New annotations:
- `sandbox.lifecycle.volcano.sh/state`: `Running|Paused|Resuming|Error`.
- `sandbox.lifecycle.volcano.sh/resume-requested-at`: RFC3339 timestamp.
- `sandbox.lifecycle.volcano.sh/last-paused-at`, `sandbox.lifecycle.volcano.sh/last-resumed-at` (optional audit).
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed annotation keys use domain-qualified names (e.g., sandbox.lifecycle.volcano.sh/state, sandbox.lifecycle.volcano.sh/resume-requested-at) which is inconsistent with existing annotation conventions in the codebase. The existing annotations use simple keys without domain prefixes: last-activity-time and creator-service-account (see pkg/apiserver/k8s_client.go:26,28).

For consistency, consider either:

  1. Using simple annotation keys like sandbox-state, resume-requested-at, last-paused-at, last-resumed-at
  2. Or, if domain-qualified names are preferred for these new lifecycle annotations, document why this departure from the existing pattern is necessary and consider migrating existing annotations to the same pattern

Mixing annotation styles can lead to confusion about which convention to follow for future additions.

Copilot uses AI. Check for mistakes.
Comment on lines +167 to +170
- New annotations:
- `sandbox.lifecycle.volcano.sh/state`: `Running|Paused|Resuming|Error`.
- `sandbox.lifecycle.volcano.sh/resume-requested-at`: RFC3339 timestamp.
- `sandbox.lifecycle.volcano.sh/last-paused-at`, `sandbox.lifecycle.volcano.sh/last-resumed-at` (optional audit).
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation sandbox.lifecycle.volcano.sh/state storing values like "Running|Paused|Resuming|Error" should be clarified. Based on Kubernetes conventions, runtime state should typically be stored in status fields rather than annotations. Annotations are intended for metadata and non-authoritative information. Consider:

  1. Moving the state to a proper status.phase or status.lifecycleState field
  2. If keeping as annotation, explain why this design choice was made over using status fields
  3. Document how conflicts are resolved if both annotation and status contain state information

This is especially important given that line 172 mentions status.conditions entries, suggesting status is already being used for related information.

Copilot uses AI. Check for mistakes.
- Use a short timeout context (e.g. 5–10 seconds) to avoid blocking reconciles.
2. For each workload container:
- Strip the `containerd://` prefix from `containerID`.
- Look up the corresponding task with `client.LoadTask(ctx, containerID)`.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API call client.LoadTask(ctx, containerID) is incorrect. The containerd Go client API requires loading a container first and then getting its task. The correct pattern is: container, err := client.LoadContainer(ctx, containerID) followed by task, err := container.Task(ctx, nil). The LoadTask method does not exist on the containerd client in this way.

Suggested change
- Look up the corresponding task with `client.LoadTask(ctx, containerID)`.
- Look up the corresponding task:
- `container, err := client.LoadContainer(ctx, containerID)`
- `task, err := container.Task(ctx, nil)`

Copilot uses AI. Check for mistakes.
Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>
Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>
Copilot AI review requested due to automatic review settings November 21, 2025 07:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +95 to +98
3. If any container fails to pause:
- Record a `Paused` condition with `status=False` and a reason (e.g. `ContainerdPauseError`).
- Emit a Kubernetes event on the `Sandbox` and the pod.
- Optionally retry with backoff; give up after a small number of attempts to keep reconciliation bounded.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling description states that the controller should "give up after a small number of attempts to keep reconciliation bounded," but this conflicts with the idempotency design mentioned in lines 125-133. If the reconciler is supposed to be idempotent and converge to desired state, giving up on pause operations could leave the sandbox in an inconsistent state. Consider clarifying whether failed pause operations should transition to an error state and wait for manual intervention, or if they should continue retrying on subsequent reconciliation loops.

Copilot uses AI. Check for mistakes.
Comment on lines +277 to +296
```mermaid
flowchart TD
Start["Reconcile Sandbox"] --> CheckIdle{Idle timeout exceeded?}
CheckIdle -- "No" --> Exit["Requeue/Exit"]
CheckIdle -- "Yes" --> VerifyPod["Verify bound pod exists"]
VerifyPod -- "Missing" --> Warn["Log warning & requeue"]
VerifyPod -- "Found" --> PauseContainers["Pause pod containers via containerd"]
PauseContainers --> UpdateState["Update annotations/status: state=Paused"]
UpdateState --> EmitPaused["Emit SandboxPaused event"]
EmitPaused --> Exit
Start["Reconcile Sandbox"] --> WaitResume{Resume signal?}
WaitResume -- "Annotation" --> HandleAnnotation["Process resume annotation"]
WaitResume -- "Message queue" --> HandleMQ["Process MQ resume message"]
HandleAnnotation --> ResumeContainers
HandleMQ --> ResumeContainers
ResumeContainers["Resume containers via containerd"] --> RefreshActivity["Refresh last-activity timestamp"]
RefreshActivity --> UpdateRunning["Set state=Running & clear triggers"]
UpdateRunning --> EmitResumed["Emit SandboxResumed event"]
EmitResumed --> Exit
```
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mermaid diagram has a logic issue: it shows two separate flows both starting with Start["Reconcile Sandbox"], which creates duplicate starting nodes. The pause and resume flows should be part of a single flowchart with conditional branches, or they should be clearly separated as different scenarios. Currently, both CheckIdle and WaitResume appear to be evaluated simultaneously from the same start node, which doesn't make logical sense.

Copilot uses AI. Check for mistakes.
###### MQ consumer and controller integration

1. A dedicated `ResumeConsumer` process (can be part of the agentd binary) connects to the NATS server.
2. It subscribes to the `agentcube.sandbox.resume` subject using a **durable subscription** (queue group or durable name) so that resumes are processed even if there are multiple replicas.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "ensure sandboxes are processed" should be "ensures that resume messages are processed" or similar. The phrase "so that resumes are processed" is grammatically unclear - it should specify what "resumes" refers to (resume messages or resume operations).

Suggested change
2. It subscribes to the `agentcube.sandbox.resume` subject using a **durable subscription** (queue group or durable name) so that resumes are processed even if there are multiple replicas.
2. It subscribes to the `agentcube.sandbox.resume` subject using a **durable subscription** (queue group or durable name) so that resume messages are processed even if there are multiple replicas.

Copilot uses AI. Check for mistakes.

- **Simplicity and footprint**: NATS is a single small binary and easy to run as a Kubernetes StatefulSet. It has fewer moving parts than Kafka or RabbitMQ.
- **Cloud-native and Kubernetes-friendly**: There are mature Helm charts and operators; it integrates well with Kubernetes RBAC and multi-tenant setups.
- **At-least-once delivery via durable subscriptions**: While we are not using JetStream, NATS durable subscriptions and manual acknowledgements give us at-least-once behavior for consumers that stay connected. Combined with idempotent handling on the controller side, this is sufficient for resume signals.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description mentions "durable subscriptions" for plain NATS core, but this could be misleading. In NATS terminology, "durable subscriptions" typically refer to queue groups with durable names, which provide at-most-once delivery per consumer group but don't persist messages when all consumers are offline. The document later acknowledges this limitation (line 260-263), but line 207 might give a false impression of stronger durability guarantees than plain NATS core actually provides. Consider clarifying that this refers to queue groups with durable consumer names, not JetStream's persistent message storage.

Suggested change
- **At-least-once delivery via durable subscriptions**: While we are not using JetStream, NATS durable subscriptions and manual acknowledgements give us at-least-once behavior for consumers that stay connected. Combined with idempotent handling on the controller side, this is sufficient for resume signals.
- **At-least-once delivery via queue groups with durable names (not JetStream; messages are not persisted if all consumers are offline)**: While we are not using JetStream, NATS queue groups with durable names and manual acknowledgements give us at-least-once behavior for consumers that stay connected. Combined with idempotent handling on the controller side, this is sufficient for resume signals.

Copilot uses AI. Check for mistakes.
- **Annotation path**: Users (or apiserver) set `sandbox.lifecycle.volcano.sh/resume-requested-at`; reconciler resumes and clears annotation.
- **Message queue path**: Integrate with a concrete MQ to emit durable resume requests into the controller queue.
4. Add feature flags and configuration knobs: idle timeout, pause enablement, containerd socket path, per-namespace opt-out, fallback behavior (delete vs keep running).
5. Update apiserver and SDKs to surface paused state and expose a `/v1/sandboxes/{sandboxId}/pause` API.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API endpoint path mentioned here (/v1/sandboxes/{sandboxId}/pause) doesn't match the resume endpoint mentioned in line 173 (/v1/sandboxes/{sandboxId}/resume). If this proposal is introducing both pause and resume operations, both endpoints should be documented consistently. Additionally, the text only mentions exposing a pause API, but the context suggests resume should also be exposed. Consider documenting both endpoints or clarifying if only one is being added.

Suggested change
5. Update apiserver and SDKs to surface paused state and expose a `/v1/sandboxes/{sandboxId}/pause` API.
5. Update apiserver and SDKs to surface paused state and expose `/v1/sandboxes/{sandboxId}/pause` and `/v1/sandboxes/{sandboxId}/resume` APIs.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

##### Security and configuration

- The controller needs sufficient permissions to talk to the containerd socket. There are two deployment options:
- Run the controller as a DaemonSet on each node with hostPath mount of `/run/containerd/containerd.sock`.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation mentions "Run the controller as a DaemonSet on each node with hostPath mount" (line 146), but the existing AgentdReconciler appears to be a cluster-scoped controller, not a DaemonSet. This architectural change has significant implications for deployment, RBAC, and high availability that aren't fully addressed. Consider adding a dedicated section on deployment architecture changes and migration strategy, or clarify if this refers to a separate component from the existing AgentdReconciler.

Copilot uses AI. Check for mistakes.
- Document that environments requiring strong durability can swap in a persistent MQ (e.g. Kafka) behind the same interface.
- **Back-pressure**: If the Kubernetes API becomes slow, the consumer can slow its processing rate and rely on NATS’ internal buffering and flow control.
- **Security**: NATS can be configured with TLS and token-based or NKey authentication. The consumer uses a dedicated NATS account/creds with access only to the required subjects.

Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The security section mentions "token-based or NKey authentication" for NATS (line 265), but doesn't address authentication/authorization for the annotation-based resume path. Any client with permission to update Sandbox annotations can trigger a resume, potentially bypassing intended access controls. Consider documenting the required RBAC permissions and whether additional authorization checks are needed beyond Kubernetes RBAC.

Suggested change
- **Annotation-based resume path**: The annotation-based resume mechanism relies on Kubernetes RBAC for access control. Only trusted controllers or users should be granted permission to update the `Sandbox` resource's annotations (specifically, the `sandbox.lifecycle.volcano.sh/resume-requested-at` and related keys). Granting broader update access could allow unauthorized clients to trigger resume operations, potentially bypassing intended access controls. It is recommended to define fine-grained RBAC roles that restrict annotation updates to only those principals that are authorized to trigger resume actions. If additional authorization checks are required beyond Kubernetes RBAC, they should be implemented in the controller logic and documented here.

Copilot uses AI. Check for mistakes.

##### High-level flow

- For each `Sandbox` CRD, there is a bound pod (the sandbox pod) created by the apiserver.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to "apiserver" on line 70 is inconsistent with the actual component that creates pods. Based on the codebase (pkg/apiserver/k8s_client.go), the apiserver creates Sandbox CRDs, but the Kubernetes controller (likely via the agent-sandbox operator) creates the actual pods from the Sandbox spec. Consider clarifying this to say "created by the Kubernetes controller" or "created from the Sandbox spec".

Suggested change
- For each `Sandbox` CRD, there is a bound pod (the sandbox pod) created by the apiserver.
- For each `Sandbox` CRD, there is a bound pod (the sandbox pod) created by the Kubernetes controller from the Sandbox spec.

Copilot uses AI. Check for mistakes.

- **Duplicates**: At-least-once delivery (via durable subscription and reconnect behavior) means the same resume intent may be delivered multiple times. We handle this by making annotations idempotent: setting `resume-requested-at` to the latest timestamp is safe even if resume is already in progress or completed.
- **Ordering**: If multiple resume requests arrive, we only care about the most recent one. NATS preserves ordering per subject, but the controller logic does not rely on strict ordering.
- **Lost messages**: For long consumer outages, plain NATS does not guarantee persistence. To mitigate this, we:
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement "For long consumer outages, plain NATS does not guarantee persistence" (line 260) contradicts the earlier claim (line 207) that "NATS durable subscriptions and manual acknowledgements give us at-least-once behavior." This inconsistency should be resolved. The document should be clear upfront that core NATS without JetStream does not persist messages when no consumers are connected, which is a significant limitation for a resume signaling mechanism where reliability is important.

Copilot uses AI. Check for mistakes.
Comment on lines +161 to +173
5. Update apiserver and SDKs to surface paused state and expose a `/v1/sandboxes/{sandboxId}/pause` API.
6. Ship metrics/events/logging and document operational playbooks.
7. Add end-to-end tests that validate containerd pause/resume against a real or fake runtime.

#### API / CRD Changes

- New annotations:
- `sandbox.lifecycle.volcano.sh/state`: `Running|Paused|Resuming|Error`.
- `sandbox.lifecycle.volcano.sh/resume-requested-at`: RFC3339 timestamp.
- `sandbox.lifecycle.volcano.sh/last-paused-at`, `sandbox.lifecycle.volcano.sh/last-resumed-at` (optional audit).
- Sandbox status additions:
- `status.conditions` entries for `Paused` and `ResumePending`.
- Optional: extend apiserver REST/SDK to POST `/v1/sandboxes/{sandboxId}/resume`.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal mentions a /v1/sandboxes/{sandboxId}/pause API (line 161) but later only describes /v1/sandboxes/{sandboxId}/resume (line 173 and 193). The document should clarify whether both pause and resume APIs are being added, or if pause is only automatic via timeout. If both are added, document the use cases for manual pause. If only resume is exposed, remove the reference to the pause API endpoint.

Copilot uses AI. Check for mistakes.
- The actual containerd task states.
- Any outstanding resume triggers.
- The controller then converges to the desired state (e.g. `Paused` or `Running`) without double-pausing or double-resuming.

Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design doesn't address potential race conditions when multiple reconciliation loops or resume triggers occur simultaneously. For example, if a pause operation is in progress and a resume annotation is added, or if two resume signals (annotation and MQ) arrive concurrently. Consider documenting concurrency control mechanisms such as using optimistic locking (resourceVersion checks), adding an operation-in-progress indicator, or ensuring the reconciler uses proper locking/queuing to serialize operations on the same sandbox.

Suggested change
##### Concurrency control and race condition prevention
- To prevent race conditions when multiple reconciliation loops or resume triggers occur simultaneously (e.g., overlapping pause and resume requests, or concurrent triggers from annotations and MQ), the controller implements the following concurrency control mechanisms:
- **Optimistic locking:** All updates to the sandbox resource use Kubernetes `resourceVersion` checks to ensure that only the latest version is modified. If a concurrent update occurs, the reconcile is retried with the new state.
- **Operation-in-progress indicator:** The controller sets an explicit status field or annotation (e.g., `sandbox.lifecycle.volcano.sh/operation-in-progress`) to indicate when a pause or resume operation is underway. New operations are not started until the current one completes, ensuring serialization of state transitions.
- **Serialized reconciliation:** The reconciler ensures that only one operation (pause or resume) is performed at a time for a given sandbox, either by relying on the controller-runtime's per-resource queueing or by explicit locking if needed.
- These mechanisms ensure that the controller converges to the correct desired state without double-pausing, double-resuming, or inconsistent state transitions, even under high event rates or controller restarts.

Copilot uses AI. Check for mistakes.
@hzxuzhonghu hzxuzhonghu changed the title Add pause resume sandbox proposal [WIP] Add pause resume sandbox proposal Dec 19, 2025
@volcano-sh-bot
Copy link
Contributor

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

  • 28f8b5f Apply suggestion from @Copilot
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants