Skip to content

osep: add pause and resume via rootfs snapshot for Kubernetes sandboxes#437

Open
fengcone wants to merge 2 commits intoalibaba:mainfrom
fengcone:docs/public-rootfs-pause-resume
Open

osep: add pause and resume via rootfs snapshot for Kubernetes sandboxes#437
fengcone wants to merge 2 commits intoalibaba:mainfrom
fengcone:docs/public-rootfs-pause-resume

Conversation

@fengcone
Copy link
Collaborator

@fengcone fengcone commented Mar 12, 2026

Summary

  • add pause and resume via rootfs snapshot for Kubernetes sandboxes

Testing

  • Not run (osep docs only)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d10de3ddc3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +409 to +410
imagePullSecrets:
- name: <imagePullSecretName>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Provide registry auth to the commit push step

In the commit Job spec, imagePullSecrets is set but the ctr ... images push command has no credentials wired in, and imagePullSecrets only helps kubelet pull the Job image, not authenticate registry operations done by the process inside the container. In environments where snapshotRegistry is private (the common case implied by imagePullSecretName), pause will fail at push and the snapshot will never reach Ready, blocking pause/resume.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original OSEP conflated kubelet image pull auth with the registry auth needed by the process running inside the commit Job container.

I updated the design to separate these two paths explicitly:

  • snapshotPushSecretName is now used for pause-time snapshot push auth.
  • resumeImagePullSecretName is used separately for resume-time kubelet image pulls.

@fengcone fengcone force-pushed the docs/public-rootfs-pause-resume branch from a398eca to 53c951d Compare March 13, 2026 02:28
Copy link
Collaborator

@hittyt hittyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change summary: Introducing pause and resume semantics for Kubernetes sandboxes via rootfs snapshotting into OCI images.

The proposal is well-structured and addresses a major gap in sandbox lifecycle management. However, there are a few architectural and security concerns regarding resource cleanup, container selection ambiguity, and host-level runtime access that should be addressed before finalization.

The workload object identity may change, but the public sandbox identity does
not.

### 9. List and get semantics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Missing SandboxSnapshot cleanup on sandbox deletion

The proposal defines the lifecycle of SandboxSnapshot during pause/resume, but it doesn't specify if or how these resources are cleaned up when a sandbox is explicitly deleted via DELETE /sandboxes/{id}. Leaving snapshots behind will lead to resource leaks in the Kubernetes cluster and the OCI registry.

SandboxID string `json:"sandboxId"`
Policy SnapshotPolicy `json:"policy"`
SourceBatchSandboxName string `json:"sourceBatchSandboxName"`
SourcePodName string `json:"sourcePodName"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Ambiguity in container selection for snapshotting

A Kubernetes Pod can have multiple containers (e.g., sidecars, init containers). The SandboxSnapshotSpec only identifies the SourcePodName. The design should specify which container's rootfs is being committed, or allow the SandboxSnapshotSpec to include a ContainerName field.

--target-image <imageUri> \
--registry-auth-file /var/run/opensandbox/registry/.dockerconfigjson
volumeMounts:
- name: containerd-sock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Security risk: Privileged access to containerd.sock

Mounting containerd.sock into the commit Job Pod grants it significant control over the node's container runtime. While necessary for the proposed implementation, this security trade-off should be explicitly documented in a "Security Considerations" section, and the committer image should be strictly controlled (e.g., via server-side configuration only).


### 4. Pause state model

State is derived from resource presence:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Undefined state for Ready snapshot with live BatchSandbox

The state model doesn't explicitly define the behavior when BatchSandbox still exists but the SandboxSnapshot phase is already Ready. This is a transient but possible state (e.g., between step 7 and 8 of the pause flow). It should probably still be reported as Pausing or a new intermediate state to avoid UI flickering or confusion.

`imageUri` alone. `SandboxSnapshot` must retain enough `resumeTemplate`
information for the server to reconstruct a new `BatchSandbox`.
- Registries with immutable tags are not compatible with this simplified
single-snapshot design unless the implementation changes the tag strategy in a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Registry tag immutability and overwrite risks

The deterministic tagging strategy (<sandboxId>:snapshot) will fail on registries with immutable tags. Additionally, concurrent pause requests for the same sandboxId could lead to race conditions in the registry. Consider using a unique suffix (e.g., timestamp or UID) for the tag and storing the exact imageUri in the SandboxSnapshot.status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants