osep: add pause and resume via rootfs snapshot for Kubernetes sandboxes by fengcone · Pull Request #437 · alibaba/OpenSandbox

fengcone · 2026-03-12T16:58:22Z

Summary

add pause and resume via rootfs snapshot for Kubernetes sandboxes

Testing

Not run (osep docs only)
Unit tests
Integration tests
e2e / manual verification

Breaking Changes

None
Yes (describe impact and migration path)

Checklist

Linked Issue or clearly described motivation
Added/updated docs (if needed)
Added/updated tests (if needed)
Security impact considered
Backward compatibility considered

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d10de3ddc3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-12T17:03:40Z

oseps/0008-pause-resume-rootfs-snapshot.md

+      imagePullSecrets:
+        - name: <imagePullSecretName>


Provide registry auth to the commit push step

In the commit Job spec, imagePullSecrets is set but the ctr ... images push command has no credentials wired in, and imagePullSecrets only helps kubelet pull the Job image, not authenticate registry operations done by the process inside the container. In environments where snapshotRegistry is private (the common case implied by imagePullSecretName), pause will fail at push and the snapshot will never reach Ready, blocking pause/resume.

Useful? React with 👍 / 👎.

The original OSEP conflated kubelet image pull auth with the registry auth needed by the process running inside the commit Job container.

I updated the design to separate these two paths explicitly:

snapshotPushSecretName is now used for pause-time snapshot push auth.

resumeImagePullSecretName is used separately for resume-time kubelet image pulls.

…cation

hittyt

Change summary: Introducing pause and resume semantics for Kubernetes sandboxes via rootfs snapshotting into OCI images.

The proposal is well-structured and addresses a major gap in sandbox lifecycle management. However, there are a few architectural and security concerns regarding resource cleanup, container selection ambiguity, and host-level runtime access that should be addressed before finalization.

hittyt · 2026-03-13T07:50:17Z

oseps/0008-pause-resume-rootfs-snapshot.md

+The workload object identity may change, but the public sandbox identity does
+not.
+
+### 9. List and get semantics


[P1] Missing SandboxSnapshot cleanup on sandbox deletion

The proposal defines the lifecycle of SandboxSnapshot during pause/resume, but it doesn't specify if or how these resources are cleaned up when a sandbox is explicitly deleted via DELETE /sandboxes/{id}. Leaving snapshots behind will lead to resource leaks in the Kubernetes cluster and the OCI registry.

hittyt · 2026-03-13T07:50:17Z

oseps/0008-pause-resume-rootfs-snapshot.md

+    SandboxID                 string                `json:"sandboxId"`
+    Policy                    SnapshotPolicy        `json:"policy"`
+    SourceBatchSandboxName    string                `json:"sourceBatchSandboxName"`
+    SourcePodName             string                `json:"sourcePodName"`


[P1] Ambiguity in container selection for snapshotting

A Kubernetes Pod can have multiple containers (e.g., sidecars, init containers). The SandboxSnapshotSpec only identifies the SourcePodName. The design should specify which container's rootfs is being committed, or allow the SandboxSnapshotSpec to include a ContainerName field.

hittyt · 2026-03-13T07:50:17Z

oseps/0008-pause-resume-rootfs-snapshot.md

+                --target-image <imageUri> \
+                --registry-auth-file /var/run/opensandbox/registry/.dockerconfigjson
+          volumeMounts:
+            - name: containerd-sock


[P1] Security risk: Privileged access to containerd.sock

Mounting containerd.sock into the commit Job Pod grants it significant control over the node's container runtime. While necessary for the proposed implementation, this security trade-off should be explicitly documented in a "Security Considerations" section, and the committer image should be strictly controlled (e.g., via server-side configuration only).

hittyt · 2026-03-13T07:50:17Z

oseps/0008-pause-resume-rootfs-snapshot.md

+
+### 4. Pause state model
+
+State is derived from resource presence:


[P2] Undefined state for Ready snapshot with live BatchSandbox

The state model doesn't explicitly define the behavior when BatchSandbox still exists but the SandboxSnapshot phase is already Ready. This is a transient but possible state (e.g., between step 7 and 8 of the pause flow). It should probably still be reported as Pausing or a new intermediate state to avoid UI flickering or confusion.

hittyt · 2026-03-13T07:50:17Z

oseps/0008-pause-resume-rootfs-snapshot.md

+  `imageUri` alone. `SandboxSnapshot` must retain enough `resumeTemplate`
+  information for the server to reconstruct a new `BatchSandbox`.
+- Registries with immutable tags are not compatible with this simplified
+  single-snapshot design unless the implementation changes the tag strategy in a


[P2] Registry tag immutability and overwrite risks

The deterministic tagging strategy (<sandboxId>:snapshot) will fail on registries with immutable tags. Additionally, concurrent pause requests for the same sandboxId could lead to race conditions in the registry. Consider using a unique suffix (e.g., timestamp or UID) for the tag and storing the exact imageUri in the SandboxSnapshot.status.

osep: add pause and resume via rootfs snapshot for Kubernetes sandboxes

d10de3d

fengcone requested review from Generalwin, Pangjiping, Spground, hittyt, jwx0925, kevinlynx and ninan-nn as code owners March 12, 2026 16:58

chatgpt-codex-connector bot reviewed Mar 12, 2026

View reviewed changes

docs(osep): OSEP-0008 separate snapshot push and resume pull authenti…

53c951d

…cation

fengcone force-pushed the docs/public-rootfs-pause-resume branch from a398eca to 53c951d Compare March 13, 2026 02:28

hittyt reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osep: add pause and resume via rootfs snapshot for Kubernetes sandboxes#437

osep: add pause and resume via rootfs snapshot for Kubernetes sandboxes#437
fengcone wants to merge 2 commits intoalibaba:mainfrom
fengcone:docs/public-rootfs-pause-resume

fengcone commented Mar 12, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 12, 2026

Uh oh!

fengcone Mar 13, 2026

Uh oh!

hittyt left a comment

Uh oh!

hittyt Mar 13, 2026

Uh oh!

hittyt Mar 13, 2026

Uh oh!

hittyt Mar 13, 2026

Uh oh!

hittyt Mar 13, 2026

Uh oh!

hittyt Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### 4. Pause state model

		State is derived from resource presence:

Conversation

fengcone commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Breaking Changes

Checklist

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

fengcone Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hittyt left a comment

Choose a reason for hiding this comment

Uh oh!

hittyt Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hittyt Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hittyt Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hittyt Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hittyt Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fengcone commented Mar 12, 2026 •

edited

Loading