diff --git a/oseps/0008-pause-resume-rootfs-snapshot.md b/oseps/0008-pause-resume-rootfs-snapshot.md new file mode 100644 index 00000000..6ca4f053 --- /dev/null +++ b/oseps/0008-pause-resume-rootfs-snapshot.md @@ -0,0 +1,630 @@ +--- +title: Pause and Resume via Rootfs Snapshot +authors: + - "@fengcone" +creation-date: 2026-03-11 +last-updated: 2026-03-13 +status: draft +--- + +# OSEP-0008: Pause and Resume via Rootfs Snapshot + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Requirements](#requirements) +- [Proposal](#proposal) + - [API Overview](#api-overview) + - [Kubernetes Resource Overview](#kubernetes-resource-overview) + - [Component Interaction Overview](#component-interaction-overview) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) +- [Test Plan](#test-plan) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed](#infrastructure-needed) +- [Upgrade & Migration Strategy](#upgrade--migration-strategy) + + +## Summary + +This proposal introduces pause and resume semantics for Kubernetes-backed +sandboxes by persisting the sandbox root filesystem as an OCI image. On pause, +the server creates a `SandboxSnapshot` CR for the running sandbox Pod, a +dedicated controller creates a commit Job on the same node, and the rootfs is +committed and pushed to a registry. After the snapshot becomes ready, the +original `BatchSandbox` is removed so compute resources are released. + +Resume is intentionally simpler. The server resolves the single retained +snapshot for the stable `sandboxId`, then creates a new `BatchSandbox` with +`replicas = 1` from the snapshot image. The public `sandboxId` remains stable +across pause and resume. + +```text +Time ------------------------------------------------------------------------> + +Sandbox lifecycle: [Running]--[Pausing]--[Paused]--[Resuming]--[Running] + | | + commit + push create new BatchSandbox + delete old BatchSandbox from snapshot image +``` + +## Motivation + +OpenSandbox users often need to temporarily stop a sandbox without losing the +filesystem state that has accumulated during a long-running task. Typical cases +include releasing cluster resources overnight, pausing an agent before a risky +step, or resuming a workspace later from the same working directory. + +Today, Kubernetes runtime returns `HTTP 501 Not Implemented` for both `pause` +and `resume`. Docker supports cgroup freeze, but that does not survive restart +or migration. Rootfs snapshot is the practical middle ground in the persistence +roadmap: + +- Phase 1: persistent volumes preserve explicit mounts. +- Phase 2: rootfs snapshot preserves the container filesystem. +- Phase 3: VM or process checkpoint preserves memory and execution state. + +This OSEP deliberately chooses a simple architecture: + +- keep `BatchSandbox` as the runtime workload resource used by the server today +- add a single `SandboxSnapshot` CR per `sandboxId` +- do not introduce a new per-instance lifecycle CR +- do not support multiple retained snapshots in v1 + +### Goals + +- Implement `pause` for Kubernetes sandboxes by committing a running sandbox Pod + rootfs into an OCI image and pushing it to a configurable registry. +- Keep the public `sandboxId` stable across pause and resume. +- Release compute resources after pause by deleting the original + `BatchSandbox`. +- Implement `resume` by creating a new `BatchSandbox` with `replicas = 1` from + the retained snapshot image. +- Expose `Pausing`, `Paused`, and `Resuming` through the existing Lifecycle API. +- Keep the design minimal by retaining only one snapshot per sandbox. + +### Non-Goals + +- Preserving in-memory process state, open sockets, or CPU registers. +- Supporting multiple historical snapshots per sandbox. +- Adding `GET /sandboxes/{sandboxId}/snapshots` in v1. +- Designing a general multi-instance pause model for `BatchSandbox` with + `replicas > 1`. +- Extending Docker runtime to rootfs snapshot. +- Implementing automatic scheduled snapshots. + +## Requirements + +- Public `sandboxId` must remain unchanged after pause and resume. +- A sandbox has at most one retained snapshot in v1. +- Pause must work from the currently running sandbox Pod and record the concrete + `podName` and `nodeName` that are being snapshotted. +- The commit Job must run on the same node as the source Pod. +- Pause must complete `commit -> push` before the original `BatchSandbox` is + deleted. +- Resume must work when the original `BatchSandbox` no longer exists. +- `GET /sandboxes/{sandboxId}` must still return `200` and state `Paused` while + the sandbox is represented only by a `SandboxSnapshot`. +- Registry credentials must be referenced via Kubernetes Secret, not inline API + credentials. +- `SandboxSnapshot` must carry enough policy and workload reconstruction data to + resume even after the original `BatchSandbox` has been deleted. +- The API shape must leave room for future snapshot backends, especially VM + snapshot, even though this revision only implements rootfs snapshot. +- The design must remain compatible with the current server behavior where + Kubernetes sandboxes are created as `BatchSandbox` with `replicas = 1`. + +## Proposal + +Pause and resume are modeled around two resources: + +- `BatchSandbox`: runtime workload resource used for the live sandbox +- `SandboxSnapshot`: persisted snapshot state for one stable `sandboxId` + +The public API stays sandbox-oriented, and the server remains the orchestrator. +The snapshot controller only handles snapshot execution. + +### API Overview + +```text +POST /sandboxes/{sandboxId}/pause -> create or update SandboxSnapshot, return 202 +POST /sandboxes/{sandboxId}/resume -> create new BatchSandbox from snapshot, return 202 +GET /sandboxes/{sandboxId} -> returns Running / Pausing / Paused / Resuming +``` + +There is no `GET /sandboxes/{sandboxId}/snapshots` endpoint in this version +because each sandbox retains only one snapshot. + +### Kubernetes Resource Overview + +```text +BatchSandbox (existing) + |- used by Server as the live workload resource + |- created with replicas = 1 for public sandbox lifecycle API + `- deleted after pause succeeds + +SandboxSnapshot (new, one per sandboxId) + |- metadata.name = + |- spec.sandboxId + |- spec.policy.type # Rootfs today, reserved for VMSnapshot later + |- spec.sourceBatchSandboxName + |- spec.sourcePodName + |- spec.sourceNodeName + |- spec.imageUri + |- spec.snapshotPushSecretName + |- spec.resumeImagePullSecretName + |- spec.resumeTemplate + |- status.phase # Pending | Committing | Pushing | Ready | Failed + |- status.readyAt + `- status.message +``` + +The `SandboxSnapshot` name is deterministic and equal to `sandboxId`, which +enforces the “one sandbox, one snapshot” rule. + +### Component Interaction Overview + +Pause flow: + +```mermaid +sequenceDiagram + participant Client + participant Server + participant Batch as BatchSandbox + participant Snapshot as SandboxSnapshot + participant Ctrl as SandboxSnapshotController + participant Job as Commit Job Pod + participant Registry + + Client->>Server: POST /sandboxes/{id}/pause + Server->>Batch: Read live BatchSandbox and Pod info + Server->>Snapshot: Create/Update SandboxSnapshot\n(sandboxId, podName, nodeName, imageUri, pushSecretRef, resumePullSecretRef) + Server-->>Client: 202 Accepted + Ctrl->>Job: Create same-node commit Job Pod + Job->>Registry: Push snapshot image + Job-->>Ctrl: Commit/push succeeded + Ctrl->>Snapshot: status.phase = Ready + Server->>Batch: Delete original BatchSandbox after snapshot Ready +``` + +Resume flow: + +```mermaid +sequenceDiagram + participant Client + participant Server + participant Snapshot as SandboxSnapshot + participant Batch as New BatchSandbox + participant Ctrl as BatchSandboxController + participant Pod as Sandbox Pod + + Client->>Server: POST /sandboxes/{id}/resume + Server->>Snapshot: Lookup SandboxSnapshot by sandboxId + Server->>Snapshot: Validate snapshot status.phase == Ready + Server->>Batch: Create new BatchSandbox\n(replicas=1, image=snapshot.imageUri, sandboxId unchanged) + Server-->>Client: 202 Accepted + Ctrl->>Pod: Create sandbox Pod + Pod-->>Ctrl: Pod becomes Running and Ready + Server->>Server: Aggregate state as Resuming -> Running +``` + +### Notes/Constraints/Caveats + +- `BatchSandbox` still supports broader semantics in the platform, but this + proposal only targets the current public server path where a sandbox maps to a + `BatchSandbox` with `replicas = 1`. +- The old `BatchSandbox` is deleted after a successful pause, so the paused + state exists only in `SandboxSnapshot`. +- The server remains the orchestration owner for pause and resume. The + snapshot controller is not responsible for creating or deleting + `BatchSandbox`. +- `SandboxSnapshot.spec.policy.type` is reserved for future snapshot backends. + This revision only supports `Rootfs`. +- Snapshot image URI should be stable for the single retained snapshot, for + example `/:snapshot`. +- Snapshot push authentication and resume-time image pull authentication are + modeled separately. They may reference the same Kubernetes Secret in some + deployments, but the design must not assume they are identical. +- Because the original `BatchSandbox` is deleted, resume cannot rely on + `imageUri` alone. `SandboxSnapshot` must retain enough `resumeTemplate` + information for the server to reconstruct a new `BatchSandbox`. +- Registries with immutable tags are not compatible with this simplified + single-snapshot design unless the implementation changes the tag strategy in a + future revision. +- Resume creates a new `BatchSandbox`; it does not resurrect the previous one. + +### Risks and Mitigations + +| Risk | Mitigation | +|------|------------| +| Pause succeeds in commit but old workload is deleted too early | Delete the original `BatchSandbox` only after `SandboxSnapshot.status.phase == Ready`. | +| Commit job lands on the wrong node | Store `sourceNodeName` in `SandboxSnapshot.spec` and pin the commit Job Pod to that node. | +| Server cannot represent a paused sandbox once `BatchSandbox` is gone | Use `SandboxSnapshot` as the source of truth for paused state in `GET /sandboxes/{sandboxId}`. | +| Repeated pause requests cause inconsistent state | Use deterministic `SandboxSnapshot.metadata.name = sandboxId` and treat pause as replace/update of the single snapshot. | +| Snapshot image is unavailable on resume | Require `status.phase == Ready` before resume and surface image-pull failures through normal sandbox startup state. | +| Single-snapshot design loses rollback ability | Accept as an intentional simplification for v1; multi-snapshot support is a future extension. | + +## Design Details + +### 1. Public Lifecycle API changes + +This OSEP keeps the public API minimal: + +- `CreateSandboxRequest.pausePolicy` is added as an optional field. +- `POST /sandboxes/{sandboxId}/pause` +- `POST /sandboxes/{sandboxId}/resume` +- `GET /sandboxes/{sandboxId}` + +There is no snapshots listing API in this version. + +Suggested request shape: + +```yaml +pausePolicy: + snapshotType: Rootfs + snapshotRegistry: registry.example.com/sandbox-snapshots + snapshotPushSecretName: snapshot-registry-push-secret + resumeImagePullSecretName: snapshot-registry-pull-secret +``` + +`pausePolicy.snapshotType` is reserved for future expansion and currently only +supports `Rootfs`. A later revision can add `VMSnapshot` without breaking the +API shape. + +### 2. PausePolicy on BatchSandbox + +Pause policy remains part of the live sandbox workload definition: + +```go +type PausePolicy struct { + SnapshotType string `json:"snapshotType,omitempty"` // Rootfs today, VMSnapshot reserved + SnapshotRegistry string `json:"snapshotRegistry"` + SnapshotPushSecretName string `json:"snapshotPushSecretName,omitempty"` + ResumeImagePullSecretName string `json:"resumeImagePullSecretName,omitempty"` +} + +type BatchSandboxSpec struct { + // existing fields... + PausePolicy *PausePolicy `json:"pausePolicy,omitempty"` +} +``` + +This policy is used by the server when constructing `SandboxSnapshot`. + +### 3. SandboxSnapshot CRD + +Introduce `SandboxSnapshot` under `sandbox.opensandbox.io/v1alpha1`. + +```go +type SandboxSnapshotPhase string + +const ( + SandboxSnapshotPhasePending SandboxSnapshotPhase = "Pending" + SandboxSnapshotPhaseCommitting SandboxSnapshotPhase = "Committing" + SandboxSnapshotPhasePushing SandboxSnapshotPhase = "Pushing" + SandboxSnapshotPhaseReady SandboxSnapshotPhase = "Ready" + SandboxSnapshotPhaseFailed SandboxSnapshotPhase = "Failed" +) + +type SandboxSnapshotSpec struct { + SandboxID string `json:"sandboxId"` + Policy SnapshotPolicy `json:"policy"` + SourceBatchSandboxName string `json:"sourceBatchSandboxName"` + SourcePodName string `json:"sourcePodName"` + SourceNodeName string `json:"sourceNodeName"` + ImageURI string `json:"imageUri"` + SnapshotPushSecretName string `json:"snapshotPushSecretName,omitempty"` + ResumeImagePullSecretName string `json:"resumeImagePullSecretName,omitempty"` + ResumeTemplate *runtime.RawExtension `json:"resumeTemplate,omitempty"` + PausedAt metav1.Time `json:"pausedAt"` +} + +type SandboxSnapshotStatus struct { + Phase SandboxSnapshotPhase `json:"phase,omitempty"` + Message string `json:"message,omitempty"` + ReadyAt *metav1.Time `json:"readyAt,omitempty"` + ImageDigest string `json:"imageDigest,omitempty"` +} + +type SnapshotPolicy struct { + Type string `json:"type"` // Rootfs today, VMSnapshot reserved +} +``` + +Key rules: + +- `metadata.name = sandboxId` +- one namespace contains at most one `SandboxSnapshot` for a given `sandboxId` +- creating a new pause request overwrites the retained snapshot +- `policy.type` must be set to `Rootfs` in this revision +- `SourcePodName` and `SourceNodeName` are mandatory because the commit workflow + is bound to a concrete live Pod +- `SnapshotPushSecretName` is used only for the in-container registry push + performed by the commit Job +- `ResumeImagePullSecretName` is used only when reconstructing the resumed + workload so kubelet can pull the retained snapshot image +- `ResumeTemplate` must preserve enough information to reconstruct a new + `BatchSandbox` after the original workload has been deleted + +### 4. Pause state model + +State is derived from resource presence: + +- `BatchSandbox` exists and is ready -> `Running` +- `BatchSandbox` exists and snapshot phase is `Pending|Committing|Pushing` -> `Pausing` +- `BatchSandbox` is absent and snapshot phase is `Ready` -> `Paused` +- `BatchSandbox` exists and was created from snapshot but is not ready yet -> + `Resuming` +- `SandboxSnapshot.status.phase == Failed` and no live replacement workload -> + `Failed` + +This means `GET /sandboxes/{sandboxId}` must consult both `BatchSandbox` and +`SandboxSnapshot`. + +### 5. Pause flow + +The pause flow is: + +```text +1. Client POST /sandboxes/{sandboxId}/pause +2. Server Resolve current BatchSandbox and running Pod for sandboxId +3. Server Validate: + - workload exists + - replicas == 1 for this server path + - pausePolicy is configured +4. Server Create or replace SandboxSnapshot(name=sandboxId) with: + - policy.type = Rootfs + - sourceBatchSandboxName + - sourcePodName + - sourceNodeName + - target imageUri + - snapshotPushSecretName + - resumeImagePullSecretName + - resumeTemplate + - pausedAt + - status.phase = Pending +5. Snapshot controller creates a same-node commit Job Pod +6. Job Pod commits container rootfs and pushes image +7. Snapshot controller updates phase: + Pending -> Committing -> Pushing -> Ready +8. Server-side pause orchestration deletes the original BatchSandbox +9. GET /sandboxes/{sandboxId} now returns Paused from SandboxSnapshot +``` + +Failure behavior: + +- If commit or push fails, `SandboxSnapshot.status.phase = Failed` +- The original `BatchSandbox` is not deleted +- The sandbox remains `Running` or transitions to `Failed` based on the final + server policy; this OSEP recommends keeping the workload running and exposing + the snapshot failure in the message + +### 6. Commit Job Pod + +The snapshot controller creates one short-lived Job Pod: + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: sbxsnap-commit- +spec: + ttlSecondsAfterFinished: 300 + template: + spec: + restartPolicy: Never + nodeName: + containers: + - name: committer + image: + command: ["/bin/sh", "-c"] + args: + - | + snapshot-committer \ + --containerd-namespace k8s.io \ + --container-id \ + --target-image \ + --registry-auth-file /var/run/opensandbox/registry/.dockerconfigjson + volumeMounts: + - name: containerd-sock + mountPath: /run/containerd/containerd.sock + - name: snapshot-push-auth + mountPath: /var/run/opensandbox/registry + readOnly: true + volumes: + - name: containerd-sock + hostPath: + path: /run/containerd/containerd.sock + type: Socket + - name: snapshot-push-auth + secret: + secretName: +``` + +The controller resolves the source container ID from `SourcePodName`. + +`snapshot-committer` in this example is a logical role, not a required product +name. The implementation may be a small in-house binary, a thin wrapper around +existing container tooling, or another committer client, as long as it +performs the following responsibilities explicitly: + +- commit the source container rootfs into a snapshot image +- read the mounted registry auth config from the Secret volume +- push the snapshot image to `spec.imageUri` +- return a clear success/failure signal so the controller can update + `SandboxSnapshot.status.phase` + +Important auth semantics: + +- `imagePullSecrets` on the Job Pod, if needed for the `committerImage`, only + affects kubelet pulling the Job image. It does not authenticate registry + operations performed by the process inside the container. +- `snapshotPushSecretName` is mounted into the committer container and must be + consumed explicitly by the snapshot push client as registry auth config. +- `resumeImagePullSecretName` is not used by the commit Job. It is propagated + to the resumed workload template so kubelet can pull `snapshot.spec.imageUri` + during resume. + +### 7. Resume flow + +The resume flow is: + +```text +1. Client POST /sandboxes/{sandboxId}/resume +2. Server Resolve SandboxSnapshot(name=sandboxId) +3. Server Validate: + - snapshot exists + - snapshot status.phase == Ready +4. Server Create a new BatchSandbox: + - metadata.name reuses the same public sandbox identity mapping + - replicas = 1 + - template reconstructed from snapshot.spec.resumeTemplate + - template image = snapshot.spec.imageUri + - template imagePullSecrets = snapshot.spec.resumeImagePullSecretName + - labels preserve sandboxId +5. Server Aggregate sandbox state as Resuming while the new BatchSandbox is + starting +6. BatchSandbox controller creates the new Pod +7. Once the new Pod is running and ready, GET /sandboxes/{sandboxId} returns Running +``` + +The snapshot is retained after resume so the sandbox can be paused and resumed +again later, but only the latest snapshot is kept. + +### 8. Stable sandbox ID + +The public `sandboxId` is stable across three states: + +- live workload exists: identify by `BatchSandbox` label `opensandbox.io/id` +- paused workload: identify by `SandboxSnapshot.metadata.name == sandboxId` +- resumed workload: identify by the new `BatchSandbox` label + +The workload object identity may change, but the public sandbox identity does +not. + +### 9. List and get semantics + +`GET /sandboxes/{sandboxId}` must: + +- first resolve the live `BatchSandbox` +- then resolve `SandboxSnapshot` +- merge both views into one lifecycle status + +`GET /sandboxes` should include: + +- running sandboxes from live `BatchSandbox` objects +- paused sandboxes from `SandboxSnapshot` objects that have no live + `BatchSandbox` + +This keeps paused sandboxes visible even though their workloads have been +deleted. + +### 10. Configuration + +Add a new server config section: + +```toml +[pause] +default_snapshot_registry = "" +committer_image = "containerd/containerd:1.7" +``` + +Semantics: + +- `default_snapshot_registry` is used when `pausePolicy.snapshotRegistry` is not + explicitly set. +- `committer_image` is the image used by the commit Job Pod. + +## Test Plan + +### Unit tests + +- Pause request creates or replaces `SandboxSnapshot(name=sandboxId)`. +- `SandboxSnapshot` contains `sourcePodName` and `sourceNodeName` from the live + Pod. +- Snapshot controller creates a Job pinned to the correct node. +- Server returns `Paused` when `BatchSandbox` is absent and snapshot is `Ready`. +- Server returns `Resuming` after new `BatchSandbox` is created from snapshot but + before readiness. +- Resume fails with `409` when snapshot is absent or not `Ready`. + +### Integration tests + +- End-to-end pause: + - running `BatchSandbox` + - snapshot becomes `Ready` + - original `BatchSandbox` is deleted + - `GET /sandboxes/{id}` returns `Paused` +- End-to-end resume: + - server finds snapshot by `sandboxId` + - creates new `BatchSandbox` + - new Pod comes up from snapshot image + - `GET /sandboxes/{id}` returns `Running` +- Repeat pause after resume: + - the same `SandboxSnapshot` resource is reused or replaced + - only one snapshot remains + +### Manual and operator validation + +- Confirm the committed image is present in the registry after pause. +- Confirm working directory contents survive pause and resume. +- Confirm CPU and memory are released after the old `BatchSandbox` is deleted. +- Confirm the commit Job Pod actually runs on the source node. + +## Drawbacks + +- Only one snapshot is retained, so rollback to older states is impossible. +- The design assumes the server-side Kubernetes path uses `replicas = 1`. +- The paused state is split from the live workload and must be reconstructed by + the server from multiple resources. +- Registries that enforce immutable tags are a poor fit for the simplified + single-snapshot design. +- Commit still requires node-local runtime access. + +## Alternatives + +### Introduce a dedicated SandboxInstance CR + +A more general design is possible, but rejected here because the user goal is a +simpler architecture aligned with the current server path. For v1, the single +snapshot CR plus existing `BatchSandbox` is sufficient. + +### Store pause state directly on BatchSandbox + +Rejected because the paused state must survive after the workload is deleted. +Once pause succeeds, the original `BatchSandbox` no longer exists. + +### Support multiple snapshots and `/snapshots` API in v1 + +Rejected to keep the architecture minimal. Multi-snapshot history can be added +later by changing `SandboxSnapshot` naming and list semantics. + +### Restore the old BatchSandbox instead of creating a new one + +Rejected because pause deletes the original workload to release resources. Resume +is cleaner if it always creates a fresh `BatchSandbox` from the retained image. + +## Infrastructure Needed + +- An OCI registry reachable from cluster nodes. +- A registry credential Secret of type `kubernetes.io/dockerconfigjson`. +- A committer image that can access `containerd.sock` on the source node. +- RBAC for `SandboxSnapshot`, Jobs, and reads on Pods and `BatchSandbox`. + +## Upgrade & Migration Strategy + +This change is additive for the public API and simple for operators. + +- Existing clients keep using the same sandbox lifecycle endpoints. +- Existing Kubernetes deployments without the new `SandboxSnapshot` CRD continue + to return `501` for pause and resume. +- Rollout sequence: + - install the `SandboxSnapshot` CRD + - deploy the `SandboxSnapshotController` + - deploy the updated server with pause/resume orchestration +- Existing running sandboxes do not require migration. Only new pause/resume + operations use the new flow.