Skip to content

Tart macOS VM Sandbox Backend #30

@rjernst

Description

@rjernst

branch: tart-sandbox-backend

Spec: Tart macOS VM Sandbox Backend

Overview

Ralph currently runs Claude Code inside Docker (Linux) sandboxes. For iOS/Swift projects that need Xcode, SwiftUI, and SwiftData, this is insufficient — those frameworks require macOS. This feature adds Tart (a Virtualization.framework wrapper by Cirrus Labs) as an alternative sandbox backend, enabling full xcodebuild test validation in macOS VMs on Apple Silicon.

Projects opt in by adding .agent-loop/config.json with "type": "tart" to their repo. The existing Docker backend remains the default and is unchanged.

Key design decisions:

  • Backend abstraction: SandboxBackend base class with DockerSandbox (existing, renamed) and TartSandbox (new)
  • Image caching: Content-addressed Tart templates (hash of base image name + dependencies file). APFS CoW clones make per-branch VMs instant.
  • Command execution: tart exec via guest agent (no SSH needed)
  • VM user: Default admin user (Cirrus Labs image default, has sudo for brew/npm)
  • VM limit: Apple SLA permits max 2 macOS VMs; ralph checks and gives clear errors
  • Proxy reuse: Existing credential proxy stays as Docker container; Tart VMs reach it via host IP instead of host.docker.internal
  • Network isolation: Deferred to follow-up (pf firewall approach documented in plan)
  • Dependencies: .agent-loop/dependencies for Docker stays as apt packages; for Tart it's a shell script. The config.json type determines parsing.

Architecture

.agent-loop/                     (in target project repo)
├── config.json                  (NEW — sandbox type + settings)
│   {"type": "tart", "base_image": "ghcr.io/cirruslabs/macos-sequoia-xcode:latest"}
├── dependencies                 (existing — apt packages for Docker, shell script for Tart)
└── Dockerfile.sandbox           (existing — custom Docker image, Docker only)

scripts/ralph                    (in dotfiles repo)
├── SandboxBackend               (NEW — base class with interface)
├── DockerSandbox(SandboxBackend) (RENAMED from Sandbox)
├── TartSandbox(SandboxBackend)  (NEW)
├── load_sandbox_config()        (NEW — reads config.json)
└── create_sandbox_backend()     (NEW — factory)

Tart VM lifecycle:

Base image (OCI registry)
  → tart clone → Template VM (agent-loop-template-{agent}-{hash})
                    dependencies installed, stopped, cached
  → tart clone → Per-branch VM (agent-loop-{agent}-{branch})
                    tart run --no-graphics --dir=workspace:{worktree}
                    kept running between iterations
                    tart exec for all commands

Proxy access from Tart VM:

Tart VM (NAT) → host gateway IP (192.168.64.1) → proxy port → Docker proxy container

Implementation Plan

Step 1: Add .agent-loop/config.json support [done]

Files:

  • scripts/ralph — Add load_sandbox_config() function
  • tests/test_ralph.py — Tests for config loading

Implement:

  1. Add load_sandbox_config(project_dir) function near the existing find_project_config method (around line 144). It reads .agent-loop/config.json from the project directory. Returns a dict with at least {"type": "docker"} as default. If the file exists, parse it with json.load(). Validate that type is either "docker" or "tart", raise ValueError for unknown types.
  2. For "tart" type, the config may include: base_image (str, required for tart), cpu (int, optional), memory_gb (int, optional).
  3. This is a standalone module-level function, not a method on Sandbox.

Test:

  • TestLoadSandboxConfig: test default when no config, test docker type, test tart type with base_image, test missing type defaults to docker, test unknown type raises ValueError, test missing config.json returns default, test malformed JSON raises.

Verify: Run pytest tests/test_ralph.py -v -k TestLoadSandboxConfig. Fix any failures.

Review: Ensure config validation catches bad input early with clear error messages.

Address feedback: Fix all review findings. Re-run tests.

Step 2: Extract SandboxBackend base class and rename Sandbox to DockerSandbox [done]

Files:

  • scripts/ralph — Add base class, rename existing class

Implement:

  1. Add a SandboxBackend base class between the Git class and the current Sandbox class. It defines interface methods that raise NotImplementedError: ensure_image, ensure_sandbox, setup_git_config, run_iteration, preflight_check, cleanup_sandbox, prune_sandboxes, remove_sandbox. Also add proxy_host() method. Move the sandbox_name static method here (shared logic).
  2. Rename class Sandbox to class DockerSandbox(SandboxBackend).
  3. Add def proxy_host(self): return "host.docker.internal" to DockerSandbox.
  4. Update all references to Sandbox throughout the file: main() line ~2243 (Sandbox(dotfiles_dir)DockerSandbox(dotfiles_dir)), prune-sandboxes subcommand line ~2118, cleanup_sandbox static calls that reference Sandbox.sandbox_name or Sandbox.cleanup_sandbox, and selftest().
  5. Do NOT change any behavior — this is a pure rename/extract refactor.
  6. Note: Sandbox has duplicate method definitions (lines ~107-220 and ~221-317 have generate_project_dockerfile, find_project_config, project_image_tag, ensure_project_image defined twice). During the rename, remove the first set of duplicates (keep the second set that's actually used). Verify with tests that the kept methods are the ones being tested.

Test:

  • All existing TestSandbox* test classes should pass with minimal changes (update any explicit Sandbox(...) constructor calls to DockerSandbox(...)).
  • Add a basic test that SandboxBackend methods raise NotImplementedError.

Verify: Run pytest tests/test_ralph.py -v. ALL existing tests must pass. Fix any failures.

Review: Verify this is a pure refactor — no behavior changes. Grep for any remaining references to Sandbox( that should be DockerSandbox(.

Address feedback: Fix all findings. Re-run full test suite.

Step 3: Add backend factory and update process_issue/main [done]

Files:

  • scripts/ralph — Add factory, update callers
  • tests/test_ralph.py — Update test fixtures

Implement:

  1. Add create_sandbox_backend(sandbox_type, dotfiles_dir, **kwargs) factory function. For "docker", returns DockerSandbox(dotfiles_dir). For "tart", it will return TartSandbox(...) (stub for now — raise NotImplementedError("tart backend not yet implemented")).
  2. Modify process_issue() signature: replace sandbox parameter with dotfiles_dir. Inside process_issue, after resolving the worktree (work_dir), call config = load_sandbox_config(repo_root) then sandbox = create_sandbox_backend(config["type"], dotfiles_dir, **config).
  3. Replace the hardcoded "host.docker.internal" in env_vars (line ~1871) with sandbox.proxy_host().
  4. Update main(): instead of sandbox = Sandbox(dotfiles_dir) → pass dotfiles_dir to process_issue.
  5. Update poll_loop() similarly: accept dotfiles_dir instead of sandbox, pass it to process_issue.
  6. Update all callers in main() that pass sandbox to pass dotfiles_dir instead.

Test:

  • TestCreateSandboxBackend: docker returns DockerSandbox, tart raises NotImplementedError (for now), unknown raises ValueError.
  • Update TestProcessIssueSandbox fixtures: mock load_sandbox_config to return {"type": "docker"}, mock create_sandbox_backend to return a mock sandbox. Verify proxy_host() is called for env vars.
  • Update TestMainSandboxFlags to match new signatures.

Verify: Run pytest tests/test_ralph.py -v. All tests pass.

Review: Ensure no remaining hardcoded host.docker.internal strings. Ensure sandbox.proxy_host() is used everywhere the proxy URL is constructed.

Address feedback: Fix all findings. Re-run tests.

Step 4: Implement TartSandbox — image and template management [done]

Files:

  • scripts/ralph — TartSandbox class (first half)
  • tests/test_ralph.py — Unit tests

Implement:

  1. Add TartSandbox(SandboxBackend) class. Constructor accepts config dict (with base_image, optional cpu, memory_gb, dependencies_content). Store config. Initialize self._vm_procs = {} dict for tracking running VMs.
  2. Implement _template_name(self, agent): hash = SHA256(base_image + "\n" + dependencies_content)[:12], returns f"agent-loop-template-{agent}-{hash}".
  3. Implement _list_vms(self): runs tart list --format json, returns parsed list. Returns [] on failure.
  4. Implement _running_vm_count(self): filters _list_vms() for State == "Running", returns count.
  5. Implement _check_vm_limit(self): if _running_vm_count() >= 2, raise RuntimeError with message: "ralph: cannot start VM — {count} macOS VMs already running (Apple SLA permits max 2). Stop an existing VM with: tart stop <name>".
  6. Implement _wait_for_guest_agent(self, vm_name, timeout=120): poll tart exec <vm> -- echo ok every 2 seconds until success or timeout. Raise RuntimeError on timeout with message including VM name and timeout.
  7. Implement ensure_image(self, agent, force_rebuild=False):
    • Compute template name via _template_name(agent)
    • Check if template already exists in _list_vms() (by name match)
    • If exists and not force_rebuild, return template name
    • If force_rebuild and exists, delete old template first (tart delete)
    • Run _check_vm_limit() before starting any VM
    • Clone base image: tart clone {base_image} {template}
    • If dependencies_content is non-empty: start VM headless (tart run {template} --no-graphics as Popen), wait for guest agent, execute dependencies via tart exec -i {template} -- bash -e with dependencies as stdin, then stop VM (tart stop {template}) and wait for the Popen to finish
    • Return template name
  8. Read dependencies content: when create_sandbox_backend creates a TartSandbox, it should read .agent-loop/dependencies if it exists and pass the content as dependencies_content in the config dict. Update create_sandbox_backend to handle this.

Test:

  • TestTartTemplateName: deterministic hash, changes with base_image, changes with dependencies
  • TestTartListVms: mock subprocess, test parse, test failure returns []
  • TestTartCheckVmLimit: 0 running passes, 1 running passes, 2 running raises with clear message
  • TestTartWaitForGuestAgent: succeeds on first try, succeeds after retries, times out raises RuntimeError
  • TestTartEnsureImage: template exists returns cached, force_rebuild deletes and recreates, dependencies installed via tart exec, no dependencies skips install step

Verify: Run pytest tests/test_ralph.py -v -k TestTart. Fix any failures.

Review: Check that VM limit errors include actionable guidance. Check that the Popen for tart run is properly cleaned up in ensure_image (stop + wait).

Address feedback: Fix all findings. Re-run tests.

Step 5: Implement TartSandbox — sandbox lifecycle and command execution [done]

Files:

  • scripts/ralph — TartSandbox remaining methods
  • tests/test_ralph.py — Unit tests

Implement:

  1. ensure_sandbox(self, agent, branch, worktree_path, **kwargs):
    • Generate name via self.sandbox_name(agent, branch)
    • Check if VM exists and is running → reuse
    • If exists but stopped → delete and recreate
    • Call _check_vm_limit() before starting
    • Clone from template: tart clone {template} {name}
    • Start headless with directory sharing: tart run {name} --no-graphics --dir=workspace:{worktree_path} as Popen, store in self._vm_procs[name]
    • Register atexit handler (once) to stop all VMs on exit
    • Wait for guest agent
    • Return name
  2. setup_git_config(self, sandbox_name, user, email): run tart exec {name} -- git config --global user.name {user}, same for email, same for safe.directory *.
  3. run_iteration(self, sandbox_name, spec_content, model, env_vars=None):
    • Write spec: tart exec -i {name} -- tee /tmp/spec.md with input=spec_content, stdout devnull
    • Build claude command with env vars: cd '/Volumes/My Shared Files/workspace' && env KEY=VAL ... claude -p '...' --model {model} --dangerously-skip-permissions --effort high
    • Execute: tart exec {name} -- bash -c "{claude_cmd}"
    • Read spec: tart exec {name} -- cat /tmp/spec.md
    • Return (exit_code, updated_spec)
  4. proxy_host(self): first try tart exec {name} -- route -n get default and parse gateway IP. Fallback to ipconfig getifaddr en0 on the host. Final fallback 192.168.64.1. Cache the result after first discovery.
  5. cleanup_sandbox(self, agent, branch): tart stop {name}; tart delete {name}. Remove from _vm_procs.
  6. remove_sandbox(self, name): same as cleanup but by name.
  7. prune_sandboxes(self, agent): list VMs with matching prefix, check if associated worktree path exists (derive from VM name), delete orphans.
  8. preflight_check(self, sandbox_name, agent, proxy_port): check token, proxy health, VM responsive (tart exec echo ok). Skip network isolation check (log note that Tart VMs don't have network isolation). Return list of failure messages.
  9. For exec_output, check_in_sync, reset_to_host, sync_to_host: these Docker-specific methods on DockerSandbox use docker sandbox exec. TartSandbox needs equivalent implementations using tart exec. For sync_to_host, since Tart uses VirtioFS (shared directory), host and VM see the same files — commits in the VM are immediately visible on the host. So sync_to_host can verify the commit exists on host and return True. check_in_sync always returns True (shared filesystem). reset_to_host is a no-op (shared filesystem).
  10. Update create_sandbox_backend to create TartSandbox for type "tart" (remove the NotImplementedError stub).

Test:

  • TestTartEnsureSandbox: creates new VM, reuses running VM, deletes stopped VM and recreates, VM limit check
  • TestTartSetupGitConfig: correct tart exec commands
  • TestTartRunIteration: writes spec, runs claude with env vars, reads spec back, handles write failure, handles read failure
  • TestTartProxyHost: gateway parsing, fallback to en0, final fallback
  • TestTartCleanupSandbox: stops and deletes
  • TestTartPruneSandboxes: removes orphans, keeps active
  • TestTartPreflightCheck: all pass, token missing, proxy down, VM unresponsive
  • TestTartSyncToHost: returns True (shared filesystem)
  • TestTartCheckInSync: returns True always
  • Update TestCreateSandboxBackend: tart type now returns TartSandbox instance

Verify: Run pytest tests/test_ralph.py -v. ALL tests pass.

Review: Check that env vars in run_iteration are properly shell-escaped to prevent injection. Check that atexit cleanup handles the case where VMs are already stopped. Check that proxy_host caching works correctly.

Address feedback: Fix all findings. Re-run full test suite.

Notes:

  • Moved ITERATION_PROMPT from DockerSandbox to SandboxBackend base class so both backends can reference it.
  • Added import atexit and import shlex to script imports.
  • _atexit_registered is a class-level flag; the actual atexit handler is a no-op safety net since cleanup_sandbox is the primary cleanup path.
  • prune_sandboxes uses stopped-state as the orphan heuristic since Tart VMs don't store workspace metadata. Running VMs and templates are preserved.
  • create_sandbox_backend already returned TartSandbox (no NotImplementedError stub to remove — that was handled in Step 4).

Step 6: Wire up CLI and prerequisites [done]

Files:

  • scripts/ralph — Update CLI, selftest, prerequisites

Implement:

  1. Update check_dependencies_prereq() to also check for tart when running in tart mode. Since we don't know the sandbox type at prereq check time (it depends on the project), check for tart in the Tart backend constructor instead, or make the check lazy. Best approach: add a check_prerequisites() method to SandboxBackend. DockerSandbox.check_prerequisites() verifies docker is available. TartSandbox.check_prerequisites() verifies both docker (for proxy) and tart are available.
  2. Update selftest(): accept an optional sandbox_type parameter (default "docker"). When "tart", use TartSandbox-specific checks: build template, clone test VM, verify tart exec works, verify proxy reachable from VM via host IP, verify Claude auth via proxy, cleanup. The tart selftest still uses the Docker proxy (proxy stays unchanged).
  3. Add --type docker|tart flag to the selftest subcommand parser in main().
  4. Update prune-sandboxes to detect sandbox type or accept --type flag. When type is "tart", use TartSandbox.prune_sandboxes. Default remains docker.

Test:

  • TestSelftest: add test for --type tart routing
  • TestMainSandboxFlags: test --type flag parsing
  • Test prerequisite checks for TartSandbox

Verify: Run pytest tests/test_ralph.py -v. All tests pass.

Review: Check that the selftest cleanup always runs (even on failure). Check error messages when tart is not installed.

Address feedback: Fix all findings. Re-run tests.

Notes:

  • check_prerequisites() added to SandboxBackend (abstract), DockerSandbox (checks docker), and TartSandbox (checks tart + docker).
  • selftest() refactored into shared preamble (token, prerequisites, proxy) + _selftest_docker() and _selftest_tart() helper functions.
  • Introduced _SelftestAbort exception to abort early from helper functions while preserving cleanup in the main selftest() finally block.
  • Cleanup is now always attempted (remove_sandbox is idempotent), rather than conditionally based on sandbox_created flag.
  • The prerequisites check adds one extra check to the selftest count (9 for docker without project image, 10 with).
  • --type flag validates against ("docker", "tart") with exit code 2 for unknown types.

Step 7: Run all checks [done]

Implement:

  1. Run the full test suite: pytest tests/test_ralph.py -v
  2. Run shellcheck on any shell scripts that were modified
  3. Run python3 -c "import py_compile; py_compile.compile('scripts/ralph', doraise=True)" to verify syntax
  4. Fix any failures and commit the fixes

Verify: All checks pass clean.

Step 8: Fix TartSandbox atexit cleanup with class-level VM tracking [done]

Files:

  • scripts/ralph — Move _vm_procs to class variable, implement _atexit_stop_all
  • tests/test_ralph.py — Tests for atexit behavior

Implement:

  1. Change _vm_procs from an instance variable to a class variable: add _vm_procs = {} at class level on TartSandbox, and remove self._vm_procs = {} from __init__.
  2. Implement _atexit_stop_all: iterate TartSandbox._vm_procs.items(), call tart stop for each VM (best-effort, suppress errors), then proc.wait() for each tracked Popen. Clear the dict afterward.
  3. All existing code that reads/mutates self._vm_procs continues to work since Python resolves instance attribute reads to the class variable when no instance attribute shadows it, and dict mutations (subscript assignment, .pop()) mutate the class-level dict in place.

Test:

  • Test that _atexit_stop_all calls tart stop for each tracked VM and waits for each Popen process.
  • Test that _atexit_stop_all handles empty _vm_procs gracefully.
  • Test that _atexit_stop_all clears _vm_procs after cleanup.
  • Test that _vm_procs is shared across instances (class-level).

Verify: Run pytest tests/test_ralph.py -v. Fix any failures.

Review: Ensure no code path reassigns self._vm_procs = ... (which would shadow the class variable). All mutations must be in-place (subscript, .pop()).

Address feedback: Fix findings, re-run tests.

Notes:

  • _vm_procs = {} added as class variable on TartSandbox; self._vm_procs = {} removed from __init__.
  • _atexit_stop_all now iterates TartSandbox._vm_procs.items(), calls tart stop for each VM (suppressing errors), waits for each Popen with a 10-second timeout, then clears the dict.
  • Test classes that mutate _vm_procs (TestTartEnsureSandbox, TestTartProxyHost, TestTartCleanupSandbox, TestTartRemoveSandbox) now save/restore the class-level dict in setup_method/teardown_method to prevent cross-test pollution.

Step 9: Fix config unpacking, interface consistency, and VM list caching [done]

Files:

  • scripts/ralph — Multiple small fixes
  • tests/test_ralph.py — Update affected tests

Implement:

  1. Config unpacking (critical): In process_issue, pop "type" from the config dict before passing **config to create_sandbox_backend, so type is not passed twice (once as positional, once in kwargs). Also pop "project_dir" since it's added to config and consumed by create_sandbox_backend but not a valid TartSandbox/DockerSandbox constructor argument.
  2. Base class interface: Add check_in_sync, reset_to_host, and sync_to_host to SandboxBackend as abstract methods (raising NotImplementedError), since both DockerSandbox and TartSandbox implement them and process_issue calls them polymorphically.
  3. Signature consistency: Remove the unused workdir parameter from DockerSandbox.run_iteration. The base class signature is run_iteration(self, sandbox_name, spec_content, model, env_vars=None) and process_issue never passes workdir.
  4. Extract shared folder constant: Add SHARED_DIR = "/Volumes/My Shared Files/workspace" as a class constant on TartSandbox. Use it in run_iteration where the path is currently hardcoded in the cd command.
  5. VM list caching: Add a _cached_vm_list tuple of (timestamp, result) to TartSandbox (class-level). In _list_vms, return the cached result if it's less than 2 seconds old. This avoids redundant tart list subprocess calls when _vm_state and _check_vm_limit are called in quick succession within ensure_sandbox.

Test:

  • Verify process_issue no longer passes type in kwargs to create_sandbox_backend.
  • Test that SandboxBackend.check_in_sync, .reset_to_host, .sync_to_host raise NotImplementedError.
  • Verify DockerSandbox.run_iteration works without workdir parameter.
  • Verify TartSandbox.SHARED_DIR is used in run_iteration output.
  • Test VM list caching: two calls within 2 seconds hit cache, call after expiry fetches fresh.

Verify: Run pytest tests/test_ralph.py -v. All tests pass.

Review: Check that removing workdir from Docker doesn't break any callers (grep for workdir). Ensure cache TTL is short enough to not mask real VM state changes.

Address feedback: Fix findings, re-run tests.

Notes:

  • Config unpacking: type is now popped from the config dict in process_issue before **config is passed to create_sandbox_backend. project_dir was already handled by the factory's tart branch (popped in create_sandbox_backend).
  • Base class interface: check_in_sync, reset_to_host, sync_to_host added to SandboxBackend as abstract methods raising NotImplementedError.
  • DockerSandbox.run_iteration: removed workdir parameter; process_issue no longer passes workdir=work_dir to run_iteration. Docker sandbox exec's -w flag is still used by exec_output and check_in_sync (separate methods).
  • TartSandbox.SHARED_DIR = "/Volumes/My Shared Files/workspace" added as class constant, used in run_iteration.
  • VM list cache: _vm_list_cache class variable as (timestamp, result) tuple with 2-second TTL using time.monotonic(). Tests save/restore the cache in setup/teardown.

Step 10: Run all checks

Implement:

  1. Run the full test suite: pytest tests/test_ralph.py -v
  2. Run python3 -c "import py_compile; py_compile.compile('scripts/ralph', doraise=True)" to verify syntax
  3. Fix any failures and commit the fixes

Verify: All checks pass clean.


Conventions

  • Language: Python 3 (stdlib only, no third-party dependencies)
  • Tests: pytest with unittest.mock for subprocess mocking. Test classes named TestXxx, test methods test_xxx.
  • Error messages: Prefix with ralph: (e.g., ralph: cannot start VM — ...)
  • Exit codes: 0=success, 1=runtime error, 2=usage error
  • Imports: stdlib only. subprocess.run for external commands, json for config parsing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    specRalph spec for automated executionstatus:doneCompleted

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions