-
Notifications
You must be signed in to change notification settings - Fork 4
Description
branch: tart-sandbox-backend
Spec: Tart macOS VM Sandbox Backend
Overview
Ralph currently runs Claude Code inside Docker (Linux) sandboxes. For iOS/Swift projects that need Xcode, SwiftUI, and SwiftData, this is insufficient — those frameworks require macOS. This feature adds Tart (a Virtualization.framework wrapper by Cirrus Labs) as an alternative sandbox backend, enabling full xcodebuild test validation in macOS VMs on Apple Silicon.
Projects opt in by adding .agent-loop/config.json with "type": "tart" to their repo. The existing Docker backend remains the default and is unchanged.
Key design decisions:
- Backend abstraction:
SandboxBackendbase class withDockerSandbox(existing, renamed) andTartSandbox(new) - Image caching: Content-addressed Tart templates (hash of base image name + dependencies file). APFS CoW clones make per-branch VMs instant.
- Command execution:
tart execvia guest agent (no SSH needed) - VM user: Default
adminuser (Cirrus Labs image default, has sudo for brew/npm) - VM limit: Apple SLA permits max 2 macOS VMs; ralph checks and gives clear errors
- Proxy reuse: Existing credential proxy stays as Docker container; Tart VMs reach it via host IP instead of
host.docker.internal - Network isolation: Deferred to follow-up (pf firewall approach documented in plan)
- Dependencies:
.agent-loop/dependenciesfor Docker stays as apt packages; for Tart it's a shell script. Theconfig.jsontype determines parsing.
Architecture
.agent-loop/ (in target project repo)
├── config.json (NEW — sandbox type + settings)
│ {"type": "tart", "base_image": "ghcr.io/cirruslabs/macos-sequoia-xcode:latest"}
├── dependencies (existing — apt packages for Docker, shell script for Tart)
└── Dockerfile.sandbox (existing — custom Docker image, Docker only)
scripts/ralph (in dotfiles repo)
├── SandboxBackend (NEW — base class with interface)
├── DockerSandbox(SandboxBackend) (RENAMED from Sandbox)
├── TartSandbox(SandboxBackend) (NEW)
├── load_sandbox_config() (NEW — reads config.json)
└── create_sandbox_backend() (NEW — factory)
Tart VM lifecycle:
Base image (OCI registry)
→ tart clone → Template VM (agent-loop-template-{agent}-{hash})
dependencies installed, stopped, cached
→ tart clone → Per-branch VM (agent-loop-{agent}-{branch})
tart run --no-graphics --dir=workspace:{worktree}
kept running between iterations
tart exec for all commands
Proxy access from Tart VM:
Tart VM (NAT) → host gateway IP (192.168.64.1) → proxy port → Docker proxy container
Implementation Plan
Step 1: Add .agent-loop/config.json support [done]
Files:
scripts/ralph— Addload_sandbox_config()functiontests/test_ralph.py— Tests for config loading
Implement:
- Add
load_sandbox_config(project_dir)function near the existingfind_project_configmethod (around line 144). It reads.agent-loop/config.jsonfrom the project directory. Returns a dict with at least{"type": "docker"}as default. If the file exists, parse it withjson.load(). Validate thattypeis either"docker"or"tart", raiseValueErrorfor unknown types. - For
"tart"type, the config may include:base_image(str, required for tart),cpu(int, optional),memory_gb(int, optional). - This is a standalone module-level function, not a method on Sandbox.
Test:
TestLoadSandboxConfig: test default when no config, test docker type, test tart type with base_image, test missing type defaults to docker, test unknown type raises ValueError, test missing config.json returns default, test malformed JSON raises.
Verify: Run pytest tests/test_ralph.py -v -k TestLoadSandboxConfig. Fix any failures.
Review: Ensure config validation catches bad input early with clear error messages.
Address feedback: Fix all review findings. Re-run tests.
Step 2: Extract SandboxBackend base class and rename Sandbox to DockerSandbox [done]
Files:
scripts/ralph— Add base class, rename existing class
Implement:
- Add a
SandboxBackendbase class between theGitclass and the currentSandboxclass. It defines interface methods that raiseNotImplementedError:ensure_image,ensure_sandbox,setup_git_config,run_iteration,preflight_check,cleanup_sandbox,prune_sandboxes,remove_sandbox. Also addproxy_host()method. Move thesandbox_namestatic method here (shared logic). - Rename
class Sandboxtoclass DockerSandbox(SandboxBackend). - Add
def proxy_host(self): return "host.docker.internal"to DockerSandbox. - Update all references to
Sandboxthroughout the file:main()line ~2243 (Sandbox(dotfiles_dir)→DockerSandbox(dotfiles_dir)),prune-sandboxessubcommand line ~2118,cleanup_sandboxstatic calls that referenceSandbox.sandbox_nameorSandbox.cleanup_sandbox, andselftest(). - Do NOT change any behavior — this is a pure rename/extract refactor.
- Note:
Sandboxhas duplicate method definitions (lines ~107-220 and ~221-317 havegenerate_project_dockerfile,find_project_config,project_image_tag,ensure_project_imagedefined twice). During the rename, remove the first set of duplicates (keep the second set that's actually used). Verify with tests that the kept methods are the ones being tested.
Test:
- All existing
TestSandbox*test classes should pass with minimal changes (update any explicitSandbox(...)constructor calls toDockerSandbox(...)). - Add a basic test that
SandboxBackendmethods raiseNotImplementedError.
Verify: Run pytest tests/test_ralph.py -v. ALL existing tests must pass. Fix any failures.
Review: Verify this is a pure refactor — no behavior changes. Grep for any remaining references to Sandbox( that should be DockerSandbox(.
Address feedback: Fix all findings. Re-run full test suite.
Step 3: Add backend factory and update process_issue/main [done]
Files:
scripts/ralph— Add factory, update callerstests/test_ralph.py— Update test fixtures
Implement:
- Add
create_sandbox_backend(sandbox_type, dotfiles_dir, **kwargs)factory function. For"docker", returnsDockerSandbox(dotfiles_dir). For"tart", it will returnTartSandbox(...)(stub for now — raiseNotImplementedError("tart backend not yet implemented")). - Modify
process_issue()signature: replacesandboxparameter withdotfiles_dir. Insideprocess_issue, after resolving the worktree (work_dir), callconfig = load_sandbox_config(repo_root)thensandbox = create_sandbox_backend(config["type"], dotfiles_dir, **config). - Replace the hardcoded
"host.docker.internal"in env_vars (line ~1871) withsandbox.proxy_host(). - Update
main(): instead ofsandbox = Sandbox(dotfiles_dir)→ passdotfiles_dirtoprocess_issue. - Update
poll_loop()similarly: acceptdotfiles_dirinstead ofsandbox, pass it toprocess_issue. - Update all callers in
main()that passsandboxto passdotfiles_dirinstead.
Test:
TestCreateSandboxBackend: docker returns DockerSandbox, tart raises NotImplementedError (for now), unknown raises ValueError.- Update
TestProcessIssueSandboxfixtures: mockload_sandbox_configto return{"type": "docker"}, mockcreate_sandbox_backendto return a mock sandbox. Verifyproxy_host()is called for env vars. - Update
TestMainSandboxFlagsto match new signatures.
Verify: Run pytest tests/test_ralph.py -v. All tests pass.
Review: Ensure no remaining hardcoded host.docker.internal strings. Ensure sandbox.proxy_host() is used everywhere the proxy URL is constructed.
Address feedback: Fix all findings. Re-run tests.
Step 4: Implement TartSandbox — image and template management [done]
Files:
scripts/ralph— TartSandbox class (first half)tests/test_ralph.py— Unit tests
Implement:
- Add
TartSandbox(SandboxBackend)class. Constructor acceptsconfigdict (withbase_image, optionalcpu,memory_gb,dependencies_content). Store config. Initializeself._vm_procs = {}dict for tracking running VMs. - Implement
_template_name(self, agent): hash = SHA256(base_image + "\n" + dependencies_content)[:12], returnsf"agent-loop-template-{agent}-{hash}". - Implement
_list_vms(self): runstart list --format json, returns parsed list. Returns[]on failure. - Implement
_running_vm_count(self): filters_list_vms()forState == "Running", returns count. - Implement
_check_vm_limit(self): if_running_vm_count() >= 2, raiseRuntimeErrorwith message:"ralph: cannot start VM — {count} macOS VMs already running (Apple SLA permits max 2). Stop an existing VM with: tart stop <name>". - Implement
_wait_for_guest_agent(self, vm_name, timeout=120): polltart exec <vm> -- echo okevery 2 seconds until success or timeout. RaiseRuntimeErroron timeout with message including VM name and timeout. - Implement
ensure_image(self, agent, force_rebuild=False):- Compute template name via
_template_name(agent) - Check if template already exists in
_list_vms()(by name match) - If exists and not force_rebuild, return template name
- If force_rebuild and exists, delete old template first (
tart delete) - Run
_check_vm_limit()before starting any VM - Clone base image:
tart clone {base_image} {template} - If dependencies_content is non-empty: start VM headless (
tart run {template} --no-graphicsas Popen), wait for guest agent, execute dependencies viatart exec -i {template} -- bash -ewith dependencies as stdin, then stop VM (tart stop {template}) and wait for the Popen to finish - Return template name
- Compute template name via
- Read dependencies content: when
create_sandbox_backendcreates a TartSandbox, it should read.agent-loop/dependenciesif it exists and pass the content asdependencies_contentin the config dict. Updatecreate_sandbox_backendto handle this.
Test:
TestTartTemplateName: deterministic hash, changes with base_image, changes with dependenciesTestTartListVms: mock subprocess, test parse, test failure returns []TestTartCheckVmLimit: 0 running passes, 1 running passes, 2 running raises with clear messageTestTartWaitForGuestAgent: succeeds on first try, succeeds after retries, times out raises RuntimeErrorTestTartEnsureImage: template exists returns cached, force_rebuild deletes and recreates, dependencies installed via tart exec, no dependencies skips install step
Verify: Run pytest tests/test_ralph.py -v -k TestTart. Fix any failures.
Review: Check that VM limit errors include actionable guidance. Check that the Popen for tart run is properly cleaned up in ensure_image (stop + wait).
Address feedback: Fix all findings. Re-run tests.
Step 5: Implement TartSandbox — sandbox lifecycle and command execution [done]
Files:
scripts/ralph— TartSandbox remaining methodstests/test_ralph.py— Unit tests
Implement:
ensure_sandbox(self, agent, branch, worktree_path, **kwargs):- Generate name via
self.sandbox_name(agent, branch) - Check if VM exists and is running → reuse
- If exists but stopped → delete and recreate
- Call
_check_vm_limit()before starting - Clone from template:
tart clone {template} {name} - Start headless with directory sharing:
tart run {name} --no-graphics --dir=workspace:{worktree_path}as Popen, store inself._vm_procs[name] - Register atexit handler (once) to stop all VMs on exit
- Wait for guest agent
- Return name
- Generate name via
setup_git_config(self, sandbox_name, user, email): runtart exec {name} -- git config --global user.name {user}, same for email, same forsafe.directory *.run_iteration(self, sandbox_name, spec_content, model, env_vars=None):- Write spec:
tart exec -i {name} -- tee /tmp/spec.mdwith input=spec_content, stdout devnull - Build claude command with env vars:
cd '/Volumes/My Shared Files/workspace' && env KEY=VAL ... claude -p '...' --model {model} --dangerously-skip-permissions --effort high - Execute:
tart exec {name} -- bash -c "{claude_cmd}" - Read spec:
tart exec {name} -- cat /tmp/spec.md - Return (exit_code, updated_spec)
- Write spec:
proxy_host(self): first trytart exec {name} -- route -n get defaultand parse gateway IP. Fallback toipconfig getifaddr en0on the host. Final fallback192.168.64.1. Cache the result after first discovery.cleanup_sandbox(self, agent, branch):tart stop {name}; tart delete {name}. Remove from_vm_procs.remove_sandbox(self, name): same as cleanup but by name.prune_sandboxes(self, agent): list VMs with matching prefix, check if associated worktree path exists (derive from VM name), delete orphans.preflight_check(self, sandbox_name, agent, proxy_port): check token, proxy health, VM responsive (tart exec echo ok). Skip network isolation check (log note that Tart VMs don't have network isolation). Return list of failure messages.- For
exec_output,check_in_sync,reset_to_host,sync_to_host: these Docker-specific methods on DockerSandbox usedocker sandbox exec. TartSandbox needs equivalent implementations usingtart exec. Forsync_to_host, since Tart uses VirtioFS (shared directory), host and VM see the same files — commits in the VM are immediately visible on the host. Sosync_to_hostcan verify the commit exists on host and return True.check_in_syncalways returns True (shared filesystem).reset_to_hostis a no-op (shared filesystem). - Update
create_sandbox_backendto createTartSandboxfor type "tart" (remove the NotImplementedError stub).
Test:
TestTartEnsureSandbox: creates new VM, reuses running VM, deletes stopped VM and recreates, VM limit checkTestTartSetupGitConfig: correct tart exec commandsTestTartRunIteration: writes spec, runs claude with env vars, reads spec back, handles write failure, handles read failureTestTartProxyHost: gateway parsing, fallback to en0, final fallbackTestTartCleanupSandbox: stops and deletesTestTartPruneSandboxes: removes orphans, keeps activeTestTartPreflightCheck: all pass, token missing, proxy down, VM unresponsiveTestTartSyncToHost: returns True (shared filesystem)TestTartCheckInSync: returns True always- Update
TestCreateSandboxBackend: tart type now returns TartSandbox instance
Verify: Run pytest tests/test_ralph.py -v. ALL tests pass.
Review: Check that env vars in run_iteration are properly shell-escaped to prevent injection. Check that atexit cleanup handles the case where VMs are already stopped. Check that proxy_host caching works correctly.
Address feedback: Fix all findings. Re-run full test suite.
Notes:
- Moved
ITERATION_PROMPTfromDockerSandboxtoSandboxBackendbase class so both backends can reference it. - Added
import atexitandimport shlexto script imports. _atexit_registeredis a class-level flag; the actual atexit handler is a no-op safety net sincecleanup_sandboxis the primary cleanup path.prune_sandboxesuses stopped-state as the orphan heuristic since Tart VMs don't store workspace metadata. Running VMs and templates are preserved.create_sandbox_backendalready returnedTartSandbox(noNotImplementedErrorstub to remove — that was handled in Step 4).
Step 6: Wire up CLI and prerequisites [done]
Files:
scripts/ralph— Update CLI, selftest, prerequisites
Implement:
- Update
check_dependencies_prereq()to also check fortartwhen running in tart mode. Since we don't know the sandbox type at prereq check time (it depends on the project), check fortartin the Tart backend constructor instead, or make the check lazy. Best approach: add acheck_prerequisites()method toSandboxBackend.DockerSandbox.check_prerequisites()verifiesdockeris available.TartSandbox.check_prerequisites()verifies bothdocker(for proxy) andtartare available. - Update
selftest(): accept an optionalsandbox_typeparameter (default "docker"). When "tart", use TartSandbox-specific checks: build template, clone test VM, verify tart exec works, verify proxy reachable from VM via host IP, verify Claude auth via proxy, cleanup. The tart selftest still uses the Docker proxy (proxy stays unchanged). - Add
--type docker|tartflag to the selftest subcommand parser inmain(). - Update
prune-sandboxesto detect sandbox type or accept--typeflag. When type is "tart", useTartSandbox.prune_sandboxes. Default remains docker.
Test:
TestSelftest: add test for--type tartroutingTestMainSandboxFlags: test--typeflag parsing- Test prerequisite checks for TartSandbox
Verify: Run pytest tests/test_ralph.py -v. All tests pass.
Review: Check that the selftest cleanup always runs (even on failure). Check error messages when tart is not installed.
Address feedback: Fix all findings. Re-run tests.
Notes:
check_prerequisites()added toSandboxBackend(abstract),DockerSandbox(checks docker), andTartSandbox(checks tart + docker).selftest()refactored into shared preamble (token, prerequisites, proxy) +_selftest_docker()and_selftest_tart()helper functions.- Introduced
_SelftestAbortexception to abort early from helper functions while preserving cleanup in the mainselftest()finally block. - Cleanup is now always attempted (remove_sandbox is idempotent), rather than conditionally based on
sandbox_createdflag. - The prerequisites check adds one extra check to the selftest count (9 for docker without project image, 10 with).
--typeflag validates against("docker", "tart")with exit code 2 for unknown types.
Step 7: Run all checks [done]
Implement:
- Run the full test suite:
pytest tests/test_ralph.py -v - Run shellcheck on any shell scripts that were modified
- Run
python3 -c "import py_compile; py_compile.compile('scripts/ralph', doraise=True)"to verify syntax - Fix any failures and commit the fixes
Verify: All checks pass clean.
Step 8: Fix TartSandbox atexit cleanup with class-level VM tracking [done]
Files:
scripts/ralph— Move_vm_procsto class variable, implement_atexit_stop_alltests/test_ralph.py— Tests for atexit behavior
Implement:
- Change
_vm_procsfrom an instance variable to a class variable: add_vm_procs = {}at class level onTartSandbox, and removeself._vm_procs = {}from__init__. - Implement
_atexit_stop_all: iterateTartSandbox._vm_procs.items(), calltart stopfor each VM (best-effort, suppress errors), thenproc.wait()for each tracked Popen. Clear the dict afterward. - All existing code that reads/mutates
self._vm_procscontinues to work since Python resolves instance attribute reads to the class variable when no instance attribute shadows it, and dict mutations (subscript assignment,.pop()) mutate the class-level dict in place.
Test:
- Test that
_atexit_stop_allcallstart stopfor each tracked VM and waits for each Popen process. - Test that
_atexit_stop_allhandles empty_vm_procsgracefully. - Test that
_atexit_stop_allclears_vm_procsafter cleanup. - Test that
_vm_procsis shared across instances (class-level).
Verify: Run pytest tests/test_ralph.py -v. Fix any failures.
Review: Ensure no code path reassigns self._vm_procs = ... (which would shadow the class variable). All mutations must be in-place (subscript, .pop()).
Address feedback: Fix findings, re-run tests.
Notes:
_vm_procs = {}added as class variable onTartSandbox;self._vm_procs = {}removed from__init__._atexit_stop_allnow iteratesTartSandbox._vm_procs.items(), callstart stopfor each VM (suppressing errors), waits for each Popen with a 10-second timeout, then clears the dict.- Test classes that mutate
_vm_procs(TestTartEnsureSandbox,TestTartProxyHost,TestTartCleanupSandbox,TestTartRemoveSandbox) now save/restore the class-level dict insetup_method/teardown_methodto prevent cross-test pollution.
Step 9: Fix config unpacking, interface consistency, and VM list caching [done]
Files:
scripts/ralph— Multiple small fixestests/test_ralph.py— Update affected tests
Implement:
- Config unpacking (critical): In
process_issue, pop"type"from the config dict before passing**configtocreate_sandbox_backend, sotypeis not passed twice (once as positional, once in kwargs). Also pop"project_dir"since it's added to config and consumed bycreate_sandbox_backendbut not a validTartSandbox/DockerSandboxconstructor argument. - Base class interface: Add
check_in_sync,reset_to_host, andsync_to_hosttoSandboxBackendas abstract methods (raisingNotImplementedError), since bothDockerSandboxandTartSandboximplement them andprocess_issuecalls them polymorphically. - Signature consistency: Remove the unused
workdirparameter fromDockerSandbox.run_iteration. The base class signature isrun_iteration(self, sandbox_name, spec_content, model, env_vars=None)andprocess_issuenever passesworkdir. - Extract shared folder constant: Add
SHARED_DIR = "/Volumes/My Shared Files/workspace"as a class constant onTartSandbox. Use it inrun_iterationwhere the path is currently hardcoded in thecdcommand. - VM list caching: Add a
_cached_vm_listtuple of(timestamp, result)toTartSandbox(class-level). In_list_vms, return the cached result if it's less than 2 seconds old. This avoids redundanttart listsubprocess calls when_vm_stateand_check_vm_limitare called in quick succession withinensure_sandbox.
Test:
- Verify
process_issueno longer passestypein kwargs tocreate_sandbox_backend. - Test that
SandboxBackend.check_in_sync,.reset_to_host,.sync_to_hostraiseNotImplementedError. - Verify
DockerSandbox.run_iterationworks withoutworkdirparameter. - Verify
TartSandbox.SHARED_DIRis used inrun_iterationoutput. - Test VM list caching: two calls within 2 seconds hit cache, call after expiry fetches fresh.
Verify: Run pytest tests/test_ralph.py -v. All tests pass.
Review: Check that removing workdir from Docker doesn't break any callers (grep for workdir). Ensure cache TTL is short enough to not mask real VM state changes.
Address feedback: Fix findings, re-run tests.
Notes:
- Config unpacking:
typeis now popped from the config dict inprocess_issuebefore**configis passed tocreate_sandbox_backend.project_dirwas already handled by the factory's tart branch (popped increate_sandbox_backend). - Base class interface:
check_in_sync,reset_to_host,sync_to_hostadded toSandboxBackendas abstract methods raisingNotImplementedError. DockerSandbox.run_iteration: removedworkdirparameter;process_issueno longer passesworkdir=work_dirtorun_iteration. Docker sandbox exec's-wflag is still used byexec_outputandcheck_in_sync(separate methods).TartSandbox.SHARED_DIR = "/Volumes/My Shared Files/workspace"added as class constant, used inrun_iteration.- VM list cache:
_vm_list_cacheclass variable as(timestamp, result)tuple with 2-second TTL usingtime.monotonic(). Tests save/restore the cache in setup/teardown.
Step 10: Run all checks
Implement:
- Run the full test suite:
pytest tests/test_ralph.py -v - Run
python3 -c "import py_compile; py_compile.compile('scripts/ralph', doraise=True)"to verify syntax - Fix any failures and commit the fixes
Verify: All checks pass clean.
Conventions
- Language: Python 3 (stdlib only, no third-party dependencies)
- Tests: pytest with
unittest.mockfor subprocess mocking. Test classes namedTestXxx, test methodstest_xxx. - Error messages: Prefix with
ralph:(e.g.,ralph: cannot start VM — ...) - Exit codes: 0=success, 1=runtime error, 2=usage error
- Imports: stdlib only.
subprocess.runfor external commands,jsonfor config parsing.