research: environment variable persistence for agent runtime modifications

## Context

Currently, when agents run in sandboxes, they can modify environment variables at runtime using standard commands (`export`, `os.environ`, etc.), but these modifications are **not persisted** when creating checkpoints. Only the artifacts directory is persisted via VAS snapshots. When resuming from a checkpoint or session, the environment is re-expanded from the original vm0.yaml template using vars/secrets, causing any runtime modifications to be lost.

This research explores the feasibility and design constraints for persisting agent-made environment variable modifications as part of the checkpoint system.

## Current Architecture

### Environment Variable Flow
1. **Configuration**: Variables defined in vm0.yaml using `${{ vars.X }}` and `${{ secrets.X }}` syntax
2. **Expansion**: Server-side expansion from vars/secrets provided via CLI flags
3. **Injection**: All env vars set at sandbox creation time (e2b-service.ts:143-193)
4. **Runtime**: Agent can read/modify environment during execution
5. **Checkpoint**: Only artifacts, volumes, and conversation history persisted - environment lost
6. **Resume**: Environment re-expanded from original template

### What Gets Persisted Today
- ✅ Conversation history (session JSONL in R2)
- ✅ Artifact snapshots (filesystem via VAS)
- ✅ Volume versions snapshots (resolved version IDs)
- ✅ Agent compose snapshots (config version, vars, secret names)
- ❌ Runtime environment modifications

## Key Findings

### Technical Challenges

1. **No capture mechanism**: There's no standard way to extract environment state from a running process. `/proc/$PID/environ` only shows initial state at process start, not runtime modifications.

2. **Distinguishing modifications**: Need to differentiate:
   - Template-sourced vars (already in vars/secrets)
   - VM0 system vars (VM0_API_URL, VM0_RUN_ID, etc.)
   - Agent-created/modified vars (what we want to persist)

3. **Merge strategy**: If vm0.yaml template changes between checkpoint and resume, how to merge persisted runtime modifications with new template?

### Security Constraints

**Critical principle**: Secret values are NEVER stored in database (only names for validation).

**Risks of environment persistence**:
- Accidentally persisting secret values that agents write to new env vars
- Leaking sensitive data that agents compute/derive
- Breaking the "secrets never stored" principle

**Required filtering**:
- Exclude template-sourced vars (already captured in vars/secretNames)
- Exclude VM0 system vars
- Exclude variables matching secret patterns
- Possibly require explicit allowlist/denylist configuration

### Conflict Scenarios

**Scenario 1: Agent modifies a template var**
```yaml
# vm0.yaml
environment:
  DATABASE_URL: "${{ vars.DB_URL }}"
```
- Initial: `DATABASE_URL=postgres://prod`
- Agent runs: `export DATABASE_URL=postgres://local`
- Resume: Which value wins - postgres://local or postgres://prod?

**Scenario 2: Agent creates new env var**
```bash
export MY_STATE="some_value"
```
- Should persist, but how to distinguish from secrets?

**Scenario 3: Seal secrets enabled**
- Sandbox receives encrypted token: `API_KEY=vm0_enc_abc123...`
- Agent can't decrypt (only proxy can)
- If agent modifies API_KEY, value is lost (not decryptable)
- Persistence would store what - encrypted token or modified value?

## Existing Patterns to Learn From

### Volume Versions Pattern (Best Match)
- Captures **resolved state at checkpoint time**, not template
- Stores in `volumeVersionsSnapshot` JSONB column
- Restored from snapshot on resume
- **Could apply similar pattern**: `environmentSnapshot` containing only runtime-modified vars

### Vars Pattern (Different)
- Vars are **user-controlled inputs**, not runtime state
- Stored in `agentRuns.vars` and copied to checkpoint
- Always come from CLI, never generated by agent

## Preliminary Design Constraints

Any environment persistence design must:

1. ✅ **Never persist secret values** (maintain existing security model)
2. ✅ **Distinguish template vars from runtime modifications** (avoid conflicts)
3. ✅ **Handle seal secrets correctly** (don't break encrypted token flow)
4. ✅ **Be opt-in or scoped** (not all env vars should persist)
5. ✅ **Support merge strategy** (handle template changes)
6. ✅ **Be backward compatible** (existing checkpoints must work)
7. ✅ **Fail safe** (if in doubt, don't persist)

## Architecture Integration Points

Where environment persistence would fit:

1. **Checkpoint creation script** (`checkpoint.py.ts`):
   - Add capture logic before calling checkpoint API
   - Filter environment state (exclude system vars, template vars, secrets)
   - Pass as `environmentSnapshot` in payload

2. **Checkpoint webhook** (`app/api/webhooks/agent/checkpoints/route.ts`):
   - Add `environmentSnapshot` to `CheckpointRequest` type
   - Store in `checkpoints.environmentSnapshot` (new JSONB column)

3. **Database migration**:
   - Add `environmentSnapshot` JSONB column to `checkpoints` table
   - Nullable for backward compatibility

4. **Resume flow** (`run-service.ts`):
   - Load `environmentSnapshot` from checkpoint
   - Merge with expanded template environment
   - Handle conflicts based on merge strategy

## Open Questions for Design Phase

1. **Scope**: Should ALL env vars be eligible, or only specific prefixes/patterns?
2. **Merge Strategy**: Do runtime modifications override template, or vice versa?
3. **User Control**: How do users specify which vars to persist? Config file? CLI flags?
4. **Security Detection**: What's the mechanism for identifying secret-like values?
5. **Seal Secrets Interaction**: Should this work with seal secrets? Mutually exclusive?
6. **Storage Format**: Flat map or categorized (system/template/runtime)?
7. **Capture Implementation**: How to extract env state from running process?
8. **Validation**: Limits on number of vars or size of values?

## Key Files Reference

### Environment Handling
- `turbo/packages/core/src/variable-expander.ts`: Variable expansion logic
- `turbo/apps/web/src/lib/run/run-service.ts`: expandEnvironmentFromCompose() (lines 55-169)
- `turbo/apps/web/src/lib/e2b/e2b-service.ts`: Sandbox env injection (lines 143-193)

### Checkpoint System
- `turbo/packages/core/src/sandbox/scripts/lib/checkpoint.py.ts`: Checkpoint creation (lines 68-200)
- `turbo/apps/web/src/lib/checkpoint/checkpoint-service.ts`: Server-side handling (lines 24-224)
- `turbo/apps/web/src/db/schema/checkpoint.ts`: Database schema

### Security
- `turbo/apps/web/src/lib/proxy/token-service.ts`: Seal secrets implementation
- `turbo/apps/web/src/db/schema/agent-run.ts`: secretNames storage
- `CLAUDE.md`: Project security principles

## Recommendation

This is exploratory research for future work, not an immediate requirement. Before implementing:

1. Determine if there's actual user demand for this feature
2. Prototype capture mechanism to validate feasibility
3. Define clear security filtering rules
4. Design merge strategy and user-facing API
5. Consider interaction with seal secrets feature

The core constraint - **never persist secret values** - must be maintained at all costs.

---

*This issue was created from deep research session.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

research: environment variable persistence for agent runtime modifications #1124

Context

Current Architecture

Environment Variable Flow

What Gets Persisted Today

Key Findings

Technical Challenges

Security Constraints

Conflict Scenarios

Existing Patterns to Learn From

Volume Versions Pattern (Best Match)

Vars Pattern (Different)

Preliminary Design Constraints

Architecture Integration Points

Open Questions for Design Phase

Key Files Reference

Environment Handling

Checkpoint System

Security

Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

research: environment variable persistence for agent runtime modifications #1124

Description

Context

Current Architecture

Environment Variable Flow

What Gets Persisted Today

Key Findings

Technical Challenges

Security Constraints

Conflict Scenarios

Existing Patterns to Learn From

Volume Versions Pattern (Best Match)

Vars Pattern (Different)

Preliminary Design Constraints

Architecture Integration Points

Open Questions for Design Phase

Key Files Reference

Environment Handling

Checkpoint System

Security

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions