Skip to content

feat: driver resilience — auto-recovery, PID tracking, and /restart endpoint#617

Open
AL-ZiLLA wants to merge 3 commits intoRightNow-AI:mainfrom
AL-ZiLLA:driver-resilience-upstream
Open

feat: driver resilience — auto-recovery, PID tracking, and /restart endpoint#617
AL-ZiLLA wants to merge 3 commits intoRightNow-AI:mainfrom
AL-ZiLLA:driver-resilience-upstream

Conversation

@AL-ZiLLA
Copy link

Summary

Adds resilience features for 24/7 autonomous agent deployments where crashed or hung agents need to recover without operator intervention:

  • Heartbeat auto-recovery: The heartbeat monitor now detects crashed agents and automatically resets them to Running, with configurable max attempts (default 3) and cooldown (default 60s). After exhausting attempts, agents are marked Terminated. Default timeout increased from 60s to 180s for browser/LLM workloads.
  • Claude Code PID tracking + timeout: Subprocess PIDs are tracked in a concurrent DashMap for external monitoring. A configurable message timeout (default 300s) kills hung CLI processes. Streaming mode now returns proper LlmError::Api on non-zero exit (previously silently ignored).
  • /restart endpoint: POST /api/agents/{id}/restart (and /start alias) cancels any active task, resets agent state to Running, and returns the previous state. Enables per-agent recovery through the API without bouncing the daemon.

Changed files

File Change
heartbeat.rs RecoveryTracker, state field, 180s timeout, Crashed monitoring
kernel.rs Auto-recovery loop in start_heartbeat_monitor()
claude_code.rs DashMap PID tracking, tokio::time::timeout, error returns
routes.rs restart_agent handler
server.rs /restart and /start route registration

Test plan

  • cargo build --release — clean, 0 warnings
  • cargo test --workspace — 2,131 tests pass, 0 failures
  • cargo clippy -p openfang-{runtime,kernel,api} -- -D warnings — 0 warnings
  • Deployed and verified all agents boot (9/9 Running)
  • Tested /restart endpoint — agent resets to Running, returns previous state
  • Verify heartbeat auto-recovery in production (crashed agent recovers within 2 cycles)
  • Verify Claude Code CLI timeout kills hung subprocess after 300s

ZiLLA Dev and others added 3 commits March 14, 2026 20:26
Extend the heartbeat monitor to detect and automatically recover crashed
agents, reducing operator intervention for 24/7 autonomous deployments:

- Add RecoveryTracker: per-agent failure count with configurable cooldown
- Heartbeat now monitors both Running and Crashed agents
- Crashed agents auto-recover up to max_recovery_attempts (default 3)
- After exhausting attempts, agents are marked Terminated
- Unresponsive Running agents marked Crashed for next-cycle recovery
- Increase default timeout from 60s to 180s (browser/LLM tasks need time)
- Add HeartbeatStatus.state field for downstream consumers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add subprocess lifecycle management to prevent hung CLI processes from
blocking agents indefinitely:

- Track active subprocess PIDs in a concurrent DashMap for external monitoring
- Enforce configurable message timeout (default 300s) with automatic process kill
- Return proper LlmError::Api on non-zero exit in streaming mode (was silently ignored)
- Add with_timeout(), active_pids(), pid_map() public methods

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…unce

POST /api/agents/{id}/restart and /api/agents/{id}/start both:
- Cancel any active task via stop_agent_run()
- Reset agent state to Running (updates last_active)
- Return JSON with previous state and whether a task was cancelled

Enables operators to recover individual crashed/stuck agents through the
API without restarting the entire daemon.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant