feat: driver resilience — auto-recovery, PID tracking, and /restart endpoint by AL-ZiLLA · Pull Request #617 · RightNow-AI/openfang

AL-ZiLLA · 2026-03-15T02:45:52Z

Summary

Adds resilience features for 24/7 autonomous agent deployments where crashed or hung agents need to recover without operator intervention:

Heartbeat auto-recovery: The heartbeat monitor now detects crashed agents and automatically resets them to Running, with configurable max attempts (default 3) and cooldown (default 60s). After exhausting attempts, agents are marked Terminated. Default timeout increased from 60s to 180s for browser/LLM workloads.
Claude Code PID tracking + timeout: Subprocess PIDs are tracked in a concurrent DashMap for external monitoring. A configurable message timeout (default 300s) kills hung CLI processes. Streaming mode now returns proper LlmError::Api on non-zero exit (previously silently ignored).
/restart endpoint: POST /api/agents/{id}/restart (and /start alias) cancels any active task, resets agent state to Running, and returns the previous state. Enables per-agent recovery through the API without bouncing the daemon.

Changed files

File	Change
`heartbeat.rs`	RecoveryTracker, state field, 180s timeout, Crashed monitoring
`kernel.rs`	Auto-recovery loop in start_heartbeat_monitor()
`claude_code.rs`	DashMap PID tracking, tokio::time::timeout, error returns
`routes.rs`	restart_agent handler
`server.rs`	/restart and /start route registration

Test plan

cargo build --release — clean, 0 warnings
cargo test --workspace — 2,131 tests pass, 0 failures
cargo clippy -p openfang-{runtime,kernel,api} -- -D warnings — 0 warnings
Deployed and verified all agents boot (9/9 Running)
Tested /restart endpoint — agent resets to Running, returns previous state
Verify heartbeat auto-recovery in production (crashed agent recovers within 2 cycles)
Verify Claude Code CLI timeout kills hung subprocess after 300s

Extend the heartbeat monitor to detect and automatically recover crashed agents, reducing operator intervention for 24/7 autonomous deployments: - Add RecoveryTracker: per-agent failure count with configurable cooldown - Heartbeat now monitors both Running and Crashed agents - Crashed agents auto-recover up to max_recovery_attempts (default 3) - After exhausting attempts, agents are marked Terminated - Unresponsive Running agents marked Crashed for next-cycle recovery - Increase default timeout from 60s to 180s (browser/LLM tasks need time) - Add HeartbeatStatus.state field for downstream consumers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add subprocess lifecycle management to prevent hung CLI processes from blocking agents indefinitely: - Track active subprocess PIDs in a concurrent DashMap for external monitoring - Enforce configurable message timeout (default 300s) with automatic process kill - Return proper LlmError::Api on non-zero exit in streaming mode (was silently ignored) - Add with_timeout(), active_pids(), pid_map() public methods Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…unce POST /api/agents/{id}/restart and /api/agents/{id}/start both: - Cancel any active task via stop_agent_run() - Reset agent state to Running (updates last_active) - Return JSON with previous state and whether a task was cancelled Enables operators to recover individual crashed/stuck agents through the API without restarting the entire daemon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ZiLLA Dev and others added 3 commits March 14, 2026 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: driver resilience — auto-recovery, PID tracking, and /restart endpoint#617

feat: driver resilience — auto-recovery, PID tracking, and /restart endpoint#617
AL-ZiLLA wants to merge 3 commits intoRightNow-AI:mainfrom
AL-ZiLLA:driver-resilience-upstream

AL-ZiLLA commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AL-ZiLLA commented Mar 15, 2026

Summary

Changed files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant