-
-
Notifications
You must be signed in to change notification settings - Fork 1
Closed
Labels
Description
Problem
The Redis task queue system (utils/task_queue.py) supports task retry with exponential backoff when tasks fail, but there is no failover mechanism that detects dead workers (expired heartbeats) and migrates their assigned tasks to healthy workers.
Current behavior:
- Workers register and send heartbeats via
npu:worker:{id}:statuskeys with TTL - When a worker dies, its heartbeat TTL expires silently
- Tasks assigned to the dead worker remain in
{queue}:runningZSET indefinitely - No background process scans for orphaned tasks or re-queues them
The CircuitBreaker pattern in services/load_balancer.py prevents new tasks from being sent to unhealthy workers, but doesn't recover existing tasks stuck on dead workers.
Discovered During
Documentation research for #1749 (Context7 benchmark: distributed task failover with Redis)
Location
- Task queue:
autobot-backend/utils/task_queue.py(lines 155-918) - Worker manager:
autobot-backend/services/npu_worker_manager.py(lines 38-701) - Load balancer:
autobot-backend/services/load_balancer.py
Impact
Medium — Tasks can be permanently stuck if a worker crashes mid-execution. Currently only affects NPU inference tasks.
Suggested Fix
Add a background failover_monitor coroutine that:
- Periodically scans
workers:registeredSET - Checks heartbeat TTL for each worker
- If heartbeat expired: re-queue tasks from
{queue}:runningback to{queue}:pending - Increment retry_count, respect max_retries
Reactions are currently unavailable