Skip to content

Bug: No automatic task migration from dead workers — only retry exists #1769

@mrveiss

Description

@mrveiss

Problem

The Redis task queue system (utils/task_queue.py) supports task retry with exponential backoff when tasks fail, but there is no failover mechanism that detects dead workers (expired heartbeats) and migrates their assigned tasks to healthy workers.

Current behavior:

  • Workers register and send heartbeats via npu:worker:{id}:status keys with TTL
  • When a worker dies, its heartbeat TTL expires silently
  • Tasks assigned to the dead worker remain in {queue}:running ZSET indefinitely
  • No background process scans for orphaned tasks or re-queues them

The CircuitBreaker pattern in services/load_balancer.py prevents new tasks from being sent to unhealthy workers, but doesn't recover existing tasks stuck on dead workers.

Discovered During

Documentation research for #1749 (Context7 benchmark: distributed task failover with Redis)

Location

  • Task queue: autobot-backend/utils/task_queue.py (lines 155-918)
  • Worker manager: autobot-backend/services/npu_worker_manager.py (lines 38-701)
  • Load balancer: autobot-backend/services/load_balancer.py

Impact

Medium — Tasks can be permanently stuck if a worker crashes mid-execution. Currently only affects NPU inference tasks.

Suggested Fix

Add a background failover_monitor coroutine that:

  1. Periodically scans workers:registered SET
  2. Checks heartbeat TTL for each worker
  3. If heartbeat expired: re-queue tasks from {queue}:running back to {queue}:pending
  4. Increment retry_count, respect max_retries

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions