Skip to content

Auto-retry failed Tasks in spawner dedup logic#428

Closed
axon-agent[bot] wants to merge 1 commit intomainfrom
axon-fake-strategist-20260225-1200
Closed

Auto-retry failed Tasks in spawner dedup logic#428
axon-agent[bot] wants to merge 1 commit intomainfrom
axon-fake-strategist-20260225-1200

Conversation

@axon-agent
Copy link

@axon-agent axon-agent bot commented Feb 25, 2026

🤖 Axon Agent @gjkim42

Summary

When a Task created by a TaskSpawner fails, the spawner's dedup logic previously treated it the same as a succeeded task — the task name existed in the existingTasks map, so the work item was skipped on subsequent poll cycles. This meant:

  • Cron spawners: A failed task blocked retries until TTL cleanup. For axon-fake-strategist (TTL: 864000s = 10 days), a single failure could waste 10 days. For axon-workers (TTL: 3600s), it wasted an hour.
  • GitHub issue spawners: A failed task for an issue blocked any retry for that issue until TTL deletion.

Changes

cmd/axon-spawner/main.go:

  • Track failed tasks separately in the dedup phase
  • Include failed tasks in the newItems list (eligible for retry)
  • Before creating a retry task, delete the old failed Task object (required because Task spec is immutable and Kubernetes doesn't allow name reuse)
  • Handle NotFound gracefully in case TTL already cleaned it up

cmd/axon-spawner/main_test.go:

  • Add TestRunCycleWithSource_FailedTaskRetriedImmediately — verifies failed task is deleted and recreated
  • Add TestRunCycleWithSource_SucceededTaskNotRetried — verifies succeeded tasks remain deduplicated
  • Add TestRunCycleWithSource_FailedTaskRetryRespectsMaxConcurrency — verifies retries still respect concurrency limits
  • Update TestRunCycleWithSource_CompletedTasksDontCountTowardsLimit to use two succeeded tasks (the original used one failed task which is now retried by design)

Design decisions

Why unconditional retry (no retry limit)? This is the simplest correct behavior — the spawner already has maxTotalTasks to cap total task creation, and maxConcurrency to limit parallelism. A retry-specific limit would add complexity for little gain. Users who want bounded retries can use maxTotalTasks. A future retryPolicy field (#298) could add more sophisticated control.

Why delete-and-recreate instead of updating? Task spec has an immutability validation rule (self == oldSelf), so the failed task cannot be updated. Deleting and recreating with the same name preserves the spawner's naming convention.

Relates to

Test plan

  • go test ./cmd/axon-spawner/ -v — all 27 tests pass
  • make verify — passes
  • make build — compiles cleanly
  • CI pipeline (build, verify, test, test-integration)

Summary by cubic

Automatically retries failed Tasks by detecting failures during spawner dedup, deleting the failed Task, and recreating it with the same name. This prevents cron and issue spawners from getting stuck until TTL cleanup and still honors maxConcurrency and maxTotalTasks.

  • Bug Fixes

Written for commit cf6c3bf. Summary will update on new commits.

When a Task created by a TaskSpawner fails, the spawner's dedup logic
previously treated it the same as a succeeded task — the task name
existed, so the work item was skipped. This meant failed cron tasks
would block retries until TTL cleanup (up to 10 days for the
strategist spawner).

Now the spawner detects failed Tasks during dedup, deletes them, and
recreates them for immediate retry. Succeeded tasks remain
deduplicated as before. The retry still respects maxConcurrency and
maxTotalTasks limits.

Partially addresses #287 (problem 1: cron dedup prevents retry after
failure).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

@gjkim42 gjkim42 closed this Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant