Skip to content

fix: add retry with exponential backoff for bot launch 409 Conflict#1241

Open
konard wants to merge 9 commits intomainfrom
issue-1240-8c8a60cb5e70
Open

fix: add retry with exponential backoff for bot launch 409 Conflict#1241
konard wants to merge 9 commits intomainfrom
issue-1240-8c8a60cb5e70

Conversation

@konard
Copy link
Contributor

@konard konard commented Feb 8, 2026

Summary

  • Add exponential backoff retry for bot.launch() when it fails with a 409 Conflict error (e.g., due to restart overlap, stale TCP connections, or network issues)
  • Extract retry logic into telegram-bot-launcher.lib.mjs with pure, testable functions
  • Add comprehensive case study documenting root causes, Telegraf behavior, and community research

Root Cause

The Telegram Bot API allows only one active getUpdates connection per bot token. When a second request arrives (e.g., during process restart overlap, Docker container restart, or network reconnection), the API returns a 409 Conflict error. Telegraf treats this as fatal and throws immediately. The bot's error handler then called process.exit(1) with no retry logic, making the bot permanently unavailable until manually restarted.

Even with a single bot instance, this can happen due to:

  • Process restart overlap (new instance starts before old connection times out)
  • Docker restart: unless-stopped creating container overlap
  • Unclean termination (SIGKILL, OOM) preventing graceful bot.stop() call
  • Network-level half-open TCP connections after network partition

Full analysis: docs/case-studies/issue-1240/README.md

Changes

File Change
src/telegram-bot-launcher.lib.mjs New: Retry logic with exponential backoff, error classification, delay formatting
src/telegram-bot.mjs Replace deleteWebhook().then(bot.launch()).catch(exit) with launchBotWithRetry(), add launchAbortController for clean shutdown
tests/test-telegram-bot-launcher.mjs New: 41 unit tests covering all retry scenarios
package.json Add new test to test script
docs/case-studies/issue-1240/README.md New: Case study with timeline, root cause analysis, 5 proposed solutions
docs/case-studies/issue-1240/telegraf-issues-research.md New: Analysis of 8 relevant Telegraf GitHub issues
docs/case-studies/issue-1240/community-research.md New: Cross-community research across 6+ frameworks
docs/case-studies/issue-1240/error-log.txt New: Raw error log from production incident
.changeset/fix-bot-409-retry.md Changeset for release

How the retry works

Attempt 1: launch() → 409 → wait 1s
Attempt 2: launch() → 409 → wait 2s
Attempt 3: launch() → 409 → wait 4s
Attempt 4: launch() → 409 → wait 8s
...
Attempt 11: launch() → 409 → wait 10m (capped at max)
Attempt 12+: launch() → 409 → wait 10m (stays at max)

Each attempt:

  1. Calls deleteWebhook({ drop_pending_updates: true }) to clear any stale webhook
  2. Calls bot.launch() with configured options
  3. On retryable error (409, 429, 5xx, network): waits with exponential backoff + 10% jitter
  4. On non-retryable error (401 Unauthorized): exits immediately

The retry loop is interruptible via AbortSignal — SIGINT/SIGTERM during retry wait cleanly stops the process.

Test plan

  • All 41 new launcher tests pass (node tests/test-telegram-bot-launcher.mjs)
  • Existing message filter tests pass (51 tests)
  • Prettier formatting passes
  • Changeset added
  • Manual test: start bot with invalid token → should exit immediately (401)
  • Manual test: start bot while another instance is running → should retry and succeed after old instance stops

Fixes #1240


🤖 Generated with Claude Code

Adding CLAUDE.md with task information for AI processing.
This file will be removed when the task is complete.

Issue: #1240
@konard konard self-assigned this Feb 8, 2026
konard and others added 6 commits February 8, 2026 11:21
When the Telegram Bot API returns a 409 Conflict error during bot startup
(e.g., due to restart overlap, stale connections, or network issues), the
bot now retries with exponential backoff instead of immediately exiting.

- Extract launch retry logic into telegram-bot-launcher.lib.mjs
- Retry schedule: 1s, 2s, 4s, 8s, 16s, 32s, 64s, ... up to 10 minutes max
- Non-retryable errors (401 Unauthorized) still cause immediate exit
- AbortSignal support for clean cancellation during shutdown
- 10% jitter on retry delays to prevent thundering herd

Fixes #1240

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
41 tests covering:
- isRetryableError: 401 (non-retryable) vs 409/429/5xx/network (retryable)
- calculateRetryDelay: exponential backoff schedule, jitter, cap at max
- formatDelay: human-readable delay formatting
- launchBotWithRetry: success, retry on 409, abort via signal, onRetry callback

Refs #1240

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive analysis including:
- Timeline reconstruction from production logs
- Root cause analysis (6 identified causes)
- Telegraf source code analysis (polling.ts error classification)
- Community research across 8+ bot libraries
- 5 proposed solutions with code examples
- References to relevant Telegraf GitHub issues

Refs #1240

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@konard konard changed the title [WIP] hive-telegram-bot: 409: Conflict: terminated by other getUpdates request; make sure that only one bot instance is running fix: add retry with exponential backoff for bot launch 409 Conflict Feb 8, 2026
These are available as globals in Node.js 15+ (project requires 18+).
Needed for the launch retry abort signal.

Refs #1240

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@konard konard marked this pull request as ready for review February 8, 2026 10:33
@konard
Copy link
Contributor Author

konard commented Feb 8, 2026

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $7.534719 USD
  • Calculated by Anthropic: $8.016356 USD
  • Difference: $0.481637 (+6.39%)
    📎 Log file uploaded as Gist (1395KB)
    🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard konard added the ready Is ready to be merged label Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready Is ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hive-telegram-bot: 409: Conflict: terminated by other getUpdates request; make sure that only one bot instance is running

1 participant