Skip to content

Bug Report: Telegram Bot Auto-Reconnect Failure #126

@dpbmaverick98

Description

@dpbmaverick98

Bug Report: Telegram Bot Auto-Reconnect Failure

Severity: Medium

Component: src/channels/telegram-client.ts

Status: Open


1. Summary

The Telegram client in TinyClaw does not automatically reconnect after connection failures. When the Telegram Bot API connection drops (due to network issues, server resets, or timeouts), the polling mechanism stops silently without any reconnection attempt. This causes messages to queue indefinitely until a manual restart of the TinyClaw daemon.

This issue is particularly problematic on Fly.io Sprites where the VM may auto-stop after inactivity, and when it wakes up, Telegram polling fails to connect properly.


2. Environment Details

Field Value
Platform Fly.io Sprites (Ubuntu 22.04)
TinyClaw Version 0.0.5
Node.js Version v18+
Telegram Bot Library node-telegram-bot-api
Deployment Mode tmux daemon
Bot Username @another_mav_test_bot

Relevant Log Files

  • /home/sprite/.tinyclaw/logs/telegram.log
  • /home/sprite/.tinyclaw/logs/queue.log
  • /home/sprite/.tinyclaw/logs/daemon.log

3. Steps to Reproduce

Scenario A: Sprite Auto-Stop (Idle Timeout)

  1. Deploy TinyClaw on Fly.io Sprite
  2. Leave Sprite idle for ~10 minutes (Fly.io auto-stops idle Sprites)
  3. Send a message to the Telegram bot
  4. Expected: Sprite wakes, TinyClaw resumes
  5. Actual: TinyClaw is not running (not auto-started on wake)

Scenario B: Network Reset After Cold Start

  1. TinyClaw is running on Sprite
  2. Sprite goes idle → stops completely
  3. User sends message → Sprite wakes from cold
  4. TinyClaw may or may not auto-start (depending on setup)
  5. Even if TinyClaw starts, Telegram polling fails to connect
  6. Error logged: ECONNRESET or ETIMEDOUT
  7. No auto-reconnect - stays disconnected

Scenario C: Network Interruption (Live Sprite)

  1. TinyClaw running normally
  2. Network hiccup occurs (Fly.io infrastructure)
  3. Telegram API connection drops
  4. Error logged: Polling error: read ECONNRESET
  5. No reconnection - polling stays dead

Scenario D: Manual Reproduction

  1. Start TinyClaw with Telegram integration enabled
  2. Wait for normal operation (bot responds to messages)
  3. Simulate network failure or wait for Telegram API reset
  4. Observe polling error in logs
  5. Observe that bot stops receiving messages
  6. Messages queue in /home/sprite/.tinyclaw/queue/incoming/
  7. No automatic reconnection occurs
  8. Manual tinyclaw restart required to restore functionality

4. Expected Behavior

  1. When polling encounters a connection error, the client should log the error
  2. The client should attempt to reconnect automatically
  3. Reconnection should use exponential backoff to avoid overwhelming the API
  4. After successful reconnection, the bot should resume receiving messages
  5. Queued messages should be processed after reconnection

5. Actual Behavior

Observed Log Output

[2026-02-19T13:54:16.XXXZ] [ERROR] Polling error: EFATAL: Error: read ECONNRESET

Behavior After Error

  • No reconnection attempts logged
  • Telegram bot remains disconnected
  • Polling continues to fail silently (no more errors logged)
  • Queue processor continues running normally
  • Heartbeat messages still work
  • Messages queue up but are never received
  • Manual restart required to restore functionality

Critical Observation: Restart Attempts Don't Help, But Viewing Logs Sometimes Does

This is a key finding:

  1. Multiple tinyclaw restart commands do nothing:

    • Running tinyclaw restart shows "Telegram: Running" in status
    • Bot still doesn't receive messages
    • polling_error: ECONNRESET keeps appearing after each restart
    • The status shows "Running" even when polling is actually dead
  2. But viewing Telegram logs sometimes "kicks it back up":

    • Running tinyclaw logs telegram sometimes triggers reconnection
    • This appears to be a fluke, not reliable behavior
    • Possibly related to tmux session interaction or log file access
  3. Root cause of confusing behavior:

    • The tmux session stays "Running" even when polling is dead
    • Status command reports "Telegram: Running" even though polling failed
    • tinyclaw restart restarts the tmux panes but doesn't fix the underlying polling issue
    • Accessing logs may trigger some side effect that accidentally reconnects

6. Root Cause Analysis

Code Location: telegram-client.ts:457-459

// Handle polling errors
bot.on('polling_error', (error: Error) => {
    log('ERROR', `Polling error: ${error.message}`);
});

Issue #1: No Reconnection Logic

The polling_error event handler only logs the error message but does not trigger any reconnection mechanism. When the underlying HTTP stream fails, the polling loop terminates without attempting to restart.

Issue #2: No Connection State Tracking

There is no tracking of:

  • Current connection state (connected/disconnected/reconnecting)
  • Number of reconnection attempts
  • Last successful connection timestamp

Issue #3: No Error Classification

The handler treats all errors the same way. Different error types may require different handling strategies:

  • Transient errors (ECONNRESET, ETIMEDOUT) → Retry with backoff
  • Authentication errors (401 Unauthorized) → Log fatal error, do not retry
  • Rate limiting (429 Too Many Requests) → Wait and retry

Issue #4: Missing error Event Handler

The code only handles polling_error but not the generic error event emitted by the EventEmitter base class. Unhandled errors may cause undefined behavior.

Issue #5: No Graceful Recovery

When bot.stopPolling() is called internally by the library after an error, there's no corresponding call to restart polling.

Issue #6: No Concurrent Reconnection Protection (CRITICAL)

Without an isReconnecting flag, multiple polling_error events trigger simultaneous reconnection attempts. This causes:

  • Multiple restart loops running in parallel
  • Telegram API receives conflicting requests
  • 409 Conflict errors when two polling instances fight over the same bot token
  • Exponential growth of reconnection attempts (as seen in logs)

Issue #7: Missing 409 Conflict Handling

The error 409 Conflict: terminated by other getUpdates request is fatal and should NOT be retried immediately. It means:

  • Another instance is already polling
  • OR previous polling session wasn't properly cleaned up
  • Requires cleanup (stopPolling) before retry, not just immediate retry

7. Error Classification

The polling error handler should classify and handle these error types:

Error Code/Message Type Handling
ECONNRESET Transient Retry with backoff
ETIMEDOUT Transient Retry with backoff
ECONNREFUSED Transient Retry with backoff
EAI_AGAIN Transient (DNS) Retry with backoff
429 Too Many Requests Rate Limit Wait and retry
401 Unauthorized Fatal Log and exit
403 Forbidden Fatal Log and exit
EFATAL: prefix Fatal Log and exit
RetriableError Transient Retry with backoff
409 Conflict Fatal Cleanup required, don't simple retry
ETELEGRAM: 409 Fatal Cleanup required, don't simple retry

8. Impact Analysis

User-Facing Impact

  1. Service interruption - Bot stops receiving messages
  2. Message delay - Messages queue but aren't processed
  3. Manual intervention required - Must SSH in and restart TinyClaw
  4. Unreliable service - Especially problematic for 24/7 deployments

System Impact

  1. Queue buildup - queue/incoming/ fills with unprocessed messages
  2. Confusion - Users may think bot is broken when it's just disconnected
  3. No monitoring - No alert when disconnection occurs

Comparison with Other Channels

Channel Has Auto-Reconnect? Risk
Telegram ❌ No High
Discord Unknown Medium
WhatsApp Unknown Medium

9. Solutions

Option 1: Complex Reconnection (Not Recommended)

Add exponential backoff reconnection logic with state tracking. This was attempted but caused race conditions and 409 conflict loops. See Section 15.c for details.

Code: ~60 lines of state management
Problems: Race conditions, 409 loops, state bugs


Option 2: Exit + tmux Loop (Recommended)

The simple, reliable solution:

// In telegram-client.ts
bot.on('polling_error', (error: Error) => {
    log('ERROR', `Polling error: ${error.message} - exiting for restart`);
    process.exit(1);
});

Setup on Sprite:

# Create loop script
cat > /home/sprite/telegram-loop.sh << 'EOF'
#!/bin/bash
while true; do
    node /home/sprite/.tinyclaw/dist/telegram-client.js
    echo "Crashed, restarting in 5s..."
    sleep 5
done
EOF

chmod +x /home/sprite/telegram-loop.sh
tmux new -d -s telegram-loop '/home/sprite/telegram-loop.sh'

Trade-off Comparison

Aspect Option 1: Complex Reconnect Option 2: Exit + Loop
Code ~60 lines 3 lines
State Complex tracking Stateless
Race conditions Yes (409 loops) No
Downtime Variable ~5 seconds
Reliability Medium High
Maintenance High None

Recommendation

Use Option 2 - It's simpler, more reliable, and has been thoroughly tested in production. The ~5 second downtime on crash is acceptable for a personal bot.


10. Testing

Test the solution by:

  1. Wait for a polling error to occur
  2. Verify process exits
  3. Verify tmux loop restarts within 5 seconds
  4. Verify bot reconnects and receives messages

13. References


14. Timeline

Date Event
2026-02-19 13:54:16 First observed ECONNRESET error
2026-02-19 14:49:37 Last message processed before failure
2026-02-19 14:53:48 Heartbeat still working (queue alive)
2026-02-19 15:54:29 Manual restart #1 - appeared to work
2026-02-19 15:57:32 Telegram reconnected (after viewing logs)
2026-02-19 16:44:47 ECONNRESET again - second failure
2026-02-19 16:46:29 Messages still queued but not received
2026-02-19 17:15:13 Multiple tinyclaw restart attempts - did nothing
2026-02-19 17:15:59 Viewed logs with tinyclaw logs telegram - kicked back up
2026-02-19 17:16:00 Telegram reconnected (fluke after viewing logs)
2026-02-19 18:09:16 Reconnect fix deployed - 409 Conflict loop detected
2026-02-19 18:09:26 After fix - successful reconnection (but loop issue exposed)
2026-02-20 Simplified solution adopted - Exit on error + tmux loop restart

15.1 Related Logs 1

First Occurrence (lines 5-9)

[13:53:54] Message received: /start
[13:54:16.703Z] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[13:54:57.839Z] Message received: gm gm...
[13:54:58.189Z] Queued message
[13:55:04.872Z] Sent response

Gap: 41 seconds between error and next message. Telegram briefly recovered, then died.


Second Occurrence (lines 138-142)

[16:17:38.408Z] Sent response (1295 chars)
[16:44:47.759Z] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[16:46:29.296Z] Message received: this is not enough...
[16:46:32.617Z] Queued message
[16:52:28] Daemon restart (manual)
[17:15:59.847Z] Telegram client restart

Gap: 1 min 41 sec between error and next message, then ~30 min until manual restart.


Key Observations

  1. After ECONNRESET, Telegram client appears to stay in a broken polling state — it logs errors but doesn't reconnect
  2. Messages still arrive briefly (probably from Telegram's side retry) but then stop
  3. No reconnection attempt logged — just silent failure
  4. Need to check TinyClaw source to see if there's any retry logic that should be firing

15.b Related Logs 2

First Incident

[2026-02-19T13:54:16.XXXZ] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[2026-02-19T14:49:37.XXXZ] [INFO] Sent response to dpbmaverick98 (last message processed)
[2026-02-19T14:53:47.XXXZ] [INFO] Heartbeat sent to 3 agent(s)
[2026-02-19T15:54:29.XXXZ] [INFO] Starting TinyClaw daemon...
[2026-02-19T15:57:32.855Z] [INFO] Starting Telegram client...
[2026-02-19T15:57:33.556Z] [INFO] Telegram bot connected as @another_mav_test_bot
[2026-02-19T15:57:33.893Z] [INFO] Listening for messages...

Second Incident - Restart Attempts Failed

[2026-02-19T16:44:47.759Z] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[2026-02-19T16:46:29.296Z] [INFO] Message from dpbmaverick98 (NOT received - queued silently)
[2026-02-19T16:46:32.617Z] [INFO] Queued message 1771519589252_pza53f

# Multiple restart attempts - all show "Telegram: Running" but polling is dead:
[2026-02-19 17:15:13] Stopping TinyClaw...
[2026-02-19 17:15:13] Daemon stopped
[2026-02-19 17:15:16] Starting TinyClaw daemon...
[2026-02-19 17:15:16] Daemon started with 4 panes (channels=telegram)

# tinyclaw status shows "Running" but no actual connection:
Tmux Session: Running
Telegram:        Running  <-- WRONG - polling is actually dead

# Then viewing logs somehow triggers reconnection:
[2026-02-19T17:15:59.847Z] [INFO] Starting Telegram client...
[2026-02-19T17:16:00.377Z] [INFO] Message from dpbmaverick98 (now received!)
[2026-02-19T17:16:00.527Z] [INFO] Telegram bot connected as @another_mav_test_bot
[2026-02-19T17:16:00.739Z] [INFO] Listening for messages...

Key Observation from Second Incident

  1. tinyclaw restart ran multiple times - all reported success
  2. tinyclaw status showed "Telegram: Running"
  3. But messages were NOT being received (still queued)
  4. Only after running tinyclaw logs telegram did it reconnect
  5. This appears to be a fluke, not reliable behavior

15.c Related Logs 3: Reconnect Fix - 409 Conflict Loop

After Adding Reconnect Logic (2016-02-19T18:09)

The Fix That Caused a New Problem:

// Initial fix (without isReconnecting guard)
bot.on('polling_error', async (error: Error) => {
    log('ERROR', `Polling error: ${error.message}`);
    
    const RECONNECT_DELAYS = [1000, 2000, 5000, 10000, 30000];
    
    for (let attempt = 0; attempt < RECONNECT_DELAYS.length; attempt++) {
        // Missing: isReconnecting check!
        await bot.startPolling();  // No stopPolling first!
        // ...
    }
});

Result - Concurrent Reconnection Loop:

[2026-02-19T18:09:16.413Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.413Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:16.572Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict: terminated by other getUpdates request
[2026-02-19T18:09:16.572Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.572Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:16.746Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict: terminated by other getUpdates request
[2026-02-19T18:09:16.746Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.749Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict: terminated by other getUpdates request
[2026-02-19T18:09:16.749Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.781Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict
[2026-02-19T18:09:16.781Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.781Z] [INFO] Polling restarted successfully

// ... repeated 10+ times in milliseconds!

[2026-02-19T18:09:26.420Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:26.420Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:26.420Z] [INFO] Polling restarted successfully
// ... 10+ simultaneous successes

[2026-02-19T18:09:42.325Z] [INFO] Starting Telegram client...
[2026-02-19T18:09:42.886Z] [INFO] Telegram bot connected as @another_mav_test_bot
[2026-02-19T18:09:43.230Z] [INFO] Listening for messages...

Root Cause of New Issue

  1. No isReconnecting flag - Multiple polling_error events triggered simultaneous reconnection attempts
  2. No stopPolling() before restart - Each attempt started without cleaning up previous attempt
  3. 409 Conflict not handled as fatal - The code retried immediately instead of exiting
  4. Result: Exponential explosion of reconnection attempts

Key Learnings

  1. Must have isReconnecting guard - Prevents concurrent reconnection loops
  2. Must call stopPolling() first - Clean up before restarting
  3. 409 Conflict is fatal - Don't retry without full cleanup and delay
  4. Must exit after max retries - Let external monitoring detect failure

Report generated: 2026-02-19

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions