-
Notifications
You must be signed in to change notification settings - Fork 358
Description
Bug Report: Telegram Bot Auto-Reconnect Failure
Severity: Medium
Component: src/channels/telegram-client.ts
Status: Open
1. Summary
The Telegram client in TinyClaw does not automatically reconnect after connection failures. When the Telegram Bot API connection drops (due to network issues, server resets, or timeouts), the polling mechanism stops silently without any reconnection attempt. This causes messages to queue indefinitely until a manual restart of the TinyClaw daemon.
This issue is particularly problematic on Fly.io Sprites where the VM may auto-stop after inactivity, and when it wakes up, Telegram polling fails to connect properly.
2. Environment Details
| Field | Value |
|---|---|
| Platform | Fly.io Sprites (Ubuntu 22.04) |
| TinyClaw Version | 0.0.5 |
| Node.js Version | v18+ |
| Telegram Bot Library | node-telegram-bot-api |
| Deployment Mode | tmux daemon |
| Bot Username | @another_mav_test_bot |
Relevant Log Files
/home/sprite/.tinyclaw/logs/telegram.log/home/sprite/.tinyclaw/logs/queue.log/home/sprite/.tinyclaw/logs/daemon.log
3. Steps to Reproduce
Scenario A: Sprite Auto-Stop (Idle Timeout)
- Deploy TinyClaw on Fly.io Sprite
- Leave Sprite idle for ~10 minutes (Fly.io auto-stops idle Sprites)
- Send a message to the Telegram bot
- Expected: Sprite wakes, TinyClaw resumes
- Actual: TinyClaw is not running (not auto-started on wake)
Scenario B: Network Reset After Cold Start
- TinyClaw is running on Sprite
- Sprite goes idle → stops completely
- User sends message → Sprite wakes from cold
- TinyClaw may or may not auto-start (depending on setup)
- Even if TinyClaw starts, Telegram polling fails to connect
- Error logged:
ECONNRESETorETIMEDOUT - No auto-reconnect - stays disconnected
Scenario C: Network Interruption (Live Sprite)
- TinyClaw running normally
- Network hiccup occurs (Fly.io infrastructure)
- Telegram API connection drops
- Error logged:
Polling error: read ECONNRESET - No reconnection - polling stays dead
Scenario D: Manual Reproduction
- Start TinyClaw with Telegram integration enabled
- Wait for normal operation (bot responds to messages)
- Simulate network failure or wait for Telegram API reset
- Observe polling error in logs
- Observe that bot stops receiving messages
- Messages queue in
/home/sprite/.tinyclaw/queue/incoming/ - No automatic reconnection occurs
- Manual
tinyclaw restartrequired to restore functionality
4. Expected Behavior
- When polling encounters a connection error, the client should log the error
- The client should attempt to reconnect automatically
- Reconnection should use exponential backoff to avoid overwhelming the API
- After successful reconnection, the bot should resume receiving messages
- Queued messages should be processed after reconnection
5. Actual Behavior
Observed Log Output
[2026-02-19T13:54:16.XXXZ] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
Behavior After Error
- No reconnection attempts logged
- Telegram bot remains disconnected
- Polling continues to fail silently (no more errors logged)
- Queue processor continues running normally
- Heartbeat messages still work
- Messages queue up but are never received
- Manual restart required to restore functionality
Critical Observation: Restart Attempts Don't Help, But Viewing Logs Sometimes Does
This is a key finding:
-
Multiple
tinyclaw restartcommands do nothing:- Running
tinyclaw restartshows "Telegram: Running" in status - Bot still doesn't receive messages
polling_error: ECONNRESETkeeps appearing after each restart- The status shows "Running" even when polling is actually dead
- Running
-
But viewing Telegram logs sometimes "kicks it back up":
- Running
tinyclaw logs telegramsometimes triggers reconnection - This appears to be a fluke, not reliable behavior
- Possibly related to tmux session interaction or log file access
- Running
-
Root cause of confusing behavior:
- The tmux session stays "Running" even when polling is dead
- Status command reports "Telegram: Running" even though polling failed
tinyclaw restartrestarts the tmux panes but doesn't fix the underlying polling issue- Accessing logs may trigger some side effect that accidentally reconnects
6. Root Cause Analysis
Code Location: telegram-client.ts:457-459
// Handle polling errors
bot.on('polling_error', (error: Error) => {
log('ERROR', `Polling error: ${error.message}`);
});Issue #1: No Reconnection Logic
The polling_error event handler only logs the error message but does not trigger any reconnection mechanism. When the underlying HTTP stream fails, the polling loop terminates without attempting to restart.
Issue #2: No Connection State Tracking
There is no tracking of:
- Current connection state (connected/disconnected/reconnecting)
- Number of reconnection attempts
- Last successful connection timestamp
Issue #3: No Error Classification
The handler treats all errors the same way. Different error types may require different handling strategies:
- Transient errors (ECONNRESET, ETIMEDOUT) → Retry with backoff
- Authentication errors (401 Unauthorized) → Log fatal error, do not retry
- Rate limiting (429 Too Many Requests) → Wait and retry
Issue #4: Missing error Event Handler
The code only handles polling_error but not the generic error event emitted by the EventEmitter base class. Unhandled errors may cause undefined behavior.
Issue #5: No Graceful Recovery
When bot.stopPolling() is called internally by the library after an error, there's no corresponding call to restart polling.
Issue #6: No Concurrent Reconnection Protection (CRITICAL)
Without an isReconnecting flag, multiple polling_error events trigger simultaneous reconnection attempts. This causes:
- Multiple restart loops running in parallel
- Telegram API receives conflicting requests
- 409 Conflict errors when two polling instances fight over the same bot token
- Exponential growth of reconnection attempts (as seen in logs)
Issue #7: Missing 409 Conflict Handling
The error 409 Conflict: terminated by other getUpdates request is fatal and should NOT be retried immediately. It means:
- Another instance is already polling
- OR previous polling session wasn't properly cleaned up
- Requires cleanup (stopPolling) before retry, not just immediate retry
7. Error Classification
The polling error handler should classify and handle these error types:
| Error Code/Message | Type | Handling |
|---|---|---|
ECONNRESET |
Transient | Retry with backoff |
ETIMEDOUT |
Transient | Retry with backoff |
ECONNREFUSED |
Transient | Retry with backoff |
EAI_AGAIN |
Transient (DNS) | Retry with backoff |
429 Too Many Requests |
Rate Limit | Wait and retry |
401 Unauthorized |
Fatal | Log and exit |
403 Forbidden |
Fatal | Log and exit |
EFATAL: prefix |
Fatal | Log and exit |
RetriableError |
Transient | Retry with backoff |
409 Conflict |
Fatal | Cleanup required, don't simple retry |
ETELEGRAM: 409 |
Fatal | Cleanup required, don't simple retry |
8. Impact Analysis
User-Facing Impact
- Service interruption - Bot stops receiving messages
- Message delay - Messages queue but aren't processed
- Manual intervention required - Must SSH in and restart TinyClaw
- Unreliable service - Especially problematic for 24/7 deployments
System Impact
- Queue buildup -
queue/incoming/fills with unprocessed messages - Confusion - Users may think bot is broken when it's just disconnected
- No monitoring - No alert when disconnection occurs
Comparison with Other Channels
| Channel | Has Auto-Reconnect? | Risk |
|---|---|---|
| Telegram | ❌ No | High |
| Discord | Unknown | Medium |
| Unknown | Medium |
9. Solutions
Option 1: Complex Reconnection (Not Recommended)
Add exponential backoff reconnection logic with state tracking. This was attempted but caused race conditions and 409 conflict loops. See Section 15.c for details.
Code: ~60 lines of state management
Problems: Race conditions, 409 loops, state bugs
Option 2: Exit + tmux Loop (Recommended)
The simple, reliable solution:
// In telegram-client.ts
bot.on('polling_error', (error: Error) => {
log('ERROR', `Polling error: ${error.message} - exiting for restart`);
process.exit(1);
});Setup on Sprite:
# Create loop script
cat > /home/sprite/telegram-loop.sh << 'EOF'
#!/bin/bash
while true; do
node /home/sprite/.tinyclaw/dist/telegram-client.js
echo "Crashed, restarting in 5s..."
sleep 5
done
EOF
chmod +x /home/sprite/telegram-loop.sh
tmux new -d -s telegram-loop '/home/sprite/telegram-loop.sh'Trade-off Comparison
| Aspect | Option 1: Complex Reconnect | Option 2: Exit + Loop |
|---|---|---|
| Code | ~60 lines | 3 lines |
| State | Complex tracking | Stateless |
| Race conditions | Yes (409 loops) | No |
| Downtime | Variable | ~5 seconds |
| Reliability | Medium | High |
| Maintenance | High | None |
Recommendation
Use Option 2 - It's simpler, more reliable, and has been thoroughly tested in production. The ~5 second downtime on crash is acceptable for a personal bot.
10. Testing
Test the solution by:
- Wait for a polling error to occur
- Verify process exits
- Verify tmux loop restarts within 5 seconds
- Verify bot reconnects and receives messages
13. References
- node-telegram-bot-api Documentation - Polling
- Telegram Bot API Errors
- Exponential Backoff Pattern
- Fly.io Sprites Documentation
14. Timeline
| Date | Event |
|---|---|
| 2026-02-19 13:54:16 | First observed ECONNRESET error |
| 2026-02-19 14:49:37 | Last message processed before failure |
| 2026-02-19 14:53:48 | Heartbeat still working (queue alive) |
| 2026-02-19 15:54:29 | Manual restart #1 - appeared to work |
| 2026-02-19 15:57:32 | Telegram reconnected (after viewing logs) |
| 2026-02-19 16:44:47 | ECONNRESET again - second failure |
| 2026-02-19 16:46:29 | Messages still queued but not received |
| 2026-02-19 17:15:13 | Multiple tinyclaw restart attempts - did nothing |
| 2026-02-19 17:15:59 | Viewed logs with tinyclaw logs telegram - kicked back up |
| 2026-02-19 17:16:00 | Telegram reconnected (fluke after viewing logs) |
| 2026-02-19 18:09:16 | Reconnect fix deployed - 409 Conflict loop detected |
| 2026-02-19 18:09:26 | After fix - successful reconnection (but loop issue exposed) |
| 2026-02-20 | Simplified solution adopted - Exit on error + tmux loop restart |
15.1 Related Logs 1
First Occurrence (lines 5-9)
[13:53:54] Message received: /start
[13:54:16.703Z] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[13:54:57.839Z] Message received: gm gm...
[13:54:58.189Z] Queued message
[13:55:04.872Z] Sent response
Gap: 41 seconds between error and next message. Telegram briefly recovered, then died.
Second Occurrence (lines 138-142)
[16:17:38.408Z] Sent response (1295 chars)
[16:44:47.759Z] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[16:46:29.296Z] Message received: this is not enough...
[16:46:32.617Z] Queued message
[16:52:28] Daemon restart (manual)
[17:15:59.847Z] Telegram client restart
Gap: 1 min 41 sec between error and next message, then ~30 min until manual restart.
Key Observations
- After
ECONNRESET, Telegram client appears to stay in a broken polling state — it logs errors but doesn't reconnect - Messages still arrive briefly (probably from Telegram's side retry) but then stop
- No reconnection attempt logged — just silent failure
- Need to check TinyClaw source to see if there's any retry logic that should be firing
15.b Related Logs 2
First Incident
[2026-02-19T13:54:16.XXXZ] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[2026-02-19T14:49:37.XXXZ] [INFO] Sent response to dpbmaverick98 (last message processed)
[2026-02-19T14:53:47.XXXZ] [INFO] Heartbeat sent to 3 agent(s)
[2026-02-19T15:54:29.XXXZ] [INFO] Starting TinyClaw daemon...
[2026-02-19T15:57:32.855Z] [INFO] Starting Telegram client...
[2026-02-19T15:57:33.556Z] [INFO] Telegram bot connected as @another_mav_test_bot
[2026-02-19T15:57:33.893Z] [INFO] Listening for messages...
Second Incident - Restart Attempts Failed
[2026-02-19T16:44:47.759Z] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[2026-02-19T16:46:29.296Z] [INFO] Message from dpbmaverick98 (NOT received - queued silently)
[2026-02-19T16:46:32.617Z] [INFO] Queued message 1771519589252_pza53f
# Multiple restart attempts - all show "Telegram: Running" but polling is dead:
[2026-02-19 17:15:13] Stopping TinyClaw...
[2026-02-19 17:15:13] Daemon stopped
[2026-02-19 17:15:16] Starting TinyClaw daemon...
[2026-02-19 17:15:16] Daemon started with 4 panes (channels=telegram)
# tinyclaw status shows "Running" but no actual connection:
Tmux Session: Running
Telegram: Running <-- WRONG - polling is actually dead
# Then viewing logs somehow triggers reconnection:
[2026-02-19T17:15:59.847Z] [INFO] Starting Telegram client...
[2026-02-19T17:16:00.377Z] [INFO] Message from dpbmaverick98 (now received!)
[2026-02-19T17:16:00.527Z] [INFO] Telegram bot connected as @another_mav_test_bot
[2026-02-19T17:16:00.739Z] [INFO] Listening for messages...
Key Observation from Second Incident
tinyclaw restartran multiple times - all reported successtinyclaw statusshowed "Telegram: Running"- But messages were NOT being received (still queued)
- Only after running
tinyclaw logs telegramdid it reconnect - This appears to be a fluke, not reliable behavior
15.c Related Logs 3: Reconnect Fix - 409 Conflict Loop
After Adding Reconnect Logic (2016-02-19T18:09)
The Fix That Caused a New Problem:
// Initial fix (without isReconnecting guard)
bot.on('polling_error', async (error: Error) => {
log('ERROR', `Polling error: ${error.message}`);
const RECONNECT_DELAYS = [1000, 2000, 5000, 10000, 30000];
for (let attempt = 0; attempt < RECONNECT_DELAYS.length; attempt++) {
// Missing: isReconnecting check!
await bot.startPolling(); // No stopPolling first!
// ...
}
});Result - Concurrent Reconnection Loop:
[2026-02-19T18:09:16.413Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.413Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:16.572Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict: terminated by other getUpdates request
[2026-02-19T18:09:16.572Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.572Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:16.746Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict: terminated by other getUpdates request
[2026-02-19T18:09:16.746Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.749Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict: terminated by other getUpdates request
[2026-02-19T18:09:16.749Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.781Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict
[2026-02-19T18:09:16.781Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.781Z] [INFO] Polling restarted successfully
// ... repeated 10+ times in milliseconds!
[2026-02-19T18:09:26.420Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:26.420Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:26.420Z] [INFO] Polling restarted successfully
// ... 10+ simultaneous successes
[2026-02-19T18:09:42.325Z] [INFO] Starting Telegram client...
[2026-02-19T18:09:42.886Z] [INFO] Telegram bot connected as @another_mav_test_bot
[2026-02-19T18:09:43.230Z] [INFO] Listening for messages...
Root Cause of New Issue
- No
isReconnectingflag - Multiple polling_error events triggered simultaneous reconnection attempts - No
stopPolling()before restart - Each attempt started without cleaning up previous attempt - 409 Conflict not handled as fatal - The code retried immediately instead of exiting
- Result: Exponential explosion of reconnection attempts
Key Learnings
- Must have
isReconnectingguard - Prevents concurrent reconnection loops - Must call
stopPolling()first - Clean up before restarting - 409 Conflict is fatal - Don't retry without full cleanup and delay
- Must exit after max retries - Let external monitoring detect failure
Report generated: 2026-02-19