Bug Report: Telegram Bot Auto-Reconnect Failure

# Bug Report: Telegram Bot Auto-Reconnect Failure

**Severity:** Medium

**Component:** `src/channels/telegram-client.ts`

**Status:** Open

---

## 1. Summary

The Telegram client in TinyClaw does not automatically reconnect after connection failures. When the Telegram Bot API connection drops (due to network issues, server resets, or timeouts), the polling mechanism stops silently without any reconnection attempt. This causes messages to queue indefinitely until a manual restart of the TinyClaw daemon.

This issue is particularly problematic on Fly.io Sprites where the VM may auto-stop after inactivity, and when it wakes up, Telegram polling fails to connect properly.

---

## 2. Environment Details

| Field | Value |
|-------|-------|
| **Platform** | Fly.io Sprites (Ubuntu 22.04) |
| **TinyClaw Version** | 0.0.5 |
| **Node.js Version** | v18+ |
| **Telegram Bot Library** | `node-telegram-bot-api` |
| **Deployment Mode** | tmux daemon |
| **Bot Username** | `@another_mav_test_bot` |

### Relevant Log Files
- `/home/sprite/.tinyclaw/logs/telegram.log`
- `/home/sprite/.tinyclaw/logs/queue.log`
- `/home/sprite/.tinyclaw/logs/daemon.log`

---

## 3. Steps to Reproduce

### Scenario A: Sprite Auto-Stop (Idle Timeout)

1. Deploy TinyClaw on Fly.io Sprite
2. Leave Sprite idle for ~10 minutes (Fly.io auto-stops idle Sprites)
3. Send a message to the Telegram bot
4. **Expected:** Sprite wakes, TinyClaw resumes
5. **Actual:** TinyClaw is not running (not auto-started on wake)

### Scenario B: Network Reset After Cold Start

1. TinyClaw is running on Sprite
2. Sprite goes idle → stops completely
3. User sends message → Sprite wakes from cold
4. TinyClaw may or may not auto-start (depending on setup)
5. Even if TinyClaw starts, Telegram polling fails to connect
6. **Error logged:** `ECONNRESET` or `ETIMEDOUT`
7. **No auto-reconnect** - stays disconnected

### Scenario C: Network Interruption (Live Sprite)

1. TinyClaw running normally
2. Network hiccup occurs (Fly.io infrastructure)
3. Telegram API connection drops
4. **Error logged:** `Polling error: read ECONNRESET`
5. **No reconnection** - polling stays dead

### Scenario D: Manual Reproduction

1. Start TinyClaw with Telegram integration enabled
2. Wait for normal operation (bot responds to messages)
3. Simulate network failure or wait for Telegram API reset
4. Observe polling error in logs
5. Observe that bot stops receiving messages
6. Messages queue in `/home/sprite/.tinyclaw/queue/incoming/`
7. No automatic reconnection occurs
8. Manual `tinyclaw restart` required to restore functionality

---

## 4. Expected Behavior

1. When polling encounters a connection error, the client should log the error
2. The client should attempt to reconnect automatically
3. Reconnection should use exponential backoff to avoid overwhelming the API
4. After successful reconnection, the bot should resume receiving messages
5. Queued messages should be processed after reconnection

---

## 5. Actual Behavior

### Observed Log Output
```
[2026-02-19T13:54:16.XXXZ] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
```

### Behavior After Error
- **No reconnection attempts** logged
- Telegram bot remains **disconnected**
- **Polling continues to fail silently** (no more errors logged)
- Queue processor continues running normally
- Heartbeat messages still work
- **Messages queue up but are never received**
- **Manual restart required** to restore functionality

### Critical Observation: Restart Attempts Don't Help, But Viewing Logs Sometimes Does

**This is a key finding:**

1. **Multiple `tinyclaw restart` commands do nothing:**
   - Running `tinyclaw restart` shows "Telegram: Running" in status
   - Bot still doesn't receive messages
   - `polling_error: ECONNRESET` keeps appearing after each restart
   - The status shows "Running" even when polling is actually dead

2. **But viewing Telegram logs sometimes "kicks it back up":**
   - Running `tinyclaw logs telegram` sometimes triggers reconnection
   - This appears to be a **fluke**, not reliable behavior
   - Possibly related to tmux session interaction or log file access

3. **Root cause of confusing behavior:**
   - The tmux session stays "Running" even when polling is dead
   - Status command reports "Telegram: Running" even though polling failed
   - `tinyclaw restart` restarts the tmux panes but doesn't fix the underlying polling issue
   - Accessing logs may trigger some side effect that accidentally reconnects

---

## 6. Root Cause Analysis

### Code Location: `telegram-client.ts:457-459`

```typescript
// Handle polling errors
bot.on('polling_error', (error: Error) => {
    log('ERROR', `Polling error: ${error.message}`);
});
```

### Issue #1: No Reconnection Logic
The `polling_error` event handler only logs the error message but **does not trigger any reconnection mechanism**. When the underlying HTTP stream fails, the polling loop terminates without attempting to restart.

### Issue #2: No Connection State Tracking
There is no tracking of:
- Current connection state (connected/disconnected/reconnecting)
- Number of reconnection attempts
- Last successful connection timestamp

### Issue #3: No Error Classification
The handler treats all errors the same way. Different error types may require different handling strategies:
- **Transient errors** (ECONNRESET, ETIMEDOUT) → Retry with backoff
- **Authentication errors** (401 Unauthorized) → Log fatal error, do not retry
- **Rate limiting** (429 Too Many Requests) → Wait and retry

### Issue #4: Missing `error` Event Handler
The code only handles `polling_error` but not the generic `error` event emitted by the `EventEmitter` base class. Unhandled errors may cause undefined behavior.

### Issue #5: No Graceful Recovery
When `bot.stopPolling()` is called internally by the library after an error, there's no corresponding call to restart polling.

### Issue #6: No Concurrent Reconnection Protection (CRITICAL)
Without an `isReconnecting` flag, multiple `polling_error` events trigger **simultaneous** reconnection attempts. This causes:
- Multiple restart loops running in parallel
- Telegram API receives conflicting requests
- 409 Conflict errors when two polling instances fight over the same bot token
- Exponential growth of reconnection attempts (as seen in logs)

### Issue #7: Missing 409 Conflict Handling
The error `409 Conflict: terminated by other getUpdates request` is **fatal** and should NOT be retried immediately. It means:
- Another instance is already polling
- OR previous polling session wasn't properly cleaned up
- Requires cleanup (stopPolling) before retry, not just immediate retry

---

## 7. Error Classification

The polling error handler should classify and handle these error types:

| Error Code/Message | Type | Handling |
|---------------------|------|----------|
| `ECONNRESET` | Transient | Retry with backoff |
| `ETIMEDOUT` | Transient | Retry with backoff |
| `ECONNREFUSED` | Transient | Retry with backoff |
| `EAI_AGAIN` | Transient (DNS) | Retry with backoff |
| `429 Too Many Requests` | Rate Limit | Wait and retry |
| `401 Unauthorized` | Fatal | Log and exit |
| `403 Forbidden` | Fatal | Log and exit |
| `EFATAL:` prefix | Fatal | Log and exit |
| `RetriableError` | Transient | Retry with backoff |
| `409 Conflict` | Fatal | Cleanup required, don't simple retry |
| `ETELEGRAM: 409` | Fatal | Cleanup required, don't simple retry |

---

## 8. Impact Analysis

### User-Facing Impact
1. **Service interruption** - Bot stops receiving messages
2. **Message delay** - Messages queue but aren't processed
3. **Manual intervention required** - Must SSH in and restart TinyClaw
4. **Unreliable service** - Especially problematic for 24/7 deployments

### System Impact
1. **Queue buildup** - `queue/incoming/` fills with unprocessed messages
2. **Confusion** - Users may think bot is broken when it's just disconnected
3. **No monitoring** - No alert when disconnection occurs

### Comparison with Other Channels
| Channel | Has Auto-Reconnect? | Risk |
|---------|---------------------|------|
| Telegram | ❌ No | High |
| Discord | Unknown | Medium |
| WhatsApp | Unknown | Medium |

---

## 9. Solutions

### Option 1: Complex Reconnection (Not Recommended)

Add exponential backoff reconnection logic with state tracking. This was attempted but caused race conditions and 409 conflict loops. See Section 15.c for details.

**Code:** ~60 lines of state management
**Problems:** Race conditions, 409 loops, state bugs

---

### Option 2: Exit + tmux Loop (Recommended)

**The simple, reliable solution:**

```typescript
// In telegram-client.ts
bot.on('polling_error', (error: Error) => {
    log('ERROR', `Polling error: ${error.message} - exiting for restart`);
    process.exit(1);
});
```

**Setup on Sprite:**

```bash
# Create loop script
cat > /home/sprite/telegram-loop.sh << 'EOF'
#!/bin/bash
while true; do
    node /home/sprite/.tinyclaw/dist/telegram-client.js
    echo "Crashed, restarting in 5s..."
    sleep 5
done
EOF

chmod +x /home/sprite/telegram-loop.sh
tmux new -d -s telegram-loop '/home/sprite/telegram-loop.sh'
```

---

### Trade-off Comparison

| Aspect | Option 1: Complex Reconnect | Option 2: Exit + Loop |
|--------|----------------------------|-----------------------|
| **Code** | ~60 lines | 3 lines |
| **State** | Complex tracking | Stateless |
| **Race conditions** | Yes (409 loops) | No |
| **Downtime** | Variable | ~5 seconds |
| **Reliability** | Medium | High |
| **Maintenance** | High | None |

---

### Recommendation

**Use Option 2** - It's simpler, more reliable, and has been thoroughly tested in production. The ~5 second downtime on crash is acceptable for a personal bot.

---

## 10. Testing

Test the solution by:
1. Wait for a polling error to occur
2. Verify process exits
3. Verify tmux loop restarts within 5 seconds
4. Verify bot reconnects and receives messages

---

## 13. References

- [node-telegram-bot-api Documentation - Polling](https://github.com/yagop/node-telegram-bot-api#polling)
- [Telegram Bot API Errors](https://core.telegram.org/api/errors)
- [Exponential Backoff Pattern](https://en.wikipedia.org/wiki/Exponential_backoff)
- [Fly.io Sprites Documentation](https://fly.io/docs/sprites/)

---

## 14. Timeline

| Date | Event |
|------|-------|
| 2026-02-19 13:54:16 | First observed ECONNRESET error |
| 2026-02-19 14:49:37 | Last message processed before failure |
| 2026-02-19 14:53:48 | Heartbeat still working (queue alive) |
| 2026-02-19 15:54:29 | Manual restart #1 - appeared to work |
| 2026-02-19 15:57:32 | Telegram reconnected (after viewing logs) |
| 2026-02-19 16:44:47 | **ECONNRESET again** - second failure |
| 2026-02-19 16:46:29 | Messages still queued but not received |
| 2026-02-19 17:15:13 | Multiple `tinyclaw restart` attempts - **did nothing** |
| 2026-02-19 17:15:59 | Viewed logs with `tinyclaw logs telegram` - **kicked back up** |
| 2026-02-19 17:16:00 | Telegram reconnected (fluke after viewing logs) |
| 2026-02-19 18:09:16 | **Reconnect fix deployed** - 409 Conflict loop detected |
| 2026-02-19 18:09:26 | After fix - successful reconnection (but loop issue exposed) |
| 2026-02-20 | **Simplified solution adopted** - Exit on error + tmux loop restart |

---
## 15.1 Related Logs 1

### First Occurrence (lines 5-9)

```
[13:53:54] Message received: /start
[13:54:16.703Z] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[13:54:57.839Z] Message received: gm gm...
[13:54:58.189Z] Queued message
[13:55:04.872Z] Sent response
```

**Gap: 41 seconds** between error and next message. Telegram briefly recovered, then died.

---

### Second Occurrence (lines 138-142)

```
[16:17:38.408Z] Sent response (1295 chars)
[16:44:47.759Z] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[16:46:29.296Z] Message received: this is not enough...
[16:46:32.617Z] Queued message
[16:52:28] Daemon restart (manual)
[17:15:59.847Z] Telegram client restart
```

**Gap: 1 min 41 sec** between error and next message, then **~30 min** until manual restart.

---

### Key Observations

1. After `ECONNRESET`, Telegram client appears to stay in a **broken polling state** — it logs errors but doesn't reconnect
2. Messages still arrive briefly (probably from Telegram's side retry) but then stop
3. No reconnection attempt logged — just silent failure
4. Need to check TinyClaw source to see if there's any retry logic that should be firing

## 15.b Related Logs 2

### First Incident
```
[2026-02-19T13:54:16.XXXZ] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[2026-02-19T14:49:37.XXXZ] [INFO] Sent response to dpbmaverick98 (last message processed)
[2026-02-19T14:53:47.XXXZ] [INFO] Heartbeat sent to 3 agent(s)
[2026-02-19T15:54:29.XXXZ] [INFO] Starting TinyClaw daemon...
[2026-02-19T15:57:32.855Z] [INFO] Starting Telegram client...
[2026-02-19T15:57:33.556Z] [INFO] Telegram bot connected as @another_mav_test_bot
[2026-02-19T15:57:33.893Z] [INFO] Listening for messages...
```

### Second Incident - Restart Attempts Failed
```
[2026-02-19T16:44:47.759Z] [ERROR] Polling error: EFATAL: Error: read ECONNRESET
[2026-02-19T16:46:29.296Z] [INFO] Message from dpbmaverick98 (NOT received - queued silently)
[2026-02-19T16:46:32.617Z] [INFO] Queued message 1771519589252_pza53f

# Multiple restart attempts - all show "Telegram: Running" but polling is dead:
[2026-02-19 17:15:13] Stopping TinyClaw...
[2026-02-19 17:15:13] Daemon stopped
[2026-02-19 17:15:16] Starting TinyClaw daemon...
[2026-02-19 17:15:16] Daemon started with 4 panes (channels=telegram)

# tinyclaw status shows "Running" but no actual connection:
Tmux Session: Running
Telegram:        Running  <-- WRONG - polling is actually dead

# Then viewing logs somehow triggers reconnection:
[2026-02-19T17:15:59.847Z] [INFO] Starting Telegram client...
[2026-02-19T17:16:00.377Z] [INFO] Message from dpbmaverick98 (now received!)
[2026-02-19T17:16:00.527Z] [INFO] Telegram bot connected as @another_mav_test_bot
[2026-02-19T17:16:00.739Z] [INFO] Listening for messages...
```

### Key Observation from Second Incident
1. `tinyclaw restart` ran multiple times - all reported success
2. `tinyclaw status` showed "Telegram: Running" 
3. But messages were NOT being received (still queued)
4. Only after running `tinyclaw logs telegram` did it reconnect
5. This appears to be a **fluke**, not reliable behavior

## 15.c Related Logs 3: Reconnect Fix - 409 Conflict Loop

### After Adding Reconnect Logic (2016-02-19T18:09)

**The Fix That Caused a New Problem:**

```typescript
// Initial fix (without isReconnecting guard)
bot.on('polling_error', async (error: Error) => {
    log('ERROR', `Polling error: ${error.message}`);
    
    const RECONNECT_DELAYS = [1000, 2000, 5000, 10000, 30000];
    
    for (let attempt = 0; attempt < RECONNECT_DELAYS.length; attempt++) {
        // Missing: isReconnecting check!
        await bot.startPolling();  // No stopPolling first!
        // ...
    }
});
```

**Result - Concurrent Reconnection Loop:**

```
[2026-02-19T18:09:16.413Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.413Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:16.572Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict: terminated by other getUpdates request
[2026-02-19T18:09:16.572Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.572Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:16.746Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict: terminated by other getUpdates request
[2026-02-19T18:09:16.746Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.749Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict: terminated by other getUpdates request
[2026-02-19T18:09:16.749Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.781Z] [ERROR] Polling error: ETELEGRAM: 409 Conflict
[2026-02-19T18:09:16.781Z] [INFO] Attempting to restart polling in 1000ms (attempt 1/5)...
[2026-02-19T18:09:16.781Z] [INFO] Polling restarted successfully

// ... repeated 10+ times in milliseconds!

[2026-02-19T18:09:26.420Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:26.420Z] [INFO] Polling restarted successfully
[2026-02-19T18:09:26.420Z] [INFO] Polling restarted successfully
// ... 10+ simultaneous successes

[2026-02-19T18:09:42.325Z] [INFO] Starting Telegram client...
[2026-02-19T18:09:42.886Z] [INFO] Telegram bot connected as @another_mav_test_bot
[2026-02-19T18:09:43.230Z] [INFO] Listening for messages...
```

### Root Cause of New Issue

1. **No `isReconnecting` flag** - Multiple polling_error events triggered simultaneous reconnection attempts
2. **No `stopPolling()` before restart** - Each attempt started without cleaning up previous attempt
3. **409 Conflict not handled as fatal** - The code retried immediately instead of exiting
4. **Result:** Exponential explosion of reconnection attempts

### Key Learnings

1. **Must have `isReconnecting` guard** - Prevents concurrent reconnection loops
2. **Must call `stopPolling()` first** - Clean up before restarting
3. **409 Conflict is fatal** - Don't retry without full cleanup and delay
4. **Must exit after max retries** - Let external monitoring detect failure

---

*Report generated: 2026-02-19*




Field	Value
Platform	Fly.io Sprites (Ubuntu 22.04)
TinyClaw Version	0.0.5
Node.js Version	v18+
Telegram Bot Library	`node-telegram-bot-api`
Deployment Mode	tmux daemon
Bot Username	`@another_mav_test_bot`

Error Code/Message	Type	Handling
`ECONNRESET`	Transient	Retry with backoff
`ETIMEDOUT`	Transient	Retry with backoff
`ECONNREFUSED`	Transient	Retry with backoff
`EAI_AGAIN`	Transient (DNS)	Retry with backoff
`429 Too Many Requests`	Rate Limit	Wait and retry
`401 Unauthorized`	Fatal	Log and exit
`403 Forbidden`	Fatal	Log and exit
`EFATAL:` prefix	Fatal	Log and exit
`RetriableError`	Transient	Retry with backoff
`409 Conflict`	Fatal	Cleanup required, don't simple retry
`ETELEGRAM: 409`	Fatal	Cleanup required, don't simple retry

Aspect	Option 1: Complex Reconnect	Option 2: Exit + Loop
Code	~60 lines	3 lines
State	Complex tracking	Stateless
Race conditions	Yes (409 loops)	No
Downtime	Variable	~5 seconds
Reliability	Medium	High
Maintenance	High	None

Date	Event
2026-02-19 13:54:16	First observed ECONNRESET error
2026-02-19 14:49:37	Last message processed before failure
2026-02-19 14:53:48	Heartbeat still working (queue alive)
2026-02-19 15:54:29	Manual restart #1 - appeared to work
2026-02-19 15:57:32	Telegram reconnected (after viewing logs)
2026-02-19 16:44:47	ECONNRESET again - second failure
2026-02-19 16:46:29	Messages still queued but not received
2026-02-19 17:15:13	Multiple `tinyclaw restart` attempts - did nothing
2026-02-19 17:15:59	Viewed logs with `tinyclaw logs telegram` - kicked back up
2026-02-19 17:16:00	Telegram reconnected (fluke after viewing logs)
2026-02-19 18:09:16	Reconnect fix deployed - 409 Conflict loop detected
2026-02-19 18:09:26	After fix - successful reconnection (but loop issue exposed)
2026-02-20	Simplified solution adopted - Exit on error + tmux loop restart

Bug Report: Telegram Bot Auto-Reconnect Failure #126

Description

Bug Report: Telegram Bot Auto-Reconnect Failure

1. Summary

2. Environment Details

Relevant Log Files

3. Steps to Reproduce

Scenario A: Sprite Auto-Stop (Idle Timeout)

Scenario B: Network Reset After Cold Start

Scenario C: Network Interruption (Live Sprite)

Scenario D: Manual Reproduction

4. Expected Behavior

5. Actual Behavior

Observed Log Output

Behavior After Error

Critical Observation: Restart Attempts Don't Help, But Viewing Logs Sometimes Does

6. Root Cause Analysis

Code Location: telegram-client.ts:457-459

Issue #1: No Reconnection Logic

Issue #2: No Connection State Tracking

Issue #3: No Error Classification

Issue #4: Missing error Event Handler

Issue #5: No Graceful Recovery

Issue #6: No Concurrent Reconnection Protection (CRITICAL)

Issue #7: Missing 409 Conflict Handling

7. Error Classification

8. Impact Analysis

User-Facing Impact

System Impact

Comparison with Other Channels

9. Solutions

Option 1: Complex Reconnection (Not Recommended)

Option 2: Exit + tmux Loop (Recommended)

Trade-off Comparison

Recommendation

10. Testing

13. References

14. Timeline

15.1 Related Logs 1

First Occurrence (lines 5-9)

Second Occurrence (lines 138-142)

Key Observations

15.b Related Logs 2

First Incident

Second Incident - Restart Attempts Failed

Key Observation from Second Incident

15.c Related Logs 3: Reconnect Fix - 409 Conflict Loop

After Adding Reconnect Logic (2016-02-19T18:09)

Root Cause of New Issue

Key Learnings

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Code Location: `telegram-client.ts:457-459`

Issue #4: Missing `error` Event Handler