Skip to content

fix: WebSocket reconnect stability — eliminate kick storms and race conditions#140

Merged
yujiawei merged 5 commits intodmwork-org:mainfrom
Jerry-Xin:fix/ws-reconnect-v2
Mar 31, 2026
Merged

fix: WebSocket reconnect stability — eliminate kick storms and race conditions#140
yujiawei merged 5 commits intodmwork-org:mainfrom
Jerry-Xin:fix/ws-reconnect-v2

Conversation

@Jerry-Xin
Copy link
Copy Markdown
Collaborator

Problem

Multiple WebSocket reconnect bugs caused persistent kick storms (~950 kicked/hour across 15 bots) when connecting to WuKongIM. The root cause was a cascade of race conditions between the transport layer (socket.ts) and the application layer (channel.ts).

Fixes #135, #139

Root Cause Analysis

9 confirmed issues found via Codex + Claude Code cross-review:

Critical (direct cause of kick loops)

  1. Stale socket eventsopen/message/error handlers lacked if (this.ws !== ws) return guard, allowing old WS DISCONNECT packets to corrupt new connection state
  2. Zombie WebSocket connections — CONNACK rejection and DISCONNECT handlers didn't call ws.close(), leaving TCP connections alive
  3. Heartbeat timer not stopped on disconnect — HTTP heartbeat timer continued running after WS disconnect, triggering competing reconnections with stale credentials
  4. Ping timeout dual reconnect — Ping timeout called onDisconnected + scheduleReconnect() but not onError, so channel layer didn't know to stop heartbeat timer

Race condition fixes

  1. Synchronous disconnect+connectws.close() is async; new WS opened before old TCP FIN completed, causing server to see 2 connections briefly
  2. Token refresh before disconnectregisterBot(forceRefresh) called while old WS still alive during await
  3. Heartbeat failure zero backoff — Immediate disconnect()+connect() bypassed exponential backoff
  4. Heartbeat failure counter not resetconsecutiveHeartbeatFailures preserved across reconnects
  5. Dual reconnect ownership — Both socket.ts and channel.ts independently drove reconnects

Changes

socket.ts

  • Add stale guards (if (this.ws !== ws) return) to open, message, error handlers
  • Add ws.close(); this.ws = null in CONNACK rejection and DISCONNECT handlers
  • Add disconnectAndWait(timeoutMs=2000) — waits for WS close event or terminates on timeout
  • Make stopReconnectTimer() public for cross-layer coordination
  • Add stable timer, rapid disconnect detection, exponential backoff with jitter

channel.ts

  • Clear heartbeat timer in onDisconnected; restart in onConnected
  • Replace all disconnect()+connect() with await disconnectAndWait() + connect()
  • Reorder token refresh: disconnect first, then registerBot(forceRefresh)
  • Add 3-5s backoff delay on heartbeat failure reconnect
  • Reset consecutiveHeartbeatFailures in onConnected
  • Call stopReconnectTimer() before connect() in cooldown/heartbeat paths
  • Deduplicate cooldown reconnect timer to prevent self-kick storms
  • Add [accountId] to all log messages for per-bot diagnostics

Tests

  • 20 new tests covering all 9 fixes in reconnect-fixes.test.ts
  • All 284 tests pass (264 existing + 20 new)

Verification

Deployed to 3 nodes with 30+ bots total:

  • Before: ~950 kicked/hour sustained, 170K+ total kicks
  • After: 0 kicked across all nodes (stable for 15+ minutes)

…d-disconnect detection

- Bug A: Defer reconnectAttempts reset until connection stable 30s+
  (prevents backoff from being defeated by short-lived connections)
- Bug B: Reconnect with current credentials during token refresh cooldown
  (5-10s random delay to avoid tight loop on persistent kicks)
- Bug C: Detect 3+ consecutive rapid disconnects (<5s each) and trigger
  token refresh via onError to break stale-token reconnect loops
- Add clearStableTimer() to all disconnect/reconnect paths
- Reset connection tracking state in disconnect()
- Add bug reproduction/verification tests (5 new tests)

Closes dmwork-org#139
When a bot is repeatedly kicked during cooldown, each kick created a new
setTimeout for reconnect without clearing the previous one. Multiple
timers firing simultaneously caused the same bot to open parallel WS
connections, which WuKongIM treats as duplicate sessions for the same
UID — each new connection kicks the previous one, creating a self-kick
feedback loop (observed: 336 kicks/min across 15 bots).

Fix: track cooldownReconnectTimer per account, clearTimeout before
creating a new one, and clear on cleanup.

Closes dmwork-org#139
Copy link
Copy Markdown
Collaborator

@yujiawei yujiawei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 9 个 WS 重连竞态修复覆盖完整,stable timer + rapid disconnect 检测设计合理,20 新测试 + 生产验证 (950 kicked/h → 0). CI ✅

@yujiawei yujiawei merged commit 8c99ad3 into dmwork-org:main Mar 31, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: WebSocket reconnection storm on IM token expiry

2 participants