Skip to content

feat: web socket health check (WPB-21876)#3837

Open
sbakhtiarov wants to merge 1 commit intodevelopfrom
feat/socket-health-check
Open

feat: web socket health check (WPB-21876)#3837
sbakhtiarov wants to merge 1 commit intodevelopfrom
feat/socket-health-check

Conversation

@sbakhtiarov
Copy link
Contributor

@sbakhtiarov sbakhtiarov commented Feb 3, 2026

BugWPB-21876 [Android] Clients using websocket notifications stop receiving messages after 24–48 hours in forward-deployed VPN environment

https://wearezeta.atlassian.net/browse/WPB-21876

Issue

WebSocket connections can become stale - appearing connected while not actually receiving events. This "zombie connection" state is undetectable because the connection technically remains open, but the server has stopped sending events (e.g., due to network issues, server-side timeouts, or load balancer disconnections).

Solution

  • Add lastWebSocketEventInstant() and recordLastWebSocketEvent() methods to IncrementalSyncRepository to track WebSocket event timestamps
  • Use @volatile annotation on the timestamp property to ensure thread-safe reads/writes across coroutine dispatchers
  • Record timestamps in IncrementalSyncManager when processing LIVE WebSocket events
  • Expose GetLastWebSocketEventInstantUseCase in UserSessionScope for the application layer to query event timing

See wireapp/wire-android#4561 for application update.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a WebSocket health check mechanism to detect stale connections that appear active but have stopped receiving events. The solution tracks the timestamp of the last WebSocket event to enable detection of "zombie connections" in environments where network issues or load balancer disconnections can cause events to stop flowing without proper disconnection notification.

Changes:

  • Added timestamp tracking in IncrementalSyncRepository with lastWebSocketEventInstant() and recordLastWebSocketEvent() methods
  • Integrated timestamp recording in IncrementalSyncManager when processing LIVE WebSocket events
  • Exposed GetLastWebSocketEventInstantUseCase in UserSessionScope for application-layer queries

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
logic/src/commonMain/kotlin/com/wire/kalium/logic/data/sync/IncrementalSyncRepository.kt Adds timestamp tracking with @volatile field and methods to record/query last WebSocket event time
logic/src/commonMain/kotlin/com/wire/kalium/logic/sync/incremental/IncrementalSyncManager.kt Integrates recordLastWebSocketEvent() call when transitioning to LIVE state
logic/src/commonMain/kotlin/com/wire/kalium/logic/feature/user/webSocketStatus/GetLastWebSocketEventInstantUseCase.kt New use case interface and implementation to expose timestamp query functionality
logic/src/commonMain/kotlin/com/wire/kalium/logic/feature/UserSessionScope.kt Exposes the new use case in the public API
logic/src/commonTest/kotlin/com/wire/kalium/logic/data/sync/IncrementalSyncRepositoryTest.kt Comprehensive tests for timestamp tracking functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 169 to 173
if (eventSource == EventSource.LIVE) {
incrementalSyncRepository.recordLastWebSocketEvent()
Uuid.random().toString() to Clock.System.now()
} else {
syncData
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a test in IncrementalSyncManagerTest to verify that recordLastWebSocketEvent is called when processing LIVE events. This would ensure the integration between IncrementalSyncManager and IncrementalSyncRepository works as expected. A test similar to the existing "givenSyncIsLive_whenWorkerEmitsSources_thenShouldResetBackoffForUserConfigSync" test would be appropriate.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

Test Results

0 tests   - 4 300   0 ✅  - 4 185   0s ⏱️ - 4m 43s
0 suites  -   723   0 💤  -   115 
0 files    -   723   0 ❌ ±    0 

Results for commit f5c53da. ± Comparison against base commit 06393f2.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

🐰 Bencher Report

Branchfeat/socket-health-check
Testbedubuntu-latest

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
BenchmarkLatencymicroseconds (µs)
com.wire.kalium.benchmarks.logic.CoreLogicBenchmark.createObjectInFiles📈 view plot
⚠️ NO THRESHOLD
675.48 µs
com.wire.kalium.benchmarks.logic.CoreLogicBenchmark.createObjectInMemory📈 view plot
⚠️ NO THRESHOLD
336,808.04 µs
com.wire.kalium.benchmarks.persistence.MessagesNoPragmaTuneBenchmark.messageInsertionBenchmark📈 view plot
⚠️ NO THRESHOLD
1,356,210.19 µs
com.wire.kalium.benchmarks.persistence.MessagesNoPragmaTuneBenchmark.queryMessagesBenchmark📈 view plot
⚠️ NO THRESHOLD
21,197.24 µs
🐰 View full continuous benchmarking report in Bencher

@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.59%. Comparing base (59e901d) to head (f5c53da).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
...ocketStatus/GetLastWebSocketEventInstantUseCase.kt 0.00% 3 Missing ⚠️
...um/logic/sync/incremental/IncrementalSyncWorker.kt 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3837      +/-   ##
===========================================
+ Coverage    59.56%   59.59%   +0.02%     
===========================================
  Files         1898     1899       +1     
  Lines        59227    59293      +66     
  Branches      6417     6421       +4     
===========================================
+ Hits         35277    35334      +57     
- Misses       21037    21043       +6     
- Partials      2913     2916       +3     
Files with missing lines Coverage Δ
...alium/logic/data/sync/IncrementalSyncRepository.kt 100.00% <100.00%> (ø)
...m/logic/sync/incremental/IncrementalSyncManager.kt 77.77% <100.00%> (+0.85%) ⬆️
...um/logic/sync/incremental/IncrementalSyncWorker.kt 77.27% <66.66%> (-1.68%) ⬇️
...ocketStatus/GetLastWebSocketEventInstantUseCase.kt 0.00% <0.00%> (ø)

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59e901d...f5c53da. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sbakhtiarov sbakhtiarov force-pushed the feat/socket-health-check branch from f9cb4ee to dfb27ec Compare February 4, 2026 11:37
@sbakhtiarov sbakhtiarov force-pushed the feat/socket-health-check branch from dfb27ec to f5c53da Compare February 4, 2026 11:46
@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 4, 2026

}
.filterIsInstance<EventStreamData.NewEvents>()
.collect { streamData ->
if (isLive) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to record only when we are live or for each event we receive via socket?
I mean again the async notifications - what if there is a scenario that we only got some messages when we were offline - in that case when we connect the socket we do receive these messages as pending non-live events so it means that at this moment the socket is healthy, but with this if (isLive) we don't record that for these non-live events, so if we don't receive any live events after pending ones then the LastWebSocketEvent will not update and will be outdated. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants