This document serves as a reliability map for the app-side system, detailing its major failure domains and the architectural mechanisms designed to contain, prevent, detect, or tolerate them. It highlights intentional engineering choices made to ensure data integrity and user experience even under adverse mobile operating system constraints, unreliable networks, and complex data flows.
Scope: This document focuses exclusively on client-side (iOS/Android) failure modes and containment mechanisms. It does not cover detailed backend failure analysis, exhaustive internal error codes, low-level third-party library internals, or historical bug-fix changelogs. Backend interactions are discussed as boundary conditions. It aligns with and cross-references: architecture.md, offline-sync.md, health-ingestion.md, nativeBLE.md, and data-flow.md.
The application's reliability model is built on an offline-first, local-first philosophy, where data integrity and graceful degradation are paramount. Failure handling is an integral part of the architecture, not an afterthought.
- Local-First / Offline-First: Critical functionality remains available and data can be recorded even without network connectivity. Changes are queued and synced later.
- Fail-Fast on Integrity: Critical data integrity violations are detected early and, where configurable, lead to immediate termination or explicit error states to prevent silent corruption.
- Retry and Degrade Gracefully: Transient failures (network, rate limits) trigger exponential backoff and retries, while non-critical features may degrade gracefully.
- Isolate Platform-Sensitive Native Work: Complex, platform-specific interactions (BLE, HealthKit, OS lifecycle) are encapsulated in native modules with robust, self-healing logic.
- Prevent Infinite Loops & Duplicate Processing: Mechanisms exist to break out of stuck states (e.g.,
FactoryResetGuard), ensure idempotent operations (payloadHash), and handle duplicate events (EventBuffer).
A key aspect of this system's reliability is its explicit distinction between how different types of failures are managed:
| Tier | Description | Example |
|---|---|---|
| Prevented | Architectural design actively makes certain failure modes impossible or highly improbable. | Reinstall loops bounded by FactoryResetGuard; cursor commits deferred until IntegrityGate passes. |
| Detected | Failures are explicitly identified and reported; the system may continue in a degraded state. | Orphaned foreign keys caught by IntegrityGate; invariant mismatches surfaced by catalog state checks. |
| Tolerated | The system continues to operate with reduced functionality while working towards recovery. | Offline mutations queued to OutboxRepository; sync cooldown/backoff; partial health backfill progress. |
| Surfaced | Failures are logged, exposed via status indicators, or made visible to the user for explicit retry/diagnosis. | Terminal auth failures triggering global logout; rate limiting messages in UI; network disconnection alerts. |
graph TD
A[App Launch] --> B{Startup & Reinstall}
B --> C[Native BLE Runtime]
B --> D[Health Ingestion]
C --> E[Local Persistence]
D --> E
E --> F[Entity Sync]
F --> G[API / Auth / Network]
E --> H[Realtime / Eventing]
G --> F
H --> F
H --> E
subgraph Orchestration
B --- I[Startup Orchestration]
end
This domain governs the application's first-run experience and resilience against problematic OS-level state. Failure handling here is strongly explicit in native iOS code.
iOS Keychain state persists across app uninstalls, leading to stale authentication tokens and corrupted user data upon reinstall. Complex multi-phase startup (native, JS, DB) introduces hazards like UI freezes and race conditions.
Failure Classes
- Stale Keychain state after uninstall/reinstall
- Partial wipe failure during factory reset
- Infinite reset loops across launches for the same build
- Startup-order hazards (e.g., accessing DB before migrations)
- Startup service-init race conditions and UI-freeze prevention
Containment
- Prevented:
ReinstallDetector.swiftdetects reinstall early.FactoryResetGuard.swiftbounds reset attempts.AppDelegate.mmorchestrates atomic reset and partial failure recovery by re-seeding markers.StartupOrchestrator(AppProvider.tsx) phases initialization to prevent UI freezes and init races. - Detected:
ReinstallDetectorandFactoryResetModule.swiftlog detection events and partial failures. - Tolerated: Bounded retries for partial resets allow for self-healing across launches.
- Surfaced: A "Critical Application Error" screen in
AppProvider.tsxoffers "Retry Init" or "Factory Reset" options.
Residual Risk: Android equivalents for native reinstall/reset logic are not explicitly evidenced in the current codebase.
Related: architecture.md, nativeBLE.md
The mobile OS environment presents unique challenges for stable Bluetooth Low Energy communication. This domain is critical for device connectivity and data streaming.
iOS/Android aggressively manage background processes and Bluetooth resources. Bonding/pairing can be fragile, and the native-JS bridge introduces latency and synchronization issues.
Failure Classes
- iOS process death / restoration-sensitive startup timing (
willRestoreStatemissed) - Connection drops and reconnect behavior (transient RF interference, device out of range)
- Bonding/key mismatch or "ghost bond" class failures (
CBError.peerRemovedPairingInformation,CBError.encryptionTimedOut) - Device-initiated sleep (
MSG_SLEEP) and dormant reconnection cases - Native-JS event starvation or listener inactivity issues (
EventBufferoverflow) - BLE receive buffer overflow from corrupt/rapidly streaming data
Containment
- Prevented:
AppDeviceBLEInitializer.initializeBLECore()for earlyCBCentralManagerinit.AppDeviceBLECore.swiftEC-SLEEP-BOND-FIX-001prioritizes device sleep signal over false bonding errors.AppDeviceBLEModule.swiftEventBufferwithEC-BUFFER-OVERFLOW-001/002for buffering and overflow notification.PROTOCOL_ACK_TIMEOUT_MSfor protocol ACK timeouts. - Detected:
AppDeviceBLECore.swiftcomprehensiveConnectionStatemachine with explicitDisconnectReasonclassification and GATT pipeline faultsEC-FAULT-001.AppDeviceBLEModule.swiftbuffer overflow events viaonBufferOverflow.BluetoothHandler.tscheckBondHealth()proactively identifies "ghost bond" issues. - Tolerated:
BluetoothHandler.tsattemptReconnectionWithBackoff()withautoConnect: truefor persistent OS-level reconnection and "dormant reconnection" mode.AppDeviceBLENative.setDeviceSleepFlag()for graceful disconnect handling. - Surfaced: UI alerts
onBondingLostfor pairing issues,onOperationRejectedfor operation failures.BluetoothContext.tsxalerts for max reconnect attempts.
Residual Risk: Detailed native Android BLE resilience implementation is less explicit in the provided codebase. Subtle OS-level power management can still affect long-term background BLE.
Related: nativeBLE.md
Health data ingestion involves native HealthKit APIs, local storage, and background processing. Reliability is paramount due to data sensitivity and volume.
Mobile OS imposes strict background execution limits. HealthKit queries can time out or hang. Large datasets (100k+ samples for backfill) require efficient, crash-safe processing.
Failure Classes
- Observer registration misses (missed background deliveries)
- Query timeout / hung query containment (
QUERY_TIMEOUT_SECONDS) - Background execution budget exhaustion (iOS app termination)
- Partial ingestion due to errors
- Empty-window backfill stalls for sparse metrics
- Crash between sample write and cursor advancement (data skipping)
- Local DB lock contention during native ingestion (
SQLITE_BUSY) - Stale or duplicated reads if cursor movement is incorrect
Containment
- Prevented:
HealthKitObserver.shared.registerDefaultObservers()for early registration inAppDelegate.HealthIngestCore.swiftcoldResumeIndexfor fairness,coldBackfillEndTs/coldPageFromTsfor crash-safe incremental progress.HealthIngestSQLite.swiftBEGIN IMMEDIATEtransactions with CAS for atomic updates viaatomicInsertAndUpdateCursor(). - Detected:
HealthIngestCore.swiftQUERY_TIMEOUT_SECONDSfor HK queries withAtomicBoolcancellation flags.HealthIngestSQLite.swiftverifySchema()for DB integrity. - Tolerated:
HealthIngestCore.swiftpartial: truefor budget exceeded/cancellation,coldCursorsAdvancedtracks progress with 0 samples.NativeHealthIngestModule.swifthard 15-second timeout for backgroundCHANGElane. - Surfaced:
NativeHealthIngestModule.swiftemitsNativeHealthIngest_Errorevents (.queryTimeout,.sqliteWriteFailed).
Residual Risk: Android Health Connect integration is not explicitly detailed in the provided codebase. Scalability for extremely large backfills on resource-constrained devices remains an open concern.
Related: health-ingestion.md
As a local-first application, the integrity of the local SQLite database is foundational to correct behavior and user trust.
The local SQLite DB is the UI's source of truth. Offline mutations, multi-threaded access, schema migrations, and syncing logic all introduce opportunities for data corruption and inconsistency.
Failure Classes
- SQLite lock/contention risk from concurrent access
- Corrupted or inconsistent local state from partial writes or crashes
- Failed local mutations (database write errors)
- Stale persisted cache not reflecting true local or server state
- Orphaned foreign keys (child references non-existent parent)
- Inconsistent aggregates/read models (e.g., dashboard data)
- Crash-safety around atomic write-plus-cursor advancement
Containment
- Prevented:
db/client.ts(WAL mode,FULLMUTEX,busy_timeout).OutboxRepository.tsatomic enqueue.HealthSampleRepository.tsatomicInsertAndUpdateCursorAtomic()ensures atomic data write/cursor update.IntegrityGate.tsruns before cursor commits, preventing bad cursors. - Detected:
IntegrityGate.tsusesRELATION_GRAPHto find orphaned FKs withIntegrityViolationreporting.CursorRepository.tsenforces monotonic cursor advancement, throwsCursorBackwardError.HealthIngestSQLite.swiftverifySchema()for table integrity. - Tolerated:
IntegrityGatein non-fail-fast mode logs violations and allows sync to proceed. - Surfaced:
IntegrityGate.tsIntegrityReportprovides detailed results.AppProvider.tsxdisplays errors if core database initialization fails.
Residual Risk: Scalability of SQLite for very large datasets. Potential for non-FK data inconsistencies.
Related: offline-sync.md, data-flow.md
The synchronization layer bridges the local offline-first state and the remote backend. Its reliability is critical for data consistency and user trust.
Sync is inherently distributed and asynchronous. Network latency, packet loss, server-side errors, and concurrent modifications create numerous opportunities for conflicts, data loss, or inconsistent states.
Failure Classes
- Offline transition during sync
- App crash during push/pull or local mutation
- Partial success / partial failure in batch responses
- Concurrent sync triggers ("sync storms")
- Cursor corruption or bad cursor advancement
- Conflict resolution mismatch
- Rate limiting and retry/backoff behavior
- Data corruption detection via
IntegrityGate - Hot-reload/multi-instance coordination hazards (
DataSyncServicesingletons)
Containment
- Prevented:
DataSyncServicesharedSyncInProgressmutex,debouncedSyncTrigger, and per-source cooldowns.PushEngineCore.computeDeterministicSyncOperationId()SHA-256 for server-side idempotency.OutboxRepository.dequeueDeduplicatedByEntity()one representative command per entity.IntegrityGateruns before cursor commits. - Detected:
DataSyncServicehandle429RateLimitError()applies exponential backoff.PushEngineCore.shouldSkipCommand()detects unpushable commands.PullEngine.tsdoes not advance cursors for entities with failures.CursorRepository.tsthrowsInvalidCursorError/CursorBackwardError. - Tolerated:
OutboxRepositoryqueues offline mutations withmarkFailed/markDeadLetter.BackendAPIClientrobust retry with exponential backoff/jitter.SyncLeaseManageradmission control for bulk operations. - Surfaced:
DataSyncService.getSyncState()provides UI status, pending counts, and errors.BackendAPIClientlogsDebugRequestMetadata.DataSyncServiceemitsWEBSOCKET_RECONNECT_FAILED/SYNC_CONFLICT_DETECTEDevents.
Residual Risk: Highly complex semantic conflicts may still require manual intervention. Reliability depends on backend API's consistent implementation of idempotency and merge rules.
Related: offline-sync.md, data-flow.md
Real-time events provide immediate UI feedback, but their asynchronous nature and the React Native lifecycle introduce complexities that can lead to subtle bugs.
React Native's Hot Reload can create duplicate listeners leading to sync storms or memory leaks. Asynchronous event processing can lead to race conditions with stale data, and external WebSocket connections are inherently fragile.
Failure Classes
- Duplicate listeners after Hot Reload / re-instantiation
- Stale event application order
- Event delivery while services are disposed or not yet ready
- WebSocket-triggered sync storms
- Coordinator disposal and cleanup hazards (memory leaks)
Containment
- Prevented:
WebSocketClient.tssingleton withremoveAllListeners()on disconnect andsetUserIdGetter()for tenancy validation.EventEmitter.tssuppress()/resume()for event storm prevention.DataSyncService.debouncedSyncTriggerbatches local change events. - Detected:
WebSocketClientlogs invalid event structures and detectsBackend userId mismatchforcing disconnect. - Tolerated:
WebSocketClientgraceful degradation on connection issues.EventEmittercatches errors in individual listeners. - Surfaced:
WebSocketClientlogs connection/reconnection events and emits local events for UI updates.
Residual Risk: Subtle memory leaks could occur in complex React Native component lifecycles combined with external event systems.
socket.io-clientprovides "at-most-once" delivery; critical consistency relies on the broader sync mechanisms.
Related: data-flow.md
Interactions with the backend API introduce failure modes related to network reliability, authentication state, and server behavior.
The mobile network environment is inherently unreliable. Authentication tokens expire, backend APIs can be rate-limited or return complex errors, and clock skew can invalidate timestamps.
Failure Classes
- Expired auth/session during sync (
401 Unauthorized) - Backend rate limiting (
429 Too Many Requests) - Transient network failures (connection drops, timeouts)
- Idempotency assumptions at the boundary (duplicate requests, side effects)
- Stale connectivity detection (
NetInforeporting false positive) - Clock skew between client and server (invalid timestamps)
Containment
- Prevented:
BackendAPIClient.ensureTokensFresh()for proactive JWT refresh.BackendAPIClient.throttleRequest()prevents API spam.BackendAPIClient.isRetryableError()distinguishes retryable errors.BackendAPIClientX-Correlation-IDfor idempotency. - Detected:
BackendAPIClientprocessError()categorizes errors,updateRateLimitStatus()parses headers,serverTimeOffsetdetects clock skew.AuthContext.tsxhandleAuthTerminalFailure()for401s. - Tolerated:
BackendAPIClientrobust retry with exponential backoff/jitter.DataSyncServicestatic exponential backoff for rate limits. - Surfaced:
BackendAPIClientlogs detailed request/response metadata.AuthContext.tsxsets error state and redirects on auth failure.
Residual Risk: Backend API consistency and reliability is a critical external dependency. Extreme mobile network flakiness can still hinder effective communication.
Related: architecture.md
Application initialization and service management require careful sequencing to avoid deadlocks, crashes, and poor user experience.
A complex mobile app needs to initialize many services. Misordered dependencies can cause crashes (e.g., DB access before migration). Heavy I/O work can block the UI thread. Hot Reload during development can create ghost instances and duplicate listeners.
Failure Classes
- Heavy initialization work blocking first paint
- Service ordering problems (dependency not ready)
- Background initialization racing user-visible flows
- Partial startup success (some services failed)
- Cleanup and disposal of long-lived services on unmount/Hot Reload (memory leaks)
Containment
- Prevented:
StartupOrchestrator(AppProvider.tsx) with phased startup, explicit task dependencies, and background tasks running after first paint.DataSyncService.configureStartupGate()delays heavy init. Singletons prevent multiple instances. - Detected:
MainThreadBlockMonitormonitors UI thread responsiveness. - Tolerated:
StartupOrchestrator'scanFailoption for background tasks allows graceful degradation. - Surfaced:
AppProvider.tsxdisplays "Critical Application Error" screen on essential startup failures. Logs emitted for background task warnings.
Residual Risk: Complex interactions with third-party SDK initializations might cause unexpected delays. Subtle memory leaks on specific Hot Reload sequences.
Related: architecture.md
This section clarifies areas where the current implementation or documentation has acknowledged limitations.
-
Android-Native Failure Handling: The current evidence is predominantly iOS-centric, especially for low-level BLE (
AppDeviceBLECore.swift), HealthKit (HealthIngestCore.swift), and app lifecycle management. While cross-platform abstractions exist, detailed native Android failure containment strategies are less explicitly surfaced. -
Backend Internals: This document focuses on app-side resilience. Client components (
BackendAPIClient,DataSyncService) make assumptions about backend behavior (payloadHashidempotency,Retry-Afterheaders) documented as boundary conditions. A full backend failure analysis is a separate document. -
Health Read-Model Maturity: The core health ingestion pipeline (
HealthIngestCore,HealthIngestSQLite) is robustly engineered. The downstreamHealthProjectionRefreshServicefor computed projections is a newer feature. While designed for resilience, the long-term completeness of derived read models is still maturing. -
UI-Layer Failures: This document's scope is architectural resilience. It does not cover React rendering errors, component-level state inconsistencies, or UX degradation not directly tied to underlying system failures.
The reliability claims in this document are supported by testing across the following layers:
| Layer | Coverage |
|---|---|
| Native iOS | Unit and integration tests for ReinstallDetector, FactoryResetGuard, KeychainWipeModule, HealthIngestCore, and AppDeviceBLECore verify core native logic and OS interactions. |
| BLE Integration | End-to-end tests covering connection, reconnection, data transfer, and error handling through the native-JS bridge. |
| Health Ingestion | Tests for HealthIngestSQLite, HealthSampleRepository, and HealthDeletionQueueRepository verify atomicity, cursor integrity, and crash recovery. |
| Offline Sync | Comprehensive tests for OutboxRepository, CursorRepository, IdMapRepository, TombstoneRepository, and sync handlers verify transactional behavior, idempotency, FK resolution, and conflict resolution. |
| Startup | Tests for StartupOrchestrator verify correct service initialization order and error handling during startup phases. |
| Document | Focus Area |
|---|---|
| architecture.md | Layered system model, service boundaries, data ownership, and lifecycle orchestration |
| offline-sync.md | Transactional outbox, cursor-based pull, conflict resolution, and integrity validation |
| health-ingestion.md | Native iOS ingestion lanes, atomic persistence, upload path, and projection refresh |
| nativeBLE.md | CoreBluetooth runtime, state restoration, bridge design, and protocol boundaries |
| data-flow.md | End-to-end data movement across layers, pipelines, and external boundaries |
decisions/ |
Architectural Decision Records |