This document contains 21 ADRs for the AppPlatform backend, capturing the context, alternatives, rationale, and consequences of significant architectural decisions. ADRs explain why the system is designed the way it is — not just what it is.
| ADR | Decision |
|---|---|
| ADR-001 | Backend Dependency Injection and Composition Root |
| ADR-002 | Transactional Outbox Pattern for Durable Eventing |
| ADR-003 | Health Data Ingestion Pipeline (Three-Lane Architecture) |
| ADR-004 | Health Data Watermark-Based Freshness (P0-G) |
| ADR-005 | API Gateway Pattern and Cross-Cutting Concerns |
| ADR-006 | Client-Generated IDs for Idempotent Entity Sync |
| ADR-007 | Configurable Conflict Resolution Strategies |
| ADR-008 | Optimistic Locking for Data Integrity |
| ADR-009 | Cursor-Based Pagination for Incremental Sync |
| ADR-010 | Health Data Privacy Gating |
| ADR-011 | Health Sample Payload Hash for Request Idempotency |
| ADR-012 | Session Telemetry Precomputation and Caching (P0-G.1) |
| ADR-013 | Multi-Factor Inventory Prediction Engine |
| ADR-014 | Dirichlet-Multinomial Temporal Consumption Patterns |
| ADR-015 | BullMQ Job Payload Size Limit |
| ADR-016 | HTTP Compression for Health Data Uploads |
| ADR-017 | Server-Time Header for Client Clock Offset Calculation |
| ADR-018 | Centralized Health Metric Definitions (Shared Contract) |
| ADR-019 | Automated Stale Processing Cleanup (Reapers) |
| ADR-020 | Soft Delete for Health Samples with Purge Policy |
| ADR-021 | Global Stale Session Reconciliation |
ADRs serve as the architectural record of the backend's evolutionary path. Every ADR is deeply grounded in the actual implementation — referencing specific files, classes, configuration parameters, and code patterns.
Purpose:
- Onboarding — Accelerate understanding for new team members
- Maintenance — Justify existing designs and inform future changes
- Alignment — Ensure a shared understanding of core architectural principles
- Debugging — Provide historical context for troubleshooting complex issues
- Auditing — Document key security, privacy, and data integrity choices
Scope — ADRs are created for decisions that are:
- Significant — Impact multiple services, modules, or layers
- Non-trivial — Involve complex trade-offs or a choice between several viable alternatives
- Potentially Irreversible — Introduce new technologies, fundamental design patterns, or major data model changes
- High Impact — Affect performance, scalability, security, cost, or maintainability
Contribution Guide: New ADRs should be created when a significant architectural decision is made. Follow the template below and ensure the ADR is grounded in the codebase. All ADRs should be reviewed by at least one other senior engineer.
The AppPlatform backend, as it evolved, faced increasing complexity in managing service dependencies. Traditional patterns like global singletons or service locators led to:
- Tight coupling, making components hard to test in isolation
- Hidden dependencies, obscuring the true graph of service interactions
- Difficulty in replacing implementations (e.g., mock databases for testing)
- Challenges in configuring services with runtime parameters
This created a maintenance burden and hindered the adoption of high-quality testing practices.
The backend adopted Pure Constructor Dependency Injection (DI) as the primary mechanism for managing dependencies. bootstrap.ts was designated as the sole "composition root" responsible for instantiating and wiring all services, repositories, controllers, and middleware. No service should internally resolve its own dependencies (e.g., no Service.getInstance() calls within other services).
- Global Singletons with Internal
getInstance(): A prevalent pattern in earlier iterations.- Rejected: Leads to tight coupling, global state, difficult testing, and obscures dependency graph.
- Service Locator Pattern: A central registry (
ServiceContainer) where services register themselves and others can look them up by name.- Rejected: Still hides dependencies, making it hard to understand what a service needs without inspecting its runtime behavior. Can lead to runtime errors if services are requested before registration.
- Setter/Property Injection: Dependencies injected via setter methods or public properties after object creation.
- Rejected: Makes it harder to guarantee that dependencies are present (nullable types proliferate), introduces an extra step in object initialization, and can lead to runtime errors if setters are not called.
Pros:
- Testability — Enables easy mocking of dependencies during unit and integration testing. Services are isolated by default.
- Clarity — Dependencies are explicit in constructor signatures, making the codebase easier to read and understand.
- Maintainability — Promotes modularity and reduces coupling. Changing a dependency's implementation does not require changing its consumers.
- Architectural Enforcement — Reinforces SOLID principles, particularly SRP and DIP.
- Configuration — Simplifies runtime configuration by centralizing object creation.
Cons:
- Boilerplate — Can lead to long constructor signatures in complex services.
- Circular Dependencies — Requires careful design to avoid circular dependencies during wiring at the composition root (though TypeScript helps detect these).
- Composition Root Complexity —
bootstrap.tsbecomes a large and complex file responsible for orchestrating the entire application.
bootstrap.tsgrew significantly, becoming the single "knowledge hub" for the entire application's wiring.- New engineers must understand the DI pattern and consult
bootstrap.tsto grasp the full dependency graph. - Unit tests for services became simpler, focusing purely on business logic without mocking complex global state.
- Easier to refactor and evolve services without cascading dependency changes.
- TypeScript's static analysis strongly enforces dependency contracts, catching many wiring errors at compile time.
This decision directly upholds Dependency Inversion Principle (DIP) by forcing services to depend on abstractions rather than concrete implementations. It reinforces Single Responsibility Principle (SRP) by making dependencies explicit. It aligns with the "Pure core, imperative shell" pattern where bootstrap.ts is the imperative shell and individual services form the pure core.
Implementation Details
-
packages/backend/src/bootstrap.ts— The core wiring logic, where all services are instantiated and passed to their consumers. -
packages/backend/src/app.ts— TheAppclass constructor explicitly takes dependencies:
export class App {
private app: Application;
private initialized: boolean = false;
constructor(
private logger: LoggerService,
private securityConfigService: ConfigSecurityService,
) { /* ... */ }
}- Service Constructors — Almost all services demonstrate constructor injection, e.g.
packages/backend/src/services/consumption.service.ts:
export class ConsumptionService extends CorrelationAwareService {
constructor(
private consumptionRepository: ConsumptionRepository,
private sessionRepository: SessionRepository,
private dailyStatRepository: DailyStatRepository,
// ... many other dependencies
logger: LoggerService,
private outboxService: OutboxService,
private domainEventService: DomainEventService,
performanceMonitoringService: PerformanceMonitoringService,
correlationTracker: CorrelationTrackerService,
private personalizedConsumptionRateService: PersonalizedConsumptionRateService,
private db: DatabaseService,
) { /* ... */ }
}Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:ARCHITECTURE.MD,WORKER-SCALABILITY.MD
When a service needs to both update its database state and publish an event to notify other services (e.g., "consumption created" event), a "dual-write problem" arises. If the database transaction commits but the event publish fails (or vice-versa), the system enters an inconsistent state (e.g., data is updated but other services are unaware, or an event is published for an uncommitted change). This leads to:
- Data integrity issues and eventual consistency failures for derived data
- Difficult debugging due to partial state changes
- Increased complexity in recovery from failures
This was especially critical for high-volume operations like health data ingestion and core entity CRUD.
The Transactional Outbox Pattern was implemented for all critical domain events where atomicity between database state changes and event publishing is paramount. A dedicated OutboxService manages an OutboxEvent table (in PostgreSQL), ensuring that events are written to the outbox within the same database transaction as the primary business data change. A separate, asynchronous OutboxProcessorService then polls the OutboxEvent table and dispatches these events.
- Direct Event Emission (Fire-and-Forget): Publishing events directly to an in-memory event bus (
DomainEventService) or message queue (e.g., Redis Pub/Sub) after the database transaction commits.- Rejected: Prone to dual-write inconsistencies. If the service crashes between DB commit and event publish, the event is lost.
- Two-Phase Commit (2PC): Coordinating a distributed transaction across the database and a message queue.
- Rejected: High complexity, performance overhead, and not all message queues support 2PC. Introduces external system coupling into the transaction.
- Change Data Capture (CDC): Using a database's transaction log (e.g., PostgreSQL WAL) to extract and publish events.
- Rejected: Higher infrastructure complexity (requires specialized CDC tools like Debezium), increased operational overhead, and tighter coupling to database internals.
Pros:
- Atomicity — Guarantees that the event is published if and only if the primary database transaction commits, ensuring strong consistency for the source data and at-least-once delivery for events.
- Resilience — Events persist in the outbox table even if the event dispatcher crashes, ensuring they are eventually processed.
- Decoupling — Primary services remain decoupled from the event dispatch mechanism.
- Auditability — The
OutboxEventtable provides an audit log of all outgoing events.
Cons:
- Increased Latency — Events are not published instantly; there's a small delay (typically seconds) as
OutboxProcessorServicepolls theOutboxEventtable. - Infrastructure Overhead — Requires an
OutboxEventtable and a dedicated polling processor. - Complexity — Adds an extra layer of abstraction to event publishing.
- Significantly improved data integrity for all derived data products (analytics, projections, predictions) that rely on domain events.
- Services now call
outboxService.addEvent(tx, ...)within theirPrisma.$transactioninstead ofdomainEventService.emitEvent()directly. OutboxProcessorServiceis responsible for handling event delivery failures (retries, dead-letter queue) rather than the original service.- Introduction of a slight delay for event propagation, though acceptable for most domain events.
- Downstream subscribers must be idempotent, as events are delivered at-least-once.
Strongly aligns with CQRS by separating the write model from event publishing. Supports Event-Driven Architecture principles by using events as the primary communication mechanism between loosely coupled services. The OutboxService implements an Asynchronous Messaging Pattern within the service boundaries.
Implementation Details
packages/backend/src/services/outbox.service.ts— Core logic for adding and processing outbox events:
export class OutboxService {
// ...
public async addEvent(
tx: Prisma.TransactionClient,
eventData: OutboxEventData,
): Promise<OutboxEvent> {
// ... writes to tx.outboxEvent.create ...
}
// ...
}packages/backend/src/services/healthSample.service.ts— Demonstrates the transactional outbox forhealth.samples.changedevents:
export class HealthSampleService {
// ...
private createOutboxCallback( /* ... */ ): HealthIngestOutboxCallback {
return async (tx: Prisma.TransactionClient, result: BatchUpsertResult, watermarkAfter: bigint) => {
// ... constructs eventPayload ...
await this.outboxRepository!.createInTransaction(tx, { /* ... eventPayload ... */ });
};
}
// ...
}packages/backend/src/services/outbox-processor.service.ts— Background worker that polls theOutboxEventtable and dispatches events.packages/backend/prisma/schema.prisma— Defines theOutboxEventmodel.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:ARCHITECTURE.MD,PROJECTION-PIPELINE.MD,DATA-INTEGRITY-GUARANTEES.MD
Ingesting health data from client platforms (HealthKit, Health Connect) presented several challenges:
- UX Freshness — Users expect to see their latest vital signs immediately upon opening the app.
- Historical Backfill — New users have years of historical data that needs to be synced without blocking the UI.
- Deletion Detection — Health data can be deleted or edited on the source platform, requiring efficient detection and propagation to the backend.
- Resource Management — Querying large date ranges or using inefficient methods can drain battery, consume excessive memory, and hit API rate limits.
- Cross-Platform Consistency — Ensuring identical ingestion logic and behavior across iOS (HealthKit) and Android (Health Connect), and between JS and native code.
The initial approach of a single, monolithic ingestion loop struggled to balance these conflicting requirements.
A Three-Lane Ingestion Architecture (HOT, COLD, CHANGE) was adopted:
- HOT Lane — Prioritizes UI freshness, using date-range queries for recent data, potentially with a two-pass strategy (first-paint + catch-up).
- COLD Lane — Focuses on historical backfill in bounded chunks, progressing backward from a time-cursor. Operates in the background.
- CHANGE Lane — Detects deletions and edits using anchored queries, updating records accordingly.
This orchestration is managed client-side by HealthSyncService (JS) and delegated to an IHealthIngestionDriver (abstracting native Swift and JS fallback implementations).
- Single Monolithic Ingestion: A single loop attempting to do all three tasks (fetch recent, backfill history, detect changes).
- Rejected: Difficult to balance UX freshness with deep backfill. High risk of UI blocking, poor resource management.
- Time-Windowed Full Re-sync: Periodically fetching all data within a rolling 90-day window.
- Rejected: Inefficient for detecting deletions/edits (requires full diff), prone to missing data if queries time out for dense windows.
- Push-Only from Source: Relying entirely on HealthKit background delivery notifications.
- Rejected: Not reliable enough for critical functions (notifications can be missed), doesn't cover historical backfill.
Pros:
- Optimized UX — HOT lane ensures immediate display of fresh data.
- Efficient Backfill — COLD lane systematically backfills history in manageable chunks, preventing UI blocking.
- Data Correctness — CHANGE lane guarantees propagation of deletions and edits.
- Resource Management — Bounded queries prevent excessive battery drain and API rate limit hits.
- Resilience — Lane isolation ensures one lane's failure doesn't block others. Idempotent operations.
- Cross-Platform Alignment — Provides a unified, semantic contract for both iOS and Android ingestion.
Cons:
- Increased Complexity — Three distinct lanes require more sophisticated orchestration logic.
- Cursor Management — Each lane requires its own cursor (
hot_anchor,cold_time,change_anchor) to operate independently. - Native/JS Driver Abstraction — Adds an abstraction layer (
IHealthIngestionDriver) to accommodate platform-specific implementations. - Operational Visibility — Requires detailed logging and metrics per lane to monitor health and progress.
HealthSyncService(app-side) became a central orchestrator, managing timers, app state, network status, and routing to appropriate lanes.- Required extending
health_ingest_cursorstable with ascopecolumn (hot_anchor,cold_time,change_anchor). - Development of ingestion logic is now lane-specific, requiring understanding of lane semantics.
- Significantly improved responsiveness for recent data and more reliable background backfill.
- More structured logging per lane aids in diagnosing ingestion issues.
Reinforces Separation of Concerns by clearly delineating the responsibilities of each ingestion lane. Uses Strategy Pattern via IHealthIngestionDriver for platform-specific implementations. The architecture is Resilient by isolating failures and supporting idempotent operations.
Implementation Details
packages/app/src/services/health/HealthSyncService.ts— The main orchestrator of the three lanes on the app side.packages/app/src/services/health/HealthIngestionEngine.ts— The core ingestion logic (JS implementation), delegating toHealthDataProviderAdapter.packages/app/src/services/health/HealthKitAdapter.ts— The iOS-specific implementation ofHealthDataProviderAdapter.packages/app/src/services/health/types/ingestion-driver.types.ts— Defines theIHealthIngestionDriverinterface andNativeErrorCodeenum.packages/app/src/repositories/health/HealthCursorRepository.ts— Manages lane-specific cursors.packages/app/src/services/health/HealthSyncCoordinationState.ts— Defines lane constants:
public readonly HOT_OVERLAP_MS = 5 * 60 * 1000; // 5 minutes
public readonly COLD_CHUNK_WINDOW_MS = 7 * 24 * 60 * 60 * 1000; // 7 days
public readonly CHANGE_LANE_INTERVAL_MS = 21_600_000; // 6 hoursStatus:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:HEALTH-INGESTION-PIPELINE.MD,FAILURE-MODES.MD,WORKER-SCALABILITY.MD
Derived health data (e.g., daily rollups, sleep summaries, session impact) is computed asynchronously from raw health samples. This introduces a challenge: how can a client reliably know if the displayed derived data is fresh relative to the underlying raw data, or if it's stale and needs recomputation?
Relying solely on time-based TTL (e.g., "cache valid for 5 minutes") is insufficient, as new raw data could arrive seconds after a cache is generated, rendering it immediately stale. This leads to:
- Users seeing outdated insights or aggregates
- Lack of a clear mechanism to trigger recomputation
- Ambiguity between "no data" and "stale data"
A Watermark-Based Freshness Model was implemented for all derived health data products (projections):
- A
UserHealthWatermarktable stores a monotonically increasingsequenceNumberfor each user, incremented on any mutation to their rawHealthSampledata. - Derived projection rows (e.g.,
UserHealthRollupDay,UserSleepNightSummary) store thesourceWatermark(thesequenceNumberactive when they were computed). - During reads,
HealthProjectionReadServicecompares thecurrentWatermarkagainst thederivedRow.sourceWatermark. IfcurrentWatermark > derivedRow.sourceWatermark, the derived row's freshness status is overridden toSTALE. - The API response includes
FreshnessMetafields (status,computedAtMs,sourceWatermark,computeVersion).
- Time-Based TTL (Cache expiration): Simply invalidating derived data after a fixed time.
- Rejected: Inaccurate. New data could arrive well within the TTL, making the data stale but still "valid" by TTL. Conversely, no new data might arrive, leading to unnecessary recomputations.
- Event-Driven Invalidations (Direct): On
health.samples.changed, directly invalidate all affected derived caches.- Rejected: For complex, multi-day aggregates, direct invalidation is difficult to scope precisely and can lead to over-invalidation (thrashing). The watermark pattern offers a lazier, pull-based invalidation.
- Periodic Full Recomputation: Recomputing all derived data nightly.
- Rejected: Not granular enough for real-time responsiveness. Data could be stale for up to 24 hours.
Pros:
- Accuracy — Precisely tracks staleness relative to source data mutations, eliminating ambiguity.
- Efficiency — Recomputation is triggered only when needed (data is stale or missing).
- Transparent UI — Provides clear signals to the frontend on whether to show "live," "updating," "stale," or "error" states.
- Resilience — Handles out-of-order event processing and replica lag gracefully (
assertWatermarkFreshnessdefers processing if watermark is too low). - Auditability —
sourceWatermarkprovides a traceable link from derived data back to the state of the raw data.
Cons:
- Overhead — Requires an additional database query to fetch the current watermark on each read API call.
- Complexity — Adds
UserHealthWatermarktable and logic to update it on every raw data mutation. - Client Implementation — Frontend must implement logic to interpret
FreshnessMetaand trigger UI updates/refreshes.
- Introduction of the
UserHealthWatermarktable andsourceWatermark/statuscolumns on projection tables. HealthSampleServiceis responsible for incrementing the watermark sequence number onhealth.samples.changedevents.HealthProjectionReadServicequeries the watermark and applies freshness overrides.- All derived health data DTOs (
HealthRollupDayDto,SleepNightSummaryDto, etc.) now includefreshness: ProjectionFreshnessMeta. - UI components consuming derived health data must adapt to handle
FreshnessStatus(e.g., show spinners forCOMPUTING, badges forSTALE).
Reinforces CQRS by explicitly separating the write side (mutating HealthSample and UserHealthWatermark) from the read side (querying derived projections and the watermark). Improves Data Integrity by providing a robust mechanism for staleness detection.
Implementation Details
packages/backend/prisma/schema.prisma— DefinesUserHealthWatermarkandsourceWatermark/statusfields on projection models.packages/backend/src/repositories/user-health-watermark.repository.ts— Manages sequence numbers.packages/backend/src/services/healthSample.service.ts— Increments watermark within the transactional outbox callback:
// excerpt from createOutboxCallback
await this.outboxRepository!.createInTransaction(tx, { /* ... eventPayload ... */ });
await this.watermarkRepo!.incrementSequenceNumberInTransaction(tx, userId, watermarkAfter);packages/backend/src/services/health-projection-read.service.ts— Applies watermark freshness overrides:
// excerpt from applyWatermarkFreshness
const currentWatermark = await this.watermarkRepo.getSequenceNumber(userId);
if (currentWatermark > derivedMeta.sourceWatermark) {
// Override status to STALE
}packages/backend/src/services/health-projection-coordinator.service.ts— Projection handlers useassertWatermarkFreshnessfor read-replica lag detection:
// excerpt from HealthRollupProjectionHandler.handle
assertWatermarkFreshness(currentWatermark, payload, 'HealthRollupProjectionHandler');packages/shared/src/contracts/health-projection.contract.ts— DefinesProjectionFreshnessMetaand related helpers.packages/shared/src/health-config/freshness-types.ts— DefinesFreshnessStatusandFreshnessMetautilities.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:PROJECTION-PIPELINE.MD,OBSERVABILITY.MD
As the backend grew, managing cross-cutting concerns (authentication, logging, rate limiting, security headers, external service calls) within individual route handlers became cumbersome, repetitive, and error-prone. This led to:
- Duplicated code across controllers and routes
- Inconsistent application of security policies
- Difficulty in auditing and monitoring due to scattered logic
- Tight coupling of controllers to infrastructure concerns
A centralized, declarative approach was needed to ensure consistency, maintainability, and enforce global policies.
An API Gateway Pattern was implemented within the Express.js application, managed by APIGatewayManager and orchestrated by MiddlewareFactory. This centralizes the configuration and application of all cross-cutting concerns:
- Middleware Pipeline — All requests flow through a predefined middleware stack (
app.ts). - Authentication/Authorization — Centralized via
auth.middleware.ts,auth-monitoring.middleware.ts,authorization.middleware.ts. - Logging/Correlation — Managed by
correlationContext.middleware.tsandlogging.middleware.ts. - Rate Limiting — Enforced by
rateLimitQueue.middleware.tsand monitored byrate-limit-monitoring.middleware.ts. - Security Headers — Applied by
httpsEnforcement.middleware.ts. - External Service Calls — Abstracted through
APIGatewayManagerfor circuit breaking, retries, and correlation.
- Ad-hoc Middleware: Registering middleware directly in
app.use()calls or on individual routes without a factory pattern.- Rejected: Leads to boilerplate, inconsistencies, and makes it hard to enforce global policies or change the middleware stack order.
- External API Gateway: Using a dedicated service (e.g., AWS API Gateway, Nginx, Kong) outside the application.
- Rejected: Adds infrastructure complexity and operational overhead for a smaller application. Current Express.js-based solution is sufficient.
Pros:
- Consistency — Guarantees all requests are subject to the same set of cross-cutting concerns.
- Maintainability — Centralizes logic in dedicated middleware files and a factory, reducing code duplication.
- Security — Enforces global security policies (auth, rate limits, headers) uniformly.
- Observability — Integrates logging and performance monitoring at key points in the request lifecycle.
- Decoupling — Controllers focus solely on business logic, adhering to SRP.
- Flexibility —
MiddlewareFactoryallows dynamic composition of middleware chains for specific routes.
Cons:
- Learning Curve — Requires understanding the middleware factory and how the pipeline is constructed.
- Debugging — Tracing execution through multiple middleware layers can be challenging without proper correlation IDs.
- Overhead — Each request incurs the overhead of multiple middleware executions.
bootstrap.tsorchestrates the initialization ofAPIGatewayManagerandMiddlewareFactory, and then passes the constructed middleware stack toapp.ts.- Route files now focus on defining routes and their specific validation schemas, relying on the middleware factory to provide common concerns.
- Changes to cross-cutting concerns are made in the middleware layer/factory, not in individual route handlers.
- Carefully designed middleware and optimized services (e.g., Redis for caching/rate limiting) mitigate the overhead.
correlationContext.middleware.tsis crucial for tracing requests through the extensive middleware pipeline.
Strongly reinforces Separation of Concerns by extracting cross-cutting concerns from business logic. Promotes Modularity and Maintainability by centralizing configuration and application of middleware. The use of factories adheres to Dependency Inversion Principle (DIP).
Implementation Details
packages/backend/src/app.ts— Main Express app configuration where the middleware stack is applied:
// excerpt from setupMiddleware
public async setupMiddleware(middleware: MiddlewareStack): Promise<void> {
this.app.use(helmet({ /* ... */ }));
this.app.use(compression({ /* ... */ }));
this.app.use(cors(corsConfig));
this.app.use(middleware.httpsEnforcement);
this.app.use(middleware.correlationContext);
this.app.use(middleware.serverTime); // ADR-017
this.app.use(middleware.apiGateway);
this.app.use(middleware.requestLogging);
// ... and many more
}packages/backend/src/core/middleware-factory.ts— Factory responsible for creating all middleware instances with their dependencies.packages/backend/src/api/v1/middleware/correlationContext.middleware.ts— Initializes correlation ID for tracing.packages/backend/src/api/v1/middleware/auth.middleware.ts— Handles JWT authentication logic.packages/backend/src/api/v1/middleware/rateLimitQueue.middleware.ts— Enforces API rate limits.packages/backend/src/api/v1/middleware/server-time.middleware.ts— AddsServer-Timeheader (ADR-017).
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:ARCHITECTURE.MD,SECURITY-COMPLIANCE.MD,OBSERVABILITY.MD
In an offline-first mobile application, clients often create new entities (e.g., a Consumption, Purchase, Device) locally while disconnected. When the client later syncs, these entities are pushed to the backend. Without a robust mechanism to identify these client-generated entities:
- Duplicate Records — Retrying a failed
CREATEcould lead to multiple identical server-side records. - Inconsistent ID Mapping — The backend might generate its own UUIDs, making it difficult for the client to map local IDs to server IDs.
- Referential Integrity — Dependent entities would have broken foreign keys if the parent's ID changed on the server.
The system adopted client-generated UUIDs for key entities (Consumption, Purchase, Device, JournalEntry, Product, InventoryItem) as the primary identifier. These client-generated UUIDs are stored in specific client...Id columns (e.g., clientConsumptionId, clientPurchaseId) and used for idempotent CREATE operations.
Upon successful server-side processing, the backend's internal UUID (which might be the client-generated UUID or a server-assigned one if an existing entity was found) is returned to the client for canonical mapping. This allows the client to update its local primary key and FK references.
- Server-Generated IDs Only: Client sends data without an ID; server generates a new ID for every
CREATEand returns it.- Rejected: Requires clients to manage a temporary ID mapping table, making offline-first complex. Retries become difficult without a client-side idempotency key.
- Natural Keys for Idempotency: Using business-domain fields (e.g.,
(timestamp, userId, productId)) as a composite key.- Rejected: Natural keys are often not globally unique, can change over time, and may not exist for all entities. UUIDs are robust and simple.
- Using Client-Generated ID as Primary Key: Client-generated UUID becomes the primary key in the backend.
- Rejected: Tightly couples the backend's primary key strategy to the client's. The current model uses the client ID for idempotency but allows the backend to return its canonical primary key.
Pros:
- Robust Offline-First — Enables clients to create data locally with globally unique identifiers without network connectivity.
- Idempotent
CREATEOperations — Retrying network requests forCREATEis safe, preventing duplicate server-side records. - Simplified Client Logic — Clients don't need complex temporary ID management.
- Clear ID Mapping — The backend explicitly returns the canonical server-assigned ID for the client to update its local state.
- Error Detection — Duplicate
client...Idvalues can indicate client-side bugs or concurrency issues.
Cons:
- Database Overhead — Requires an extra
client...Idcolumn and unique index for each entity type. - Client Implementation — Clients must reliably generate and store UUIDs for new records.
- Backend Complexity — Backend handlers must explicitly check for
client...IdonCREATEand handle unique constraint violations. - Semantic Drift — If clients fail to provide these IDs, the backend auto-generates them, which breaks the original idempotency contract for that client, leading to
[RISK]: duplicate recordsin corner cases.
prisma/schema.prismadefinesclientConsumptionId,clientPurchaseId,clientEntryId, etc., with unique indexes.PurchaseService.createPurchase,ConsumptionService.createConsumption,JournalService.createJournalEntryexplicitly handleclient...Idfor deduplication.CreateConsumptionSchema,CreatePurchaseSchema,CreateJournalEntrySchema(from@shared/contracts) includeclient...Idfields, making them mandatory for new entities.SyncService.processPushSynctracksclientIdandserverIdmappings to ensure FK references are correctly updated during cascade operations.
Supports Idempotency as a key principle for distributed systems. Facilitates an Offline-First Architecture and promotes Resilience against network failures. By making client IDs part of the contract, it encourages a Contract-First Development approach.
Implementation Details
packages/backend/prisma/schema.prisma:
model Consumption {
// ...
clientConsumptionId String? @map("client_consumption_id")
// ...
@@unique([userId, clientConsumptionId], name: "user_clientConsumptionId_unique")
}packages/backend/src/services/purchase.service.ts—createPurchasemethod handlesclientPurchaseId:
// excerpt from createPurchase
const purchaseId = uuidv4();
const clientPurchaseIdProvided = !!data.clientPurchaseId;
const clientPurchaseId = data.clientPurchaseId || uuidv4(); // Generate if not provided
const result = await retryWithBackoff(async () => {
return await this.db.getClient().$transaction(async (tx) => {
// ... createWithOutboxEvent handles INSERT ... ON CONFLICT DO NOTHING ...
});
});packages/shared/src/contracts/health.contract.ts— DefinesBatchUpsertSamplesRequest.requestIdfor batch-level idempotency andsourceRecordIdfor sample-level idempotency.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:SYNC-ENGINE.MD,DATA-INTEGRITY-GUARANTEES.MD,FAILURE-MODES.MD
In a multi-device, offline-first sync system, concurrent modifications to the same entity inevitably lead to conflicts. Without clear, deterministic rules:
- Data Loss — One device's changes might silently overwrite another's.
- Inconsistent State — Devices could end up with different versions of the same entity.
- User Confusion — Unpredictable behavior when syncing.
- Developer Burden — Manual, ad-hoc merge logic in each handler leads to inconsistencies and bugs.
The absence of a centralized, configurable conflict resolution policy made the sync engine fragile and hard to extend.
A Configurable Conflict Resolution System was adopted, formalizing merge strategies and field-level policies:
- Conflict strategies (
SERVER_WINS,CLIENT_WINS,LAST_WRITE_WINS,MERGE,MANUAL) and field policies (LOCAL_WINS,SERVER_WINS,MERGE_ARRAYS,MONOTONIC,MAX_VALUE,SERVER_DERIVED) are centrally defined in@shared/sync-config/conflict-strategies.ts. - Each
EntityTypehas a specificEntityConflictConfigin@shared/sync-config/conflict-configs.tsthat specifies its default strategy, field-level overrides, and server-derived fields. - The
SyncServiceutilizes the Strategy Pattern withSyncEntityHandlerinterfaces. Generic conflict resolution logic inSyncService.resolveConflictStrategyapplies these configurations.
- Hardcoded Merge Logic (per entity): Implementing
merge()methods directly in each entity handler with custom logic.- Rejected: Led to boilerplate, inconsistencies, and made it difficult to audit or change merge behavior globally.
- Client-Side Resolution Only: Always returning conflicts to the client for user resolution.
- Rejected: High friction for users, impractical for frequent background syncs.
- Last-Write-Wins (Global Default): A simple, universal policy where the most recent change always prevails.
- Rejected: Too blunt for complex entities. Could lead to loss of important fields.
Pros:
- Consistency — Ensures all entity types resolve conflicts according to a predefined, auditable policy.
- Flexibility — Allows fine-grained control over how individual fields merge (e.g., local wins for notes, max for counters, union for tags).
- Extensibility — Adding new entity types requires defining their
EntityConflictConfig, not rewriting core merge logic. - Maintainability — Centralizes policy definition, simplifies review, and reduces errors.
- Transparency — Explicit policies make conflict resolution predictable.
- Shared Contract —
@shared/sync-configensures frontend and backend use identical rules.
Cons:
- Complexity — Defining detailed policies for numerous fields adds configuration overhead.
- Learning Curve — Requires understanding the policy types and how they interact.
- Performance — Field-by-field merging can be slightly slower than a blunt "last write wins" for very large entities, but necessary for data integrity.
- Creation of
@shared/sync-config/conflict-strategies.tsandconflict-configs.tsas the authoritative source.SyncServicebecame more generic, delegating entity-specific logic toSyncEntityHandlerimplementations. - Engineers now define conflict behavior declaratively in config files, rather than imperatively in code.
- Easier to test merge logic for specific fields by providing mock local/server entities and expected merged outputs.
- The API exposes
ConflictStrategyand expectsresolvedDatafor manual merges. - Reduced semantic drift between frontend and backend due to shared configurations.
Reinforces Separation of Concerns by separating conflict resolution policy (config) from implementation logic (generic resolver). Adheres to the Strategy Pattern and Open/Closed Principle by allowing new entity types to extend the system without modifying core sync logic. Promotes Idempotency by ensuring predictable outcomes on retries.
Implementation Details
packages/shared/src/sync-config/conflict-strategies.ts— DefinesCONFLICT_STRATEGY,FIELD_POLICY, andIdStrategy.packages/shared/src/sync-config/conflict-configs.ts— Registry mappingEntityTypeto itsEntityConflictConfig:
// excerpt from SESSIONS_CONFIG
const SESSIONS_CONFIG: EntityConflictConfig = {
defaultStrategy: CONFLICT_STRATEGY.MERGE,
fieldPolicies: [
{ field: 'purchaseId', policy: FIELD_POLICY.SERVER_WINS },
{ field: 'notes', policy: FIELD_POLICY.LOCAL_WINS },
{ field: 'status', policy: FIELD_POLICY.MONOTONIC, transitions: ['ACTIVE', 'PAUSED', 'CANCELLED', 'COMPLETED'] },
],
serverDerivedFields: ['eventCount', 'totalDurationMs', 'sessionStartTimestamp'],
conflictFree: false,
requiresCustomMerge: true,
idStrategy: ID_STRATEGY.PRIMARY_KEY_IS_SERVER_ID,
};packages/backend/src/services/sync.service.ts— Entity processing methods delegate toSyncEntityHandlers.resolveConflictStrategyapplies the config.packages/backend/src/services/sync/handlers/session.handler.ts— Example of a concreteSyncEntityHandlerwith custom merge logic for sessions.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:SYNC-ENGINE.MD,DATA-INTEGRITY-GUARANTEES.MD
In a distributed, multi-device, offline-first environment, multiple clients or backend services can attempt to modify the same entity concurrently. Without a mechanism to detect and prevent conflicting updates:
- Lost Updates — A user's changes on one device might be silently overwritten by an older version from another device.
- Data Inconsistency — The database state does not reflect the intended logical sequence of operations.
- Debugging Complexity — Hard to trace why data appears corrupted or outdated.
Optimistic Locking was implemented for all syncable entities:
- A
versioninteger column was added to all relevant Prisma models (User,Product,Consumption,Session,JournalEntry, etc.). - During an
UPDATEoperation, the client provides theexpectedVersion(the version it last read). - The backend's
BaseRepository.updatemethod performs a conditional update:UPDATE ... WHERE id = [id] AND version = [expectedVersion]. - If a
versionmismatch occurs, the update fails, signaling aCONFLICT(HTTP 409) to the caller.
- Pessimistic Locking: Using database locks (
SELECT FOR UPDATE) to prevent concurrent access.- Rejected: Introduces contention, reduces concurrency, and is less suitable for long-running, disconnected operations typical of mobile sync.
- Last-Write-Wins (Implicit): Simply update the record without version checks.
- Rejected: Leads to lost updates and data corruption. Explicitly rejected for data integrity.
- Timestamp-Based Last-Write-Wins: Using
updatedAttimestamps for conflict detection.- Rejected: Timestamps can suffer from clock skew between client devices and servers.
versionnumbers are atomic and deterministic.
- Rejected: Timestamps can suffer from clock skew between client devices and servers.
Pros:
- Data Integrity — Prevents lost updates, ensuring that every valid modification is preserved or explicitly resolved.
- High Concurrency — Does not hold database locks, allowing multiple clients to read concurrently.
- Simple Implementation — Relatively straightforward with a
versioncolumn and conditionalUPDATE. - Clear Conflict Signal — Explicitly returns HTTP 409 when a version mismatch occurs.
Cons:
- Client-Side Awareness — Clients must be aware of the
versionfield and include it inUPDATErequests. - Increased Conflict Rate — For highly contentious data, this could lead to more frequent conflicts (though this indicates a legitimate concurrency issue).
- Developer Workflow — Requires engineers to include
versioninUPDATEoperations and handleCONFLICTerrors.
prisma/schema.prismanow includes a non-nullableversioninteger column (default1) on all syncable models.BaseRepository.update(and all derived repositories) automatically handle incrementing theversionand using it inWHEREclauses.Update...Schema(from@shared/contracts) for all syncable entities include an optionalversionfield.SyncService.detectConflictexplicitly comparesclientVersionwithserverVersion.- Clients must fetch the
versionalong with the entity and include it in subsequentUPDATEpayloads.
Reinforces Data Integrity and Resilience in a distributed system. Promotes Explicit Design by making conflict detection a first-class concern. Aligns with SRP by encapsulating the locking logic within the repository layer.
Implementation Details
packages/backend/prisma/schema.prisma— Theversioncolumn in models:
model User {
// ...
version Int @default(1)
// ...
}packages/backend/src/repositories/base.repository.ts— Theupdatemethod handles optimistic locking:
// excerpt from update method
const updated = await tx[this.modelName].update({
where: { id: id, version: updateData.expectedVersion }, // Conditional update
data: {
...prismaUpdateData,
version: { increment: 1 }, // Increment version
updatedAt: new Date(),
},
});packages/backend/src/services/sync.service.ts—detectConflictmethod checksserverVersionagainstclientVersion.packages/shared/src/contracts/sync-config/conflict-configs.ts—EntityConflictConfigfor each entity detailsidStrategyandmonotonicFields.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:SYNC-ENGINE.MD,DATA-INTEGRITY-GUARANTEES.MD,FAILURE-MODES.MD
The initial sync implementation used offset-based pagination (OFFSET/LIMIT). This approach is problematic for large, frequently changing datasets:
- Performance Degradation —
OFFSETqueries require the database to scan all rows up to the offset, leading to O(N) performance for deep pages. - Data Skew/Missing Data — Concurrent writes during pagination can cause rows to be skipped or duplicated.
- Scalability — Not suitable for high-volume, real-time sync scenarios.
These issues directly impacted sync reliability and performance, especially for users with extensive historical data.
Keyset (Cursor-Based) Pagination was implemented for fetching incremental changes (GET /sync/changes):
- The cursor is an opaque, base64-encoded string containing a composite key (
lastCreatedAt,lastId) from the last record of the previous page. - Queries use
WHERE (createdAt > [lastCreatedAt] OR (createdAt = [lastCreatedAt] AND id > [lastId])) ORDER BY createdAt ASC, id ASC LIMIT [limit]. - The
SyncServiceencodes and decodes cursors using strict Zod schemas defined in@shared/sync-config/cursor.ts. GET /health/samples/cursoralso adopted this pattern.
- Offset-Based Pagination (
OFFSET/LIMIT): The existing solution.- Rejected: Performance degrades on deep pages, susceptible to data skew during concurrent writes, not scalable.
- GraphQL Cursor Connections (Relay-style): A standardized way to implement cursor pagination with GraphQL.
- Rejected: Backend is REST-based. Introducing GraphQL solely for pagination would be overkill.
- Timestamp-Only Cursor: Using only
lastCreatedAtfor the cursor.- Rejected: Not robust enough. Multiple records with the same
createdAttimestamp (common for batch operations) would lose tie-breaking logic.
- Rejected: Not robust enough. Multiple records with the same
Pros:
- Performance — O(1) or O(log N) regardless of page depth, as the database jumps directly to the cursor position.
- Data Integrity — Immune to data skew from concurrent writes, ensuring consistent pagination.
- Scalability — Highly scalable for high-volume incremental sync operations.
- Deterministic — Cursor logic is deterministic, ensuring repeatable results.
- Shared Contract —
@shared/sync-config/cursor.tsguarantees identical encoding/decoding on frontend and backend.
Cons:
- Complexity — Requires more complex query logic for building
WHEREclauses with composite keys. - Opaque Cursor — Cursors are opaque to clients, limiting their ability to "jump to page N."
- Debugging — Debugging corrupt or malformed cursors can be challenging without proper error reporting.
- Client Implementation — Clients must adapt to cursor-based pagination logic (storing
nextCursor, looping untilhasMore).
GET /sync/changesandGET /health/samples/cursorAPIs now return acursorandhasMorefield instead ofpage,total,totalPages.- Introduction of
@shared/sync-config/cursor.tsfor cursor types, encoding, and decoding. SyncChangeRepositoryandHealthSampleRepositoryimplemented specific methods for cursor-based queries.- Requires new mental models for sync (cursors instead of page numbers).
InvalidCursorErroris a common error type. - Significantly improved sync performance for users with large datasets.
Reinforces Performance Optimization and Scalability. Promotes Data Integrity by ensuring reliable pagination. The strict cursor contract adheres to Contract-First Development and Robustness Principle.
Implementation Details
packages/shared/src/sync-config/cursor.ts— DefinesEntityCursor,CompositeCursor, encoding/decoding functions:
// excerpt from decodeCompositeCursor
export function decodeCompositeCursor(encoded: string): CompositeCursor {
// ... base64 decoding ...
// ... JSON parsing ...
const result = CompositeCursorSchema.safeParse(parsed);
if (!result.success) {
throw new InvalidCursorError(encoded, 'Schema validation failed', result.error);
}
return result.data;
}packages/backend/src/api/v1/controllers/sync.controller.ts—getIncrementalChangesmethod.packages/backend/src/repositories/sync-change.repository.ts—getChangesSincemethod usingWHERE (createdAt > ? OR (createdAt = ? AND id > ?))clause.packages/backend/src/api/v1/schemas/sync.schemas.ts—syncChangesSchemadefines cursor validation.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:SYNC-ENGINE.MD,WORKER-SCALABILITY.MD
Health data is Protected Health Information (PHI) subject to stringent privacy regulations (e.g., HIPAA). Users must have explicit control over how their health data is collected, processed, and used. Without a robust, server-side enforcement mechanism:
- Privacy Violations — Accidental ingestion of data types a user has explicitly blocked.
- Compliance Risks — Failure to meet regulatory requirements for user data control.
- Erosion of Trust — Users losing confidence if their privacy preferences are not honored.
The initial ingestion pipeline lacked a centralized server-side privacy check, relying solely on client-side permission handling.
A Server-Side Privacy Gating mechanism was implemented in HealthSampleService. Before any batch of health samples is processed or stored, the service:
- Fetches the user's
privacySettings(stored in theUsermodel as a JSONB column). - Parses a nested
healthobject withinprivacySettingsto retrieveallowHealthDataUpload(global toggle) andblockedMetrics(an array ofmetricCodes). - Rejects the entire batch with a
FORBIDDENerror ifallowHealthDataUploadis false. - Filters out individual samples whose
metricCodeis inblockedMetrics, returning them asPRIVACY_BLOCKEDfailures in the batch response.
- Client-Side Enforcement Only: Relying solely on the mobile client to respect privacy settings.
- Rejected: Insufficient for robust privacy enforcement. Clients can be buggy, compromised, or outdated.
- Post-Ingestion Filtering: Ingesting all data and then filtering for privacy after storage but before processing.
- Rejected: Inefficient (ingests unwanted data), higher storage cost, and creates a privacy risk as the data temporarily resides in the database.
- Separate UserHealthPreference Table: Storing health-specific privacy settings in a dedicated table.
- Rejected: Increased schema complexity for a relatively small, JSON-serializable preference set. JSONB on
Useris sufficient and more flexible.
- Rejected: Increased schema complexity for a relatively small, JSON-serializable preference set. JSONB on
Pros:
- Regulatory Compliance — Directly addresses HIPAA and other privacy regulations.
- Robust Enforcement — Guarantees user privacy preferences are honored server-side, regardless of client behavior.
- User Trust — Builds user confidence by providing clear, enforceable control.
- Efficiency — Filters unwanted data at the ingestion boundary.
- Flexibility — JSONB column allows easy evolution without schema migrations.
Cons:
- Performance Overhead — Requires a database read (user profile) and JSONB parsing on every health batch upload. Mitigated by careful indexing and caching.
- Complexity — Adds logic to
HealthSampleServiceto fetch, parse, and enforce privacy settings. - Client Communication — Client needs to understand
PRIVACY_BLOCKEDerror codes and provide appropriate UI feedback.
- The
Usermodel inprisma/schema.prismahas aprivacySettingsJSONB column.ExtendedUserPrivacySettingsSchemaandHealthPrivacySettingsSchemadefine the expected JSON structure. HealthSampleService.batchUpsertSamplesnow callsassertHealthUploadAllowedandfilterBlockedMetrics.BatchUpsertSamplesResponseSchemaincludesPRIVACY_BLOCKEDas aSampleErrorCode.- Clients must handle
PRIVACY_BLOCKEDerrors inHealthUploadEngineand inform the user.
Directly upholds Privacy by Design and Security by Design principles. Reinforces Separation of Concerns by encapsulating privacy enforcement within a dedicated layer. Improves Robustness by adding a server-side trust boundary.
Implementation Details
packages/backend/src/services/healthSample.service.ts—assertHealthUploadAllowedandfilterBlockedMetricsmethods:
// excerpt from batchUpsertSamples
const privacySettings = await this.assertHealthUploadAllowed(userId, requestId, samples.length);
if (privacySettings.blockedMetrics && privacySettings.blockedMetrics.length > 0) {
const filterResult = this.filterBlockedMetrics(samples, privacySettings.blockedMetrics);
samplesToProcess = filterResult.allowed;
// ... push PRIVACY_BLOCKED failures ...
}packages/shared/src/contracts/health.contract.ts— DefinesHealthPrivacySettingsSchemaandSampleErrorCode.PRIVACY_BLOCKED.packages/backend/src/models/index.ts— DefinesExtendedUserPrivacySettingsSchemafor JSONB parsing.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:SECURITY-COMPLIANCE.MD,DATA-INTEGRITY-GUARANTEES.MD
The batch upsert endpoint (POST /health/samples/batch-upsert) relies on a requestId for request-level idempotency. However, simply using requestId is insufficient for robust protection against:
- Data Tampering — A malicious actor (or buggy client) could reuse an old
requestIdwith a modified payload, leading to data corruption. - Client Bugs — A client might accidentally send different samples under the same
requestId(e.g., due to local state corruption).
Without verifying payload integrity, the requestId alone provided incomplete idempotency.
A Payload Hash (payloadHash) was introduced in the BatchUpsertSamplesRequest contract:
- Clients compute a SHA-256 hash of the canonicalized samples (and deletions) array using
computeBatchPayloadHashfrom@shared/health-config/payload-hash.ts. - This
payloadHashis sent alongside therequestId. - The backend's
health.routes.tsvalidation middleware verifies that the receivedpayloadHashexactly matches the hash computed from the incomingsamplesanddeletedarrays. - A
configVersionfield handles backward compatibility for deletion hashing (specifically,startAtfor precise deletion identity).
requestIdOnly: Relying solely onrequestIdfor idempotency.- Rejected: Vulnerable to data tampering and client bugs.
- Server-Side Content Hashing: Server computes hash internally and stores it with the
requestId.- Rejected: Client wouldn't know if content changed without server re-hashing. Less efficient.
- Stronger Cryptographic Signatures: Using client-side digital signatures for the entire payload.
- Rejected: Overkill for the current threat model. SHA-256 hash provides sufficient integrity for idempotency.
Pros:
- Data Integrity — Guarantees batch content remains unchanged across retries for the same
requestId. - Security — Detects accidental or malicious data tampering.
- Robust Idempotency — Provides a strong, cryptographic guarantee for request-level idempotency.
- Deterministic — The canonicalization algorithm ensures the same set of samples (regardless of order) always produces the same hash.
- Backward Compatibility —
configVersionallows the hash algorithm to evolve while maintaining compatibility.
Cons:
- Client-Side Complexity — Clients must implement SHA-256 hashing and deterministic JSON canonicalization.
- CPU Overhead — Hash computation adds a small CPU overhead per request.
- Payload Size — Adds 64 bytes for the hash string to each batch request.
BatchUpsertSamplesRequestSchema(from@shared/contracts) now includespayloadHash(mandatory) andconfigVersion(optional).DeletionItemSchemaincludes optionalstartAt.- Introduction of
@shared/health-config/payload-hash.tswithcomputeBatchPayloadHashandverifyBatchPayloadHash. - Mobile clients (
HealthUploadEngine) must implement hash computation for outgoing requests. health.routes.tsincludescreateBatchUpsertValidationMiddlewareto verify the hash.BatchValidationErrorwithPAYLOAD_HASH_MISMATCHis thrown on verification failure.
Reinforces Data Integrity and Security by Design. Ensures Idempotency is cryptographically robust and promotes Contract-First Development through shared hashing logic.
Implementation Details
packages/shared/src/health-config/payload-hash.ts— Canonicalization and hashing functions:
// excerpt from computeBatchPayloadHash
export async function computeBatchPayloadHash(input: BatchPayloadHashInput): Promise<string> {
const { samples, deleted = [], configVersion = 1 } = input;
const sortedSamples = canonicalizeSamplesForHash(samples);
const sortedDeleted = canonicalizeDeletionsForHash(deleted, configVersion);
const combinedPayload = { samples: sortedSamples, deleted: sortedDeleted };
return computePayloadHash(combinedPayload as CanonicalPayload);
}packages/shared/src/contracts/health.contract.ts— DefinesBatchUpsertSamplesRequestSchemaandPayloadHashSchema.packages/backend/src/api/v1/routes/health.routes.ts—createBatchUpsertValidationMiddlewarecallsvalidateBatchUpsertRequestWithHash.packages/backend/src/services/health/HealthUploadEngine.ts— CallscomputeBatchPayloadHashbefore sending requests.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:SECURITY-COMPLIANCE.MD,DATA-INTEGRITY-GUARANTEES.MD,FAILURE-MODES.MD
Displaying health vitals for a past consumption session (e.g., heart rate, HRV before, during, and after) is a critical UI feature. Naively querying raw HealthSample data for each request is problematic:
- High Latency — Joins across thousands of raw samples on every API request.
- Computational Overhead — On-the-fly downsampling and aggregation is CPU-intensive.
- Inconsistent Data — Without caching, different requests could return slightly different aggregates.
- Mobile Bandwidth — Fetching raw data for client-side aggregation is inefficient.
Session Telemetry Precomputation and Caching was implemented:
- Derived, downsampled health data for each completed session is precomputed by
SessionTelemetryServiceand stored inSessionTelemetryCache. - Computation is triggered asynchronously by
session.endeddomain events (viaSessionTelemetrySubscriberandSessionTelemetryQueueService). - The
SessionTelemetryServiceAPI (getSessionTelemetry) prioritizes fetching from cache. On a cache miss, it returns aCOMPUTINGstate (with aretryAfterSecondshint) and triggers async recomputation viaBoundedComputeCoordinator. - Cache entries use watermark-based freshness (
sourceWatermark) for staleness detection.
- On-Demand Computation Only: Computing telemetry from raw samples on every API request.
- Rejected: High latency, high CPU/DB load, inconsistent results, poor UX.
- Client-Side Computation: Fetching raw samples for client-side downsampling/aggregation.
- Rejected: High mobile bandwidth/battery usage, increased client complexity.
- Materialized Views in PostgreSQL: Creating SQL materialized views for session telemetry.
- Rejected: Less flexible for dynamic windowing/resolution, complex to manage per-session state.
Pros:
- API Performance — Sub-second response times (cache hits are O(1)).
- Reduced Load — Minimizes queries against raw
HealthSampletable. - Consistent Data — Cached results ensure all clients see the same aggregates.
- Scalability — Computation is offloaded to background workers.
- UI Responsiveness — Returns
COMPUTINGstate instantly on cache miss, improving perceived performance. - Freshness — Watermark-based staleness ensures data is consistent with source mutations.
Cons:
- Complexity — Requires a new table, background jobs, a queue service, and a sophisticated cache workflow.
- Eventual Consistency — A brief delay exists between
session.endedand the cache becomingREADY. - Infrastructure Overhead — Requires BullMQ workers and potentially more Redis memory.
- Data Size — Storing aggregated telemetry (JSONB) increases database size.
prisma/schema.prismadefinesSessionTelemetryCachemodel withmetricsJson(JSONB) andsourceWatermark.SessionTelemetryServicemanages cache reads, computations, and interactions withSessionTelemetryQueueService.- New BullMQ jobs (
SESSION_TELEMETRY_COMPUTE,SESSION_TELEMETRY_LOCK_REAPER) were introduced. SessionTelemetryPayload(from@shared/contracts) includes freshness metadata.- Clients must interpret
TelemetryQueryResult.state(ready,computing,stale,error,no_data) andretryAfterSeconds.
Reinforces CQRS by precomputing read models. Promotes Event-Driven Architecture for triggering computation. Ensures Performance Optimization and Scalability for the telemetry API.
Implementation Details
-
packages/backend/prisma/schema.prisma— TheSessionTelemetryCachemodel. -
packages/shared/src/contracts/session-telemetry.contract.ts— DefinesSessionTelemetryPayload,TelemetryResolution,TelemetryFreshnessMeta. -
packages/backend/src/services/session-telemetry.service.ts— Main service:
// excerpt from getSessionTelemetry
const cacheResult = await this.checkCache(sessionId, userId, windowMinutes, resolvedResolution);
if (cacheResult) {
// ... return cached, trigger async recompute if stale ...
}
// ... trigger async compute and return 'computing' state ...packages/backend/src/services/sessionTelemetryQueue.service.ts— SchedulesSESSION_TELEMETRY_COMPUTEjobs.packages/backend/src/jobs/job.types.ts— DefinesSessionTelemetryComputeJobData.packages/backend/src/jobs/job-processor.ts— ImplementsprocessSessionTelemetryComputeJob.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:PROJECTION-PIPELINE.MD,WORKER-SCALABILITY.MD
Predicting inventory depletion and suggesting purchase timing accurately for tracked products is a complex problem. Relying on simplistic linear models leads to:
- Inaccurate Predictions — Ignoring variations in user consumption, routine patterns, or safety concerns.
- Suboptimal Recommendations — Missing opportunities for timely purchases or failing to warn about low stock.
- Lack of Trust — Users distrusting predictions due to poor reliability.
A more sophisticated, multi-factor approach was needed.
A Multi-Factor Inventory Prediction Engine was implemented in InventoryPredictionService. This service orchestrates several data sources and models:
- EMA Learning —
UserConsumptionProfileServiceprovides learned average quantity-per-event and consumption rates. - Temporal Patterns —
TemporalPatternService(Dirichlet-multinomial histogram) provides time-of-day/day-of-week consumption multipliers and routine stability. - Safety Factors —
SafetyServiceprovides adjustments based on recent high-risk events. - Loss Estimation —
UserConsumptionProfileServiceprovides personalized loss factors.
The service combines these factors into a single effectiveDailyRate to forecast depletion, assess risk, and generate purchase recommendations.
- Single-Factor Linear Regression: Predicting depletion solely based on average daily consumption.
- Rejected: Too simplistic, ignores individual variations, routines, and safety context.
- External ML Platform (e.g., AWS SageMaker): Building complex ML models on a dedicated platform.
- Rejected: Higher cost, increased operational complexity, latency overhead.
- Hardcoded Rules: Using a set of static rules for predictions.
- Rejected: Not personalized, not adaptive, and prone to poor accuracy.
Pros:
- Increased Accuracy — Combines multiple behavioral signals for robust, personalized predictions.
- Context-Aware — Integrates temporal routines and safety concerns.
- Transparency —
PredictionExplainprovides a breakdown of contributing factors, improving user trust. - Extensibility — New prediction factors can be added as modular strategies.
- Scalability — Uses lightweight statistical models directly in the backend.
Cons:
- Increased Complexity — Orchestrating multiple services and models adds significant logic.
- Cold Start Problem — Requires sufficient user data for confident predictions.
[MITIGATION]:Fallback to inventory-only predictions. - Debugging — Tracing prediction logic through multiple factors can be challenging.
InventoryPredictionServicebecame a core prediction orchestrator withUserConsumptionProfileService,UserRoutineService,SafetyService, andTemporalPatternServiceas dependencies.@shared/contracts/prediction.contract.tsdefinesInventoryPredictionResult,ProductInventoryPrediction,PredictionExplain, andPurchaseRecommendation.PredictionRecordtable stores historical predictions for accuracy tracking.
Embraces Composition by combining multiple specialized services. Aligns with SRP by having InventoryPredictionService orchestrate, rather than implement, all underlying models. Promotes Data-Driven Decision Making.
Implementation Details
packages/backend/src/services/inventory-prediction.service.ts— The main orchestrator:
// excerpt from predictInventoryDepletion
const effectiveDailyRate = baseRate
.mul(dampedTemporalFactor)
.mul(computedTrendMultiplier)
.mul(safetyAdjustment)
.mul(lossFactor);
// ...
const predictions = this.buildPredictions(predictionBuildParams);packages/shared/src/contracts/prediction.contract.ts— Defines prediction data structures.packages/backend/src/services/user-consumption-profile.service.ts— ProvideslearnedAvgQuantityPerEvent.packages/backend/src/services/temporal-pattern.service.ts— ProvidestemporalMultiplier,routineStability,temporalConfidence.packages/backend/src/services/safety.service.ts— Provides safety adjustments.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:AI-INTEGRATION.MD,PROJECTION-PIPELINE.MD
The previous implementation of user routine detection relied on a Hidden Markov Model (HMM). This approach proved to be:
- Data-Intensive — HMMs require substantial data to train effectively, leading to poor performance for new users or sparse consumption histories.
- Opaque — HMM states and transitions are hard to interpret, making explanations to users difficult.
- Complex — Training and managing HMMs added significant complexity to
UserRoutineService.
The HMM-based routine engine was retired and replaced by a Dirichlet-Multinomial Temporal Histogram Engine implemented in TemporalPatternService:
- Maintains 168-bin histograms (24 hours × 7 days-of-week) of user consumption frequency and quantity.
- Uses exponential decay to weight recent consumption events more heavily.
- Bayesian posterior inference with a Dirichlet prior provides robust probability distributions even with sparse data.
- Computes: a
temporalMultiplier(consumption relative to average for a given time slot),routineStability(1 − entropy), andconfidence(based on total decayed sessions).
- Retain HMM: Continue with the existing HMM.
- Rejected: Data-intensive, opaque, and complex — the problems this architecture solves.
- Simpler Moving Averages: Using standard moving averages for consumption patterns.
- Rejected: Not granular enough (no time-of-day/day-of-week specificity), poor sparse data handling.
- Neural Networks (e.g., RNNs): Using deep learning models for sequence prediction.
- Rejected: Overkill, extreme data hunger, opaque, high computational cost.
Pros:
- Cold Start Friendly — Bayesian approach works well with sparse data, providing reasonable priors.
- Interpretability — Histograms and multipliers are much easier to understand and explain.
- Robustness — Exponential decay handles changing user behavior over time.
- Simplicity — Simpler mathematical model, reducing implementation complexity.
- Efficiency — Faster computation for real-time inference compared to HMM.
- Confidence Metrics — Provides explicit confidence scores based on decayed session count.
Cons:
- Loss of State Sequence — Does not model transitions between states like HMMs. Focuses solely on when consumption occurs.
- Granularity — Fixed 168 bins might not capture extremely subtle patterns.
- Implementation Effort — Required a complete re-implementation of routine pattern detection.
UserRoutineServicewas refactored to delegate pattern detection toTemporalPatternService. HMM-related code was removed.UserRoutineProfilenow storestemporalCounts,temporalQuantity,lastTemporalUpdateAt,totalDecayedSessions,priorStrength, andtrendMultiplier.TemporalPatternSubscriberwas introduced to update the histogram onsession.endedevents.- Improved accuracy and reliability of inventory predictions that rely on these temporal patterns.
Reinforces Simplicity and Robustness in model design. Aligns with SRP by separating routine profile management (UserRoutineService) from the pattern learning algorithm (TemporalPatternService). Data-Driven by directly inferring patterns from user consumption history.
Implementation Details
packages/backend/src/services/temporal-pattern.service.ts— Core Dirichlet-multinomial histogram logic:
// excerpt from updateTemporalHistogram
// Apply exponential decay
if (profile.lastTemporalUpdateAt) { /* ... */ }
// Increment bin counts with age-based weight
counts[binIndex] = (counts[binIndex] ?? 0) + sessionWeight;
quantityArr[binIndex] = (quantityArr[binIndex] ?? 0) + sessionQuantity * sessionWeight;
totalDecayed += sessionWeight;
// Compute posterior and derived metrics
const posterior = TemporalPatternService.computePosteriorDistribution(counts, priorStrength);
const temporalMultiplier = TemporalPatternService.computeMultiplier(posterior, binIndex);packages/backend/src/subscribers/temporal-pattern.subscriber.ts— Subscribes tosession.endedevents.packages/backend/src/repositories/user-routine-profile.repository.ts— Stores histogram data (temporalCounts,temporalQuantity).
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:AI-INTEGRATION.MD,PROJECTION-PIPELINE.MD
The HealthIngestBatchJob leverages BullMQ for asynchronous processing. BullMQ stores job data as JSON in Redis. Without explicit limits, very large payloads could lead to:
- Redis Memory Exhaustion (OOM) — Unbounded job data consuming excessive Redis memory, causing evictions or crashes.
- ioredis Socket Failures (EPIPE/ECONNRESET) — Overly large JSON writes to Redis causing socket errors.
- Job Processing Deadlocks — If enqueueing fails after
HealthIngestRequestis created but before the job enters the queue, the request remains stuck in "PROCESSING."
A hard payload size limit of 5MB was imposed on HealthIngestBatchJobData within the HealthIngestQueueService:
- Before enqueueing, the service serializes the job data and checks its byte length.
- If the payload exceeds 5MB, the job is rejected, the
HealthIngestRequestis markedFAILED, and anAppError(413)(Payload Too Large) is returned. - The client-side
HealthUploadEngineimplements auto-rechunking and pre-send byte size validation (MAX_BATCH_BYTES = 4.5MB).
- No Limit: Allowing unbounded job payload sizes.
- Rejected: Directly leads to Redis OOM, socket failures, and system instability.
- Queue-Specific Redis Instance: A dedicated Redis instance for BullMQ with
noevictionpolicy.- Rejected: Even with a dedicated instance, unbounded payloads still consume memory. This ADR addresses payload size, not eviction policy.
- Store Payload in S3, Pass Reference: Storing large payloads in S3, passing only the S3 key.
- Rejected: Adds significant complexity (S3 upload, pre-signed URLs, garbage collection), increased latency. The 5MB limit is sufficient to avoid this complexity.
Pros:
- Redis Stability — Prevents memory exhaustion and socket failures.
- System Resilience — Reduces the likelihood of
HealthIngestRequestgetting stuck in "PROCESSING." - Predictability — Ensures robust job processing by handling oversized payloads gracefully.
- Client Auto-Rechunking — Proactively splits large batches, preventing server-side rejections.
- Clear Error Feedback — Clients receive specific
413 Payload Too Largeerrors.
Cons:
- Client-Side Complexity — Clients must implement payload size estimation and auto-rechunking.
- Batch Splitting — Very large health data batches may need splitting.
- Arbitrary Limit — 5MB is a heuristic; optimal size depends on Redis configuration and network.
HealthIngestQueueService.maybeQueueBatchnow includes byte size validation.HealthUploadEnginehasestimateRequestBytesand auto-rechunking logic.BatchUpsertSamplesErrorResponseSchemaincludesPAYLOAD_TOO_LARGEas aBatchErrorCode.- Errors are logged with payload byte sizes, aiding in debugging client-side batching issues.
Reinforces System Resilience and Resource Management. Promotes Robustness by handling failure conditions gracefully. Ensures Operational Stability for the BullMQ job queue.
Implementation Details
packages/backend/src/services/healthIngestQueue.service.ts— Pre-enqueue byte size check:
// excerpt from maybeQueueBatch
const payloadBytes = Buffer.byteLength(JSON.stringify(jobData), 'utf-8');
if (payloadBytes > HealthIngestQueueService.MAX_JOB_PAYLOAD_BYTES) {
// ... fail ingest request, throw AppError(413) ...
}packages/backend/src/services/health/HealthUploadEngine.ts— Client-side payload estimation and auto-rechunking:
// excerpt from doUploadPendingSamplesInternal
// ... auto-rechunk logic ...
throw new PayloadTooLargeError( /* ... */ );packages/shared/src/contracts/health.contract.ts— DefinesBatchErrorCode.PAYLOAD_TOO_LARGE.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:FAILURE-MODES.MD,WORKER-SCALABILITY.MD
Health data payloads, especially high-frequency metrics (heart rate, steps) or detailed categorical data (sleep stages), can be quite large. Sending these uncompressed leads to:
- High Bandwidth Consumption — Increased data transfer costs.
- Increased Latency — Longer upload times on slow or congested mobile networks.
- Client Battery Drain — More data transfer consumes more battery.
HTTP Compression (gzip) was implemented for health data upload requests:
- The client (
HealthUploadHttpClientImpl) compresses theBatchUpsertSamplesRequestpayload usingpako.gzipif its uncompressed size exceeds aGZIP_MIN_BYTESthreshold. - The
Content-Encoding: gzipheader is added to the request. - The backend's
app.tsmiddleware pipeline includescompression()to automatically decompress incoming requests. health.controller.tsvalidates theContent-Encodingheader and logs byte sizes for observability.
- No Compression: Continue sending payloads uncompressed.
- Rejected: Directly leads to poor UX and higher costs.
- Application-Level Compression (Custom): Implementing a custom compression algorithm (e.g., LZ4).
- Rejected: Reinvents the wheel, higher complexity, not compatible with standard HTTP headers.
- Always Compress: Compress all payloads regardless of size.
- Rejected: For very small payloads, compression overhead can exceed the savings.
GZIP_MIN_BYTESthreshold optimizes this.
- Rejected: For very small payloads, compression overhead can exceed the savings.
Pros:
- Reduced Bandwidth — Significantly decreases transfer size (often 60-80% for JSON payloads).
- Improved Performance — Faster uploads, especially over cellular networks.
- Battery Savings — Lower network activity on mobile clients.
- Standardized — Uses standard
gzipandContent-Encodingheaders. - Observability — Backend logs
bytes_uncompressed,bytes_compressed, andcompression_ratio.
Cons:
- CPU Overhead — Compression/decompression consumes CPU on both client and server. Mitigated by optimizing gzip level and threshold.
- Client-Side Complexity — Requires
pakolibrary and conditional compression logic. - Backend Configuration — Requires correct
compression()middleware ordering inapp.ts.
HealthUploadHttpClientImplnow includespako.gzipandContent-Encodingheader logic.app.tsincludescompression()middleware.health.controllerandHealthUploadEnginelog compression details for metrics.
Improves Performance Optimization and Resource Management. Adheres to Standardization by using common HTTP mechanisms. Contributes to a better User Experience.
Implementation Details
packages/app/src/services/health/HealthUploadHttpClientImpl.ts— Client-side compression:
// excerpt from uploadBatch
const requestJson = JSON.stringify(request);
const uncompressedBytes = utf8ByteLength(requestJson);
if (isFeatureEnabled('healthGzip') && uncompressedBytes >= GZIP_MIN_BYTES) {
const compressed = pako.gzip(requestJson);
requestBody = compressed;
bodyType = 'raw';
contentEncoding = 'gzip';
compressedBytes = compressed.byteLength;
headers['Content-Encoding'] = 'gzip';
headers['Content-Type'] = 'application/json';
}packages/backend/src/app.ts— Configures the Expresscompressionmiddleware.packages/backend/src/api/v1/controllers/health.controller.ts— ValidatesContent-Encodingheader and logs byte sizes.packages/backend/src/api/v1/middleware/jsonBodyParser.middleware.ts—createJsonBodyParserhandlesContent-Encodingduring parsing.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:WORKER-SCALABILITY.MD,OBSERVABILITY.MD
Client devices often have inaccurate clocks due to drift, user manipulation, or network latency. This causes issues for a sync-heavy application that relies on accurate timestamps for:
- Event Ordering — Correctly sequencing events (
consumption.created,purchase.finished). - Stale Data Detection — Miscalculating local cache freshness.
- Optimistic UI Updates — Client-side updates appearing out of sync with the server's authoritative timeline.
A Server-Time HTTP header was added to all HTTP responses, containing the server's current UTC timestamp in ISO 8601 format (e.g., 2025-01-15T10:30:45.123Z):
- A dedicated
server-time.middleware.tsadds this header early in the middleware pipeline. - Clients parse this header and calculate their clock offset (
serverTime - clientTime). - This offset adjusts local timestamps for display and optimistic updates.
- NTP Client on Device: Implementing a full NTP client on mobile devices.
- Rejected: Overkill, high complexity, security concerns. OS-level NTP exists but isn't exposed to apps.
- Dedicated Time API Endpoint: A specific
/timeendpoint.- Rejected: Requires an extra HTTP request per cycle. A header is more efficient.
- Embedding Server Time in Response Body: Including
serverTimein every API response body.- Rejected: Modifies every API contract. A header is a cleaner cross-cutting concern.
Pros:
- Accuracy — Provides a reliable, server-authoritative UTC timestamp.
- Efficiency — Minimal overhead (one header per request), no extra network calls.
- Consistency — Enables deterministic clock offset calculations.
- Improved UX — Correct "just now" labels and optimistic UI timing.
- Decoupling — Implemented as a clean, cross-cutting middleware.
Cons:
- Client Implementation — Requires consistent header parsing and offset application.
- Latency Bias — The timestamp is captured when the response starts being sent. Network latency contributes to an inherent (small) offset.
- Security — Timestamp is not sensitive but needs correct formatting.
- Introduction of
server-time.middleware.tsin the middleware layer, registered early inapp.ts. - Clients (
BackendAPIClient) are expected to parse this header and compute/apply a clock offset. - The
Server-Timeheader is implicitly part of the API contract for all HTTP responses.
Improves Data Integrity for timestamp-sensitive operations and enhances User Experience. A clean implementation of a Cross-Cutting Concern via middleware.
Implementation Details
packages/backend/src/api/v1/middleware/server-time.middleware.ts:
export function serverTimeMiddleware(req: Request, res: Response, next: NextFunction): void {
const serverTime = new Date().toISOString();
res.setHeader('Server-Time', serverTime);
next();
}packages/backend/src/app.ts— The middleware is applied globally:
// excerpt from setupMiddleware
this.app.use(middleware.correlationContext);
this.app.use(middleware.serverTime); // ADR-017
this.app.use(middleware.apiGateway);Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:OBSERVABILITY.MD
Managing health data consistently across a mobile frontend and a backend API is challenging. Disparate definitions for metric codes, units, value types, and validation rules lead to:
- Semantic Drift — Frontend and backend interpreting the same metric differently.
- Validation Mismatches — Client-side validation succeeding while server-side fails.
- Normalization Errors — Incorrect unit conversions or data transformations.
- API Inconsistencies — Difficulty extending the API with new metrics.
- Debugging Overhead — Tracing issues caused by subtle differences in metric definitions.
All canonical health metric definitions (codes, names, categories, value kinds, units, allowed value ranges, category code allowlists) were centralized in @shared/health-config/metric-types.ts:
HEALTH_METRIC_DEFINITIONSacts as the single source of truth (registry).- Utilities (
getMetricDefinition,getCanonicalUnit,isValueInBounds,normalizeToCanonicalUnit,isCategoryCodeAllowed) provide consistent access and validation. HEALTH_METRIC_CODESis a frozen array derived from the definitions, used for strict Zod schema validation.- Invariants are validated at module load time to catch configuration drift early.
- Separate Definitions (Frontend/Backend): Each layer maintains its own definitions.
- Rejected: Directly leads to semantic drift, inconsistencies, and high maintenance burden.
- Backend as SSOT (API-driven): Frontend fetches metric definitions from a backend API endpoint.
- Rejected: Adds runtime overhead, requires local caching, and still needs a shared type definition. Compile-time shared contract is more robust.
- Database-Driven (Dynamic): Storing metric definitions in a database and loading dynamically.
- Rejected: Higher complexity, runtime overhead, and loses compile-time type safety. For immutable metadata, a code-based registry is simpler.
Pros:
- Single Source of Truth — Eliminates semantic drift between frontend and backend.
- Type Safety — All metric handling is compile-time type-checked.
- Consistent Validation — Identical rules applied consistently across layers.
- Reliable Normalization — Guarantees correct unit conversions.
- Extensibility — Adding new metrics involves a single update to the shared registry.
- Fail-Fast — Invariant checks catch configuration errors at module load time.
- Performance — O(1) lookup from the frozen registry.
Cons:
- Shared Module Coupling — Requires a shared
@sharedmodule, increasing frontend build size. - Deployment Coordination — Updates require a coordinated deployment of both frontend and backend.
- Immutable Configuration — Not suitable for highly dynamic definitions that change frequently at runtime.
- Creation of
@shared/health-config/metric-types.tsas a foundational shared module. - Mobile apps import
HEALTH_METRIC_DEFINITIONSand utilities for local validation and normalization. health.contract.ts(Zod schemas) directly importsHealthMetricCodeSchema.HealthSampleServiceuses these definitions for ingestion.- Engineers must update the shared registry for any new health metric.
- Module-level invariant checks ensure the integrity of the definitions at runtime.
Strongly reinforces Single Source of Truth and DRY. Adheres to Contract-First Development and enhances Type Safety. Improves Robustness by ensuring consistent validation and normalization.
Implementation Details
packages/shared/src/health-config/metric-types.ts— Defines the registry:
// excerpt from HEALTH_METRIC_DEFINITIONS
const _HEALTH_METRIC_DEFINITIONS = {
heart_rate: {
code: 'heart_rate',
name: 'Heart Rate',
category: 'vital_signs',
valueKind: 'SCALAR_NUM',
canonicalUnit: 'bpm',
allowedUnits: ['bpm', 'count/min'] as const,
minValue: 20, maxValue: 400, expectedSamplesPerHour: 60,
},
// ... more definitions ...
} as const satisfies Readonly<Record<HealthMetricCode, HealthMetricDefinition>>;packages/shared/src/contracts/health.contract.ts— ImportsHealthMetricCodeSchemaand usesisValueInBounds,isCategoryCodeAllowedfor Zod schema refinements.packages/backend/src/services/health/HealthIngestionEngine.ts— UsesgetMetricDefinition,isValueInBounds,isUnitAllowedForMetricduring ingestion.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:ARCHITECTURE.MD,DATA-INTEGRITY-GUARANTEES.MD
In an asynchronous processing pipeline, operations can get stuck in an "in-progress" or "processing" state due to worker crashes, network failures, or timeouts. When HealthIngestRequest or SessionTelemetryCache entries get stuck in COMPUTING/PROCESSING:
- Deadlocks — Subsequent requests for the same item are blocked, waiting for an abandoned "lock."
- Stale Data — Caches remain in a
COMPUTINGstate indefinitely. - Operational Blindness — Difficulty identifying and recovering from stuck states.
Manual intervention was previously required — not scalable or reliable.
Automated Reaper Jobs were implemented using BullMQ for proactive cleanup:
HealthIngestReaperJob— Periodically findsHealthIngestRequestentries stuck inPROCESSINGfor too long and marks themFAILED. Clients can then retry safely.SessionTelemetryLockReaperJob— Periodically findsSessionTelemetryCacheentries stuck inCOMPUTINGand marks themFAILED. Subsequent requests can take over the abandoned computation lock.
These jobs run on a fixed schedule (e.g., every 5-15 minutes) via BullMQ cron.
- Manual Cleanup: Relying on human operators.
- Rejected: Not scalable, error-prone, leads to longer downtimes.
- Process-Specific Watchdogs: Implementing a watchdog thread within each worker process.
- Rejected: Complex to implement reliably, doesn't cover external failures.
- Longer Timeouts: Simply increasing the timeout for
PROCESSINGstates.- Rejected: Only delays the problem.
Pros:
- Proactive Recovery — Automatically recovers from abandoned processing states.
- System Resilience — Improves robustness against transient failures and crashes.
- Operational Visibility — Reaper actions are logged, providing clear signals of stuck processes.
- Idempotency — Marking as
FAILED(instead of deleting) allows clients to retry. - Configurable — Thresholds for "stale" are configurable.
Cons:
- False Positives — A legitimate long-running operation might be incorrectly classified as "stale." Mitigated by generous timeouts.
- Complexity — Adds background jobs and associated logic.
- Race Conditions (Small Window) — A tiny window exists where a reaper might mark a job
FAILEDjust before the original worker completes. Handled by idempotent updates and retry logic.
- No new tables, but
HealthIngestRequestandSessionTelemetryCachehave theirstatusandupdatedAtfields used for stale detection. HealthSampleRepositoryandSessionTelemetryCacheRepositoryimplementreapStaleProcessingIngestRequestsandreapStaleComputingRows.- Introduction of
HealthIngestReaperJobDataandSessionTelemetryLockReaperJobDatainjob.types.ts, implemented injob-processor.ts, and scheduled inschedules.ts. - Reduced need for manual intervention, improved system reliability.
Reinforces System Resilience and Reliability. Promotes Automated Operations and improves Observability into asynchronous process state.
Implementation Details
-
packages/backend/src/jobs/job.types.ts— DefinesHealthIngestReaperJobDataandSessionTelemetryLockReaperJobData. -
packages/backend/src/jobs/job-processor.ts— Implements reaper logic:
// excerpt from processHealthIngestReaperJob
const result = await this.healthSampleRepository.reapStaleProcessingIngestRequests(
data.staleAfterMinutes,
data.maxRows
);packages/backend/src/jobs/schedules.ts— Schedules these jobs via BullMQ cron.packages/backend/src/repositories/health-sample.repository.ts— ImplementsreapStaleProcessingIngestRequests.packages/backend/src/repositories/session-telemetry-cache.repository.ts— ImplementsreapStaleComputingRows.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:FAILURE-MODES.MD,WORKER-SCALABILITY.MD
When health data is deleted from the client device's HealthKit or Health Connect, the backend needs to reflect this change. A naive hard-delete approach poses several problems:
- Audit Trail Loss — No record of the data ever existing or being deleted.
- Analytics Impact — Historical aggregates would suddenly drop values.
- Reconciliation Issues — Difficult to reconcile if the client re-sends a "deleted" sample.
- GDPR Compliance — Users may request data deletion, but a temporary retention period might be necessary for auditing.
However, indefinite storage of all deleted data is costly.
Soft Deletion was implemented for HealthSample records:
- Instead of hard-deleting,
HealthSampleServicesets anisDeletedboolean totrueand adeletedAttimestamp. - API queries (
GET /health/samples) default to filteringisDeleted = false. - A periodic
HealthSampleSoftDeletePurgerJob(BullMQ worker) hard-deletes records older than a configurable retention period (e.g., 30 days).
- Hard Delete Only: Permanently delete records immediately.
- Rejected: Leads to audit trail loss, analytics disruption, and reconciliation issues.
- Indefinite Soft Delete: Never hard-delete records.
- Rejected: High storage costs, performance degradation, unnecessary data retention beyond needs.
- TimescaleDB Retention Policy: Using
add_retention_policy.- Rejected: Operates on entire chunks by time, not filtered by
isDeleted. Would delete all data older than the period, not just soft-deleted rows.
- Rejected: Operates on entire chunks by time, not filtered by
Pros:
- Auditability — Preserves a complete history of data and its deletion.
- Analytics Integrity — Historical aggregates remain stable.
- Reconciliation — Simplifies handling if deleted data reappears.
- GDPR Compliance — Supports deletion requests with a temporary audit retention period.
- Storage Optimization — The purger job manages long-term storage.
Cons:
- Increased Storage — Soft-deleted records temporarily consume space.
- Query Complexity — Default queries require
WHERE isDeleted = falseclauses. - Background Job Overhead — Requires a dedicated BullMQ job for purging.
prisma/schema.prismaincludesisDeleted(boolean) anddeletedAt(DateTime) onHealthSample.HealthSampleRepositorymethods handleisDeletedflags (e.g.,queryActiveByUserAndTimeRangefiltersisDeleted=false).HealthSampleService.deleteConsumptionperforms soft deletion.- Introduction of
HealthSampleSoftDeletePurgerJobDatainjob.types.ts, scheduled inschedules.ts. GET /health/samplesimplicitly filters soft-deleted samples.
Improves Data Integrity, Auditability, and Resource Management. Adheres to Privacy by Design for data deletion and supports Automated Operations.
Implementation Details
packages/backend/prisma/schema.prisma— Soft delete columns:
model HealthSample {
// ...
isDeleted Boolean @default(false) @map("is_deleted")
deletedAt DateTime? @map("deleted_at") @db.Timestamptz(6)
// ...
@@index([userId, isDeleted])
@@index([deletedAt])
}packages/backend/src/services/healthSample.service.ts—deleteConsumptionmethod for soft deleting.packages/backend/src/repositories/health-sample.repository.ts—queryActiveByUserAndTimeRangeandpurgeAllOldDeletedSamplesForAdmin.packages/backend/src/jobs/job.types.ts— DefinesHealthSampleSoftDeletePurgerJobData.packages/backend/src/jobs/job-processor.ts— ImplementsprocessHealthSampleSoftDeletePurgerJob.packages/backend/src/jobs/schedules.ts— SchedulesscheduleHealthSampleSoftDeletePurger.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:DATA-INTEGRITY-GUARANTEES.MD,FAILURE-MODES.MD
In an event-driven system with a rolling window definition for active sessions, it's possible for sessions to become "stale active":
- A user exits the app abruptly without explicitly ending a session.
- Network issues prevent the client from sending the
session:endevent. - A client's clock is significantly out of sync.
These stale active sessions lead to inaccurate counts of active sessions, blocking of downstream processing (e.g., session impact calculations), and potential data inconsistencies. Reactive cleanup (e.g., createSession closing prior active sessions) is insufficient for global cleanup.
A Global Stale Session Reconciliation Job was implemented:
- A periodic
StaleSessionReconciliationJob(BullMQ worker) runs on a schedule (e.g., every 10 minutes). - It sweeps across all users to find sessions still marked
ACTIVEorPAUSEDbut whosesessionEndTimestampis in the past. - For each stale session, it invokes
SessionService.completeSession()durably, ensuring the session status is set toCOMPLETEDand thesession.endeddomain event is emitted.
- Purely Reactive Cleanup: Relying solely on
createSessionto close prior active sessions, orgetActiveSessionsto perform inline cleanup.- Rejected: Insufficient for global cleanup. Only fires when the specific user takes action.
- Database-Level Cron Job: A SQL cron job directly updating
session.status.- Rejected: Bypasses service-layer business logic and event emission, losing critical
session.endeddomain events.
- Rejected: Bypasses service-layer business logic and event emission, losing critical
- Longer Session Timeouts: Increasing
SESSION_IDLE_TIMEOUT_MS.- Rejected: Only delays the problem.
Pros:
- Data Integrity — Ensures session statuses are eventually consistent with their actual lifespan.
- Accuracy — Provides accurate counts of genuinely active sessions.
- Proactive Cleanup — Automatically clears abandoned sessions.
- Event Emission — Guarantees
session.endeddomain events for all completed sessions. - Scalability — Bounded by
maxUsersandmaxSessionsPerUserto prevent runaway queries.
Cons:
- Background Job Overhead — Requires a dedicated BullMQ job.
- Complexity — Adds global sweep logic to the
SessionService. - Potential False Positives — A session might be active despite
sessionEndTimestampbeing in the past (extreme clock skew). Mitigated by idempotent completion.
SessionServicegainedreconcileStaleSessionsGlobalandcompleteSessionwas made robust for idempotent calls.- Introduction of
StaleSessionReconciliationJobDatainjob.types.ts, implemented injob-processor.ts, and scheduled inschedules.ts. - Reduced manual intervention, improved data consistency for session-related analytics and projections.
- Queries for active sessions are more accurate.
Improves Data Integrity, System Resilience, and Automated Operations. Ensures Event-Driven Architecture consistency by guaranteeing session.ended events for all completed sessions.
Implementation Details
packages/backend/src/services/session.service.ts— Reconciliation logic:
// excerpt from reconcileStaleSessionsGlobal
const staleSessions = await this.sessionRepository.findManyAdmin({
where: {
status: { in: ['ACTIVE', 'PAUSED'] },
sessionEndTimestamp: { lt: now },
},
orderBy: { userId: 'asc' },
take: maxUsers * maxSessionsPerUser,
});
// ... then loops through and calls completeSession() ...packages/backend/src/repositories/session.repository.ts—findManyAdminmethod for querying across users.packages/backend/src/jobs/job.types.ts— DefinesStaleSessionReconciliationJobData.packages/backend/src/jobs/job-processor.ts— ImplementsprocessStaleSessionReconciliationJob.packages/backend/src/jobs/schedules.ts— SchedulesscheduleStaleSessionReconciliation.
Status:
Implemented· Date: 2026-03-24 · Reviewers: Core Engineering Team Related:DATA-INTEGRITY-GUARANTEES.MD,FAILURE-MODES.MD,WORKER-SCALABILITY.MD
To ensure the ADRs remain a living, accurate, and high-value document:
When to Create a New ADR:
- Introduces a new technology or significant library
- Changes the core data model of an entity
- Impacts the system's scalability, performance, security, or resilience
- Modifies a cross-cutting concern (e.g., authentication, logging, error handling)
- Resolves a critical bug that exposed an architectural flaw
- Involves a significant trade-off or a choice between multiple complex alternatives
How to Update an ADR:
Once an ADR is marked Implemented, its core Decision, Alternatives, and Rationale sections are considered immutable. However, the Status can be updated (e.g., to Deprecated or Superseded with a link to a newer ADR). The Consequences section can be appended with new insights.
Review Process:
All proposed ADRs should undergo peer review to ensure clarity, accuracy, and alignment with architectural principles before being merged.
Tooling:
Use Markdown for easy readability and version control. Consider incorporating CI checks for ADR format consistency in the future.