feat: add health monitoring and health check management#2
Merged
Conversation
- Implemented a health monitoring system with a new Monitor struct to manage health checks for database connections. - Added functionality to probe connections for various database types, including PostgreSQL, MySQL, Convex, and Cloudflare D1. - Created a health check section in the UI to display the status of health checks, including the ability to configure settings and view notifications. - Enhanced the app's HTML template to include health check indicators and a detailed view of affected databases.
There was a problem hiding this comment.
3 issues found across 13 files
Confidence score: 3/5
- There is concrete regression risk in
internal/health/monitor.go:Stop()can return beforerunCheckgoroutines finish, so background work may still touchm.repo/m.notifierduring shutdown. internal/repository/repository.gohas a consistency gap: notification creation and health-check binding are committed separately, so a failure inattachNotificationToAllHealthCheckscan leave partially persisted state while returning an error.internal/notifications/dispatcher.gohas a user-facing formatting issue where an emptyConnection.Namecan produce a blank name in message bodies (subject has fallback, body does not), which is lower severity but still noticeable.- Pay close attention to
internal/health/monitor.go,internal/repository/repository.go, andinternal/notifications/dispatcher.go- shutdown synchronization, transactional consistency, and message fallback behavior need validation.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="internal/notifications/dispatcher.go">
<violation number="1" location="internal/notifications/dispatcher.go:312">
P2: Missing fallback for empty `Connection.Name` in the message body. `buildHealthSMTPSubject` falls back to `check.ConnectionID` when the name is empty, but `formatHealthMessage` writes the name directly, producing a blank `- Connection: ` line.</violation>
</file>
<file name="internal/health/monitor.go">
<violation number="1" location="internal/health/monitor.go:70">
P1: **`Stop()` returns before in-flight health-check goroutines finish.** Goroutines spawned inside `processDue` are not tracked, so `Stop()` can return while `runCheck` goroutines are still using `m.repo` and `m.notifier`. Use a `sync.WaitGroup` to track spawned goroutines and `wg.Wait()` inside `Stop()` after the ticker loop exits.</violation>
</file>
<file name="internal/repository/repository.go">
<violation number="1" location="internal/repository/repository.go:313">
P2: The notification create and the health-check binding create are two independent commits. If `attachNotificationToAllHealthChecks` fails, the notification is already persisted but the caller gets an error, leaving the system in an inconsistent state. Wrap both operations in a single transaction, similar to how `DeleteNotification` handles multi-table changes.</violation>
</file>
Architecture diagram
sequenceDiagram
participant UI as Client (UI/API)
participant HMon as NEW: Health Monitor (Background)
participant Repo as Repository (GORM)
participant Probe as NEW: Connection Prober
participant DB as Target DB (PG/MySQL/D1/Convex)
participant Notif as Dispatcher (SMTP/Discord)
Note over HMon, Repo: Periodic Background Loop
loop Every HEALTH_POLL_INTERVAL
HMon->>Repo: ListDueHealthChecks(now, batchSize)
Repo-->>HMon: HealthCheck metadata + Connection
loop For each due check
HMon->>Repo: NEW: ClaimHealthCheckRun(id, lease)
Note right of Repo: Prevents concurrent runs
Repo-->>HMon: Claim Success/Failure
opt Claimed successfully
HMon->>Probe: NEW: ProbeConnection(connection, timeout)
alt TCP (Postgres/MySQL)
Probe->>DB: Dial TCP Reachability
else API (Cloudflare D1)
Probe->>DB: POST /query (SELECT 1)
else Endpoint (Convex)
Probe->>DB: HTTPS reachability
end
DB-->>Probe: Result (Success/Error)
Probe-->>HMon: ProbeResult (Healthy/Unhealthy)
HMon->>Repo: NEW: SaveHealthCheckProbeResult(id, result)
Note right of Repo: Increments consecutive fail/success counts
alt NEW: Threshold met (Status change)
Repo-->>HMon: Return Event (DOWN or RECOVERED)
HMon->>Notif: NEW: NotifyHealthCheckEvent(check, event)
Notif->>Notif: Filter by Binding (OnDown/OnRecovered)
Notif->>UI: External Notification (Discord/Email)
else No status change
Repo-->>HMon: Return no event
end
end
end
end
Note over UI, Probe: Manual Trigger Path
UI->>Repo: NEW: POST /health-checks/{id}/run
Repo->>Repo: Check NEW: HEALTH_MANUAL_RUN_COOLDOWN
alt Cooldown Active
Repo-->>UI: 429 Too Many Requests (Retry-After)
else Cooldown Expired
Repo->>Probe: NEW: ProbeConnection(...)
Probe->>DB: Execute Probe
DB-->>Probe: Result
Probe-->>UI: Return immediate health status
end
Note over UI, Repo: Connection Lifecycle
UI->>Repo: Create NEW Connection
Repo->>Repo: Save Connection Record
Repo->>Repo: NEW: UpsertHealthCheckForConnection(...)
Note right of Repo: Seeds default intervals/thresholds from Config
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by cubic
Adds automated health monitoring for database connections with a background monitor, API and UI to manage checks and notifications, plus a header status indicator with a down-database popover.
New Features
/health-checks(list/get),/health-checks/{id}(PATCH settings),/health-checks/{id}/run(manual run with cooldown),/health-checks/{id}/notifications(list/set bindings).Retry-Afterwhen invoked too soon.Migration
HEALTH_POLL_INTERVAL,HEALTH_DUE_BATCH_SIZE,MAX_CONCURRENT_HEALTH_CHECKS,DEFAULT_HEALTH_CHECK_INTERVAL,DEFAULT_HEALTH_CHECK_TIMEOUT,DEFAULT_HEALTH_FAILURE_THRESHOLD,DEFAULT_HEALTH_SUCCESS_THRESHOLD,HEALTH_MANUAL_RUN_COOLDOWN.CLOUDFLARE_ACCOUNT_ID,CLOUDFLARE_DATABASE_ID,CLOUDFLARE_API_KEY(optionalD1_API_BASE_URL). Existing connections are seeded with checks on startup.Written for commit f00d088. Summary will update on new commits.