refactor(gateway): merge standalone cron process into gateway#103
refactor(gateway): merge standalone cron process into gateway#103jacoblee-io merged 3 commits intomainfrom
Conversation
The cron service was a separate Node.js process that communicated with gateway via 17 HTTP internal endpoints — pure overhead since all state lives in gateway's DB. This merges cron into an in-process CronService that calls configRepo directly. Changes: - New CronService (src/gateway/cron/cron-service.ts) wraps CronScheduler with direct DB access, job execution via localhost agent-prompt, and notification delivery (DB + WebSocket + channel callback) - RPC methods and channel-bridge call cronService.addOrUpdate/cancel directly instead of HTTP broadcast via notifyCronService - Removed ~170 lines of /api/internal/cron/* HTTP endpoints from server.ts - Removed ~60 lines of /api/internal/cron-notify handler - Deleted 6 src files: cron-main, cron-coordinator, cron-executor, gateway-client, cron-api, cron/notify - Deleted Dockerfile.cron, k8s/cron-deployment.yaml, 3 helm templates - Updated Makefile, docs, CLAUDE.md, CONTRIBUTING.md Net: -1100 lines, one fewer K8s deployment to manage.
LikiosSedo
left a comment
There was a problem hiding this comment.
Review: refactor(gateway): merge standalone cron process into gateway
LGTM — clean, well-motivated refactor. Net -1100 lines, one fewer K8s deployment, and fixes local-mode cron not working.
What works well
- Eliminates the HTTP proxy layer: The old
gateway-client.tswrapped every DB operation as an HTTP call (register, heartbeat, claim-job, etc. — 20+ endpoints inserver.ts). Now all DB access is direct viaconfigRepo. Lower latency, fewer failure modes. - Removes over-engineered multi-instance coordination: Heartbeat, dead instance detection, job claiming/reassigning — none of this was needed for single-instance deployment.
- Fixes local-mode cron:
CronServiceis created wheneverconfigRepo && notifRepoexist, so local web mode finally gets working cron. - Consistent refactoring in rpc-methods + channel-bridge: All
notifyCronService()calls uniformly replaced withcronService.addOrUpdate()/cronService.cancel().
Minor suggestions (non-blocking)
-
Purge operations still use localhost HTTP:
purgeNotifications()andpurgeSessions()callhttp://localhost:{port}/api/internal/..., butCronServicealready holdsconfigRepoandnotifReporeferences. Session purge has complex logic (soft-delete + hard-delete + stats) that lives in the route handler, so this avoids duplication — pragmatic for now. Could be extracted into a shared service method in a follow-up. -
K8s multi-replica: If gateway ever scales >1, each replica runs its own
CronService.lockJobForExecutionprevents duplicate execution and purge is idempotent, so correctness is fine. But background timers (stale lock cleanup, purge) would run redundantly. Not an issue today. -
cronService.start()is fire-and-forget: Ingateway-main.ts, start failure is logged but gateway continues without cron. Consider adding a health indicator or retry for robustness. -
Vestigial
assignedTofield: All call sites now passassignedTo: null. IfCronSchedulerand the DB schema still carry this column, a follow-up cleanup would be nice.
Correctness check
execute(): re-validate → lock → execute → unlock in finally ✓recordResult(): matches old/api/internal/cron-notifyhandler logic (insert run + notification + WS push + channel callback) ✓buildCronPrompt(): preserved NON-INTERACTIVE rules from oldcron-executor.ts✓server.ts: all/api/internal/cron/*endpoints removed (~240 lines),/api/internal/agent-promptpreserved ✓stop(): cleans up all 3 timers +scheduler.stop()✓
chent1996
left a comment
There was a problem hiding this comment.
Great simplification — -1100 lines, one fewer K8s deployment, all coordination complexity removed. Session/notification purge logic is correctly preserved (same endpoints, same intervals, loopback HTTP). Deletion is clean, no dangling imports.
One bug below, plus two non-blocking notes:
Note 1: docs/design/db-cleanup-spec.md still references cron-main.ts as a standalone process in 4 places — should be updated.
Note 2: src/shared/retry.ts exports (withRetry, shouldRetryHttp, HttpError) are now dead code — only consumers were the deleted gateway-client.ts and cron-main.ts.
| } catch (err) { | ||
| console.error(`[cron-service] Job ${job.id} failed:`, err); | ||
| await this.configRepo.updateCronJobRun(job.id, "failure"); | ||
| await this.recordResult(job, "failure", "", err instanceof Error ? err.message : String(err)); |
There was a problem hiding this comment.
Bug (Medium): updateCronJobRun not wrapped in try/catch — failure notification lost
If updateCronJobRun(job.id, "failure") throws (DB error), the next line recordResult() is skipped entirely. recordResult handles DB history insert, WebSocket push, AND channel notification — so the user gets zero feedback about the failed job.
The success path on line 132 has the same issue, but it's less critical since the result has already been captured.
Suggestion:
} catch (err) {
try { await this.configRepo.updateCronJobRun(job.id, "failure"); } catch { /* warn */ }
await this.recordResult(job, "failure", "", err instanceof Error ? err.message : String(err));
}This matches the defensive pattern already used inside recordResult itself (which wraps insertCronJobRun in try/catch).
jacoblee-io
left a comment
There was a problem hiding this comment.
Thanks for the review!
src/shared/retry.ts: Deleted in 027cfc8 — confirmed no other consumers remain.docs/design/db-cleanup-spec.md: This file doesn't exist onmainor this branch. Could you double-check which file you meant?
… loss If updateCronJobRun throws (e.g. DB error), recordResult was skipped entirely — losing the execution history, WebSocket push, and channel notification. Now both success and failure paths handle the error defensively, matching the pattern already used inside recordResult.
chent1996
left a comment
There was a problem hiding this comment.
Both fixes addressed:
updateCronJobRunnow wrapped in try/catch on both success and failure paths —recordResultwill always execute.- Dead
retry.tsremoved.
LGTM.
Summary
Merge the standalone cron process into the gateway, eliminating 17 HTTP internal endpoints, multi-instance coordination, and a separate K8s deployment. The cron service was stateless — all data lives in gateway's DB — so the separate process added unnecessary complexity (heartbeat, dead-instance detection, job claiming/reassignment, HTTP round-trips).
Solution
CronService(src/gateway/cron/cron-service.ts) wraps the existingCronSchedulerwith directconfigRepoDB calls. Job execution still goes throughlocalhost/api/internal/agent-prompt(same as triggers).cronService.addOrUpdate()/.cancel()directly instead of HTTP broadcast vianotifyCronService().CronCoordinator,assignedTologic,cron_instancestable reads).Dockerfile.cron,k8s/cron-deployment.yaml, 3 helm templates.Net: -1100 lines, one fewer K8s deployment.
DB schema note
cron_instancestable andcron_jobs.assignedTocolumn become unused — left in place to avoid migration, can be dropped in a follow-up.Test plan
npx tsc --noEmitpassesnpx vitest run— no new failuresgrepverified)[cron-service] Loaded 0 active jobs+Startedsiclaw-crondeployment