Skip to content

refactor(gateway): merge standalone cron process into gateway#103

Merged
jacoblee-io merged 3 commits intomainfrom
refactor/merge-cron-into-gateway
Mar 13, 2026
Merged

refactor(gateway): merge standalone cron process into gateway#103
jacoblee-io merged 3 commits intomainfrom
refactor/merge-cron-into-gateway

Conversation

@jacoblee-io
Copy link
Collaborator

Summary

Merge the standalone cron process into the gateway, eliminating 17 HTTP internal endpoints, multi-instance coordination, and a separate K8s deployment. The cron service was stateless — all data lives in gateway's DB — so the separate process added unnecessary complexity (heartbeat, dead-instance detection, job claiming/reassignment, HTTP round-trips).

Solution

  • New in-process CronService (src/gateway/cron/cron-service.ts) wraps the existing CronScheduler with direct configRepo DB calls. Job execution still goes through localhost/api/internal/agent-prompt (same as triggers).
  • RPC methods and channel-bridge call cronService.addOrUpdate()/.cancel() directly instead of HTTP broadcast via notifyCronService().
  • Removed multi-instance coordination (CronCoordinator, assignedTo logic, cron_instances table reads).
  • Deleted 6 source files, Dockerfile.cron, k8s/cron-deployment.yaml, 3 helm templates.
  • Updated Makefile, helm values, k8s manifests, docs.

Net: -1100 lines, one fewer K8s deployment.

DB schema note

cron_instances table and cron_jobs.assignedTo column become unused — left in place to avoid migration, can be dropped in a follow-up.

Test plan

  • npx tsc --noEmit passes
  • npx vitest run — no new failures
  • No remaining imports of deleted modules (grep verified)
  • Deployed to test cluster — gateway logs show [cron-service] Loaded 0 active jobs + Started
  • Helm upgrade automatically removed siclaw-cron deployment
  • Manual: create/pause/delete/rename cron job via UI, verify scheduler picks up changes
  • Manual: wait for a job to fire, verify agent-prompt execution + notification delivery

The cron service was a separate Node.js process that communicated with
gateway via 17 HTTP internal endpoints — pure overhead since all state
lives in gateway's DB. This merges cron into an in-process CronService
that calls configRepo directly.

Changes:
- New CronService (src/gateway/cron/cron-service.ts) wraps CronScheduler
  with direct DB access, job execution via localhost agent-prompt, and
  notification delivery (DB + WebSocket + channel callback)
- RPC methods and channel-bridge call cronService.addOrUpdate/cancel
  directly instead of HTTP broadcast via notifyCronService
- Removed ~170 lines of /api/internal/cron/* HTTP endpoints from server.ts
- Removed ~60 lines of /api/internal/cron-notify handler
- Deleted 6 src files: cron-main, cron-coordinator, cron-executor,
  gateway-client, cron-api, cron/notify
- Deleted Dockerfile.cron, k8s/cron-deployment.yaml, 3 helm templates
- Updated Makefile, docs, CLAUDE.md, CONTRIBUTING.md

Net: -1100 lines, one fewer K8s deployment to manage.
Copy link
Collaborator

@LikiosSedo LikiosSedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: refactor(gateway): merge standalone cron process into gateway

LGTM — clean, well-motivated refactor. Net -1100 lines, one fewer K8s deployment, and fixes local-mode cron not working.

What works well

  • Eliminates the HTTP proxy layer: The old gateway-client.ts wrapped every DB operation as an HTTP call (register, heartbeat, claim-job, etc. — 20+ endpoints in server.ts). Now all DB access is direct via configRepo. Lower latency, fewer failure modes.
  • Removes over-engineered multi-instance coordination: Heartbeat, dead instance detection, job claiming/reassigning — none of this was needed for single-instance deployment.
  • Fixes local-mode cron: CronService is created whenever configRepo && notifRepo exist, so local web mode finally gets working cron.
  • Consistent refactoring in rpc-methods + channel-bridge: All notifyCronService() calls uniformly replaced with cronService.addOrUpdate() / cronService.cancel().

Minor suggestions (non-blocking)

  1. Purge operations still use localhost HTTP: purgeNotifications() and purgeSessions() call http://localhost:{port}/api/internal/..., but CronService already holds configRepo and notifRepo references. Session purge has complex logic (soft-delete + hard-delete + stats) that lives in the route handler, so this avoids duplication — pragmatic for now. Could be extracted into a shared service method in a follow-up.

  2. K8s multi-replica: If gateway ever scales >1, each replica runs its own CronService. lockJobForExecution prevents duplicate execution and purge is idempotent, so correctness is fine. But background timers (stale lock cleanup, purge) would run redundantly. Not an issue today.

  3. cronService.start() is fire-and-forget: In gateway-main.ts, start failure is logged but gateway continues without cron. Consider adding a health indicator or retry for robustness.

  4. Vestigial assignedTo field: All call sites now pass assignedTo: null. If CronScheduler and the DB schema still carry this column, a follow-up cleanup would be nice.

Correctness check

  • execute(): re-validate → lock → execute → unlock in finally ✓
  • recordResult(): matches old /api/internal/cron-notify handler logic (insert run + notification + WS push + channel callback) ✓
  • buildCronPrompt(): preserved NON-INTERACTIVE rules from old cron-executor.ts
  • server.ts: all /api/internal/cron/* endpoints removed (~240 lines), /api/internal/agent-prompt preserved ✓
  • stop(): cleans up all 3 timers + scheduler.stop()

Copy link
Collaborator

@chent1996 chent1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great simplification — -1100 lines, one fewer K8s deployment, all coordination complexity removed. Session/notification purge logic is correctly preserved (same endpoints, same intervals, loopback HTTP). Deletion is clean, no dangling imports.

One bug below, plus two non-blocking notes:

Note 1: docs/design/db-cleanup-spec.md still references cron-main.ts as a standalone process in 4 places — should be updated.

Note 2: src/shared/retry.ts exports (withRetry, shouldRetryHttp, HttpError) are now dead code — only consumers were the deleted gateway-client.ts and cron-main.ts.

} catch (err) {
console.error(`[cron-service] Job ${job.id} failed:`, err);
await this.configRepo.updateCronJobRun(job.id, "failure");
await this.recordResult(job, "failure", "", err instanceof Error ? err.message : String(err));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (Medium): updateCronJobRun not wrapped in try/catch — failure notification lost

If updateCronJobRun(job.id, "failure") throws (DB error), the next line recordResult() is skipped entirely. recordResult handles DB history insert, WebSocket push, AND channel notification — so the user gets zero feedback about the failed job.

The success path on line 132 has the same issue, but it's less critical since the result has already been captured.

Suggestion:

} catch (err) {
  try { await this.configRepo.updateCronJobRun(job.id, "failure"); } catch { /* warn */ }
  await this.recordResult(job, "failure", "", err instanceof Error ? err.message : String(err));
}

This matches the defensive pattern already used inside recordResult itself (which wraps insertCronJobRun in try/catch).

Copy link
Collaborator Author

@jacoblee-io jacoblee-io left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

  • src/shared/retry.ts: Deleted in 027cfc8 — confirmed no other consumers remain.
  • docs/design/db-cleanup-spec.md: This file doesn't exist on main or this branch. Could you double-check which file you meant?

… loss

If updateCronJobRun throws (e.g. DB error), recordResult was skipped
entirely — losing the execution history, WebSocket push, and channel
notification. Now both success and failure paths handle the error
defensively, matching the pattern already used inside recordResult.
Copy link
Collaborator

@chent1996 chent1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fixes addressed:

  1. updateCronJobRun now wrapped in try/catch on both success and failure paths — recordResult will always execute.
  2. Dead retry.ts removed.

LGTM.

@jacoblee-io jacoblee-io merged commit c655c1f into main Mar 13, 2026
3 checks passed
@jacoblee-io jacoblee-io deleted the refactor/merge-cron-into-gateway branch March 13, 2026 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants