feat: add resilient background job retry & monitoring (#130)#611
Closed
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
Closed
feat: add resilient background job retry & monitoring (#130)#611DrGalio wants to merge 1 commit intorohitdash08:mainfrom
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
Conversation
Closes rohitdash08#130 Problem: - Background reminder jobs had zero retry logic — if send_reminder() returned False or threw an exception, the reminder was still marked sent=True - No visibility into job health, failure rates, or exhausted jobs Changes: - Model: Add retry_count and last_error columns to Reminder model - Schema: Add new columns + partial index for efficient retry queries - Routes: Fix run_due to only mark sent=True on successful delivery - Routes: Add exponential backoff retry (1min → 5min → 25min) - Routes: Add GET /reminders/stats monitoring endpoint - Routes: Add POST /reminders/:id/retry for manual exhausted-job reset - Migration: Auto-migrate new columns in _ensure_schema_compatibility - Tests: 6 new tests covering success, retry, exhaustion, exception handling, monitoring stats, and manual retry - Docs: Document retry mechanism, monitoring API, and Prometheus metrics Retry strategy: 3 attempts with exponential backoff (60s * 5^n). After MAX_RETRIES failures, job is marked exhausted (sent=False). Exhausted jobs can be manually reset via the retry endpoint.
Author
|
Closing duplicate — superseded by #641 which has the complete implementation with tests and OpenAPI docs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #130 — Resilient background job retry & monitoring
Problem
The current
run_dueendpoint had zero retry logic:send_reminder()returnedFalseor threw an exception, the reminder was still markedsent=TrueSolution
Retry with Exponential Backoff
After 3 failures, the reminder is marked exhausted (
sent=False,retry_count >= 3).Key Changes
retry_count(INT, default 0) andlast_error(VARCHAR 500) columnsidx_reminders_retryfor efficient retry queriesPOST /reminders/run: Fixed — only markssent=Trueon actual successful deliveryGET /reminders/stats: New monitoring endpoint with per-channel breakdownPOST /reminders/:id/retry: Manual reset for exhausted reminders_ensure_schema_compatibility— zero-downtime deployMonitoring Response
{ "total": 10, "sent": 7, "pending": 2, "exhausted": 1, "retrying": 2, "max_retries": 3, "channels": { "email": {"sent": 5, "failed_or_pending": 2}, "whatsapp": {"sent": 2, "failed_or_pending": 1} }, "next_due_at": "2026-03-22T10:00:00" }Test Results
All existing tests continue to pass (27/28 — 1 pre-existing failure unrelated to this change).