Skip to content

feat: add resilient background job retry & monitoring (#130)#611

Closed
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
DrGalio:feat/resilient-background-job-retry-monitoring
Closed

feat: add resilient background job retry & monitoring (#130)#611
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
DrGalio:feat/resilient-background-job-retry-monitoring

Conversation

@DrGalio
Copy link

@DrGalio DrGalio commented Mar 22, 2026

Summary

Closes #130 — Resilient background job retry & monitoring

Problem

The current run_due endpoint had zero retry logic:

  • If send_reminder() returned False or threw an exception, the reminder was still marked sent=True
  • No visibility into job health, failure rates, or exhausted jobs
  • No way to retry failed reminders

Solution

Retry with Exponential Backoff

Attempt Wait Before Retry
1st failure 1 minute
2nd failure 5 minutes
3rd failure 25 minutes

After 3 failures, the reminder is marked exhausted (sent=False, retry_count >= 3).

Key Changes

  • Model: Added retry_count (INT, default 0) and last_error (VARCHAR 500) columns
  • Schema: New columns + partial index idx_reminders_retry for efficient retry queries
  • POST /reminders/run: Fixed — only marks sent=True on actual successful delivery
  • GET /reminders/stats: New monitoring endpoint with per-channel breakdown
  • POST /reminders/:id/retry: Manual reset for exhausted reminders
  • Migration: Auto-applied via _ensure_schema_compatibility — zero-downtime deploy
  • Tests: 6 new tests (all passing ✅)
  • Docs: README updated with retry/monitoring documentation

Monitoring Response

{
  "total": 10, "sent": 7, "pending": 2, "exhausted": 1,
  "retrying": 2, "max_retries": 3,
  "channels": {
    "email": {"sent": 5, "failed_or_pending": 2},
    "whatsapp": {"sent": 2, "failed_or_pending": 1}
  },
  "next_due_at": "2026-03-22T10:00:00"
}

Test Results

tests/test_reminders.py::test_run_due_marks_sent_on_success PASSED
tests/test_reminders.py::test_run_due_retries_on_failure_with_backoff PASSED
tests/test_reminders.py::test_run_due_exhausts_after_max_retries PASSED
tests/test_reminders.py::test_run_due_handles_exception_in_send PASSED
tests/test_reminders.py::test_reminder_stats_endpoint PASSED
tests/test_reminders.py::test_manual_retry_resets_exhausted_reminder PASSED

All existing tests continue to pass (27/28 — 1 pre-existing failure unrelated to this change).

Closes rohitdash08#130

Problem:
- Background reminder jobs had zero retry logic — if send_reminder() returned
  False or threw an exception, the reminder was still marked sent=True
- No visibility into job health, failure rates, or exhausted jobs

Changes:
- Model: Add retry_count and last_error columns to Reminder model
- Schema: Add new columns + partial index for efficient retry queries
- Routes: Fix run_due to only mark sent=True on successful delivery
- Routes: Add exponential backoff retry (1min → 5min → 25min)
- Routes: Add GET /reminders/stats monitoring endpoint
- Routes: Add POST /reminders/:id/retry for manual exhausted-job reset
- Migration: Auto-migrate new columns in _ensure_schema_compatibility
- Tests: 6 new tests covering success, retry, exhaustion, exception handling,
  monitoring stats, and manual retry
- Docs: Document retry mechanism, monitoring API, and Prometheus metrics

Retry strategy: 3 attempts with exponential backoff (60s * 5^n).
After MAX_RETRIES failures, job is marked exhausted (sent=False).
Exhausted jobs can be manually reset via the retry endpoint.
@DrGalio DrGalio requested a review from rohitdash08 as a code owner March 22, 2026 15:12
@DrGalio
Copy link
Author

DrGalio commented Mar 25, 2026

Closing duplicate — superseded by #641 which has the complete implementation with tests and OpenAPI docs.

@DrGalio DrGalio closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant