Skip to content

feat: resilient background job retry & monitoring (fixes #130)#641

Open
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
DrGalio:feat/resilient-job-retry-monitoring
Open

feat: resilient background job retry & monitoring (fixes #130)#641
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
DrGalio:feat/resilient-job-retry-monitoring

Conversation

@DrGalio
Copy link

@DrGalio DrGalio commented Mar 24, 2026

Summary

Implements a production-grade resilient background job execution system with exponential backoff retries, dead-letter queue, and admin monitoring — addressing Issue #130 ($250 bounty).

What's Built

Core Engine (services/jobs.py)

  • Exponential backoff: 30s → 2min → 8min between retries
  • Dead-letter queue: Permanently failed jobs after max retries are quarantined for manual review
  • @register_handler decorator: Pluggable job type system — just register a handler and enqueue jobs
  • process_due_jobs(): Batch processor with configurable limits
  • retry_dead_letter_job(): Manual reprocessing of dead-lettered jobs
  • get_job_stats(): Aggregated monitoring by status and job type

Job Model (models.py)

  • BackgroundJob model with full lifecycle tracking
  • JobStatus enum: PENDING → RUNNING → SUCCESS | FAILED | DEAD_LETTER
  • Tracks: attempt count, max_retries, next_run_at, last_error, result payload

Reminders Integration (routes/reminders.py)

  • /reminders/run now enqueues jobs instead of fire-and-forget sends
  • Failed sends are automatically retried with backoff
  • Permanently failed sends go to dead-letter queue for admin review

Admin Monitoring API (routes/jobs.py)

Endpoint Description
GET /admin/jobs/stats Aggregated stats by status/type
GET /admin/jobs Paginated list with ?status= and ?job_type= filters
GET /admin/jobs/:id Full job details including payload and result
POST /admin/jobs/:id/retry Manual dead-letter retry
POST /admin/jobs/process Manual batch processing trigger
DELETE /admin/jobs/:id Remove job record

All endpoints require admin role (403 for non-admin, 401 for unauthenticated).

Observability

  • New Prometheus metric: finmind_job_events_total (labels: event, job_type, status)
  • Tracks: enqueued, succeeded, retried, dead_lettered, manual_retry, failed

Database

  • background_jobs table with proper indexes (status+next_run_at, job_type, created_at)
  • Migration file for existing deployments: migrations/001_background_jobs.sql
  • Auto-migration via _ensure_schema_compatibility() — zero-downtime deploy

OpenAPI

  • Full documentation for all 6 admin endpoints
  • BackgroundJob schema definition
  • New Jobs tag

Tests (tests/test_jobs.py)

20 tests covering:

  • Backoff calculation
  • Job enqueue and execution
  • Retry on transient failure (3 attempts)
  • Dead-letter after max retries
  • No-handler failure path
  • Batch processing
  • Skip-not-yet-due jobs
  • Manual dead-letter retry
  • get_job_stats()
  • All admin API endpoints
  • Auth enforcement (admin-only, 401/403)

Acceptance Criteria

  • ✅ Production-ready implementation
  • ✅ Includes tests (20 tests, service + API + auth)
  • ✅ Documentation updated (OpenAPI, inline docstrings, this PR)

Before → After

Before:

for r in items:
    send_reminder(r)  # Can fail silently
    r.sent = True      # Marked sent regardless of actual success

After:

for r in items:
    enqueue("send_reminder", {"reminder_id": r.id}, max_retries=3)
stats = process_due_jobs()  # Retry with backoff, dead-letter on exhaustion

Files Changed

packages/backend/app/__init__.py              — auto-migration for new table
packages/backend/app/models.py                — BackgroundJob model + JobStatus enum
packages/backend/app/observability.py         — job_events_total Prometheus counter
packages/backend/app/openapi.yaml             — 6 admin endpoints + schema
packages/backend/app/routes/__init__.py       — jobs blueprint registration
packages/backend/app/routes/jobs.py           — admin monitoring API (NEW)
packages/backend/app/routes/reminders.py      — run_due uses job system
packages/backend/app/services/jobs.py         — retry engine (NEW)
packages/backend/app/services/reminders.py    — registered job handler
packages/backend/app/db/schema.sql            — background_jobs table
packages/backend/app/db/migrations/001_background_jobs.sql  — migration (NEW)
packages/backend/tests/test_jobs.py           — 20 tests (NEW)

12 files changed, 1211 insertions(+), 9 deletions(-)

)

Implements a production-grade background job execution system with:

Core Engine (services/jobs.py):
- Exponential backoff retries (30s → 2min → 8min)
- Dead-letter queue for permanently failed jobs
- @register_handler decorator for pluggable job types
- process_due_jobs() batch processor with configurable limits
- retry_dead_letter_job() for manual reprocessing
- get_job_stats() for monitoring aggregation

Job Model (models.py):
- BackgroundJob model with full lifecycle tracking
- JobStatus enum: PENDING, RUNNING, SUCCESS, FAILED, DEAD_LETTER
- Tracks: attempt count, max_retries, next_run_at, last_error, result

Reminders Integration (routes/reminders.py):
- run_due endpoint now enqueues jobs instead of fire-and-forget
- send_reminder raises on failure for proper retry handling
- Handler registered via @register_handler decorator

Admin Monitoring API (routes/jobs.py):
- GET /admin/jobs/stats — aggregated stats by status/type
- GET /admin/jobs — paginated list with status/type filters
- GET /admin/jobs/:id — full job details
- POST /admin/jobs/:id/retry — manual dead-letter retry
- POST /admin/jobs/process — manual batch processing trigger
- DELETE /admin/jobs/:id — remove job record
- All endpoints require admin role

Observability (observability.py):
- finmind_job_events_total Prometheus counter (event/job_type/status)
- track_job_event() helper function

Database:
- background_jobs table with indexes (schema.sql)
- Migration file for existing deployments (001_background_jobs.sql)
- Auto-migration via _ensure_schema_compatibility()

OpenAPI (openapi.yaml):
- Full documentation for all 6 admin endpoints
- BackgroundJob schema definition
- Jobs tag added

Tests (tests/test_jobs.py):
- 20 tests covering service layer and admin API
- Backoff calculation, enqueue, execute, retry, dead-letter
- Admin auth enforcement (403 for non-admin, 401 for unauth)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant