Skip to content

feat: Resilient background job retry and monitoring (#130)#647

Open
HuiNeng6 wants to merge 1 commit intorohitdash08:mainfrom
HuiNeng6:feat/background-job-retry-monitoring
Open

feat: Resilient background job retry and monitoring (#130)#647
HuiNeng6 wants to merge 1 commit intorohitdash08:mainfrom
HuiNeng6:feat/background-job-retry-monitoring

Conversation

@HuiNeng6
Copy link

Summary

Implements #130: Resilient background job retry & monitoring

This PR adds a comprehensive background job execution system with automatic retry and monitoring capabilities.

Features Implemented

Core Functionality

  • Exponential Backoff Retry: Jobs that fail are automatically retried with exponentially increasing delays
  • Dead Letter Queue: Permanently failed jobs are stored for inspection and manual retry
  • Priority Queue: Jobs can be prioritized to ensure critical tasks are processed first
  • Configurable Retry Policies: Customize retry behavior per job type

Monitoring & Metrics

  • ✅ Built-in metrics tracking (created, succeeded, failed, retried, dead_letter)
  • ✅ Processing time tracking with averages
  • ✅ Health check endpoint for monitoring
  • ✅ API endpoints for job status and management

Technical Details

  • Retry Configuration:
    • Initial delay: 1 second
    • Max delay: 5 minutes (300 seconds)
    • Backoff multiplier: 2x
    • Jitter: ±25% random variation (prevents thundering herd)
  • Job Status Flow: PENDING → RUNNING → SUCCEEDED/FAILED → RETRYING → DEAD_LETTER

Files Added/Modified

File Description
�pp/services/background_jobs.py Core job service with retry logic
�pp/routes/background_jobs.py REST API endpoints
�pp/db/migrations/001_background_jobs.sql Database migration
ests/test_background_jobs.py Comprehensive test suite
�pp/routes/init.py Register new blueprint
�pp/services/README_BACKGROUND_JOBS.md Documentation

API Endpoints

Endpoint Method Description
/api/jobs/metrics\ GET Get aggregated job metrics
/api/jobs/\ GET Get status of specific job
/api/jobs/pending\ GET List pending and retrying jobs
/api/jobs/dead-letter\ GET List failed jobs
/api/jobs//retry\ POST Manually retry a failed job
/api/jobs/process\ POST Manually trigger job processing
/api/jobs/cleanup\ POST Clean up old completed jobs
/api/jobs/health\ GET Health check

Usage Example

\\python
from app.services.background_jobs import BackgroundJobService, JobType

Enqueue a job

job = BackgroundJobService.enqueue(
job_type=JobType.SEND_EMAIL,
payload={'to': 'user@example.com', 'subject': 'Welcome'},
priority=10,
max_retries=3
)

Register a handler

BackgroundJobService.register_handler(
JobType.SEND_EMAIL,
lambda p: send_email(p['to'], p['subject'])
)

Process jobs (in background worker)

stats = BackgroundJobService.process_pending_jobs(limit=10)
\\

Testing

All tests pass with comprehensive coverage:

  • ✅ Job creation and enqueueing
  • ✅ Retry logic with exponential backoff
  • ✅ Dead letter queue handling
  • ✅ Job metrics
  • ✅ API endpoints
  • ✅ Priority ordering
  • ✅ Job cleanup

Acceptance Criteria

  • Production ready implementation
  • Includes tests
  • Documentation updated

/claim #130

…ohitdash08#130)

Implements rohitdash08#130: Resilient background job retry & monitoring

Features:
- Exponential backoff retry with configurable delays
- Dead letter queue for permanently failed jobs
- Priority-based job processing
- Comprehensive metrics and monitoring
- REST API endpoints for job management
- Database migration script
- Full test coverage

Components:
- BackgroundJob model for tracking job state
- BackgroundJobService with retry logic
- Job metrics tracking (created, succeeded, failed, retried, dead_letter)
- API endpoints for monitoring and manual intervention
- Database migration for background_jobs table

Technical details:
- Exponential backoff: initial 1s, max 5min, 2x multiplier
- Configurable jitter to prevent thundering herd
- Priority queue processing (higher priority first)
- Automatic cleanup of old completed jobs
- Health check endpoint for monitoring

/claim rohitdash08#130
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant