Skip to content

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Oct 30, 2025

Summary

Implements Docker healthchecks and automatic restart mechanisms for Celery workers to detect and recover from stuck/unresponsive workers without manual intervention.

List of Changes

  • Added healthcheck scripts for Celery workers (/celery/healthcheck.sh)
  • Added worker protections: --max-tasks-per-child=50 and --max-memory-per-child=4000000
  • Moved debugpy configuration from docker-compose.yml to start script (env var controlled)
  • Added healthcheck configuration to docker-compose.worker.yml and docker-compose.production.yml
  • Added autoheal container for automatic restart of unhealthy workers
  • Installed procps package in Docker images for healthcheck commands
  • Added healthchecks for celeryworker, celerybeat, and flower services

Related Issues

Addresses the problem of Celery workers periodically crashing or getting stuck and requiring manual restart.

Detailed Description

Problem

Celery workers can become stuck or unresponsive (deadlocked, out of memory, frozen tasks) but Docker doesn't detect this. Workers appear "running" to Docker but stop processing tasks, requiring manual intervention.

Solution

1. Healthcheck Scripts

  • Process-based healthcheck verifying worker process is running (pgrep)
  • Optional Redis broker connectivity check
  • Works without Django settings (no DATABASE_URL required)
  • Checks every 30s, marks unhealthy after 3 consecutive failures (90s total)

2. Worker Protections (Preventive)

  • --max-tasks-per-child=50: Restarts worker process after 50 tasks to prevent memory leaks
  • --max-memory-per-child=4000000: Restarts worker if memory exceeds 4GB
  • These prevent resource buildup that causes workers to get stuck

3. Local Development Improvements

  • Moved debugpy and watchfiles configuration to start script
  • Environment variables: CELERY_DEBUG=1 (enable debugpy), CELERY_NO_RELOAD=1 (disable watchfiles)
  • Same worker protections applied as production

4. Automatic Recovery

  • Added autoheal container that monitors Docker health status
  • When worker marked unhealthy, autoheal kills the container
  • Docker's restart: always policy brings container back automatically
  • Fresh worker starts processing tasks again

How to Test the Changes

Verify healthcheck is working:

# Build and start worker
docker compose build celeryworker
docker compose up -d celeryworker

# Wait ~60 seconds for start_period, then check health
docker ps | grep celeryworker
# Should show "healthy" status

Test unhealthy detection:

# Simulate stuck worker by pausing the process
docker exec ami-local-celeryworker-1 pkill -STOP -f "celery.*worker"

# Wait ~90 seconds (3 retries × 30s interval)
docker ps | grep celeryworker
# Should show "unhealthy"

# With autoheal running, worker will automatically restart
# Without autoheal, manually restart:
docker compose restart celeryworker

Test debugpy (local dev):

# Enable debugpy
CELERY_DEBUG=1 docker compose up celeryworker
# Attach debugger to localhost:5678

Screenshots

N/A - Infrastructure change

Deployment Notes

Docker Compose Deployment (Current Production)

  • The autoheal container will automatically restart unhealthy workers
  • No configuration changes needed beyond deploying the updated compose files
  • For worker VMs: docker compose -f docker-compose.worker.yml up -d

Future: Kubernetes/Swarm

  • Kubernetes has built-in liveness probes that can use the healthcheck script
  • Docker Swarm has built-in restart-on-unhealthy functionality
  • The healthcheck scripts are orchestration-agnostic

Production Considerations

  • Healthcheck intervals and timeouts can be adjusted per environment
  • Worker protection limits (max-tasks, max-memory) may need tuning based on workload
  • Monitor logs for healthcheck failures to identify patterns

Checklist

  • I have tested these changes appropriately (manual testing of healthcheck and auto-restart)
  • I have added and/or modified relevant tests (infrastructure change, no unit tests needed)
  • I updated relevant documentation or comments (added inline documentation in scripts)
  • I have verified that this PR follows the project's coding standards
  • Any dependent changes have already been merged to main (no dependencies)

Summary by CodeRabbit

  • New Features

    • Added health monitoring for Celery workers, Celery Beat, and Flower services.
    • Implemented automatic container restart for unhealthy services.
    • Added optional debug and reload control for Celery workers in development.
    • Added resource protection limits to Celery workers.
  • Chores

    • Updated Docker configurations across local and production environments.

Implements healthcheck scripts, worker protections (max-tasks-per-child, max-memory-per-child), and autoheal container for automatic recovery of stuck workers.
@netlify
Copy link

netlify bot commented Oct 30, 2025

Deploy Preview for antenna-preview ready!

Name Link
🔨 Latest commit ae584bc
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6903f63d23972c00089ed617
😎 Deploy Preview https://deploy-preview-1024--antenna-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 30 (🔴 down 1 from production)
Accessibility: 80 (no change from production)
Best Practices: 100 (no change from production)
SEO: 92 (no change from production)
PWA: 80 (no change from production)
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Walkthrough

This PR introduces comprehensive healthcheck and auto-healing capabilities to Celery workers across local, staging, and production environments. Healthcheck scripts validate worker processes and broker connectivity, while Docker Compose configurations add automated restart policies and integrate an autoheal service to manage container health.

Changes

Cohort / File(s) Change Summary
Docker Image Configuration
compose/local/django/Dockerfile, compose/production/django/Dockerfile
Added procps package to system dependencies; copied Celery directory and made healthcheck script executable in both local and production images.
Local Celery Healthcheck
compose/local/django/celery/healthcheck.sh
New Bash script that validates Celery worker process status via pgrep and optionally checks Redis broker connectivity using redis-cli with CELERY_BROKER_URL.
Production Celery Healthcheck
compose/production/django/celery/healthcheck.sh
New Bash script with strict error handling that checks worker process availability and Redis connectivity; intended for autoheal container restarts.
Local Worker Startup
compose/local/django/celery/worker/start
Replaced static invocation with environment-driven branching: debug mode (debugpy on port 5678), no-reload mode, or default with watchfiles auto-reload; preserves resource protection flags.
Production Worker Startup
compose/production/django/celery/worker/start
Enhanced Celery worker launch with resource protection: added max-tasks-per-child=50 and max-memory-per-child=4000000 parameters.
Base Docker Compose Configuration
docker-compose.yml
Changed celeryworker command from debugpy invocation to /start-celeryworker script; added healthcheck block and commented environment variables for debug/reload control.
Worker-Only Compose Configuration
docker-compose.worker.yml
Added healthcheck to celeryworker with interval/timeout/retry parameters; introduced autoheal service label and autoheal service configuration.
Production Compose Configuration
docker-compose.production.yml
Added healthchecks to celeryworker, celerybeat, and flower services with respective check intervals; introduced new autoheal service (willfarrell/autoheal) with label-based container health management.

Sequence Diagram(s)

sequenceDiagram
    participant Docker as Docker<br/>Engine
    participant Container as Celery<br/>Worker Container
    participant Autoheal as Autoheal<br/>Service
    participant Redis as Redis<br/>Broker

    loop Every 30s (celeryworker) / 60s (celerybeat)
        Docker->>Container: Execute healthcheck.sh
        Container->>Container: Check worker process<br/>(pgrep)
        alt Process found
            Container->>Redis: Ping broker<br/>(redis-cli)
            alt Broker responds
                Container-->>Docker: Exit 0 (healthy)
            else Broker unreachable
                Container-->>Docker: Exit 1 (unhealthy)
            end
        else Process not found
            Container-->>Docker: Exit 1 (unhealthy)
        end
    end

    Docker->>Autoheal: Report container status<br/>(autoheal label detected)
    alt Container unhealthy for N retries
        Autoheal->>Docker: Request container restart
        Docker->>Container: Stop & restart
    else Container healthy
        Autoheal-->>Autoheal: Continue monitoring
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Areas requiring extra attention:
    • Healthcheck script logic in both local and production versions to ensure proper error handling and Redis connectivity checks
    • Local worker startup script branching logic for debug/no-reload modes; verify environment variable conditions and watchfiles fallback behavior
    • Production resource protection parameters (max-tasks-per-child=50, max-memory-per-child=4000000) alignment with application requirements
    • Autoheal service configuration in compose files (label matching, interval settings, socket mount security) for proper container restart behavior

Poem

🐰 A rabbit hops through containers with glee,
Healthchecks now keeping the workers so spry!
When Celery stumbles, autoheal springs free—
Docker containers dance, reaching for the sky! ✨
No more sleepy processes, all's well that heals. 💚

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "Docker healthcheck and auto-restart for Celery workers" directly and clearly summarizes the main objective of the changeset. It captures the primary focus of the changes—implementing Docker healthchecks and automatic restart mechanisms for Celery workers—which aligns with the core functionality introduced across the modified files. The title is concise, specific, and avoids vague terminology, making it clear to reviewers scanning the repository history what the PR accomplishes.
Description Check ✅ Passed The pull request description comprehensively addresses all required sections of the template: Summary, List of Changes, Related Issues, Detailed Description, How to Test the Changes, Screenshots, Deployment Notes, and Checklist. The description is well-organized with clear subsections explaining the problem, solution components (healthcheck scripts, worker protections, local development improvements, and automatic recovery), and includes specific testing instructions with bash commands. All checklist items are completed with explanatory notes. The only minor gap is that the Related Issues section describes the problem without referencing a specific issue number using GitHub keywords, but it clearly contextualizes the changes being made.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/celery-worker-healthcheck

Comment @coderabbitai help to get the list of available commands and usage tips.

@mihow mihow changed the title feat: add Docker healthcheck and auto-restart for Celery workers Docker healthcheck and auto-restart for Celery workers Oct 30, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
compose/local/django/celery/healthcheck.sh (1)

15-19: Consider a more specific process pattern.

The pattern "celery.*worker" could potentially match unrelated processes. Consider using a more specific pattern like "celery -A config.celery_app worker" to ensure you're matching the actual worker process.

-if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then
+if ! pgrep -f "celery -A config.celery_app worker" > /dev/null 2>&1; then
     echo "ERROR: Celery worker process not found" >&2
     exit 1
 fi
compose/production/django/celery/healthcheck.sh (1)

15-19: Consider a more specific process pattern.

The pattern "celery.*worker" could potentially match unrelated processes. Consider using a more specific pattern like "celery -A config.celery_app worker" to ensure you're matching the actual worker process.

-if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then
+if ! pgrep -f "celery -A config.celery_app worker" > /dev/null 2>&1; then
     echo "ERROR: Celery worker process not found" >&2
     exit 1
 fi
docker-compose.production.yml (2)

46-53: Consider increasing celerybeat start_period to match celeryworker.

The 30-second start period may be insufficient for Celery Beat to fully initialize and begin task scheduling. Celeryworker uses 60 seconds; consider matching this for consistency, or document why Beat requires faster readiness.

Additionally, the pgrep check only validates process existence—it does not detect if Beat is actively scheduling tasks. For production, consider adding a secondary check (e.g., verifying Beat has logged recent task scheduling activity) if available in the healthcheck script.


72-81: Pin autoheal image version and document privileged socket access.

The autoheal service uses latest tag, which risks introducing breaking changes on image updates. For production stability, pin to a specific version:

-    image: willfarrell/autoheal:latest
+    image: willfarrell/autoheal:1.2.0

Additionally, the /var/run/docker.sock mount grants significant privileges to the autoheal container (ability to restart any container). Ensure this security consideration is documented in deployment notes or README for operators.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e3b9711 and ae584bc.

📒 Files selected for processing (9)
  • compose/local/django/Dockerfile (2 hunks)
  • compose/local/django/celery/healthcheck.sh (1 hunks)
  • compose/local/django/celery/worker/start (1 hunks)
  • compose/production/django/Dockerfile (2 hunks)
  • compose/production/django/celery/healthcheck.sh (1 hunks)
  • compose/production/django/celery/worker/start (1 hunks)
  • docker-compose.production.yml (2 hunks)
  • docker-compose.worker.yml (1 hunks)
  • docker-compose.yml (1 hunks)
🔇 Additional comments (12)
compose/production/django/celery/worker/start (1)

7-26: LGTM! Excellent resource protection and documentation.

The worker protections are well-configured:

  • --max-tasks-per-child=50 prevents memory leaks from accumulating across tasks
  • --max-memory-per-child=4000000 (4GB) provides a reasonable ceiling for ML workloads
  • Comments clearly explain the interaction between healthcheck, autoheal, and restart policies
compose/local/django/celery/worker/start (1)

15-28: Verify the watchfiles command syntax.

The conditional startup logic is well-structured, but please verify the watchfiles invocation on Line 27 is correct:

exec watchfiles --filter python celery.__main__.main --args '-A config.celery_app worker -l INFO ...'

Based on watchfiles documentation, the typical syntax is watchfiles [module] [args]. Confirm that:

  1. The module path celery.__main__.main is correctly specified
  2. The --args flag properly passes all arguments to Celery
  3. The command successfully reloads the worker when Python files change

Consider testing this locally with a file change to ensure auto-reload triggers as expected.

compose/production/django/Dockerfile (2)

48-49: LGTM! Correctly adds procps for healthcheck.

The procps package provides pgrep which is used by the healthcheck script to verify the Celery worker process is running.


86-88: LGTM! Properly installs healthcheck script.

The healthcheck directory is copied and made executable, enabling Docker's healthcheck mechanism to monitor worker status.

docker-compose.worker.yml (2)

28-35: LGTM! Well-configured healthcheck parameters.

The healthcheck configuration is appropriate:

  • 30s interval provides frequent checks without overhead
  • 15s timeout allows for slow responses during high load
  • 3 retries (90s total) prevents premature restarts from transient issues
  • 60s start_period accommodates worker initialization

37-46: Container names are unique and correctly environment-scoped—no changes needed.

The verification confirms the container names across compose files are distinct: ami_local_redis (local), ami_worker_autoheal (worker), and ami_production_autoheal (production). The naming convention prevents collisions even if files were deployed on the same host. The autoheal service's global label-based restart behavior is the standard expected behavior for the autoheal image—confirm this is intentional for your use case.

compose/local/django/Dockerfile (2)

44-45: LGTM! Correctly adds procps for healthcheck.

Consistent with the production Dockerfile, this ensures pgrep is available for the healthcheck script.


79-81: LGTM! Properly installs healthcheck script.

The healthcheck directory is correctly copied and made executable for local development.

docker-compose.yml (2)

89-107: LGTM! Healthcheck configuration is appropriate for local development.

The healthcheck is properly configured with the same parameters as production. Note that unlike docker-compose.worker.yml and docker-compose.production.yml, this local compose file:

  • Does not include an autoheal service
  • Does not set restart: always policy
  • Does not have the autoheal=true label

This is appropriate for local development where developers may want manual control over container restarts for debugging purposes.


93-100: Clear documentation for debugging options.

The commented environment variables provide helpful guidance for enabling debugpy remote debugging and controlling auto-reload behavior.

docker-compose.production.yml (2)

32-39: Celeryworker healthcheck configuration is sound.

The script-based approach allows for sophisticated health validation beyond simple process checks. The 60-second start period provides adequate buffer for Celery worker initialization, and the 30-second check interval with 3 retries (90-second total to unhealthy) balances responsiveness with stability.

Ensure /celery/healthcheck.sh exists in the production Django image and handles cases where the Redis broker is unavailable (per PR summary, it should work without broker connectivity).


63-70: Flower healthcheck configuration is appropriate.

The HTTP endpoint check is reliable and the timing (30s interval, 5s timeout, 3 retries) is well-suited for Flower's responsiveness characteristics. Internal port 5555 is correctly used despite the service being exposed on 5550 to the host.

Comment on lines +21 to +28
# Check 2: Can we connect to Redis (the broker)?
# Use redis-cli if available, otherwise skip
if command -v redis-cli > /dev/null 2>&1; then
if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
echo "ERROR: Cannot connect to Redis broker" >&2
exit 1
fi
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix Redis broker URL parsing.

Line 24 uses ${CELERY_BROKER_URL:-redis} directly as the hostname argument to redis-cli, but CELERY_BROKER_URL is typically a full URL like redis://redis:6379/0, not just a hostname. This will cause the healthcheck to fail when the environment variable is set.

Apply this diff to properly extract the hostname from the broker URL:

 # Check 2: Can we connect to Redis (the broker)?
 # Use redis-cli if available, otherwise skip
 if command -v redis-cli > /dev/null 2>&1; then
-    if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+    REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|')
+    if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then
         echo "ERROR: Cannot connect to Redis broker" >&2
         exit 1
     fi
 fi

This also adds a 5-second timeout to prevent the healthcheck from hanging indefinitely if Redis is unresponsive.

🤖 Prompt for AI Agents
In compose/local/django/celery/healthcheck.sh around lines 21 to 28, the script
currently passes CELERY_BROKER_URL (a full URL) directly to redis-cli -h which
fails; instead parse CELERY_BROKER_URL to extract host and port (strip scheme
like redis://, remove any userinfo before @, cut off any path/query after first
/, then split host[:port] into host and port), default to host "redis" and port
"6379" if missing, and call redis-cli with both -h host -p port and a 5-second
timeout (redis-cli -h <host> -p <port> -t 5 ping) so the healthcheck correctly
connects and won’t hang.

Comment on lines +21 to +28
# Check 2: Can we connect to Redis (the broker)?
# Use redis-cli if available, otherwise skip
if command -v redis-cli > /dev/null 2>&1; then
if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
echo "ERROR: Cannot connect to Redis broker" >&2
exit 1
fi
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix Redis broker URL parsing.

Line 24 uses ${CELERY_BROKER_URL:-redis} directly as the hostname argument to redis-cli, but CELERY_BROKER_URL is typically a full URL like redis://redis:6379/0, not just a hostname. This will cause the healthcheck to fail when the environment variable is set.

Apply this diff to properly extract the hostname from the broker URL:

 # Check 2: Can we connect to Redis (the broker)?
 # Use redis-cli if available, otherwise skip
 if command -v redis-cli > /dev/null 2>&1; then
-    if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+    REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|')
+    if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then
         echo "ERROR: Cannot connect to Redis broker" >&2
         exit 1
     fi
 fi

This also adds a 5-second timeout to prevent the healthcheck from hanging indefinitely if Redis is unresponsive.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Check 2: Can we connect to Redis (the broker)?
# Use redis-cli if available, otherwise skip
if command -v redis-cli > /dev/null 2>&1; then
if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
echo "ERROR: Cannot connect to Redis broker" >&2
exit 1
fi
fi
# Check 2: Can we connect to Redis (the broker)?
# Use redis-cli if available, otherwise skip
if command -v redis-cli > /dev/null 2>&1; then
REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|')
if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then
echo "ERROR: Cannot connect to Redis broker" >&2
exit 1
fi
fi
🤖 Prompt for AI Agents
In compose/production/django/celery/healthcheck.sh around lines 21 to 28, the
script currently passes ${CELERY_BROKER_URL:-redis} directly to redis-cli which
fails when CELERY_BROKER_URL is a full URL (e.g. redis://redis:6379/0); update
the script to parse CELERY_BROKER_URL to extract host (and optionally port)
using shell string manipulation or a simple URL parse (fall back to "redis" host
if unset), then call redis-cli with -h <host> and -p <port> as appropriate and
include a connection timeout (e.g. --connect-timeout 5 or use redis-cli -t 5) so
the healthcheck fails fast on unresponsive Redis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants