-
Notifications
You must be signed in to change notification settings - Fork 11
Docker healthcheck and auto-restart for Celery workers #1024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements healthcheck scripts, worker protections (max-tasks-per-child, max-memory-per-child), and autoheal container for automatic recovery of stuck workers.
✅ Deploy Preview for antenna-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
WalkthroughThis PR introduces comprehensive healthcheck and auto-healing capabilities to Celery workers across local, staging, and production environments. Healthcheck scripts validate worker processes and broker connectivity, while Docker Compose configurations add automated restart policies and integrate an autoheal service to manage container health. Changes
Sequence Diagram(s)sequenceDiagram
participant Docker as Docker<br/>Engine
participant Container as Celery<br/>Worker Container
participant Autoheal as Autoheal<br/>Service
participant Redis as Redis<br/>Broker
loop Every 30s (celeryworker) / 60s (celerybeat)
Docker->>Container: Execute healthcheck.sh
Container->>Container: Check worker process<br/>(pgrep)
alt Process found
Container->>Redis: Ping broker<br/>(redis-cli)
alt Broker responds
Container-->>Docker: Exit 0 (healthy)
else Broker unreachable
Container-->>Docker: Exit 1 (unhealthy)
end
else Process not found
Container-->>Docker: Exit 1 (unhealthy)
end
end
Docker->>Autoheal: Report container status<br/>(autoheal label detected)
alt Container unhealthy for N retries
Autoheal->>Docker: Request container restart
Docker->>Container: Stop & restart
else Container healthy
Autoheal-->>Autoheal: Continue monitoring
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
compose/local/django/celery/healthcheck.sh (1)
15-19: Consider a more specific process pattern.The pattern
"celery.*worker"could potentially match unrelated processes. Consider using a more specific pattern like"celery -A config.celery_app worker"to ensure you're matching the actual worker process.-if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then +if ! pgrep -f "celery -A config.celery_app worker" > /dev/null 2>&1; then echo "ERROR: Celery worker process not found" >&2 exit 1 ficompose/production/django/celery/healthcheck.sh (1)
15-19: Consider a more specific process pattern.The pattern
"celery.*worker"could potentially match unrelated processes. Consider using a more specific pattern like"celery -A config.celery_app worker"to ensure you're matching the actual worker process.-if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then +if ! pgrep -f "celery -A config.celery_app worker" > /dev/null 2>&1; then echo "ERROR: Celery worker process not found" >&2 exit 1 fidocker-compose.production.yml (2)
46-53: Consider increasing celerybeat start_period to match celeryworker.The 30-second start period may be insufficient for Celery Beat to fully initialize and begin task scheduling. Celeryworker uses 60 seconds; consider matching this for consistency, or document why Beat requires faster readiness.
Additionally, the
pgrepcheck only validates process existence—it does not detect if Beat is actively scheduling tasks. For production, consider adding a secondary check (e.g., verifying Beat has logged recent task scheduling activity) if available in the healthcheck script.
72-81: Pin autoheal image version and document privileged socket access.The autoheal service uses
latesttag, which risks introducing breaking changes on image updates. For production stability, pin to a specific version:- image: willfarrell/autoheal:latest + image: willfarrell/autoheal:1.2.0Additionally, the
/var/run/docker.sockmount grants significant privileges to the autoheal container (ability to restart any container). Ensure this security consideration is documented in deployment notes or README for operators.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
compose/local/django/Dockerfile(2 hunks)compose/local/django/celery/healthcheck.sh(1 hunks)compose/local/django/celery/worker/start(1 hunks)compose/production/django/Dockerfile(2 hunks)compose/production/django/celery/healthcheck.sh(1 hunks)compose/production/django/celery/worker/start(1 hunks)docker-compose.production.yml(2 hunks)docker-compose.worker.yml(1 hunks)docker-compose.yml(1 hunks)
🔇 Additional comments (12)
compose/production/django/celery/worker/start (1)
7-26: LGTM! Excellent resource protection and documentation.The worker protections are well-configured:
--max-tasks-per-child=50prevents memory leaks from accumulating across tasks--max-memory-per-child=4000000(4GB) provides a reasonable ceiling for ML workloads- Comments clearly explain the interaction between healthcheck, autoheal, and restart policies
compose/local/django/celery/worker/start (1)
15-28: Verify the watchfiles command syntax.The conditional startup logic is well-structured, but please verify the watchfiles invocation on Line 27 is correct:
exec watchfiles --filter python celery.__main__.main --args '-A config.celery_app worker -l INFO ...'Based on watchfiles documentation, the typical syntax is
watchfiles [module] [args]. Confirm that:
- The module path
celery.__main__.mainis correctly specified- The
--argsflag properly passes all arguments to Celery- The command successfully reloads the worker when Python files change
Consider testing this locally with a file change to ensure auto-reload triggers as expected.
compose/production/django/Dockerfile (2)
48-49: LGTM! Correctly adds procps for healthcheck.The procps package provides
pgrepwhich is used by the healthcheck script to verify the Celery worker process is running.
86-88: LGTM! Properly installs healthcheck script.The healthcheck directory is copied and made executable, enabling Docker's healthcheck mechanism to monitor worker status.
docker-compose.worker.yml (2)
28-35: LGTM! Well-configured healthcheck parameters.The healthcheck configuration is appropriate:
- 30s interval provides frequent checks without overhead
- 15s timeout allows for slow responses during high load
- 3 retries (90s total) prevents premature restarts from transient issues
- 60s start_period accommodates worker initialization
37-46: Container names are unique and correctly environment-scoped—no changes needed.The verification confirms the container names across compose files are distinct:
ami_local_redis(local),ami_worker_autoheal(worker), andami_production_autoheal(production). The naming convention prevents collisions even if files were deployed on the same host. The autoheal service's global label-based restart behavior is the standard expected behavior for the autoheal image—confirm this is intentional for your use case.compose/local/django/Dockerfile (2)
44-45: LGTM! Correctly adds procps for healthcheck.Consistent with the production Dockerfile, this ensures
pgrepis available for the healthcheck script.
79-81: LGTM! Properly installs healthcheck script.The healthcheck directory is correctly copied and made executable for local development.
docker-compose.yml (2)
89-107: LGTM! Healthcheck configuration is appropriate for local development.The healthcheck is properly configured with the same parameters as production. Note that unlike
docker-compose.worker.ymlanddocker-compose.production.yml, this local compose file:
- Does not include an autoheal service
- Does not set
restart: alwayspolicy- Does not have the
autoheal=truelabelThis is appropriate for local development where developers may want manual control over container restarts for debugging purposes.
93-100: Clear documentation for debugging options.The commented environment variables provide helpful guidance for enabling debugpy remote debugging and controlling auto-reload behavior.
docker-compose.production.yml (2)
32-39: Celeryworker healthcheck configuration is sound.The script-based approach allows for sophisticated health validation beyond simple process checks. The 60-second start period provides adequate buffer for Celery worker initialization, and the 30-second check interval with 3 retries (90-second total to unhealthy) balances responsiveness with stability.
Ensure
/celery/healthcheck.shexists in the production Django image and handles cases where the Redis broker is unavailable (per PR summary, it should work without broker connectivity).
63-70: Flower healthcheck configuration is appropriate.The HTTP endpoint check is reliable and the timing (30s interval, 5s timeout, 3 retries) is well-suited for Flower's responsiveness characteristics. Internal port 5555 is correctly used despite the service being exposed on 5550 to the host.
| # Check 2: Can we connect to Redis (the broker)? | ||
| # Use redis-cli if available, otherwise skip | ||
| if command -v redis-cli > /dev/null 2>&1; then | ||
| if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then | ||
| echo "ERROR: Cannot connect to Redis broker" >&2 | ||
| exit 1 | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix Redis broker URL parsing.
Line 24 uses ${CELERY_BROKER_URL:-redis} directly as the hostname argument to redis-cli, but CELERY_BROKER_URL is typically a full URL like redis://redis:6379/0, not just a hostname. This will cause the healthcheck to fail when the environment variable is set.
Apply this diff to properly extract the hostname from the broker URL:
# Check 2: Can we connect to Redis (the broker)?
# Use redis-cli if available, otherwise skip
if command -v redis-cli > /dev/null 2>&1; then
- if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+ REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|')
+ if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then
echo "ERROR: Cannot connect to Redis broker" >&2
exit 1
fi
fiThis also adds a 5-second timeout to prevent the healthcheck from hanging indefinitely if Redis is unresponsive.
🤖 Prompt for AI Agents
In compose/local/django/celery/healthcheck.sh around lines 21 to 28, the script
currently passes CELERY_BROKER_URL (a full URL) directly to redis-cli -h which
fails; instead parse CELERY_BROKER_URL to extract host and port (strip scheme
like redis://, remove any userinfo before @, cut off any path/query after first
/, then split host[:port] into host and port), default to host "redis" and port
"6379" if missing, and call redis-cli with both -h host -p port and a 5-second
timeout (redis-cli -h <host> -p <port> -t 5 ping) so the healthcheck correctly
connects and won’t hang.
| # Check 2: Can we connect to Redis (the broker)? | ||
| # Use redis-cli if available, otherwise skip | ||
| if command -v redis-cli > /dev/null 2>&1; then | ||
| if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then | ||
| echo "ERROR: Cannot connect to Redis broker" >&2 | ||
| exit 1 | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix Redis broker URL parsing.
Line 24 uses ${CELERY_BROKER_URL:-redis} directly as the hostname argument to redis-cli, but CELERY_BROKER_URL is typically a full URL like redis://redis:6379/0, not just a hostname. This will cause the healthcheck to fail when the environment variable is set.
Apply this diff to properly extract the hostname from the broker URL:
# Check 2: Can we connect to Redis (the broker)?
# Use redis-cli if available, otherwise skip
if command -v redis-cli > /dev/null 2>&1; then
- if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+ REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|')
+ if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then
echo "ERROR: Cannot connect to Redis broker" >&2
exit 1
fi
fiThis also adds a 5-second timeout to prevent the healthcheck from hanging indefinitely if Redis is unresponsive.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Check 2: Can we connect to Redis (the broker)? | |
| # Use redis-cli if available, otherwise skip | |
| if command -v redis-cli > /dev/null 2>&1; then | |
| if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then | |
| echo "ERROR: Cannot connect to Redis broker" >&2 | |
| exit 1 | |
| fi | |
| fi | |
| # Check 2: Can we connect to Redis (the broker)? | |
| # Use redis-cli if available, otherwise skip | |
| if command -v redis-cli > /dev/null 2>&1; then | |
| REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|') | |
| if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then | |
| echo "ERROR: Cannot connect to Redis broker" >&2 | |
| exit 1 | |
| fi | |
| fi |
🤖 Prompt for AI Agents
In compose/production/django/celery/healthcheck.sh around lines 21 to 28, the
script currently passes ${CELERY_BROKER_URL:-redis} directly to redis-cli which
fails when CELERY_BROKER_URL is a full URL (e.g. redis://redis:6379/0); update
the script to parse CELERY_BROKER_URL to extract host (and optionally port)
using shell string manipulation or a simple URL parse (fall back to "redis" host
if unset), then call redis-cli with -h <host> and -p <port> as appropriate and
include a connection timeout (e.g. --connect-timeout 5 or use redis-cli -t 5) so
the healthcheck fails fast on unresponsive Redis.

Summary
Implements Docker healthchecks and automatic restart mechanisms for Celery workers to detect and recover from stuck/unresponsive workers without manual intervention.
List of Changes
/celery/healthcheck.sh)--max-tasks-per-child=50and--max-memory-per-child=4000000autohealcontainer for automatic restart of unhealthy workersprocpspackage in Docker images for healthcheck commandsRelated Issues
Addresses the problem of Celery workers periodically crashing or getting stuck and requiring manual restart.
Detailed Description
Problem
Celery workers can become stuck or unresponsive (deadlocked, out of memory, frozen tasks) but Docker doesn't detect this. Workers appear "running" to Docker but stop processing tasks, requiring manual intervention.
Solution
1. Healthcheck Scripts
pgrep)2. Worker Protections (Preventive)
--max-tasks-per-child=50: Restarts worker process after 50 tasks to prevent memory leaks--max-memory-per-child=4000000: Restarts worker if memory exceeds 4GB3. Local Development Improvements
CELERY_DEBUG=1(enable debugpy),CELERY_NO_RELOAD=1(disable watchfiles)4. Automatic Recovery
autohealcontainer that monitors Docker health statusrestart: alwayspolicy brings container back automaticallyHow to Test the Changes
Verify healthcheck is working:
Test unhealthy detection:
Test debugpy (local dev):
Screenshots
N/A - Infrastructure change
Deployment Notes
Docker Compose Deployment (Current Production)
autohealcontainer will automatically restart unhealthy workersdocker compose -f docker-compose.worker.yml up -dFuture: Kubernetes/Swarm
Production Considerations
Checklist
Summary by CodeRabbit
New Features
Chores