-
-
Notifications
You must be signed in to change notification settings - Fork 271
Description
I've been "bpftracing" a slow httpd restart problem that cropped up when we upgraded our AWS VM to their latest Amazon Linux 2023 a while back (and also from python 3.8.5 to 3.11). Full httpd restarts (sudo systemctl restart httpd) almost always take 60+ seconds, during which time all sites are inaccessbile and return "connection refused" (edit: not 5xx as I originally wrote). Wasn't sure at first but from tracing it looks to be possibly mod_wsgi related.
What I've seen is that, on requested restart, the root httpd PID repeatedly attempts kill() syscalls on the very last httpd child PID, every 1 second, until that 60 second timeout elapses. That last child PID is not a site-specfic WSGI daemon process, it comes right after them - sudo systemctl status httpd shows something like:
├─676367 /usr/sbin/httpd -DFOREGROUND
├─676369 /usr/bin/python3.11 /etc/httpd/error_logger.py
├─676370 /usr/sbin/httpd -DFOREGROUND
├─676371 "(wsgi:lukerissa" -DFOREGROUND
├─676372 "(wsgi:ra) " -DFOREGROUND
├─676373 "(wsgi:analytics" -DFOREGROUND
....
├─676392 "(wsgi:finance) " -DFOREGROUND
├─676393 /usr/sbin/httpd -DFOREGROUND <--- root httpd repeatedly tries to kill this
When I trace its activity during normal activity (e.g. 676393 above) I see it's doing reads and writes over Unix domain sockets like \x01\x00/etc/httpd/run/wsgi.676367.0.5.sock\x00, presumably proxying WSGI responses. And also doing some opens/reads of static content files from multiple different sites.
Seems to maybe be related to WSGIDaemonProcess's socket-timeout, or response-socket-timeout which we don't specifically set, and so resolves to the Timeout directive which is also at its default (60).
I can keep digging but thought I'd check here if anything stands out (and have it here in case others can benefit). If it's related to HTTP keepalives or browser clients whose connections have stalled, I'd prefer to aggressively disconnect those to keep the restart quick, rather than take quite a few sites offline for 60+ seconds.