Full-httpd restarts taking ~60 seconds and leave server inaccessible

I've been "bpftracing" a slow httpd restart problem that cropped up when we upgraded our AWS VM to their latest Amazon Linux 2023 a while back (and also from python 3.8.5 to 3.11). Full httpd restarts (`sudo systemctl restart httpd`) almost always take 60+ seconds, during which time all sites are inaccessbile and return "connection refused" (edit: not 5xx as I originally wrote). Wasn't sure at first but from tracing it looks to be possibly mod_wsgi related. 

What I've seen is that, on requested restart, the root `httpd` PID repeatedly attempts `kill()` syscalls on the very last httpd child PID, every 1 second, until that 60 second timeout elapses. That last child PID is not a site-specfic WSGI daemon process, it comes right after them - `sudo systemctl status httpd` shows something like:

```
          ├─676367 /usr/sbin/httpd -DFOREGROUND
             ├─676369 /usr/bin/python3.11 /etc/httpd/error_logger.py
             ├─676370 /usr/sbin/httpd -DFOREGROUND
             ├─676371 "(wsgi:lukerissa" -DFOREGROUND
             ├─676372 "(wsgi:ra)      " -DFOREGROUND
             ├─676373 "(wsgi:analytics" -DFOREGROUND
              ....
             ├─676392 "(wsgi:finance) " -DFOREGROUND
             ├─676393 /usr/sbin/httpd -DFOREGROUND    <--- root httpd repeatedly tries to kill this
```

When I trace its activity during normal activity (e.g. 676393 above) I see it's doing reads and writes over Unix domain sockets like `\x01\x00/etc/httpd/run/wsgi.676367.0.5.sock\x00`, presumably proxying WSGI responses. And also doing some opens/reads of static content files from multiple different sites.

Seems to maybe be related to `WSGIDaemonProcess`'s `socket-timeout`, or `response-socket-timeout` which we don't specifically set, and so resolves to the `Timeout` directive which is also at its default (60).

I can keep digging but thought I'd check here if anything stands out (and have it here in case others can benefit). If it's related to HTTP keepalives or browser clients whose connections have stalled, I'd prefer to aggressively disconnect those to keep the restart quick, rather than take quite a few sites offline for 60+ seconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Full-httpd restarts taking ~60 seconds and leave server inaccessible #925

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Full-httpd restarts taking ~60 seconds and leave server inaccessible #925

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions