Skip to content

Full-httpd restarts taking ~60 seconds and leave server inaccessible #925

@QuadrupleA

Description

@QuadrupleA

I've been "bpftracing" a slow httpd restart problem that cropped up when we upgraded our AWS VM to their latest Amazon Linux 2023 a while back (and also from python 3.8.5 to 3.11). Full httpd restarts (sudo systemctl restart httpd) almost always take 60+ seconds, during which time all sites are inaccessbile and return "connection refused" (edit: not 5xx as I originally wrote). Wasn't sure at first but from tracing it looks to be possibly mod_wsgi related.

What I've seen is that, on requested restart, the root httpd PID repeatedly attempts kill() syscalls on the very last httpd child PID, every 1 second, until that 60 second timeout elapses. That last child PID is not a site-specfic WSGI daemon process, it comes right after them - sudo systemctl status httpd shows something like:

          ├─676367 /usr/sbin/httpd -DFOREGROUND
             ├─676369 /usr/bin/python3.11 /etc/httpd/error_logger.py
             ├─676370 /usr/sbin/httpd -DFOREGROUND
             ├─676371 "(wsgi:lukerissa" -DFOREGROUND
             ├─676372 "(wsgi:ra)      " -DFOREGROUND
             ├─676373 "(wsgi:analytics" -DFOREGROUND
              ....
             ├─676392 "(wsgi:finance) " -DFOREGROUND
             ├─676393 /usr/sbin/httpd -DFOREGROUND    <--- root httpd repeatedly tries to kill this

When I trace its activity during normal activity (e.g. 676393 above) I see it's doing reads and writes over Unix domain sockets like \x01\x00/etc/httpd/run/wsgi.676367.0.5.sock\x00, presumably proxying WSGI responses. And also doing some opens/reads of static content files from multiple different sites.

Seems to maybe be related to WSGIDaemonProcess's socket-timeout, or response-socket-timeout which we don't specifically set, and so resolves to the Timeout directive which is also at its default (60).

I can keep digging but thought I'd check here if anything stands out (and have it here in case others can benefit). If it's related to HTTP keepalives or browser clients whose connections have stalled, I'd prefer to aggressively disconnect those to keep the restart quick, rather than take quite a few sites offline for 60+ seconds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions