Skip to content

httpserver: Fix race condition in server startup#457

Closed
HayaoSuzuki wants to merge 3 commits intocsernazs:masterfrom
HayaoSuzuki:fix/wait-for-server-ready
Closed

httpserver: Fix race condition in server startup#457
HayaoSuzuki wants to merge 3 commits intocsernazs:masterfrom
HayaoSuzuki:fix/wait-for-server-ready

Conversation

@HayaoSuzuki
Copy link
Contributor

@HayaoSuzuki HayaoSuzuki commented Jan 28, 2026

Motivation

start() returns immediately after thread.start(), but the server may not be accepting connections yet. This causes intermittent connection failures, especially in CI environments.

Changes

  • Wait for server readiness in start() using a threading event.
  • Maintain backward compatibility: warn (don't fail) if a custom thread_target() doesn't set the event.
  • Add tests for server startup readiness.

@HayaoSuzuki HayaoSuzuki force-pushed the fix/wait-for-server-ready branch 2 times, most recently from ff68652 to ada0679 Compare January 28, 2026 12:53
@HayaoSuzuki HayaoSuzuki marked this pull request as ready for review January 28, 2026 12:59
@csernazs
Copy link
Owner

Hi @HayaoSuzuki ,

Your PR looks ok for the first sight but I want to take a closer look.

I'm just wondering how this issue has not appeared until now (as it seems clear that this is racey). Maybe it was just pure luck or python got faster so the time between thread starting and the first request became shorter?

Which version of python did you use?

I'm also thinking about installing some internally used hook to check server readiness:

  1. Server binds
  2. It registers /ready uri returning status 200
  3. In the fixture we wait for this after the server started. So we poll the url periodically.
  4. We remove the /ready uri handler to not conflict with the actual tests (which might have this for their own tests).

Zsolt

@HayaoSuzuki HayaoSuzuki force-pushed the fix/wait-for-server-ready branch from ada0679 to 1952ca0 Compare January 29, 2026 00:56
@HayaoSuzuki
Copy link
Contributor Author

@csernazs
Thank you for reviewing my PR.

I'm using Python 3.14.2 for local development and for our projects.

Here's a minimal reproduction of how we're using pytest-httpserver. Multiple servers are started via a with statement, and requests are made immediately afterward:

import requests
from pytest_httpserver import HTTPServer


def test_race_condition():
    servers = [HTTPServer(port=7001 + i) for i in range(5)]

    with servers[0], servers[1], servers[2], servers[3], servers[4]:
        for server in servers:
            server.expect_request("/ping").respond_with_data("pong")

        for server in servers:
            response = requests.get(server.url_for("/ping"))
            assert response.status_code == 200

In our integration tests, we start 5–6 mock servers simultaneously and immediately send health-check requests. This pattern seems to trigger the race condition more frequently.

Regarding the /ready endpoint approach you mentioned, I'm happy to implement that if you prefer it over the current threading event implementation.

@HayaoSuzuki HayaoSuzuki force-pushed the fix/wait-for-server-ready branch from 1952ca0 to 135be3a Compare January 30, 2026 07:54
@csernazs
Copy link
Owner

Thanks for the explanation of your use.

I checked how werkzeug binds to the port: the bind() happens in make_server() in the main thread. That means that after that point, clients can connect to the server, but they will be waiting to be accept()ed, up to the backlog limit specified for listen(). So for me, this means that the server is ready when start() returns.

I wanted to write a test which triggered the error you reported, but this code ran fine for me (on master):

import requests
from pytest_httpserver import HTTPServer
import time


class SlowStartServer(HTTPServer):
    """A server subclass that simulates slow startup."""

    def thread_target(self):
        assert self.server is not None
        time.sleep(5)  # Simulate slow initialization
        self.server.serve_forever()


def test_race_condition():
    servers = [SlowStartServer(port=7001 + i) for i in range(5)]

    with servers[0], servers[1], servers[2], servers[3], servers[4]:
        for server in servers:
            server.expect_request("/ping").respond_with_data("pong")

        for server in servers:
            response = requests.get(server.url_for("/ping"))
            assert response.status_code == 200

The above example relied on the timeout of the requests.get() which is set to infinity by default (IIRC).

Could you please check? I'm happy to merge your PR but I would like to understand/reproduce your issue first.

Zsolt

@HayaoSuzuki
Copy link
Contributor Author

@csernazs
Thank you for the detailed analysis.
You're right that bind() / listen() happen in make_server(), so TCP connections can be accepted into the backlog before the server thread starts handling them.

To make the issue reproducible, I added tests that use short client timeouts. With startup_timeout=0.0 (old behavior), TCP connect succeeds, but the HTTP request times out because serve_forever() hasn’t started yet:

class SlowServeServer(HTTPServer):
    def thread_target(self):
        time.sleep(1.0)
        self._server_ready_event.set()
        self.server.serve_forever()

def test_http_request_fails_before_serve_forever_without_wait():
    server = SlowServeServer(host="localhost", port=0, startup_timeout=0.0)
    server.expect_request("/ping").respond_with_data("pong")
    server.start()

    # TCP connection succeeds (queued in backlog)
    sock = socket.create_connection((server.host, server.port), timeout=1)
    sock.close()

    # HTTP request with a short timeout fails
    requests.get(server.url_for("/ping"), timeout=(0.5, 0.5))

In containerized environments (e.g. Docker Compose), clients often set short timeouts, so this race shows up as HTTP failures even though TCP connects.
The fix makes start() wait for the ready event (set immediately before serve_forever() in the default thread_target), so HTTP requests work right after start() returns.

(Also, the previous reproduction didn’t fail because requests defaults to timeout=None, i.e., no timeout.)

@HayaoSuzuki HayaoSuzuki force-pushed the fix/wait-for-server-ready branch from 180b92f to d57ddbc Compare January 30, 2026 08:48
@csernazs
Copy link
Owner

hi @HayaoSuzuki ,

Ok, I see your point now. If the client has minimal timeout then it could happen that the server can't accept the connection in time.

Your PR makes an event which makes the code to get closer to the processing - there's still some gap between serve_forever() and calling accept() in werkzeug (just FYI).

So I'm also thinking about the polling (eg trying to trigger some http request to verify the server is capable of serving http), but that would add a more complicated code and polling can make tests slower (vs your thead Event based implementation has no such problem) especially if the first try fails and then it goes to sleep for the next try.

Does this PR solve your issue? Could you test it before we merge it?

My other worry is that in tests (thanks for writing the tests!) this is nearly impossible to avoid using sleep and time management but this also opens the possibility of flakiness. So my idea is that:

  1. either reduce the number of tests to some minimum level (which can be still flaky, though)
  2. or add a flaky marker and run the tests in our CI only (and if the developer specifies it), but it would be excluded from the tests in general (so if a package maintainer wants to install our package, they would not run it by default).
  3. and/or increase the separation of the timings significantly (eg in SlowStartServer it would sleep longer but that would make the test run longer as well).

What do you think?

Zsolt

@HayaoSuzuki
Copy link
Contributor Author

@csernazs
I'll test this PR in our Docker Compose environment, where we originally hit the problem. Please give me some time. The issue is flaky and doesn't always reproduce immediately.

I'm leaning toward option 3 (increasing timing separation). It seems the most straightforward way to make the tests stable without adding marker/CI complexity.

@HayaoSuzuki HayaoSuzuki force-pushed the fix/wait-for-server-ready branch 3 times, most recently from ec85824 to af975f2 Compare January 30, 2026 10:37
@csernazs
Copy link
Owner

hi @HayaoSuzuki ,

I think we could do this better: as the server is already bound to the address in the main loop, then we could safely do one socket connection attempt, then wait until the connection established (eg. the connect() returned, the server accept()ed), and then we could close the connection without making a http request. There would be no retry (or polling) needed as the server is already bound, so the connect() can't go wrong.

So this would reduce the gap of your implementation, as it would be ensured that your client can connect to the server.

What do you think?

We could make the connection from the main thread (with an adjustable timeout - if needed) so there would be no thread event needed.

I'll be AFK for a few days but I'm happy to make an example (but it's your call).

Zsolt

@HayaoSuzuki
Copy link
Contributor Author

HayaoSuzuki commented Feb 2, 2026

@csernazs
Thanks for the suggestion, but a socket-connect probe still leaves a race.

socket.create_connection() returns once the connection is queued in the backlog.
That only confirms bind()/listen() completed in make_server(), which happens before the server thread enters serve_forever().
So TCP connect can succeed while HTTP requests still time out waiting for accept() to start handling connections.
If we want to guarantee HTTP readiness for short client timeouts, we need a signal closer to serve_forever()/request handling (e.g., the ready event set immediately before serve_forever()).

This is the version that uses sockets.
httpserver.py

@HayaoSuzuki HayaoSuzuki force-pushed the fix/wait-for-server-ready branch from af975f2 to 0a3e6ac Compare February 2, 2026 03:10
@csernazs
Copy link
Owner

csernazs commented Feb 2, 2026

hi @HayaoSuzuki,

Oh yes, you are right. I always keep forgetting for TCP the connect() returns before accept(). 🤦

I've implemented http based readiness check in #462. This uses urllib to keep our deps, and have a modification of the dispatch method so it can intercept the readiness check (in other words the first request will be the readiness check).

This will have an issue with custom TLS sockets so I think the readiness check would be kept disabled by default to not break TLS tests.

#462

What do you think? Take my PR as a PoC.

ps: However I'm also thinking adding a fixture purely for creating httpserver arguments (besides make_httpserver), to make this feature to be used more easily.

@HayaoSuzuki
Copy link
Contributor Author

@csernazs
Thanks for the PoC.

The HTTP probe approach looks good to me. I have one concern: super().start() returns before serve_forever() is called in the server thread, so the first urlopen call might get ConnectionRefusedError. A retry loop would be needed to make this work reliably.

@csernazs
Copy link
Owner

csernazs commented Feb 2, 2026

hi @HayaoSuzuki,

Thanks for the review!

I think by that time the probe is sent, we already bound to the socket (eg make_server of werkzeug has been made). Otherwise there would be no valid port to test (for ephemeral ports).

Zsolt

@HayaoSuzuki
Copy link
Contributor Author

I'm closing this in favor of #462 (HTTP readiness probe).

If we end up needing a thread-event format, we can always revisit it then.

Thanks, @csernazs

@HayaoSuzuki HayaoSuzuki closed this Feb 6, 2026
@csernazs
Copy link
Owner

csernazs commented Feb 6, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants