httpserver: Fix race condition in server startup#457
httpserver: Fix race condition in server startup#457HayaoSuzuki wants to merge 3 commits intocsernazs:masterfrom
Conversation
ff68652 to
ada0679
Compare
|
Hi @HayaoSuzuki , Your PR looks ok for the first sight but I want to take a closer look. I'm just wondering how this issue has not appeared until now (as it seems clear that this is racey). Maybe it was just pure luck or python got faster so the time between thread starting and the first request became shorter? Which version of python did you use? I'm also thinking about installing some internally used hook to check server readiness:
Zsolt |
ada0679 to
1952ca0
Compare
|
@csernazs I'm using Python 3.14.2 for local development and for our projects. Here's a minimal reproduction of how we're using pytest-httpserver. Multiple servers are started via a with statement, and requests are made immediately afterward: import requests
from pytest_httpserver import HTTPServer
def test_race_condition():
servers = [HTTPServer(port=7001 + i) for i in range(5)]
with servers[0], servers[1], servers[2], servers[3], servers[4]:
for server in servers:
server.expect_request("/ping").respond_with_data("pong")
for server in servers:
response = requests.get(server.url_for("/ping"))
assert response.status_code == 200In our integration tests, we start 5–6 mock servers simultaneously and immediately send health-check requests. This pattern seems to trigger the race condition more frequently. Regarding the |
1952ca0 to
135be3a
Compare
|
Thanks for the explanation of your use. I checked how werkzeug binds to the port: the I wanted to write a test which triggered the error you reported, but this code ran fine for me (on master): import requests
from pytest_httpserver import HTTPServer
import time
class SlowStartServer(HTTPServer):
"""A server subclass that simulates slow startup."""
def thread_target(self):
assert self.server is not None
time.sleep(5) # Simulate slow initialization
self.server.serve_forever()
def test_race_condition():
servers = [SlowStartServer(port=7001 + i) for i in range(5)]
with servers[0], servers[1], servers[2], servers[3], servers[4]:
for server in servers:
server.expect_request("/ping").respond_with_data("pong")
for server in servers:
response = requests.get(server.url_for("/ping"))
assert response.status_code == 200The above example relied on the timeout of the requests.get() which is set to infinity by default (IIRC). Could you please check? I'm happy to merge your PR but I would like to understand/reproduce your issue first. Zsolt |
|
@csernazs To make the issue reproducible, I added tests that use short client timeouts. With class SlowServeServer(HTTPServer):
def thread_target(self):
time.sleep(1.0)
self._server_ready_event.set()
self.server.serve_forever()
def test_http_request_fails_before_serve_forever_without_wait():
server = SlowServeServer(host="localhost", port=0, startup_timeout=0.0)
server.expect_request("/ping").respond_with_data("pong")
server.start()
# TCP connection succeeds (queued in backlog)
sock = socket.create_connection((server.host, server.port), timeout=1)
sock.close()
# HTTP request with a short timeout fails
requests.get(server.url_for("/ping"), timeout=(0.5, 0.5))In containerized environments (e.g. Docker Compose), clients often set short timeouts, so this race shows up as HTTP failures even though TCP connects. (Also, the previous reproduction didn’t fail because |
180b92f to
d57ddbc
Compare
|
hi @HayaoSuzuki , Ok, I see your point now. If the client has minimal timeout then it could happen that the server can't accept the connection in time. Your PR makes an event which makes the code to get closer to the processing - there's still some gap between serve_forever() and calling accept() in werkzeug (just FYI). So I'm also thinking about the polling (eg trying to trigger some http request to verify the server is capable of serving http), but that would add a more complicated code and polling can make tests slower (vs your thead Event based implementation has no such problem) especially if the first try fails and then it goes to sleep for the next try. Does this PR solve your issue? Could you test it before we merge it? My other worry is that in tests (thanks for writing the tests!) this is nearly impossible to avoid using sleep and time management but this also opens the possibility of flakiness. So my idea is that:
What do you think? Zsolt |
|
@csernazs I'm leaning toward option 3 (increasing timing separation). It seems the most straightforward way to make the tests stable without adding marker/CI complexity. |
ec85824 to
af975f2
Compare
|
hi @HayaoSuzuki , I think we could do this better: as the server is already bound to the address in the main loop, then we could safely do one socket connection attempt, then wait until the connection established (eg. the connect() returned, the server accept()ed), and then we could close the connection without making a http request. There would be no retry (or polling) needed as the server is already bound, so the connect() can't go wrong. So this would reduce the gap of your implementation, as it would be ensured that your client can connect to the server. What do you think? We could make the connection from the main thread (with an adjustable timeout - if needed) so there would be no thread event needed. I'll be AFK for a few days but I'm happy to make an example (but it's your call). Zsolt |
|
@csernazs
This is the version that uses sockets. |
af975f2 to
0a3e6ac
Compare
|
hi @HayaoSuzuki, Oh yes, you are right. I always keep forgetting for TCP the connect() returns before accept(). 🤦 I've implemented http based readiness check in #462. This uses urllib to keep our deps, and have a modification of the dispatch method so it can intercept the readiness check (in other words the first request will be the readiness check). This will have an issue with custom TLS sockets so I think the readiness check would be kept disabled by default to not break TLS tests. What do you think? Take my PR as a PoC. ps: However I'm also thinking adding a fixture purely for creating httpserver arguments (besides |
|
@csernazs The HTTP probe approach looks good to me. I have one concern: |
|
hi @HayaoSuzuki, Thanks for the review! I think by that time the probe is sent, we already bound to the socket (eg Zsolt |
|
Thanks! |
Motivation
start()returns immediately afterthread.start(), but the server may not be accepting connections yet. This causes intermittent connection failures, especially in CI environments.Changes
start()using a threading event.thread_target()doesn't set the event.