fix(ci): disable rerunfailures crash recovery under xdist

mchen-sentry · mchen-sentry · commit de5ae7d3d3c2 · 2026-02-26T15:05:31.000-08:00
The socket-based crash recovery threads in pytest-rerunfailures get
TimeoutError on recv(1) during heavy xdist startup. Normal --reruns
still works (each worker retries locally).
diff --git a/docs/notes.md b/docs/notes.md
@@ -27,7 +27,7 @@ Scattering imports inside every method of a class (like the `Browser` class in s
 
 ## pytest-rerunfailures crash recovery under xdist
 
-The experiment branch disabled `pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM`, claiming the socket-based crash recovery protocol deadlocks during heavy xdist startup due to connection timeouts. However, reading the actual source code (v15.0), the server thread (`ServerStatusDB`) calls `self.sock.accept()` in an infinite loop with no timeout, and the socket is set to `setblocking(1)` with no timeout on `recv(1)`. There is no connection window that workers can miss. The deadlock explanation from the experiment docs doesn't match the code. Skip this change and only revisit if we actually hit freezes when enabling xdist.
+The experiment branch was correct. The `run_connection` threads get `TimeoutError: timed out` on `conn.recv(1)` — the timeout is set on the accepted connection socket, not the listening socket (which is why reading the `__init__` code alone was misleading). During heavy xdist startup, workers take too long to send data, the connection threads die, and crash recovery breaks. Setting `HAS_PYTEST_HANDLECRASHITEM = False` disables crash recovery mode. Normal `--reruns` still works (each worker retries locally).
 
 ## Why hash-based sharding beats algorithmic LPT
 
diff --git a/docs/tiered-xdist-changes.md b/docs/tiered-xdist-changes.md
@@ -71,9 +71,15 @@ Region name RNG is seeded with `PYTEST_XDIST_TESTRUNUID` so all workers generate
 
 ### 2e. xdist CI workflow
 
-**New file:** `.github/workflows/backend-xdist.yml` — copy of `backend.yml` with minimal changes: triggers on `mchen/tiered-xdist-v2` branch, adds `PYTHONHASHSEED=0`, `XDIST_PER_WORKER_SNUBA=1`, `SENTRY_SKIP_SELENIUM_PLUGIN=1`, per-worker Snuba bootstrap step, and runs pytest with `-n 3 --dist=loadfile` instead of `make test-python-ci`.
+**New file:** `.github/workflows/backend-xdist.yml` — copy of `backend.yml` with minimal changes: triggers on `mchen/tiered-xdist-v2` branch, adds `PYTHONHASHSEED=0`, `XDIST_PER_WORKER_SNUBA=1`, `SENTRY_SKIP_SELENIUM_PLUGIN=1`, per-worker Snuba bootstrap step, and runs pytest with `-n 3 --dist=loadfile` instead of `make test-python-ci`. Per-worker Snuba bootstrap runs all 3 instances in parallel (`&` + `wait`) — sequential bootstrap takes ~55s per worker (~165s total), parallel brings it down to ~55s.
 
-### 2f. Snowflake test fix
+### 2f. Disable rerunfailures crash recovery
+
+**Modified:** `tests/conftest.py`
+
+Sets `pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM = False`. The socket-based crash recovery threads get `TimeoutError` on `recv(1)` during heavy xdist startup. Normal `--reruns` still works.
+
+### 2g. Snowflake test fix
 
 **Modified:** `tests/sentry/utils/test_snowflake.py`
 
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -5,6 +5,9 @@
 
 import psutil
 import pytest
+import pytest_rerunfailures
+
+pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM = False
 import responses
 import sentry_sdk
 from django.core.cache import cache