Skip to content

Commit de5ae7d

Browse files
committed
fix(ci): disable rerunfailures crash recovery under xdist
The socket-based crash recovery threads in pytest-rerunfailures get TimeoutError on recv(1) during heavy xdist startup. Normal --reruns still works (each worker retries locally).
1 parent 0c5d0be commit de5ae7d

File tree

3 files changed

+12
-3
lines changed

3 files changed

+12
-3
lines changed

docs/notes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Scattering imports inside every method of a class (like the `Browser` class in s
2727

2828
## pytest-rerunfailures crash recovery under xdist
2929

30-
The experiment branch disabled `pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM`, claiming the socket-based crash recovery protocol deadlocks during heavy xdist startup due to connection timeouts. However, reading the actual source code (v15.0), the server thread (`ServerStatusDB`) calls `self.sock.accept()` in an infinite loop with no timeout, and the socket is set to `setblocking(1)` with no timeout on `recv(1)`. There is no connection window that workers can miss. The deadlock explanation from the experiment docs doesn't match the code. Skip this change and only revisit if we actually hit freezes when enabling xdist.
30+
The experiment branch was correct. The `run_connection` threads get `TimeoutError: timed out` on `conn.recv(1)` — the timeout is set on the accepted connection socket, not the listening socket (which is why reading the `__init__` code alone was misleading). During heavy xdist startup, workers take too long to send data, the connection threads die, and crash recovery breaks. Setting `HAS_PYTEST_HANDLECRASHITEM = False` disables crash recovery mode. Normal `--reruns` still works (each worker retries locally).
3131

3232
## Why hash-based sharding beats algorithmic LPT
3333

docs/tiered-xdist-changes.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,9 +71,15 @@ Region name RNG is seeded with `PYTEST_XDIST_TESTRUNUID` so all workers generate
7171

7272
### 2e. xdist CI workflow
7373

74-
**New file:** `.github/workflows/backend-xdist.yml` — copy of `backend.yml` with minimal changes: triggers on `mchen/tiered-xdist-v2` branch, adds `PYTHONHASHSEED=0`, `XDIST_PER_WORKER_SNUBA=1`, `SENTRY_SKIP_SELENIUM_PLUGIN=1`, per-worker Snuba bootstrap step, and runs pytest with `-n 3 --dist=loadfile` instead of `make test-python-ci`.
74+
**New file:** `.github/workflows/backend-xdist.yml` — copy of `backend.yml` with minimal changes: triggers on `mchen/tiered-xdist-v2` branch, adds `PYTHONHASHSEED=0`, `XDIST_PER_WORKER_SNUBA=1`, `SENTRY_SKIP_SELENIUM_PLUGIN=1`, per-worker Snuba bootstrap step, and runs pytest with `-n 3 --dist=loadfile` instead of `make test-python-ci`. Per-worker Snuba bootstrap runs all 3 instances in parallel (`&` + `wait`) — sequential bootstrap takes ~55s per worker (~165s total), parallel brings it down to ~55s.
7575

76-
### 2f. Snowflake test fix
76+
### 2f. Disable rerunfailures crash recovery
77+
78+
**Modified:** `tests/conftest.py`
79+
80+
Sets `pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM = False`. The socket-based crash recovery threads get `TimeoutError` on `recv(1)` during heavy xdist startup. Normal `--reruns` still works.
81+
82+
### 2g. Snowflake test fix
7783

7884
**Modified:** `tests/sentry/utils/test_snowflake.py`
7985

tests/conftest.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55

66
import psutil
77
import pytest
8+
import pytest_rerunfailures
9+
10+
pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM = False
811
import responses
912
import sentry_sdk
1013
from django.core.cache import cache

0 commit comments

Comments
 (0)