feat(ci): xdist per-worker isolation infrastructure + sanity check workflow

mchen-sentry · mchen-sentry · commit 0c5d0be86e32 · 2026-02-26T14:58:57.000-08:00
- New xdist.py module: per-worker Redis DB, Kafka topics, Snuba URL helpers
- sentry.py: wire xdist Redis DB + Snuba URL into settings before initialize_app
- sentry.py: deterministic region name (seeded RNG) + per-worker snowflake IDs
- kafka.py: per-worker topic names and consumer group IDs
- relay.py: per-worker container names, port offsets, Redis DB, Kafka topic template vars
- template/config.yml: parameterized Kafka topic names
- skips.py: _requires_snuba reads per-worker port from SNUBA env var
- test_snowflake.py: explicit region override so expected values are deterministic
- backend-xdist.yml: copy of backend.yml with xdist flags and per-worker Snuba bootstrap

All isolation changes are no-ops without xdist env vars.
diff --git a/.github/workflows/backend-xdist.yml b/.github/workflows/backend-xdist.yml
diff --git a/docs/memory.md b/docs/memory.md
@@ -190,7 +190,7 @@ New file: `.github/workflows/classify-services.yml`
 
 `workflow_dispatch` only. Runs the classifier across 22 shards, merges per-shard reports into one artifact.
 
-### Phase 5: Collection Optimization (G1)
+### Phase 5: Collection Optimization (G1) + H1 Overlapped Startup Support
 
 **`pytest_ignore_collect` hook** in `sentry.py`:
 
@@ -206,9 +206,15 @@ Guards:
 
 Blocks until `/tmp/services-ready` sentinel file exists (created by background service-startup script). Needed because G1 makes collection finish ~50s faster, potentially before services are ready.
 
+**`_requires_snuba` polling** in `skips.py`: Add `_wait_for_service()` polling controlled by `SNUBA_WAIT_TIMEOUT` env var. With H1 overlapped startup, Snuba may not be up when pytest starts. The polling waits instead of failing immediately.
+
 **Two-layer filtering note**: Both G1 (`pytest_ignore_collect`) and `pytest_collection_modifyitems` filter by `SELECTED_TESTS_FILE`. G1 prevents import; `modifyitems` deselects after import. G1 is the performance win; `modifyitems` handles class/test granularity filtering and shard assignment.
 
-### Phase 6: Tiered Workflow
+### Phase 6: Performance Optimizations
+
+**Relay container lifecycle**: Broaden `relay_server_setup` and `_relay_container` from function/module scope to session scope. `live_server` is already session-scoped, so this is safe. One Docker container per worker session instead of per test. Only ~6 relay test classes exist, saving ~50-60s. Add `_relay_container` session-scoped fixture, make `relay_server` a thin wrapper calling `_ensure_relay_in_db()` + `adjust_settings_for_relay_tests()`.
+
+### Phase 7: Tiered Workflow
 
 **`backend-xdist-split-poc.yml`**: The full CI workflow.
 
diff --git a/docs/notes.md b/docs/notes.md
@@ -0,0 +1,37 @@
+# Tiered xdist v2 — Design Notes
+
+## Why Relay needs per-worker Docker containers
+
+Each xdist worker needs its own Relay container because Relay's config is baked in at startup:
+- Kafka topics: each worker writes to its own topics (`ingest-events-gw0` vs `ingest-events-gw1`)
+- Redis DB: each worker uses its own DB number
+- Snuba instance: each worker routes to its own Snuba on a different port
+
+A single Relay container can only have one config, so sharing across workers is not possible.
+
+Within a single worker, we can't share one container across test classes because `TransactionTestCase` flushes the DB between tests, deleting the Relay model row that Sentry uses to authenticate Relay (401s without it). The `_ensure_relay_in_db()` call before each test re-inserts the row, but the container itself persists across tests in the same class.
+
+Currently one container per test (function-scoped). Could be optimized to **one container per worker session** since `live_server` (pytest-django) is session-scoped. Only ~6 relay test classes exist. This optimization is separated from the xdist correctness changes (per-worker naming/ports) to keep concerns clean. The function-scoped `relay_server` fixture would become a thin wrapper calling `_ensure_relay_in_db()` + `adjust_settings_for_relay_tests()`.
+
+## Why Snuba URL must be set before `initialize_app()`
+
+`sentry.utils.snuba` creates a module-level connection pool singleton (`_snuba_pool`) from `settings.SENTRY_SNUBA` at import time. `initialize_app()` transitively triggers that import through the Django app loading chain (100+ modules reference `sentry.utils.snuba`). So `settings.SENTRY_SNUBA` must be overridden before `initialize_app()` is called in `pytest_configure`.
+
+We verified that `sentry.utils.snuba` is NOT imported during plugin loading (before `pytest_configure`), so overriding the setting in `pytest_configure` is early enough. No module-level env var hack needed.
+
+## Why lazy imports inside fixtures are OK but inside class methods are not
+
+Pytest fixtures are lazily invoked — the import only runs when a test actually requests the fixture. This is standard pytest practice for optional/heavy dependencies. Moving imports from module-level into a fixture function body is a single import per fixture, clean and sustainable.
+
+Scattering imports inside every method of a class (like the `Browser` class in selenium.py) is unsustainable — anyone adding a new method must remember to add the import. The better approach for selenium is conditional plugin loading via env var.
+
+## pytest-rerunfailures crash recovery under xdist
+
+The experiment branch disabled `pytest_rerunfailures.HAS_PYTEST_HANDLECRASHITEM`, claiming the socket-based crash recovery protocol deadlocks during heavy xdist startup due to connection timeouts. However, reading the actual source code (v15.0), the server thread (`ServerStatusDB`) calls `self.sock.accept()` in an infinite loop with no timeout, and the socket is set to `setblocking(1)` with no timeout on `recv(1)`. There is no connection window that workers can miss. The deadlock explanation from the experiment docs doesn't match the code. Skip this change and only revisit if we actually hit freezes when enabling xdist.
+
+## Why hash-based sharding beats algorithmic LPT
+
+With 17+ shards and ~32K tests, the law of large numbers gives hash-based (`sha256(nodeid) % N`) sharding good-enough balance (~90-130s spread). LPT algorithms failed because:
+- Test count is a poor proxy for duration (files with few slow integration tests get treated as "light")
+- Flat duration LPT optimizes `sum(worker_loads)` but actual wall clock = `max(worker_loads)` — it ignores intra-shard parallelism
+- Indivisible mega-scopes (large test classes) create unavoidable hotspots under scope-preserving algorithms
diff --git a/docs/tiered-xdist-changes.md b/docs/tiered-xdist-changes.md
@@ -33,3 +33,48 @@
 **What:** Moved `sentry.testutils.pytest.selenium` out of the static `pytest_plugins` list. It's now appended conditionally only when `SENTRY_SKIP_SELENIUM_PLUGIN != "1"`.
 
 **Why:** selenium is a 23MB package imported at module level. We should avoid loading it when not running acceptance tests. Currently we pass `--ignore tests/acceptance` but that only prevents test collection and not plugin loading.
+
+## 2. xdist Per-Worker Isolation Infrastructure
+
+**Problem:** When pytest-xdist spawns multiple workers (`-n 3`) inside a single shard, all workers share the same Redis, Kafka, Snuba/ClickHouse, and Relay. Without isolation, workers corrupt each other: `flushdb()` wipes another worker's cache, Kafka events cross-pollinate between consumers, `reset_snuba` truncates another worker's data, and identical snowflake IDs cause `IntegrityError` on unique constraints.
+
+**Approach:** Give each worker its own Redis DB number, Kafka topic names, Snuba instance, Relay container, and snowflake ID range. All gated on xdist env vars — **no-ops without them**.
+
+### 2a. xdist helpers + per-worker Redis and Snuba
+
+**New file:** `src/sentry/testutils/pytest/xdist.py` — resolves worker ID once at module level; provides `get_redis_db()`, `get_kafka_topic()`, `get_snuba_url()`.
+
+**Modified:** `src/sentry/testutils/pytest/sentry.py` — Redis cluster settings call `xdist.get_redis_db()` instead of hardcoded `TEST_REDIS_DB`. Snuba URL is overridden via `settings.SENTRY_SNUBA = xdist.get_snuba_url()` in `pytest_configure` before `initialize_app()`. This must happen before `initialize_app` because `sentry.utils.snuba` creates a module-level connection pool singleton (`_snuba_pool`) from `settings.SENTRY_SNUBA` at import time, and `initialize_app` transitively triggers that import.
+
+**Modified:** `src/sentry/testutils/skips.py` — `_requires_snuba` reads port from `SNUBA` env var instead of hardcoded 1218 (per-worker Snuba uses 1230+N).
+
+### 2b. Deterministic region name + per-worker snowflake IDs
+
+**Modified:** `src/sentry/testutils/pytest/sentry.py` (`_configure_test_env_regions`)
+
+Region name RNG is seeded with `PYTEST_XDIST_TESTRUNUID` so all workers generate the same name (xdist requires identical test collection). Each worker gets `region_snowflake_id = worker_num + 1` so concurrent Project/Organization/Team creation produces unique snowflake IDs instead of colliding.
+
+### 2c. Per-worker Kafka topic isolation
+
+**Modified:** `src/sentry/testutils/pytest/kafka.py` — topic names and consumer group ID use `xdist.get_kafka_topic()`.
+
+**Modified:** `src/sentry/testutils/pytest/template/config.yml` — hardcoded `ingest-events`/`outcomes` replaced with `${KAFKA_TOPIC_EVENTS}`/`${KAFKA_TOPIC_OUTCOMES}` template variables.
+
+**Modified:** `src/sentry/testutils/pytest/relay.py` — passes the per-worker topic names as template variables when rendering Relay config.
+
+### 2d. Per-worker Relay container isolation
+
+**Modified:** `src/sentry/testutils/pytest/relay.py`
+
+- Per-worker container names (`sentry_test_relay_server_gw0`) and port offsets (`33331 + worker_num * 100`) to avoid Docker name and port collisions.
+- Per-worker Redis DB via `xdist.get_redis_db()`.
+
+### 2e. xdist CI workflow
+
+**New file:** `.github/workflows/backend-xdist.yml` — copy of `backend.yml` with minimal changes: triggers on `mchen/tiered-xdist-v2` branch, adds `PYTHONHASHSEED=0`, `XDIST_PER_WORKER_SNUBA=1`, `SENTRY_SKIP_SELENIUM_PLUGIN=1`, per-worker Snuba bootstrap step, and runs pytest with `-n 3 --dist=loadfile` instead of `make test-python-ci`.
+
+### 2f. Snowflake test fix
+
+**Modified:** `tests/sentry/utils/test_snowflake.py`
+
+Two tests hardcode expected snowflake values assuming `region_snowflake_id=0`. Under xdist, workers use `worker_num + 1` (from 2b). Fix: wrap in `override_regions` with explicit `Region("test-region", 0, ...)` so expected values are deterministic.
diff --git a/src/sentry/testutils/pytest/kafka.py b/src/sentry/testutils/pytest/kafka.py
@@ -6,6 +6,8 @@
 from confluent_kafka import Consumer, Producer
 from confluent_kafka.admin import AdminClient
 
+from sentry.testutils.pytest import xdist
+
 _log = logging.getLogger(__name__)
 
 MAX_SECONDS_WAITING_FOR_EVENT = 16
@@ -71,10 +73,8 @@ def scope_consumers():
 
     """
     all_consumers: MutableMapping[str, Consumer | None] = {
-        # Relay is configured to use this topic for all ingest messages. See
-        # `templates/config.yml`.
-        "ingest-events": None,
-        "outcomes": None,
+        xdist.get_kafka_topic("ingest-events"): None,
+        xdist.get_kafka_topic("outcomes"): None,
     }
 
     yield all_consumers
@@ -106,10 +106,8 @@ def ingest_consumer(settings):
         from sentry.consumers import get_stream_processor
         from sentry.utils.batching_kafka_consumer import create_topics
 
-        # Relay is configured to use this topic for all ingest messages. See
-        # `template/config.yml`.
         cluster_name = "default"
-        topic_event_name = "ingest-events"
+        topic_event_name = xdist.get_kafka_topic("ingest-events")
 
         if scope_consumers[topic_event_name] is not None:
             # reuse whatever was already created (will ignore the settings)
@@ -120,8 +118,7 @@ def ingest_consumer(settings):
         admin.delete_topic(topic_event_name)
         create_topics(cluster_name, [topic_event_name])
 
-        # simulate the event ingestion task
-        group_id = "test-consumer"
+        group_id = xdist.get_kafka_topic("test-consumer")
 
         consumer = get_stream_processor(
             "ingest-attachments",
diff --git a/src/sentry/testutils/pytest/relay.py b/src/sentry/testutils/pytest/relay.py
@@ -13,7 +13,7 @@
 import requests
 
 from sentry.runner.commands.devservices import get_docker_client
-from sentry.testutils.pytest.sentry import TEST_REDIS_DB
+from sentry.testutils.pytest import xdist
 
 _log = logging.getLogger(__name__)
 
@@ -23,6 +23,8 @@
 
 
 def _relay_server_container_name() -> str:
+    if xdist._worker_id:
+        return f"sentry_test_relay_server_{xdist._worker_id}"
     return "sentry_test_relay_server"
 
 
@@ -66,9 +68,10 @@ def relay_server_setup(live_server, tmpdir_factory):
     template_path = _get_template_dir()
     sources = ["config.yml", "credentials.json"]
 
-    relay_port = ephemeral_port_reserve.reserve(ip="127.0.0.1", port=33331)
+    worker_num = xdist._worker_num or 0
+    relay_port = ephemeral_port_reserve.reserve(ip="127.0.0.1", port=33331 + worker_num * 100)
 
-    redis_db = TEST_REDIS_DB
+    redis_db = xdist.get_redis_db()
 
     from sentry.relay import projectconfig_cache
     from sentry.relay.projectconfig_cache.redis import RedisProjectConfigCache
@@ -84,6 +87,8 @@ def relay_server_setup(live_server, tmpdir_factory):
         "KAFKA_HOST": "kafka",
         "REDIS_HOST": "redis",
         "REDIS_DB": redis_db,
+        "KAFKA_TOPIC_EVENTS": xdist.get_kafka_topic("ingest-events"),
+        "KAFKA_TOPIC_OUTCOMES": xdist.get_kafka_topic("outcomes"),
     }
 
     for source in sources:
diff --git a/src/sentry/testutils/pytest/sentry.py b/src/sentry/testutils/pytest/sentry.py
@@ -17,6 +17,7 @@
 import sentry_sdk
 from django.conf import settings
 
+from sentry.testutils.pytest import xdist
 from sentry.runner.importer import install_plugin_apps
 from sentry.silo.base import SiloMode
 from sentry.testutils.region import TestEnvRegionDirectory
@@ -32,9 +33,6 @@
     os.path.join(os.path.dirname(__file__), os.pardir, os.pardir, os.pardir, os.pardir, "tests")
 )
 
-TEST_REDIS_DB = 9
-
-
 def _use_monolith_dbs() -> bool:
     return os.environ.get("SENTRY_USE_MONOLITH_DBS", "0") == "1"
 
@@ -69,10 +67,21 @@ def _configure_test_env_regions() -> None:
     # Assign a random name on every test run, as a reminder that test setup and
     # assertions should not depend on this value. If you need to test behavior that
     # depends on region attributes, use `override_regions` in your test case.
-    region_name = "testregion" + "".join(random.choices(string.digits, k=6))
+    # Under xdist, seed deterministically so all workers generate the same name
+    # (divergent names break xdist's requirement for identical test collection).
+    xdist_uid = os.environ.get("PYTEST_XDIST_TESTRUNUID")
+    r = random.Random(xdist_uid) if xdist_uid else random
+    region_name = "testregion" + "".join(r.choices(string.digits, k=6))
+
+    # Under xdist, each worker gets a unique snowflake_id (1, 2, 3, ...) so
+    # concurrent model creation doesn't produce colliding IDs.
+    region_snowflake_id = xdist._worker_num + 1 if xdist._worker_num is not None else 0
 
     default_region = Region(
-        region_name, 0, settings.SENTRY_OPTIONS["system.url-prefix"], RegionCategory.MULTI_TENANT
+        region_name,
+        region_snowflake_id,
+        settings.SENTRY_OPTIONS["system.url-prefix"],
+        RegionCategory.MULTI_TENANT,
     )
 
     settings.SENTRY_REGION = region_name
@@ -196,14 +205,17 @@ def pytest_configure(config: pytest.Config) -> None:
     settings.SENTRY_RATELIMITER = "sentry.ratelimits.redis.RedisRateLimiter"
     settings.SENTRY_RATELIMITER_OPTIONS = {}
 
+    if snuba_url := xdist.get_snuba_url():
+        settings.SENTRY_SNUBA = snuba_url
+
     settings.SENTRY_ISSUE_PLATFORM_FUTURES_MAX_LIMIT = 1
 
     if not hasattr(settings, "SENTRY_OPTIONS"):
         settings.SENTRY_OPTIONS = {}
 
     settings.SENTRY_OPTIONS.update(
         {
-            "redis.clusters": {"default": {"hosts": {0: {"db": TEST_REDIS_DB}}}},
+            "redis.clusters": {"default": {"hosts": {0: {"db": xdist.get_redis_db()}}}},
             "mail.backend": "django.core.mail.backends.locmem.EmailBackend",
             "system.url-prefix": "http://testserver",
             "system.base-hostname": "testserver",
diff --git a/src/sentry/testutils/pytest/template/config.yml b/src/sentry/testutils/pytest/template/config.yml
@@ -14,10 +14,10 @@ processing:
   kafka_config:
     - {name: 'bootstrap.servers', value: '${KAFKA_HOST}:9093'}
   topics:
-    events: ingest-events
-    attachments: ingest-events
-    transactions: ingest-events
-    outcomes: outcomes
+    events: ${KAFKA_TOPIC_EVENTS}
+    attachments: ${KAFKA_TOPIC_EVENTS}
+    transactions: ${KAFKA_TOPIC_EVENTS}
+    outcomes: ${KAFKA_TOPIC_OUTCOMES}
   redis: redis://${REDIS_HOST}:6379/${REDIS_DB}
 aggregator:
   bucket_interval: 1 # Use shortest possible interval to speed up tests
diff --git a/src/sentry/testutils/pytest/xdist.py b/src/sentry/testutils/pytest/xdist.py
@@ -0,0 +1,27 @@
+from __future__ import annotations
+
+import os
+
+_TEST_REDIS_DB = 9
+_SNUBA_BASE_PORT = 1230
+
+_worker_id: str | None = os.environ.get("PYTEST_XDIST_WORKER")
+_worker_num: int | None = int(_worker_id.replace("gw", "")) if _worker_id else None
+
+
+def get_redis_db() -> int:
+    if _worker_num is not None:
+        return _TEST_REDIS_DB + _worker_num
+    return _TEST_REDIS_DB
+
+
+def get_kafka_topic(base_name: str) -> str:
+    if _worker_id:
+        return f"{base_name}-{_worker_id}"
+    return base_name
+
+
+def get_snuba_url() -> str | None:
+    if _worker_num is not None and os.environ.get("XDIST_PER_WORKER_SNUBA"):
+        return f"http://127.0.0.1:{_SNUBA_BASE_PORT + _worker_num}"
+    return None
diff --git a/src/sentry/testutils/skips.py b/src/sentry/testutils/skips.py
@@ -1,6 +1,8 @@
 from __future__ import annotations
 
+import os
 import socket
+from urllib.parse import urlparse
 
 import pytest
 
@@ -22,7 +24,8 @@ def _requires_service_message(name: str) -> str:
 @pytest.fixture(scope="session")
 def _requires_snuba() -> None:
     # TODO: ability to ask devservices what port a service is on
-    if not _service_available("127.0.0.1", 1218):
+    port = urlparse(os.environ.get("SNUBA", "")).port or 1218
+    if not _service_available("127.0.0.1", port):
         pytest.fail(_requires_service_message("snuba"))
 
 
diff --git a/tests/sentry/utils/test_snowflake.py b/tests/sentry/utils/test_snowflake.py
@@ -36,32 +36,35 @@ def test_uses_snowflake_id(self) -> None:
 
     @freeze_time(CURRENT_TIME)
     def test_generate_correct_ids(self) -> None:
-        snowflake_id = generate_snowflake_id("test_redis_key")
-        expected_value = (16 << 48) + (
-            int(self.CURRENT_TIME.timestamp() - settings.SENTRY_SNOWFLAKE_EPOCH_START) << 16
-        )
+        region = Region("test-region", 0, "http://testserver", RegionCategory.MULTI_TENANT)
+        with override_settings(SILO_MODE=SiloMode.REGION), override_regions([region], region):
+            snowflake_id = generate_snowflake_id("test_redis_key")
+            expected_value = (16 << 48) + (
+                int(self.CURRENT_TIME.timestamp() - settings.SENTRY_SNOWFLAKE_EPOCH_START) << 16
+            )
 
-        assert snowflake_id == expected_value
+            assert snowflake_id == expected_value
 
     @freeze_time(CURRENT_TIME)
     def test_generate_correct_ids_with_region_sequence(self) -> None:
-        # next id in the same timestamp, should be 1 greater than last id up to 16 timestamps
-        # the 17th will be at the previous timestamp
-        snowflake_id = generate_snowflake_id("test_redis_key")
+        region = Region("test-region", 0, "http://testserver", RegionCategory.MULTI_TENANT)
+        with override_settings(SILO_MODE=SiloMode.REGION), override_regions([region], region):
+            snowflake_id = generate_snowflake_id("test_redis_key")
 
-        for _ in range(MAX_AVAILABLE_REGION_SEQUENCES - 1):
-            new_snowflake_id = generate_snowflake_id("test_redis_key")
+            for _ in range(MAX_AVAILABLE_REGION_SEQUENCES - 1):
+                new_snowflake_id = generate_snowflake_id("test_redis_key")
 
-            assert new_snowflake_id - snowflake_id == 1
-            snowflake_id = new_snowflake_id
+                assert new_snowflake_id - snowflake_id == 1
+                snowflake_id = new_snowflake_id
 
-        snowflake_id = generate_snowflake_id("test_redis_key")
+            snowflake_id = generate_snowflake_id("test_redis_key")
 
-        expected_value = (16 << 48) + (
-            (int(self.CURRENT_TIME.timestamp() - settings.SENTRY_SNOWFLAKE_EPOCH_START) - 1) << 16
-        )
+            expected_value = (16 << 48) + (
+                (int(self.CURRENT_TIME.timestamp() - settings.SENTRY_SNOWFLAKE_EPOCH_START) - 1)
+                << 16
+            )
 
-        assert snowflake_id == expected_value
+            assert snowflake_id == expected_value
 
     @freeze_time(CURRENT_TIME)
     def test_out_of_region_sequences(self) -> None: