-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
ref(ci): parallel devservices startup and more robust bootstrap-snuba #112381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
6f2aa2c
ref(ci): overlap devservices startup with venv setup in backend tests
joshuarli e8287ab
ref(ci): overlap devservices with webpack build in acceptance tests
joshuarli aa24e03
ref(ci): apply early devservices to all remaining setup-sentry callers
joshuarli 74ab454
fix
joshuarli 2dd7a01
fix(ci): address review feedback for early devservices
joshuarli f9069db
fix(ci): check devservices exit code before snuba Phase 2
joshuarli 01d6123
refinements
joshuarli c8766fe
test change for backend
joshuarli 4325402
Revert "test change for backend"
joshuarli 25e1ae1
port to more resilient bootstrap-snuba.py
joshuarli 6d20efb
Reapply "test change for backend"
joshuarli 569c9a7
Revert "Reapply "test change for backend""
joshuarli f1b04e1
fix
joshuarli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| name: 'Early Devservices' | ||
| description: 'Starts devservices in the background so image pulls overlap with venv setup' | ||
| inputs: | ||
| mode: | ||
| description: 'devservices mode (must match the mode passed to setup-sentry)' | ||
| required: true | ||
| timeout-minutes: | ||
| description: 'Maximum minutes for devservices up' | ||
| required: false | ||
| default: '10' | ||
|
|
||
| runs: | ||
| using: 'composite' | ||
| steps: | ||
| - uses: astral-sh/setup-uv@884ad927a57e558e7a70b92f2bccf9198a4be546 # v6 | ||
| with: | ||
| version: '0.9.28' | ||
| enable-cache: false | ||
|
|
||
| - name: Start devservices in background | ||
| shell: bash --noprofile --norc -euo pipefail {0} | ||
| run: | | ||
| DS_VERSION=$(python3 -c " | ||
| import tomllib | ||
| with open('uv.lock', 'rb') as f: | ||
| lock = tomllib.load(f) | ||
| for pkg in lock['package']: | ||
| if pkg['name'] == 'devservices': | ||
| print(pkg['version']) | ||
| break | ||
| ") | ||
| echo "Installing devservices==${DS_VERSION}" | ||
| uv venv /tmp/ds-venv --python python3 -q | ||
| uv pip install --python /tmp/ds-venv/bin/python -q \ | ||
| --index-url https://pypi.devinfra.sentry.io/simple \ | ||
| "devservices==${DS_VERSION}" | ||
| (set +e; timeout ${{ inputs.timeout-minutes }}m /tmp/ds-venv/bin/devservices up --mode ${{ inputs.mode }}; echo $? > /tmp/ds-exit) \ | ||
| > /tmp/ds.log 2>&1 & | ||
|
joshuarli marked this conversation as resolved.
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,250 @@ | ||
| #!/usr/bin/env python3 | ||
| """Bootstrap per-worker Snuba instances for CI. | ||
|
|
||
| Overlaps the expensive ClickHouse table setup with the devservices | ||
| health-check wait. | ||
|
|
||
| Phase 1 (early): As soon as ClickHouse is accepting queries, create | ||
| per-worker databases and run ``snuba bootstrap --force``. | ||
| Phase 2 (after devservices): Stop snuba-snuba-1 and start per-worker | ||
| API containers. We must wait for devservices to finish first — | ||
| stopping the container while devservices is health-checking it would | ||
| cause a timeout. | ||
|
|
||
| Requires: XDIST_WORKERS env var | ||
| Reads: /tmp/ds-exit (written by setup-devservices/wait.sh) | ||
| Writes: /tmp/snuba-bootstrap-exit | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import os | ||
| import subprocess | ||
| import sys | ||
| import time | ||
| from concurrent.futures import ThreadPoolExecutor, as_completed | ||
| from functools import partial | ||
| from pathlib import Path | ||
| from typing import Any, Callable | ||
| from urllib.error import URLError | ||
| from urllib.request import urlopen | ||
|
|
||
| DS_EXIT = Path("/tmp/ds-exit") | ||
| SNUBA_EXIT = Path("/tmp/snuba-bootstrap-exit") | ||
|
|
||
| SNUBA_ENV = { | ||
| "CLICKHOUSE_HOST": "clickhouse", | ||
| "CLICKHOUSE_PORT": "9000", | ||
| "CLICKHOUSE_HTTP_PORT": "8123", | ||
| "DEFAULT_BROKERS": "kafka:9093", | ||
| "REDIS_HOST": "redis", | ||
| "REDIS_PORT": "6379", | ||
| "REDIS_DB": "1", | ||
| "SNUBA_SETTINGS": "docker", | ||
| } | ||
|
|
||
| ENV_ARGS = [flag for k, v in SNUBA_ENV.items() for flag in ("-e", f"{k}={v}")] | ||
|
|
||
|
|
||
| def retry( | ||
| fn: Callable[[], Any], *, attempts: int = 3, delay: int = 5, label: str = "operation" | ||
| ) -> Any: | ||
| for i in range(attempts): | ||
| try: | ||
| return fn() | ||
| except Exception: | ||
| if i == attempts - 1: | ||
| raise | ||
| log(f"{label} failed (attempt {i + 1}/{attempts}), retrying in {delay}s...") | ||
| time.sleep(delay) | ||
|
|
||
|
|
||
| def log(msg: str) -> None: | ||
| print(msg, flush=True) | ||
|
|
||
|
|
||
| def fail(msg: str) -> None: | ||
| log(f"::error::{msg}") | ||
| SNUBA_EXIT.write_text("1") | ||
| sys.exit(1) | ||
|
|
||
|
|
||
| def http_ok(url: str) -> bool: | ||
| try: | ||
| with urlopen(url, timeout=3): | ||
| return True | ||
| except (URLError, OSError): | ||
| return False | ||
|
|
||
|
|
||
| def docker( | ||
| *args: str, check: bool = False, timeout: int | None = None | ||
| ) -> subprocess.CompletedProcess[str]: | ||
| return subprocess.run( | ||
| ["docker", *args], capture_output=True, text=True, check=check, timeout=timeout | ||
| ) | ||
|
|
||
|
|
||
| def docker_inspect(container: str, fmt: str) -> str: | ||
| r = docker("inspect", container, "--format", fmt) | ||
| return r.stdout.strip() if r.returncode == 0 else "" | ||
|
|
||
|
|
||
| def inspect_snuba_container() -> tuple[str, str]: | ||
| image = docker_inspect("snuba-snuba-1", "{{.Config.Image}}") | ||
| network = docker_inspect( | ||
| "snuba-snuba-1", | ||
| "{{range $k, $v := .NetworkSettings.Networks}}{{$k}}{{end}}", | ||
| ) | ||
| if not image or not network: | ||
| fail("Could not inspect snuba-snuba-1 container") | ||
| return image, network | ||
|
|
||
|
|
||
| def run_parallel(fn: Callable[[int], Any], workers: range, *, fail_fast: bool = True) -> int: | ||
| """Run fn(i) in parallel for each i in workers. Returns 0 on full success.""" | ||
| rc = 0 | ||
| with ThreadPoolExecutor(max_workers=len(workers)) as pool: | ||
| futs = {pool.submit(fn, i): i for i in workers} | ||
| for fut in as_completed(futs): | ||
| try: | ||
| fut.result() | ||
| except Exception as e: | ||
| if fail_fast: | ||
| fail(str(e)) | ||
| log(f"ERROR: {e}") | ||
| rc = 1 | ||
| return rc | ||
|
|
||
|
|
||
| def wait_for_prerequisites(timeout: int = 300) -> None: | ||
| log("Waiting for ClickHouse and Snuba container...") | ||
| start = time.monotonic() | ||
| while True: | ||
| if time.monotonic() - start > timeout: | ||
| fail("Timed out waiting for Snuba bootstrap prerequisites") | ||
| if http_ok("http://localhost:8123/") and docker_inspect("snuba-snuba-1", "{{.Id}}"): | ||
| break | ||
| time.sleep(2) | ||
| log(f"Prerequisites ready ({time.monotonic() - start:.0f}s)") | ||
|
|
||
|
|
||
| def wait_for_devservices(timeout: int = 300) -> None: | ||
| start = time.monotonic() | ||
| while not DS_EXIT.exists(): | ||
| if time.monotonic() - start > timeout: | ||
| fail("Timed out waiting for devservices to finish") | ||
| time.sleep(1) | ||
| rc = int(DS_EXIT.read_text().strip()) | ||
| if rc != 0: | ||
| fail(f"devservices failed (exit {rc}), skipping Phase 2") | ||
|
|
||
|
|
||
| def bootstrap_worker(worker_id: int, *, image: str, network: str) -> None: | ||
| """Create a ClickHouse database and run snuba bootstrap.""" | ||
| db = f"default_gw{worker_id}" | ||
|
|
||
| def create_db() -> None: | ||
| with urlopen( | ||
| "http://localhost:8123/", f"CREATE DATABASE IF NOT EXISTS {db}".encode(), timeout=30 | ||
| ): | ||
| pass | ||
|
|
||
| retry(create_db, label=f"CREATE DATABASE {db}") | ||
|
|
||
| def run_bootstrap() -> None: | ||
| r = docker( | ||
| "run", | ||
| "--rm", | ||
| "--network", | ||
| network, | ||
| "-e", | ||
| f"CLICKHOUSE_DATABASE={db}", | ||
| *ENV_ARGS, | ||
| image, | ||
| "bootstrap", | ||
| "--force", | ||
| ) | ||
| for line in (r.stdout + r.stderr).strip().splitlines()[-3:]: | ||
| log(line) | ||
| if r.returncode != 0: | ||
| raise RuntimeError(f"snuba bootstrap failed for worker {worker_id}") | ||
|
|
||
| retry(run_bootstrap, label=f"snuba bootstrap gw{worker_id}") | ||
|
|
||
|
|
||
| def start_worker_container(worker_id: int, *, image: str, network: str) -> None: | ||
| """Start a per-worker Snuba API container and wait for health.""" | ||
| db = f"default_gw{worker_id}" | ||
| port = 1230 + worker_id | ||
| name = f"snuba-gw{worker_id}" | ||
|
|
||
| docker("rm", "-f", name) | ||
|
|
||
| r = docker( | ||
| "run", | ||
| "-d", | ||
| "--name", | ||
| name, | ||
| "--network", | ||
| network, | ||
| "-p", | ||
| f"{port}:1218", | ||
| "-e", | ||
| f"CLICKHOUSE_DATABASE={db}", | ||
| *ENV_ARGS, | ||
| "-e", | ||
| "DEBUG=1", | ||
| image, | ||
| "api", | ||
| ) | ||
| if r.returncode != 0: | ||
| raise RuntimeError(f"docker run {name} failed: {r.stderr.strip()}") | ||
|
|
||
| for attempt in range(1, 31): | ||
| if http_ok(f"http://127.0.0.1:{port}/health"): | ||
| log(f"{name} healthy on port {port}") | ||
| return | ||
| if attempt == 30: | ||
| r = docker("logs", name) | ||
| for line in (r.stdout + r.stderr).strip().splitlines()[-20:]: | ||
| log(line) | ||
| raise RuntimeError(f"{name} failed health check after 30 attempts") | ||
| time.sleep(2) | ||
|
|
||
|
|
||
| def main() -> None: | ||
| workers_str = os.environ.get("XDIST_WORKERS") | ||
| if not workers_str: | ||
| fail("XDIST_WORKERS must be set") | ||
| workers = range(int(workers_str)) | ||
| start = time.monotonic() | ||
|
|
||
| wait_for_prerequisites() | ||
| image, network = inspect_snuba_container() | ||
|
|
||
| log("Phase 1: bootstrapping ClickHouse databases") | ||
| run_parallel(partial(bootstrap_worker, image=image, network=network), workers) | ||
| log(f"Phase 1 done ({time.monotonic() - start:.0f}s)") | ||
|
|
||
| wait_for_devservices() | ||
| try: | ||
| docker("stop", "snuba-snuba-1", timeout=30) | ||
| except subprocess.TimeoutExpired: | ||
| log("WARNING: docker stop snuba-snuba-1 timed out, killing") | ||
| docker("kill", "snuba-snuba-1") | ||
|
|
||
| log("Phase 2: starting per-worker Snuba API containers") | ||
| rc = run_parallel( | ||
| partial(start_worker_container, image=image, network=network), | ||
| workers, | ||
| fail_fast=False, | ||
| ) | ||
|
|
||
| log(f"Snuba bootstrap complete ({time.monotonic() - start:.0f}s total)") | ||
| SNUBA_EXIT.write_text(str(rc)) | ||
| sys.exit(rc) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| #!/bin/bash | ||
| set -euo pipefail | ||
|
|
||
| # Wait for the background devservices process started by the setup-devservices action. | ||
| # Usage: wait.sh [timeout_seconds] | ||
| TIMEOUT=${1:-600} | ||
|
|
||
| SECONDS=0 | ||
| while [ ! -f /tmp/ds-exit ]; do | ||
| if [ $SECONDS -gt "$TIMEOUT" ]; then | ||
| echo "::error::Timed out waiting for devservices after ${TIMEOUT}s" | ||
| cat /tmp/ds.log | ||
| exit 1 | ||
| fi | ||
| sleep 2 | ||
| done | ||
|
|
||
| DS_RC=$(< /tmp/ds-exit) | ||
| if [ "$DS_RC" -ne 0 ]; then | ||
| echo "::error::devservices up failed (exit $DS_RC)" | ||
| cat /tmp/ds.log | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "DJANGO_LIVE_TEST_SERVER_ADDRESS=$(docker network inspect bridge --format='{{(index .IPAM.Config 0).Gateway}}')" >> "$GITHUB_ENV" | ||
| docker ps -a |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This action +
wait.shcould live insidesetup-sentryinstead of being a separate action.setup-sentryalready hasskip-devservicesanddevservices-timeout-minutes inputs. We can add aparallel-devservicesmode to let it start the background process early (before venv setup) andwait at the end (where it currently runs devservices up synchronously). I notice we're repeating the same 3-step pattern across all the 10 or so jobs, but just parallel-devservices: 'true' on the existing setup-sentry step would be a lot easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would like to cut down on the boilerplate too but we'd lose most of the benefit for acceptance because webpacking requires setup-sentry and we want to be pulling stuff while webpacking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, let's just tackle this later.