-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
Summary
Evaluation runs are intermittently failing with HTTP 503 Service Unavailable: no available server when polling conversation runtimes. The runtime pods (agent-server) in the eval-runtime cluster frequently fail their startup probe (/server_info on port 60000) with connection refused, leading to pods being force-stopped and clients seeing 503s.
Evidence
- Eval job logs (
evaluation-jobs/eval-eval-21233988879-deepseek-v-w57bc) show repeated 503s during conversation polling for runtime IDs such assqjmuliagwzeacmk:httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'https://sqjmuliagwzeacmk.eval-runtime.all-hands.dev/api/conversations/...'- Warnings:
Error polling status (will retry): HTTP 503 Service Unavailable.
- Runtime cluster events (
gke_evaluation-092424_us-central1_eval-runtime, namespaceruntime-pods) show many startup probe failures:- Examples:
runtime-jalsgbfyluprngys,runtime-gjaikorgkbzxfehz,runtime-cucnnwjkxeohxwqs,runtime-himglxfvjzitsqhq,runtime-olxcbueiuaukcidi,runtime-rzdvidyrmlgaimdn,runtime-psivfraokikdgbty, etc. - Event message:
Startup probe failed: Get "http://<pod-ip>:60000/server_info": dial tcp <pod-ip>:60000: connect: connection refused.
- Examples:
- Runtime-api logs in the core cluster confirm 503’d runtimes were force-stopped:
Force-stopping runtime sqjmuliagwzeacmk→Runtime stopped successfully(total runtime ~366s).
- Runtime-api deployment is healthy (HPA 3–6, CPU ~4%/75%), so the issue is per-runtime pod readiness, not the runtime-api service.
Root cause hypothesis
Cold/slow startup of agent-server containers causes the startup probe on port 60000 to fail (connection refused) before the service is ready. Pods are then force-stopped, and the client sees 503 “no available server” when polling that runtime.
Impact
Intermittent 503s during evaluation runs, causing retries and stalled/failed instances.
Proposed fix
- Relax startupProbe for runtime pods (agent-server): increase
initialDelaySeconds,failureThreshold, and/orperiodSeconds(and optionallytimeoutSeconds) to allow slower boot. - Consider pre-pulling large images on runtime nodes or keeping a small warm pool to reduce cold-start latency.
- Monitor
Startup probe failedevents inruntime-podsto confirm improvement.
Metadata
Metadata
Assignees
Labels
No labels