Skip to content

Runtime pods: startup probe failures cause 503s to evaluation runs #352

@simonrosenberg

Description

@simonrosenberg

Summary

Evaluation runs are intermittently failing with HTTP 503 Service Unavailable: no available server when polling conversation runtimes. The runtime pods (agent-server) in the eval-runtime cluster frequently fail their startup probe (/server_info on port 60000) with connection refused, leading to pods being force-stopped and clients seeing 503s.

Evidence

  • Eval job logs (evaluation-jobs/eval-eval-21233988879-deepseek-v-w57bc) show repeated 503s during conversation polling for runtime IDs such as sqjmuliagwzeacmk:
    • httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'https://sqjmuliagwzeacmk.eval-runtime.all-hands.dev/api/conversations/...'
    • Warnings: Error polling status (will retry): HTTP 503 Service Unavailable.
  • Runtime cluster events (gke_evaluation-092424_us-central1_eval-runtime, namespace runtime-pods) show many startup probe failures:
    • Examples: runtime-jalsgbfyluprngys, runtime-gjaikorgkbzxfehz, runtime-cucnnwjkxeohxwqs, runtime-himglxfvjzitsqhq, runtime-olxcbueiuaukcidi, runtime-rzdvidyrmlgaimdn, runtime-psivfraokikdgbty, etc.
    • Event message: Startup probe failed: Get "http://<pod-ip>:60000/server_info": dial tcp <pod-ip>:60000: connect: connection refused.
  • Runtime-api logs in the core cluster confirm 503’d runtimes were force-stopped:
    • Force-stopping runtime sqjmuliagwzeacmkRuntime stopped successfully (total runtime ~366s).
  • Runtime-api deployment is healthy (HPA 3–6, CPU ~4%/75%), so the issue is per-runtime pod readiness, not the runtime-api service.

Root cause hypothesis

Cold/slow startup of agent-server containers causes the startup probe on port 60000 to fail (connection refused) before the service is ready. Pods are then force-stopped, and the client sees 503 “no available server” when polling that runtime.

Impact

Intermittent 503s during evaluation runs, causing retries and stalled/failed instances.

Proposed fix

  • Relax startupProbe for runtime pods (agent-server): increase initialDelaySeconds, failureThreshold, and/or periodSeconds (and optionally timeoutSeconds) to allow slower boot.
  • Consider pre-pulling large images on runtime nodes or keeping a small warm pool to reduce cold-start latency.
  • Monitor Startup probe failed events in runtime-pods to confirm improvement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions