Runtime pods: startup probe failures cause 503s to evaluation runs

### Summary
Evaluation runs are intermittently failing with `HTTP 503 Service Unavailable: no available server` when polling conversation runtimes. The runtime pods (agent-server) in the eval-runtime cluster frequently fail their startup probe (`/server_info` on port 60000) with connection refused, leading to pods being force-stopped and clients seeing 503s.

### Evidence
- Eval job logs (`evaluation-jobs/eval-eval-21233988879-deepseek-v-w57bc`) show repeated 503s during conversation polling for runtime IDs such as `sqjmuliagwzeacmk`:
  - `httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'https://sqjmuliagwzeacmk.eval-runtime.all-hands.dev/api/conversations/...`'
  - Warnings: `Error polling status (will retry): HTTP 503 Service Unavailable`.
- Runtime cluster events (`gke_evaluation-092424_us-central1_eval-runtime`, namespace `runtime-pods`) show many startup probe failures:
  - Examples: `runtime-jalsgbfyluprngys`, `runtime-gjaikorgkbzxfehz`, `runtime-cucnnwjkxeohxwqs`, `runtime-himglxfvjzitsqhq`, `runtime-olxcbueiuaukcidi`, `runtime-rzdvidyrmlgaimdn`, `runtime-psivfraokikdgbty`, etc.
  - Event message: `Startup probe failed: Get "http://<pod-ip>:60000/server_info": dial tcp <pod-ip>:60000: connect: connection refused`.
- Runtime-api logs in the core cluster confirm 503’d runtimes were force-stopped:
  - `Force-stopping runtime sqjmuliagwzeacmk` → `Runtime stopped successfully` (total runtime ~366s).
- Runtime-api deployment is healthy (HPA 3–6, CPU ~4%/75%), so the issue is per-runtime pod readiness, not the runtime-api service.

### Root cause hypothesis
Cold/slow startup of agent-server containers causes the startup probe on port 60000 to fail (connection refused) before the service is ready. Pods are then force-stopped, and the client sees 503 “no available server” when polling that runtime.

### Impact
Intermittent 503s during evaluation runs, causing retries and stalled/failed instances.

### Proposed fix
- Relax startupProbe for runtime pods (agent-server): increase `initialDelaySeconds`, `failureThreshold`, and/or `periodSeconds` (and optionally `timeoutSeconds`) to allow slower boot.
- Consider pre-pulling large images on runtime nodes or keeping a small warm pool to reduce cold-start latency.
- Monitor `Startup probe failed` events in `runtime-pods` to confirm improvement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime pods: startup probe failures cause 503s to evaluation runs #352

Summary

Evidence

Root cause hypothesis

Impact

Proposed fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runtime pods: startup probe failures cause 503s to evaluation runs #352

Description

Summary

Evidence

Root cause hypothesis

Impact

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions