feat: stateless round-robin router for Claude Fleet warm pool#303
feat: stateless round-robin router for Claude Fleet warm pool#303kulvirgit wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kulvirgit The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
Welcome @kulvirgit! |
|
Hi @kulvirgit. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
18fc772 to
ac38b02
Compare
govindpawa
left a comment
There was a problem hiding this comment.
Code Review
Overall
Good conversion to stateless round-robin routing. The backward compatibility with X-Sandbox-ID is clean. A few issues to address before merge.
Required Changes
1. K8s API call on every request — add a TTL cache
sandbox_router.py:53-72 — get_warm_pod_ips() hits the K8s API on every single request (wrapped in asyncio.to_thread). Under load this will hammer the API server and add latency.
Fix: Add a short TTL cache (2-5 seconds):
import time
_pod_cache: list[str] = []
_pod_cache_time: float = 0
POD_CACHE_TTL = 3.0 # seconds
def get_warm_pod_ips() -> list[str]:
global _pod_cache, _pod_cache_time
if time.monotonic() - _pod_cache_time < POD_CACHE_TTL:
ips = _pod_cache.copy()
random.shuffle(ips)
return ips
# ... existing K8s query ...
_pod_cache = pod_ips
_pod_cache_time = time.monotonic()
random.shuffle(pod_ips)
return pod_ips2. Last-pod retry logic is fragile
sandbox_router.py:136,155 — The check target_host != target_hosts[-1] has two problems:
- With only 1 pod, it never retries (last pod == first pod)
- If the last pod returns 503, it returns that 503 directly to the client instead of the friendlier "All pods are busy" message at the bottom
Fix: Use an index-based loop:
for idx, target_host in enumerate(target_hosts):
is_last = (idx == len(target_hosts) - 1)
# ... use is_last instead of target_host != target_hosts[-1]3. Resource leak on exception
sandbox_router.py:157-160 — if 'resp' in locals() is brittle. If resp was set in a previous loop iteration, this could close the wrong response. Track the response explicitly:
current_resp = None
try:
current_resp = await http_client.send(req, stream=True)
...
except Exception:
if current_resp:
await current_resp.aclose()Recommendations
4. requirements.txt Python version change
The file header changed from pip-compile with Python 3.13 to 3.10. Is this intentional? Could cause compatibility issues.
5. Consider adding a /ready endpoint
The router has /healthz but no readiness probe. If the K8s client fails to init, the router still accepts traffic and will 500 on every request.
6. random.shuffle load distribution
Random distribution can cause hot-spotting under load. Consider round-robin with an atomic counter for more even distribution.
Summary
| Category | Items |
|---|---|
| Required fixes | 3 |
| Recommendations | 3 |
Verdict: Approve after required changes are addressed.
|
@kulvirgit Please sign the CLA and add a more descriptive PR description |
No description provided.