feat: stateless round-robin router for Claude Fleet warm pool by kulvirgit · Pull Request #303 · kubernetes-sigs/agent-sandbox

kulvirgit · 2026-02-10T00:42:36Z

No description provided.

netlify · 2026-02-10T00:42:42Z

✅ Deploy Preview for agent-sandbox canceled.

Name	Link
🔨 Latest commit	`ac38b02`
🔍 Latest deploy log	https://app.netlify.com/projects/agent-sandbox/deploys/698a80087d4ce6000855fb4b

k8s-ci-robot · 2026-02-10T00:42:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kulvirgit
Once this PR has been reviewed and has the lgtm label, please assign janetkuo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

clients/python/agentic-sandbox-client/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

linux-foundation-easycla · 2026-02-10T00:42:44Z

❌ - login: @kulvirgit / name: Kulvir . The commit (ac38b02) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

k8s-ci-robot · 2026-02-10T00:42:45Z

Welcome @kulvirgit!

It looks like this is your first PR to kubernetes-sigs/agent-sandbox 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/agent-sandbox has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-02-10T00:42:46Z

Hi @kulvirgit. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

govindpawa

Code Review

Overall

Good conversion to stateless round-robin routing. The backward compatibility with X-Sandbox-ID is clean. A few issues to address before merge.

Required Changes

1. K8s API call on every request — add a TTL cache

sandbox_router.py:53-72 — get_warm_pod_ips() hits the K8s API on every single request (wrapped in asyncio.to_thread). Under load this will hammer the API server and add latency.

Fix: Add a short TTL cache (2-5 seconds):

import time

_pod_cache: list[str] = []
_pod_cache_time: float = 0
POD_CACHE_TTL = 3.0  # seconds

def get_warm_pod_ips() -> list[str]:
    global _pod_cache, _pod_cache_time
    if time.monotonic() - _pod_cache_time < POD_CACHE_TTL:
        ips = _pod_cache.copy()
        random.shuffle(ips)
        return ips
    # ... existing K8s query ...
    _pod_cache = pod_ips
    _pod_cache_time = time.monotonic()
    random.shuffle(pod_ips)
    return pod_ips

2. Last-pod retry logic is fragile

sandbox_router.py:136,155 — The check target_host != target_hosts[-1] has two problems:

With only 1 pod, it never retries (last pod == first pod)
If the last pod returns 503, it returns that 503 directly to the client instead of the friendlier "All pods are busy" message at the bottom

Fix: Use an index-based loop:

for idx, target_host in enumerate(target_hosts):
    is_last = (idx == len(target_hosts) - 1)
    # ... use is_last instead of target_host != target_hosts[-1]

3. Resource leak on exception

sandbox_router.py:157-160 — if 'resp' in locals() is brittle. If resp was set in a previous loop iteration, this could close the wrong response. Track the response explicitly:

current_resp = None
try:
    current_resp = await http_client.send(req, stream=True)
    ...
except Exception:
    if current_resp:
        await current_resp.aclose()

Recommendations

4. `requirements.txt` Python version change

The file header changed from pip-compile with Python 3.13 to 3.10. Is this intentional? Could cause compatibility issues.

5. Consider adding a `/ready` endpoint

The router has /healthz but no readiness probe. If the K8s client fails to init, the router still accepts traffic and will 500 on every request.

6. `random.shuffle` load distribution

Random distribution can cause hot-spotting under load. Consider round-robin with an atomic counter for more even distribution.

Summary

Category	Items
Required fixes	3
Recommendations	3

Verdict: Approve after required changes are addressed.

natasha41575 · 2026-03-04T16:51:03Z

@kulvirgit Please sign the CLA and add a more descriptive PR description

k8s-ci-robot requested review from janetkuo and justinsb February 10, 2026 00:42

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 10, 2026

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 10, 2026

feat: stateless random-failover router for Claude Fleet warm pool

ac38b02

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kulvirgit force-pushed the feat/claude-fleet-stateless branch from 18fc772 to ac38b02 Compare February 10, 2026 00:47

govindpawa reviewed Feb 10, 2026

View reviewed changes

janetkuo added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stateless round-robin router for Claude Fleet warm pool#303

feat: stateless round-robin router for Claude Fleet warm pool#303
kulvirgit wants to merge 1 commit intokubernetes-sigs:mainfrom
kulvirgit:feat/claude-fleet-stateless

kulvirgit commented Feb 10, 2026

Uh oh!

netlify bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Feb 10, 2026

Uh oh!

linux-foundation-easycla bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Feb 10, 2026

Uh oh!

k8s-ci-robot commented Feb 10, 2026

Uh oh!

govindpawa left a comment

Uh oh!

natasha41575 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kulvirgit commented Feb 10, 2026

Uh oh!

netlify bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for agent-sandbox canceled.

Uh oh!

k8s-ci-robot commented Feb 10, 2026

Uh oh!

linux-foundation-easycla bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 10, 2026

Uh oh!

k8s-ci-robot commented Feb 10, 2026

Uh oh!

govindpawa left a comment

Choose a reason for hiding this comment

Code Review

Overall

Required Changes

1. K8s API call on every request — add a TTL cache

2. Last-pod retry logic is fragile

3. Resource leak on exception

Recommendations

4. requirements.txt Python version change

5. Consider adding a /ready endpoint

6. random.shuffle load distribution

Summary

Uh oh!

natasha41575 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

netlify bot commented Feb 10, 2026 •

edited

Loading

linux-foundation-easycla bot commented Feb 10, 2026 •

edited

Loading

4. `requirements.txt` Python version change

5. Consider adding a `/ready` endpoint

6. `random.shuffle` load distribution