Skip to content

feat: stateless round-robin router for Claude Fleet warm pool#303

Open
kulvirgit wants to merge 1 commit intokubernetes-sigs:mainfrom
kulvirgit:feat/claude-fleet-stateless
Open

feat: stateless round-robin router for Claude Fleet warm pool#303
kulvirgit wants to merge 1 commit intokubernetes-sigs:mainfrom
kulvirgit:feat/claude-fleet-stateless

Conversation

@kulvirgit
Copy link

No description provided.

@netlify
Copy link

netlify bot commented Feb 10, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit ac38b02
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/698a80087d4ce6000855fb4b

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kulvirgit
Once this PR has been reviewed and has the lgtm label, please assign janetkuo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 10, 2026

CLA Not Signed

@k8s-ci-robot
Copy link
Contributor

Welcome @kulvirgit!

It looks like this is your first PR to kubernetes-sigs/agent-sandbox 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/agent-sandbox has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 10, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @kulvirgit. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 10, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kulvirgit kulvirgit force-pushed the feat/claude-fleet-stateless branch from 18fc772 to ac38b02 Compare February 10, 2026 00:47
Copy link

@govindpawa govindpawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall

Good conversion to stateless round-robin routing. The backward compatibility with X-Sandbox-ID is clean. A few issues to address before merge.


Required Changes

1. K8s API call on every request — add a TTL cache

sandbox_router.py:53-72get_warm_pod_ips() hits the K8s API on every single request (wrapped in asyncio.to_thread). Under load this will hammer the API server and add latency.

Fix: Add a short TTL cache (2-5 seconds):

import time

_pod_cache: list[str] = []
_pod_cache_time: float = 0
POD_CACHE_TTL = 3.0  # seconds

def get_warm_pod_ips() -> list[str]:
    global _pod_cache, _pod_cache_time
    if time.monotonic() - _pod_cache_time < POD_CACHE_TTL:
        ips = _pod_cache.copy()
        random.shuffle(ips)
        return ips
    # ... existing K8s query ...
    _pod_cache = pod_ips
    _pod_cache_time = time.monotonic()
    random.shuffle(pod_ips)
    return pod_ips

2. Last-pod retry logic is fragile

sandbox_router.py:136,155 — The check target_host != target_hosts[-1] has two problems:

  • With only 1 pod, it never retries (last pod == first pod)
  • If the last pod returns 503, it returns that 503 directly to the client instead of the friendlier "All pods are busy" message at the bottom

Fix: Use an index-based loop:

for idx, target_host in enumerate(target_hosts):
    is_last = (idx == len(target_hosts) - 1)
    # ... use is_last instead of target_host != target_hosts[-1]

3. Resource leak on exception

sandbox_router.py:157-160if 'resp' in locals() is brittle. If resp was set in a previous loop iteration, this could close the wrong response. Track the response explicitly:

current_resp = None
try:
    current_resp = await http_client.send(req, stream=True)
    ...
except Exception:
    if current_resp:
        await current_resp.aclose()

Recommendations

4. requirements.txt Python version change

The file header changed from pip-compile with Python 3.13 to 3.10. Is this intentional? Could cause compatibility issues.

5. Consider adding a /ready endpoint

The router has /healthz but no readiness probe. If the K8s client fails to init, the router still accepts traffic and will 500 on every request.

6. random.shuffle load distribution

Random distribution can cause hot-spotting under load. Consider round-robin with an atomic counter for more even distribution.


Summary

Category Items
Required fixes 3
Recommendations 3

Verdict: Approve after required changes are addressed.

@janetkuo janetkuo added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2026
@natasha41575
Copy link

@kulvirgit Please sign the CLA and add a more descriptive PR description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants