Skip to content

Conversation

@LeslieKuo
Copy link
Contributor

add ReadinessProbe to fix issue #64

VanderChen and others added 6 commits December 10, 2025 23:51
Signed-off-by: VanderChen <vanderchen@outlook.com>
Signed-off-by: LeslieKuo <676365950@qq.com>
Signed-off-by: LeslieKuo <676365950@qq.com>
Signed-off-by: VanderChen <vanderchen@outlook.com>
Signed-off-by: VanderChen <vanderchen@outlook.com>
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @LeslieKuo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new agentcube-router service to efficiently route HTTP requests to appropriate sandboxes. It includes the full implementation of the router, its build and deployment configurations, and a detailed design document. A significant enhancement is the addition of ReadinessProbe capabilities for sandboxes, improving their reliability and addressing a specific issue. The router is designed for high performance, integrates with Redis for session activity, and incorporates concurrency controls.

Highlights

  • New Router Component: Introduced a new agentcube-router component designed for high-performance request routing to sandboxes, integrating with the existing SessionManager and Redis for session handling and activity tracking.
  • Readiness Probe Support: Added support for specifying ReadinessProbes in sandbox creation requests, allowing for more robust health checking of sandbox containers. This addresses issue Invoke request fails with 'Connection refused' for complex agent #64.
  • Build and Deployment Updates: Modified the Makefile and Kubernetes deployment manifests to support building, deploying, and managing the new agentcube-router service within the Kubernetes cluster.
  • Session Management Enhancements: Implemented a new SessionManager within the router to handle sandbox session retrieval from Redis and dynamically create new sandboxes via the workload manager when a session ID is not provided.
  • Concurrency Limiting: The new router includes a concurrency limiting middleware to prevent server overload by gracefully rejecting requests when the maximum concurrent request limit is reached.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new agentcube-router component and adds support for ReadinessProbe when creating sandboxes. The changes are extensive, including a new Dockerfile, Makefile targets, Kubernetes manifests, and the core router implementation with handlers, server logic, and a session manager. Overall, the implementation is solid and well-tested. I've identified a few areas for improvement, including a race condition in the graceful shutdown logic, a redundant Redis call, and some minor inconsistencies in the Makefile and documentation. Addressing these points will enhance the robustness and maintainability of the new router component.

Comment on lines +63 to +72
select {
case <-ctx.Done():
log.Println("Received shutdown signal, shutting down gracefully...")
// Cancel the context to trigger server shutdown
cancel()
// Wait for server goroutine to exit after graceful shutdown is complete
<-errCh
case err := <-errCh:
log.Fatalf("Server error: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a race condition in the graceful shutdown logic. If the server fails to start (e.g., port already in use) and a shutdown signal is received at the same time, the select statement has non-deterministic behavior. If the <-ctx.Done() case is selected, the startup error from errCh will be consumed but ignored, causing the program to exit silently without reporting the fatal startup error. This can make debugging startup issues very difficult.

To fix this, you should ensure that any error from errCh is handled correctly, regardless of which select case is chosen. Here is a suggested change to at least log the error in the shutdown path, which mitigates the issue of silently ignoring it.

Suggested change
select {
case <-ctx.Done():
log.Println("Received shutdown signal, shutting down gracefully...")
// Cancel the context to trigger server shutdown
cancel()
// Wait for server goroutine to exit after graceful shutdown is complete
<-errCh
case err := <-errCh:
log.Fatalf("Server error: %v", err)
}
select {
case <-ctx.Done():
log.Println("Received shutdown signal, shutting down gracefully...")
// The server's own context handling will trigger shutdown.
// We wait for it to complete and check for any unexpected errors.
// A `http.ErrServerClosed` is expected on graceful shutdown, so we ignore it.
if err := <-errCh; err != nil && err.Error() != "http: Server closed" {
log.Printf("Server shutdown with error: %v", err)
}
case err := <-errCh:
log.Fatalf("Server error: %v", err)
}

@echo "Cleaning..."
rm -rf bin/
rm -f agentcube-apiserver agentd
rm -f agentcube-router agentd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The clean target was modified to remove agentcube-router but it no longer removes agentcube-apiserver. Since the build target for agentcube-apiserver still exists, it should probably be included in the clean target to avoid leaving stale artifacts.

rm -f agentcube-apiserver agentcube-router agentd

Comment on lines +293 to +294
- **Liveness Check**: `/health/live`
- **Readiness Check**: `/health/ready`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The "Health Checks" section lists /health/live and /health/ready endpoints. However, the implementation in pkg/router/server.go also exposes a generic /health endpoint, which is used by the Kubernetes readiness and liveness probes in k8s/agentcube-router.yaml. For consistency and completeness, this endpoint should also be documented here.

Suggested change
- **Liveness Check**: `/health/live`
- **Readiness Check**: `/health/ready`
- **Health Check**: `/health`
- **Liveness Check**: `/health/live`
- **Readiness Check**: `/health/ready`

Comment on lines +107 to +109
if err := s.redisClient.UpdateSessionLastActivity(c.Request.Context(), sandbox.SessionID, time.Now()); err != nil {
log.Printf("Failed to update sandbox last activity for request: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to s.redisClient.UpdateSessionLastActivity is duplicated. It's called once before s.forwardToSandbox (lines 98-102) and again immediately after this. Since forwardToSandbox blocks until the request is served, the second call is redundant and causes an unnecessary write to Redis. You should remove this second call.

Additionally, the log message on line 100 "Failed to update sandbox last activity" is slightly misleading, as the function UpdateSessionLastActivity updates the activity for a session, not a sandbox.

Comment on lines +87 to +88
if len(sandbox.Spec.PodTemplate.Spec.Containers) > 0 {
container := &sandbox.Spec.PodTemplate.Spec.Containers[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for adding a readiness probe applies it to the first container of the pod spec (.Spec.PodTemplate.Spec.Containers[0]). This assumes that the pod has only one container or that the probe is always intended for the first one. For AgentRuntime workloads, which can have multiple containers in their pod template, this might lead to incorrect behavior if the probe is intended for a different container.

Consider making this more robust, for example by allowing the ReadinessProbe configuration to specify a container name, and falling back to the first container if no name is provided.

@MahaoAlex
Copy link
Contributor

Issue #64 is just a quick draft idea for the readiness probe. You’d better add a fully designed proposal for it in this PR.

sandbox.Spec.PodTemplate.ObjectMeta.Labels[SessionIdLabelKey] = params.sessionID
sandbox.Spec.PodTemplate.ObjectMeta.Labels["sandbox-name"] = params.sandboxName

// Handle Readiness Probe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the readiness probe only being added to container[0]? If this follows the design logic and is a deliberate design limit, please add code comments to mark this point clearly.

@acsoto
Copy link
Contributor

acsoto commented Dec 23, 2025

-> #89
This can be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants