Fix issue 64 #71

LeslieKuo · 2025-12-11T02:02:24Z

add ReadinessProbe to fix issue #64

Signed-off-by: VanderChen <vanderchen@outlook.com>

Signed-off-by: LeslieKuo <676365950@qq.com>

Signed-off-by: VanderChen <vanderchen@outlook.com>

volcano-sh-bot · 2025-12-11T02:02:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2025-12-11T02:02:57Z

Summary of Changes

Hello @LeslieKuo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new agentcube-router service to efficiently route HTTP requests to appropriate sandboxes. It includes the full implementation of the router, its build and deployment configurations, and a detailed design document. A significant enhancement is the addition of ReadinessProbe capabilities for sandboxes, improving their reliability and addressing a specific issue. The router is designed for high performance, integrates with Redis for session activity, and incorporates concurrency controls.

Highlights

New Router Component: Introduced a new agentcube-router component designed for high-performance request routing to sandboxes, integrating with the existing SessionManager and Redis for session handling and activity tracking.
Readiness Probe Support: Added support for specifying ReadinessProbes in sandbox creation requests, allowing for more robust health checking of sandbox containers. This addresses issue Invoke request fails with 'Connection refused' for complex agent #64.
Build and Deployment Updates: Modified the Makefile and Kubernetes deployment manifests to support building, deploying, and managing the new agentcube-router service within the Kubernetes cluster.
Session Management Enhancements: Implemented a new SessionManager within the router to handle sandbox session retrieval from Redis and dynamically create new sandboxes via the workload manager when a session ID is not provided.
Concurrency Limiting: The new router includes a concurrency limiting middleware to prevent server overload by gracefully rejecting requests when the maximum concurrent request limit is reached.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new agentcube-router component and adds support for ReadinessProbe when creating sandboxes. The changes are extensive, including a new Dockerfile, Makefile targets, Kubernetes manifests, and the core router implementation with handlers, server logic, and a session manager. Overall, the implementation is solid and well-tested. I've identified a few areas for improvement, including a race condition in the graceful shutdown logic, a redundant Redis call, and some minor inconsistencies in the Makefile and documentation. Addressing these points will enhance the robustness and maintainability of the new router component.

gemini-code-assist · 2025-12-11T02:05:31Z

cmd/router/main.go

+	select {
+	case <-ctx.Done():
+		log.Println("Received shutdown signal, shutting down gracefully...")
+		// Cancel the context to trigger server shutdown
+		cancel()
+		// Wait for server goroutine to exit after graceful shutdown is complete
+		<-errCh
+	case err := <-errCh:
+		log.Fatalf("Server error: %v", err)
+	}


There's a race condition in the graceful shutdown logic. If the server fails to start (e.g., port already in use) and a shutdown signal is received at the same time, the select statement has non-deterministic behavior. If the <-ctx.Done() case is selected, the startup error from errCh will be consumed but ignored, causing the program to exit silently without reporting the fatal startup error. This can make debugging startup issues very difficult.

To fix this, you should ensure that any error from errCh is handled correctly, regardless of which select case is chosen. Here is a suggested change to at least log the error in the shutdown path, which mitigates the issue of silently ignoring it.

Suggested change

select {

case <-ctx.Done():

log.Println("Received shutdown signal, shutting down gracefully...")

// Cancel the context to trigger server shutdown

cancel()

// Wait for server goroutine to exit after graceful shutdown is complete

<-errCh

case err := <-errCh:

log.Fatalf("Server error: %v", err)

}

select {

case <-ctx.Done():

log.Println("Received shutdown signal, shutting down gracefully...")

// The server's own context handling will trigger shutdown.

// We wait for it to complete and check for any unexpected errors.

// A `http.ErrServerClosed` is expected on graceful shutdown, so we ignore it.

if err := <-errCh; err != nil && err.Error() != "http: Server closed" {

log.Printf("Server shutdown with error: %v", err)

}

case err := <-errCh:

log.Fatalf("Server error: %v", err)

}

gemini-code-assist · 2025-12-11T02:05:31Z

Makefile

 	@echo "Cleaning..."
 	rm -rf bin/
-	rm -f agentcube-apiserver agentd
+	rm -f agentcube-router agentd


The clean target was modified to remove agentcube-router but it no longer removes agentcube-apiserver. Since the build target for agentcube-apiserver still exists, it should probably be included in the clean target to avoid leaving stale artifacts.

rm -f agentcube-apiserver agentcube-router agentd

gemini-code-assist · 2025-12-11T02:05:31Z

docs/proposal/router-design.md

+- **Liveness Check**: `/health/live`
+- **Readiness Check**: `/health/ready`


The "Health Checks" section lists /health/live and /health/ready endpoints. However, the implementation in pkg/router/server.go also exposes a generic /health endpoint, which is used by the Kubernetes readiness and liveness probes in k8s/agentcube-router.yaml. For consistency and completeness, this endpoint should also be documented here.

Suggested change

- **Liveness Check**: `/health/live`

- **Readiness Check**: `/health/ready`

- **Health Check**: `/health`

- **Liveness Check**: `/health/live`

- **Readiness Check**: `/health/ready`

gemini-code-assist · 2025-12-11T02:05:31Z

pkg/router/handlers.go

+	if err := s.redisClient.UpdateSessionLastActivity(c.Request.Context(), sandbox.SessionID, time.Now()); err != nil {
+		log.Printf("Failed to update sandbox last activity for request: %v", err)
+	}


The call to s.redisClient.UpdateSessionLastActivity is duplicated. It's called once before s.forwardToSandbox (lines 98-102) and again immediately after this. Since forwardToSandbox blocks until the request is served, the second call is redundant and causes an unnecessary write to Redis. You should remove this second call.

Additionally, the log message on line 100 "Failed to update sandbox last activity" is slightly misleading, as the function UpdateSessionLastActivity updates the activity for a session, not a sandbox.

gemini-code-assist · 2025-12-11T02:05:31Z

pkg/workloadmanager/workload_builder.go

+	if len(sandbox.Spec.PodTemplate.Spec.Containers) > 0 {
+		container := &sandbox.Spec.PodTemplate.Spec.Containers[0]


The logic for adding a readiness probe applies it to the first container of the pod spec (.Spec.PodTemplate.Spec.Containers[0]). This assumes that the pod has only one container or that the probe is always intended for the first one. For AgentRuntime workloads, which can have multiple containers in their pod template, this might lead to incorrect behavior if the probe is intended for a different container.

Consider making this more robust, for example by allowing the ReadinessProbe configuration to specify a container name, and falling back to the first container if no name is provided.

MahaoAlex · 2025-12-15T03:17:39Z

Issue #64 is just a quick draft idea for the readiness probe. You’d better add a fully designed proposal for it in this PR.

MahaoAlex · 2025-12-15T04:02:32Z

pkg/workloadmanager/workload_builder.go

 	sandbox.Spec.PodTemplate.ObjectMeta.Labels[SessionIdLabelKey] = params.sessionID
 	sandbox.Spec.PodTemplate.ObjectMeta.Labels["sandbox-name"] = params.sandboxName
+
+	// Handle Readiness Probe


Is the readiness probe only being added to container[0]? If this follows the design logic and is a deliberate design limit, please add code comments to mark this point clearly.

acsoto · 2025-12-23T03:18:52Z

-> #89
This can be closed?

VanderChen and others added 6 commits December 10, 2025 23:51

Implement router for invoke reuqest

e24ae05

Signed-off-by: VanderChen <vanderchen@outlook.com>

session manager implementation

b4cdaac

Signed-off-by: LeslieKuo <676365950@qq.com>

add UT

1aef50c

Signed-off-by: LeslieKuo <676365950@qq.com>

Integrate session manager in router

b7a3fe3

Signed-off-by: VanderChen <vanderchen@outlook.com>

Fix router path and set no timeout for invoke request

d1f24f8

Signed-off-by: VanderChen <vanderchen@outlook.com>

add readinessProbe

5eab334

volcano-sh-bot added the size/XXL label Dec 11, 2025

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

MahaoAlex reviewed Dec 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 64 #71

Fix issue 64 #71

Uh oh!

LeslieKuo commented Dec 11, 2025

Uh oh!

volcano-sh-bot commented Dec 11, 2025

Uh oh!

gemini-code-assist bot commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

MahaoAlex commented Dec 15, 2025

Uh oh!

MahaoAlex Dec 15, 2025

Uh oh!

acsoto commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		- Liveness Check: `/health/live`
		- Readiness Check: `/health/ready`

		if len(sandbox.Spec.PodTemplate.Spec.Containers) > 0 {
		container := &sandbox.Spec.PodTemplate.Spec.Containers[0]

Fix issue 64 #71

Are you sure you want to change the base?

Fix issue 64 #71

Uh oh!

Conversation

LeslieKuo commented Dec 11, 2025

Uh oh!

volcano-sh-bot commented Dec 11, 2025

Uh oh!

gemini-code-assist bot commented Dec 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

MahaoAlex commented Dec 15, 2025

Uh oh!

MahaoAlex Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

acsoto commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants