-
Notifications
You must be signed in to change notification settings - Fork 497
[Store][WIP] Add retry logic for auto port binding in client setup #1328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @chenkaiyue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of the Mooncake client setup process by introducing an intelligent retry mechanism for automatically assigned ports. This change directly addresses common issues like port binding race conditions and metadata registration conflicts that can arise in distributed or high-concurrency environments, ensuring more reliable client initialization and preventing service startup failures. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a retry mechanism for automatic port binding during client setup, which is a solid improvement for reliability in high-concurrency environments. The implementation is clear and directly addresses the issues of port conflicts and metadata registration failures. I have a few suggestions to further enhance the robustness and configurability of the new logic:
- Make the maximum number of retries configurable via an environment variable.
- Use a more descriptive error code when all retry attempts fail.
- Introduce a small delay between retries to better handle transient resource contention.
| if (!success) { | ||
| LOG(ERROR) << "Failed to create client after " << kMaxRetries | ||
| << " retries"; | ||
| return tl::unexpected(ErrorCode::INVALID_PARAMS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning ErrorCode::INVALID_PARAMS when all retries for auto-port binding fail is misleading. This error code implies that the user provided incorrect parameters, but in this scenario, the port is automatically generated. A failure after multiple retries suggests a persistent environmental issue (e.g., resource exhaustion, network problem). A more appropriate error code would be ErrorCode::INTERNAL_ERROR to indicate a persistent failure to set up the client.
return tl::unexpected(ErrorCode::INTERNAL_ERROR);| this->local_hostname = local_hostname; | ||
| } | ||
| // Auto port binding with retry on metadata registration failure | ||
| constexpr int kMaxRetries = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maximum number of retries is hardcoded to 20. While this is a reasonable default, making it configurable via an environment variable (e.g., MC_STORE_CLIENT_SETUP_RETRIES) would provide more flexibility for different deployment environments where more or fewer retries might be appropriate. You can use the GetEnvOr helper from utils.h for this.
| constexpr int kMaxRetries = 20; | |
| const int kMaxRetries = GetEnvOr<int>("MC_STORE_CLIENT_SETUP_RETRIES", 20); |
| // conflict), release port and retry with a different port | ||
| LOG(WARNING) << "Failed to create client on port " << port | ||
| << ", retry " << (retry + 1) << "/" << kMaxRetries; | ||
| port_binder_.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry loop attempts to create a client immediately after a failure. In high-concurrency scenarios where failures might be due to transient resource contention (like port binding races), immediate retries can contribute to the problem. Adding a small delay between retries would make the process more robust by giving the system time to recover. A simple sleep would be beneficial, and an exponential backoff strategy could be even better for more complex scenarios.
port_binder_.reset();
std::this_thread::sleep_for(std::chrono::milliseconds(100));|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Description
fix this issue sgl-project/sglang#13395
This PR adds a retry mechanism for automatic port binding during client initialization in
RealClient::setup_internal(). When multiple clients start simultaneously on the same host, they may encounter port conflicts or metadata registration failures. This change improves reliability by automatically retrying with different ports.Problem
In high-concurrency scenarios (e.g., multiple workers starting at the same time), the current implementation may fail to create a client due to:
hostname:porthostname:portentries.The original code would fail immediately without any retry, causing service startup failures in distributed deployments.
Solution
Separate handling for user-specified vs auto-assigned ports:
hostname:8080), the behavior remains unchanged (no retry)Retry mechanism:
Code restructuring:
device_nameinitialization before the port binding logic for cleaner code flowChanges
mooncake-store/src/real_client.cpp: Modifiedsetup_internal()functionType of Change
How Has This Been Tested?
Checklist