Skip to content

Conversation

@stmatengss
Copy link
Collaborator

Description

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @stmatengss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust health monitoring and cleanup feature for the HTTP metadata server. The server will now actively verify the health status of registered segments and clients by integrating with the MasterService. This ensures that any metadata corresponding to unhealthy or inactive entities is automatically purged, thereby preventing the accumulation of stale data and improving the overall data integrity and efficiency of the metadata service.

Highlights

  • Health Monitoring for HTTP Metadata Server: Implemented a new health monitoring mechanism within the HttpMetadataServer to automatically detect and clean up stale metadata.
  • Automatic Metadata Cleanup: The server now periodically checks the health of registered segments and clients. If a segment or client is deemed unhealthy or inactive by the MasterService, its associated metadata is automatically removed from the store.
  • MasterService Integration: The HttpMetadataServer now accepts a MasterService instance during initialization, allowing it to perform health checks by querying the master service for active segments and clients.
  • Configuration Options: New configuration parameters (enable_http_metadata_server, http_metadata_server_port, http_metadata_server_host) have been added to MasterServiceSupervisorConfig to control the HTTP metadata server's behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a health monitoring and cleanup mechanism for the HTTP metadata server to remove stale metadata for segments and clients. The changes are a good step towards better resource management. However, I've identified several critical bugs and areas for improvement related to correctness, performance, and maintainability. The most critical issues are incorrect string conversion for UUIDs which will cause client cleanup to fail, and a broad substring match for segment cleanup that could delete metadata for incorrect segments. There are also several medium-severity issues related to performance and code robustness that should be addressed.

Comment on lines +248 to +254
if (it->first.find(segment_name) != std::string::npos) {
LOG(INFO) << "Cleaning up metadata for segment: " << segment_name
<< ", key: " << it->first;
it = store_.erase(it);
} else {
++it;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The use of it->first.find(segment_name) is dangerously broad and can lead to deleting metadata for the wrong segments. For example, if you are cleaning up metadata for a segment named "seg1", this logic will also match and delete metadata for a segment named "seg10". This is because find performs a substring search. You should use a more precise matching logic to ensure only the correct segment's metadata is removed.

        std::string rpc_meta_key = "rpc_meta_" + segment_name;
        if (it->first == segment_name || it->first == rpc_meta_key) {
            LOG(INFO) << "Cleaning up metadata for segment: " << segment_name
                      << ", key: " << it->first;
            it = store_.erase(it);
        } else {
            ++it;
        }

std::lock_guard<std::mutex> lock(store_mutex_);

// Convert UUID to string for matching
std::string client_id_str = std::to_string(client_id.first) + "-" + std::to_string(client_id.second);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a critical bug in how the client_id is converted to a string for cleanup. The code uses std::to_string, which produces a decimal representation of the uint64_t parts of the UUID. However, the UUIDs are parsed from keys using std::stoull with base 16 (hexadecimal). This mismatch means client_id_str will never match the keys in the store, and client metadata cleanup will fail.

You should convert the UUID to a hexadecimal string to match how it's stored. You will also need to include <sstream> and <iomanip>.

    std::ostringstream oss;
    oss << std::hex << client_id.first << "-" << client_id.second;
    std::string client_id_str = oss.str();

Comment on lines 156 to 207
void HttpMetadataServer::check_and_cleanup_metadata() {
if (!master_service_) {
return;
}

// Get all current keys from the metadata store
std::vector<std::string> keys_to_check;
{
std::lock_guard<std::mutex> lock(store_mutex_);
for (const auto& pair : store_) {
keys_to_check.push_back(pair.first);
}
}

// Check each key to see if it corresponds to segment or client metadata that should be cleaned up
for (const auto& key : keys_to_check) {
// Check if this key corresponds to segment metadata (e.g., keys containing "segment")
if (key.find("segment") != std::string::npos) {
std::string segment_name = key;
// Extract segment name from key if needed
if (key.find("rpc_meta_") == 0) {
segment_name = key.substr(9); // Remove "rpc_meta_" prefix
}
if (!is_segment_healthy(segment_name)) {
cleanup_segment_metadata(segment_name);
}
}
// Check if this key corresponds to client metadata (e.g., keys containing UUID patterns)
else if (key.find("client") != std::string::npos ||
std::regex_match(key, std::regex("[0-9a-fA-F]{16}-[0-9a-fA-F]{16}"))) { // UUID pattern
// Extract UUID from key if needed
UUID client_id;
if (key.find("rpc_meta_") == 0) {
std::string uuid_str = key.substr(9); // Remove "rpc_meta_" prefix
// Parse UUID string to UUID object
size_t pos = uuid_str.find('-');
if (pos != std::string::npos) {
try {
client_id.first = std::stoull(uuid_str.substr(0, pos), nullptr, 16);
client_id.second = std::stoull(uuid_str.substr(pos + 1), nullptr, 16);
if (!is_client_healthy(client_id)) {
cleanup_client_metadata(client_id);
}
} catch (...) {
// Invalid UUID format, skip
continue;
}
}
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of check_and_cleanup_metadata is inefficient. It first copies all keys, then iterates through them. For each key that needs cleanup, it calls a function (cleanup_segment_metadata or cleanup_client_metadata) which then iterates over the entire metadata store_ again. This results in multiple full scans of the store.

A more performant approach would be to collect a list of all keys to be deleted in the first loop, and then perform a single pass to erase them from the store_ under a single lock. This would reduce the complexity from O(N*M) to roughly O(N), where N is the number of keys in the store and M is the number of keys to delete.

// Check each key to see if it corresponds to segment or client metadata that should be cleaned up
for (const auto& key : keys_to_check) {
// Check if this key corresponds to segment metadata (e.g., keys containing "segment")
if (key.find("segment") != std::string::npos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Identifying keys based on key.find("segment") (and similarly key.find("client") on line 184) is fragile. This can lead to misclassifying keys and either failing to clean them up or incorrectly trying to clean them up as the wrong type. For example, a key named "my_client_backup_segment" might be misidentified. It would be more robust to use a structured key format, like segment:<name> or client:<id>, to avoid ambiguity.

Comment on lines +199 to +202
} catch (...) {
// Invalid UUID format, skip
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using catch (...) is generally discouraged as it catches all exceptions, including system-level ones, and hides the type of error that occurred. This makes debugging difficult. It's better to catch specific exceptions that std::stoull can throw, such as std::invalid_argument and std::out_of_range, or at least std::exception and log the error message.

Suggested change
} catch (...) {
// Invalid UUID format, skip
continue;
}
} catch (const std::exception& e) {
LOG(WARNING) << "Failed to parse UUID from key '" << key << "': " << e.what();
continue;
}

Comment on lines 215 to 223
auto segments_result = master_service_->GetAllSegments();
if (segments_result.has_value()) {
const auto& segments = segments_result.value();
for (const auto& segment : segments) {
if (segment == segment_name) {
return true; // Segment exists and is healthy
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The is_segment_healthy function calls master_service_->GetAllSegments() and then performs a linear search. This is done for every potential segment key inside the check_and_cleanup_metadata loop. If there are many segments and many keys, this can become a performance bottleneck. Consider adding a more efficient method to MasterService, like bool DoesSegmentExist(const std::string& segment_name), which could use a hash set for O(1) lookups.

Comment on lines 159 to 174
std::unique_ptr<mooncake::HttpMetadataServer> http_metadata_server;
if (config_.enable_http_metadata_server) {
http_metadata_server =
std::make_unique<mooncake::HttpMetadataServer>(
config_.http_metadata_server_port,
config_.http_metadata_server_host,
&wrapped_master_service.master_service_);
http_metadata_server->start();

if (!http_metadata_server->is_running()) {
LOG(ERROR) << "Failed to start HTTP metadata server";
return -1;
}

LOG(INFO) << "HTTP metadata server started successfully";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for starting the HttpMetadataServer is duplicated here and in mooncake-store/src/master.cpp. This duplication can lead to maintenance issues, as changes might be applied in one place but not the other (as seen with the inconsistent use of sleep_for). Consider extracting this logic into a common helper function to improve code reuse and maintainability.

}

// Give the server some time to start
std::this_thread::sleep_for(std::chrono::seconds(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using std::this_thread::sleep_for to wait for the server to start is unreliable and considered a code smell. The server might take more or less time to initialize depending on system load. A more robust solution would be to use a proper synchronization mechanism. For example, the start() method could return a std::future that is fulfilled when the server is fully initialized and ready to accept connections, or use a condition variable. Note that this sleep is also inconsistently applied, as it's missing in the HA-mode startup logic in ha_helper.cpp.

@stmatengss stmatengss requested a review from YiXR as a code owner January 4, 2026 13:03
@stmatengss stmatengss requested a review from whybeyoung January 4, 2026 13:07
@stmatengss
Copy link
Collaborator Author

@whybeyoung This PR can clean useless metadata periodically in the metadata server.

@codecov-commenter
Copy link

codecov-commenter commented Jan 4, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 5.17241% with 110 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/http_metadata_server.cpp 6.89% 81 Missing ⚠️
mooncake-store/src/ha_helper.cpp 0.00% 11 Missing ⚠️
mooncake-store/src/master.cpp 0.00% 9 Missing ⚠️
mooncake-store/src/rpc_service.cpp 0.00% 6 Missing ⚠️
mooncake-store/include/master_config.h 0.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@stmatengss
Copy link
Collaborator Author

@ykwd PTAL

WrappedMasterService* wrapped_master_service_;
std::thread health_monitor_thread_;
std::atomic<bool> health_monitor_running_{false};
static constexpr uint64_t kHealthMonitorSleepMs = 5000; // 5 seconds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check doesn’t seem to need to run very frequently. Would once every 10 minutes be sufficient?

bool running_;

// Health monitoring
WrappedMasterService* wrapped_master_service_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better use shared ptr for easy maintenance in the future.


// Check each key to see if it corresponds to segment or client metadata
// that should be cleaned up
for (const auto& key : keys_to_check) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here is strange.

When checking segments, is_segment_healthy is called once for each segment to be checked, and this function in turn calls GetAllSegments to fetch all segments just to verify a single specific segment. In fact, for the entire flow, we only need to call GetAllSegments once.

When checking clients, the health check is done by calling ping on the master, which feels a bit hacky. Moreover, ping_master(client_id) is not side-effect free. It updates the client’s last-seen timestamp from the master’s perspective, which prevents the master from correctly determining the client’s actual liveness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants