-
Notifications
You must be signed in to change notification settings - Fork 497
[Feature] Intranode NVLINK Transport support in Hopper GPU #1197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tect Fabric/Posix handle | design gracefully shutdown mechanism
Summary of ChangesHello @TTThanos, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the intra-node NVLink transfer capabilities within the Mooncake Transfer Engine. The primary goal is to provide a more robust and flexible method for sharing GPU memory between processes on the same machine. It achieves this by implementing a dynamic system that prioritizes CUDA Fabric handles for optimal performance but gracefully falls back to a POSIX file descriptor-based inter-process communication (IPC) mechanism when Fabric handles are unavailable. This fallback involves a dedicated Unix domain socket server to manage the secure exchange of memory handles, ensuring broad compatibility across different NVIDIA GPU architectures and CUDA versions. The changes involve substantial refactoring of the NvlinkTransport and TransferMetadata components to support this new, metadata-driven memory sharing approach. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature for intra-node communication using NVLink, with a fallback mechanism to POSIX file descriptors over Unix domain sockets when CUDA Fabric is unavailable. The implementation is comprehensive, involving a client-server model for exchanging memory handles. While the feature is a great addition, the code contains several critical issues, including security vulnerabilities due to overly permissive file modes on sockets and a buffer overflow bug that could lead to crashes. Additionally, there are numerous debug log statements that should be removed before merging. My review provides specific feedback on these critical areas, along with suggestions to improve code clarity and maintainability.
| #ifdef USE_MNNVL | ||
| LOG(INFO) << "Prepare to shutdown nvlink transport"; | ||
| mooncake::NvlinkTransport::freePinnedLocalMemory(addr[i]); | ||
| xport->shutdownServer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shutdownServer() function is being called inside a for loop. This function is intended to be called only once to shut down the transport server. Calling it multiple times, especially within a loop, can lead to undefined behavior or crashes on subsequent iterations as it attempts to operate on an already-closed transport. This should be moved outside and after the loop.
| close(client_sock); | ||
| return -1; | ||
| } | ||
| chmod(client_socket_path.c_str(), 0777); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| if (result == CUDA_SUCCESS){ | ||
| // Blackwell GB200 | ||
| memcpy(shm_data.data(), &fabric_handle, sizeof(fabric_handle)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This memcpy writes to the buffer of an empty std::vector<uint8_t> shm_data. This will result in a buffer overflow and undefined behavior, likely crashing the application. You must resize the vector to the appropriate size before copying data into it.
shm_data.resize(sizeof(fabric_handle));
memcpy(shm_data.data(), &fabric_handle, sizeof(fabric_handle));| return; | ||
| } | ||
|
|
||
| chmod(path.c_str(), 0777); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the socket file permission to 0777 is a significant security risk, as it gives world-readable, -writable, and -executable permissions. This could allow any user on the system to interact with the socket, potentially leading to unauthorized access or denial of service. The permissions should be restricted to the owner, for example, 0700.
chmod(path.c_str(), 0700);|
|
||
| desc.metadata["handle_type"] = "2"; // POSIX_FD | ||
| desc.metadata["export_pid"] = std::to_string(getpid()); | ||
| desc.metadata["socket_path"] = getSocketPath(); // 如 /tmp/nvlink_export_12345OG(INFO) << "Directly send fd"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LOG(INFO) << "register memory: addr " << addr << ", length " << length; | ||
| if (globalConfig().trace) { | ||
| LOG(INFO) << "register memory: addr " << addr << ", length " << length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| prop.type = CU_MEM_ALLOCATION_TYPE_PINNED; | ||
| prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE; | ||
| prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_FABRIC; | ||
| // prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| private: | ||
|
|
||
| void startExportServer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| struct ShareableHandle { | ||
| int type; // 2 = POSIX_FD | ||
| union { | ||
| int fd; | ||
| uint8_t fabric[32]; | ||
| } value; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| result = cuMemAddressReserve((CUdeviceptr*)&mapped_addr, entry.length, 0, 0, 0); | ||
| if (result != CUDA_SUCCESS) goto fail; | ||
|
|
||
| result = cuMemMap((CUdeviceptr)mapped_addr, entry.length, 0, imported_fd, 0); | ||
| if (result != CUDA_SUCCESS) goto fail; | ||
|
|
||
| // Grant access | ||
| for (int i = 0; i < device_count; ++i) { | ||
| access_descs[i].location.type = CU_MEM_LOCATION_TYPE_DEVICE; | ||
| access_descs[i].location.id = i; | ||
| access_descs[i].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE; | ||
| } | ||
|
|
||
| result = cuMemSetAccess((CUdeviceptr)mapped_addr, entry.length, | ||
| access_descs.data(), device_count); | ||
| if (result != CUDA_SUCCESS) goto fail; | ||
|
|
||
| OpenedShmEntry shm_entry; | ||
| shm_entry.shm_addr = mapped_addr; | ||
| shm_entry.length = length; | ||
| remap_entries_[std::make_pair(target_id, entry.addr)] = | ||
| shm_entry; | ||
|
|
||
| dest_addr = dest_addr - entry.addr + (uint64_t)mapped_addr; | ||
| return 0; | ||
|
|
||
| fail: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of goto fail for error handling is C-style and can make the code harder to follow and maintain in C++. Consider using RAII principles for resource management. For example, you could use std::unique_ptr with custom deleters or a scope guard object to ensure that cleanup logic (like cuMemUnmap, cuMemRelease, cleanupSocket) is automatically executed when the scope is exited, whether normally or due to an error. This would eliminate the need for goto and make the resource handling more robust.
|
Please use clang-format or pre-commit for format checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds intranode NVLINK transport support for Hopper GPUs by implementing a POSIX file descriptor-based memory handle sharing mechanism. The implementation creates a Unix domain socket server to export GPU memory handles between processes when the CUDA Fabric memory API is not available, enabling NVLINK communication on Hopper architecture GPUs that don't support the CU_MEM_HANDLE_TYPE_FABRIC handle type.
Key changes:
- Implements fallback from FABRIC handles to POSIX_FD handles for Hopper GPU compatibility
- Adds Unix domain socket-based IPC mechanism for sharing file descriptors across processes
- Enhances metadata serialization to include handle type and socket path information
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 36 comments.
Show a summary per file
| File | Description |
|---|---|
| mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp | Core implementation of POSIX FD export/import server, memory backend detection, and socket-based IPC communication |
| mooncake-transfer-engine/include/transport/nvlink_transport/nvlink_transport.h | Added MemoryBackend enum, ExportedBuffer struct, and server management methods |
| mooncake-transfer-engine/include/transfer_metadata.h | Added metadata map to BufferDesc for storing handle type and socket information |
| mooncake-transfer-engine/src/transfer_metadata.cpp | Implemented JSON serialization/deserialization for new metadata field |
| mooncake-transfer-engine/src/transfer_metadata_plugin.cpp | Added debug logging for connection string parsing |
| mooncake-transfer-engine/example/transfer_engine_bench.cpp | Added shutdown call for NvlinkTransport server cleanup |
| mooncake-transfer-engine/src/multi_transport.cpp | Added commented debug logging |
| mooncake-transfer-engine/nvlink-allocator/nvlink_allocator.cpp | Added commented alternative handle type |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| desc.metadata["handle_type"] = "2"; // POSIX_FD | ||
| desc.metadata["export_pid"] = std::to_string(getpid()); | ||
| desc.metadata["socket_path"] = getSocketPath(); // 如 /tmp/nvlink_export_12345OG(INFO) << "Directly send fd"; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment contains Chinese characters and incomplete text. Use English and complete the comment properly.
| desc.metadata["socket_path"] = getSocketPath(); // 如 /tmp/nvlink_export_12345OG(INFO) << "Directly send fd"; | |
| desc.metadata["socket_path"] = getSocketPath(); // e.g., /tmp/nvlink_export_12345; directly send fd |
| desc.shm_name = serializeBinaryData(&fd, sizeof(int)); | ||
| return metadata_->addLocalMemoryBuffer(desc, true); | ||
| } | ||
| LOG(INFO) << "still use shm_data"; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LOG(INFO) statement appears to be leftover debug logging. Consider removing it for production code.
| LOG(INFO) << "still use shm_data"; |
| server_running_ = true; | ||
| export_server_thread_ = std::thread(&NvlinkTransport::exportServerLoop, this); | ||
| LOG(INFO) << "NvlinkTransport: FD export server started at " << getSocketPath(); | ||
| } |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: The export server thread is started in the constructor before the object is fully constructed. If the server thread accesses member variables during initialization, it could lead to undefined behavior. Consider moving the thread start to a separate init() method that is called after construction is complete.
| server_running_ = true; | |
| export_server_thread_ = std::thread(&NvlinkTransport::exportServerLoop, this); | |
| LOG(INFO) << "NvlinkTransport: FD export server started at " << getSocketPath(); | |
| } | |
| server_running_ = true; | |
| // Thread is now started in init() | |
| } | |
| void NvlinkTransport::init() { | |
| export_server_thread_ = std::thread(&NvlinkTransport::exportServerLoop, this); | |
| LOG(INFO) << "NvlinkTransport: FD export server started at " << getSocketPath(); | |
| } |
| if (result == CUDA_SUCCESS){ | ||
| // Blackwell GB200 | ||
| memcpy(shm_data.data(), &fabric_handle, sizeof(fabric_handle)); | ||
| handle_type = 1; // 标记为 FABRIC |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment contains Chinese characters. Use English for consistency with the rest of the codebase.
| handle_type = 1; // 标记为 FABRIC | |
| handle_type = 1; // Mark as FABRIC |
| LOG(INFO) << "sizeof(CUmemFabricHandle) " << sizeof(CUmemFabricHandle); | ||
| LOG(INFO) << "use_fabric_mem_ " << use_fabric_mem_; | ||
| LOG(INFO) << "size of output buffer " << output_buffer.size(); |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These LOG(INFO) statements appear to be leftover debug logging. Consider removing them or making them conditional on a trace flag.
| LOG(INFO) << "sizeof(CUmemFabricHandle) " << sizeof(CUmemFabricHandle); | |
| LOG(INFO) << "use_fabric_mem_ " << use_fabric_mem_; | |
| LOG(INFO) << "size of output buffer " << output_buffer.size(); | |
| VLOG(1) << "sizeof(CUmemFabricHandle) " << sizeof(CUmemFabricHandle); | |
| VLOG(1) << "use_fabric_mem_ " << use_fabric_mem_; | |
| VLOG(1) << "size of output buffer " << output_buffer.size(); |
| for (int i = 0; i < buffer_num; ++i) { | ||
| engine->unregisterLocalMemory(addr[i]); | ||
| #ifdef USE_MNNVL | ||
| LOG(INFO) << "Prepare to shutdown nvlink transport"; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LOG(INFO) statement appears to be leftover debug logging. Consider removing it for production code.
| LOG(INFO) << "Prepare to shutdown nvlink transport"; |
|
|
||
| void startExportServer(); | ||
| void exportServerLoop(); | ||
| void cleanupExportServer(); |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanupExportServer method is declared but never defined or used in the implementation. Consider removing it or implementing it if it's needed.
| void cleanupExportServer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanupExportServer method is declared but never defined or used in the implementation. Consider removing it or implementing it if it's needed.
It is useful
| LOG(INFO) << "Using Fabric Memory backend"; | ||
| prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_FABRIC; | ||
| break; | ||
|
|
||
| case MemoryBackend::IPC_POSIX_FD: | ||
| LOG(INFO) << "Using POSIX_FD IPC backend"; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These LOG(INFO) statements appear to be leftover debug logging. Consider removing them or making them conditional on a trace flag to reduce noise in production.
| LOG(INFO) << "The value of parse conn string: " << parsed_conn_string.first; | ||
| #ifdef USE_ETCD | ||
| LOG(INFO) << "Inside USE_ETCD"; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These LOG statements appear to be leftover debug logging. Consider removing them for production code.
| LOG(INFO) << "The value of parse conn string: " << parsed_conn_string.first; | |
| #ifdef USE_ETCD | |
| LOG(INFO) << "Inside USE_ETCD"; | |
| #ifdef USE_ETCD |
| close(received_fd); | ||
| cleanupSocket(client_sock, client_socket_path); | ||
| return -1; | ||
| } |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The received_fd is closed after a failed import, but it should also be closed in the success path after the handle is imported. File descriptors should be closed once they are no longer needed to avoid resource leaks.
| } | |
| } | |
| close(received_fd); |
mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp
Outdated
Show resolved
Hide resolved
|
No further technical problems for me. But I'd like to resolve issues opened by copilot (mostly removing loggings) |
Hi Feng, I've removed the redundant log prints and added an automatic detection method for nvlink_allocator. |
| #include <string> | ||
| #include <vector> | ||
| #include <utility> | ||
| #include <cuda.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this xport will be used by NV, HIP and other vendors. So consider changing it as "cuda_alike.h"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, we use regular cudaIpcXxx APIs to share device memory if use_fabric_mem_ is true. So device the POSIX-based method necessary? They seems to achieve the same goal (i.e. intra-node nvlink communication)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like cudaIpcXXX API is not feasible in 8 * H20, I have tested it and it can not achieve the goal of intraNode nvlink transport. I have give it a try again.
|
I have no idea why |
Description
Type of Change
How Has This Been Tested?
This patch is validated by transfer_engine_bench which use one GPU to transport data to another GPU on 8 * H20 machine.
The topo of the machine is below:

The screenshot of the test result is:

You can observe the nvlink volume through nvidia-smi dmon when running the test:

Checklist