Skip to content

Conversation

@TTThanos
Copy link
Contributor

@TTThanos TTThanos commented Dec 11, 2025

Description

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

This patch is validated by transfer_engine_bench which use one GPU to transport data to another GPU on 8 * H20 machine.

The topo of the machine is below:
image

The screenshot of the test result is:
image

You can observe the nvlink volume through nvidia-smi dmon when running the test:
image

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @TTThanos, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the intra-node NVLink transfer capabilities within the Mooncake Transfer Engine. The primary goal is to provide a more robust and flexible method for sharing GPU memory between processes on the same machine. It achieves this by implementing a dynamic system that prioritizes CUDA Fabric handles for optimal performance but gracefully falls back to a POSIX file descriptor-based inter-process communication (IPC) mechanism when Fabric handles are unavailable. This fallback involves a dedicated Unix domain socket server to manage the secure exchange of memory handles, ensuring broad compatibility across different NVIDIA GPU architectures and CUDA versions. The changes involve substantial refactoring of the NvlinkTransport and TransferMetadata components to support this new, metadata-driven memory sharing approach.

Highlights

  • Enhanced NVLink Memory Sharing: Implements a new mechanism for intra-node GPU memory sharing via NVLink, supporting both CUDA Fabric handles and a fallback to POSIX file descriptors for broader compatibility.
  • Unix Domain Socket IPC: Introduces a server-client model using Unix domain sockets to facilitate the secure exchange of GPU memory file descriptors between processes on the same node.
  • Dynamic Memory Backend Detection: Adds logic to dynamically detect whether CUDA Fabric memory handles are supported and automatically falls back to POSIX file descriptor-based IPC if not.
  • Refactored NVLink Transport: The NvlinkTransport class is significantly updated to manage the export server thread, handle memory export/import, and integrate with the new metadata-driven sharing approach.
  • Metadata for Memory Descriptors: Extends TransferMetadata to include a generic metadata map within BufferDesc, allowing for flexible storage of handle-specific information (e.g., handle type, process ID, socket path).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@TTThanos TTThanos changed the title Feature intra node nvlink [Feature] Intranode NVLINK Transport support in Hopper GPU Dec 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature for intra-node communication using NVLink, with a fallback mechanism to POSIX file descriptors over Unix domain sockets when CUDA Fabric is unavailable. The implementation is comprehensive, involving a client-server model for exchanging memory handles. While the feature is a great addition, the code contains several critical issues, including security vulnerabilities due to overly permissive file modes on sockets and a buffer overflow bug that could lead to crashes. Additionally, there are numerous debug log statements that should be removed before merging. My review provides specific feedback on these critical areas, along with suggestions to improve code clarity and maintainability.

#ifdef USE_MNNVL
LOG(INFO) << "Prepare to shutdown nvlink transport";
mooncake::NvlinkTransport::freePinnedLocalMemory(addr[i]);
xport->shutdownServer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The shutdownServer() function is being called inside a for loop. This function is intended to be called only once to shut down the transport server. Calling it multiple times, especially within a loop, can lead to undefined behavior or crashes on subsequent iterations as it attempts to operate on an already-closed transport. This should be moved outside and after the loop.

close(client_sock);
return -1;
}
chmod(client_socket_path.c_str(), 0777);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Setting the client socket file permission to 0777 is a significant security risk. It allows any user on the system to interact with the socket. The permissions should be restricted to the owner, for example, 0700.

                    chmod(client_socket_path.c_str(), 0700);


if (result == CUDA_SUCCESS){
// Blackwell GB200
memcpy(shm_data.data(), &fabric_handle, sizeof(fabric_handle));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This memcpy writes to the buffer of an empty std::vector<uint8_t> shm_data. This will result in a buffer overflow and undefined behavior, likely crashing the application. You must resize the vector to the appropriate size before copying data into it.

            shm_data.resize(sizeof(fabric_handle));
            memcpy(shm_data.data(), &fabric_handle, sizeof(fabric_handle));

return;
}

chmod(path.c_str(), 0777);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Setting the socket file permission to 0777 is a significant security risk, as it gives world-readable, -writable, and -executable permissions. This could allow any user on the system to interact with the socket, potentially leading to unauthorized access or denial of service. The permissions should be restricted to the owner, for example, 0700.

    chmod(path.c_str(), 0700);


desc.metadata["handle_type"] = "2"; // POSIX_FD
desc.metadata["export_pid"] = std::to_string(getpid());
desc.metadata["socket_path"] = getSocketPath(); // 如 /tmp/nvlink_export_12345OG(INFO) << "Directly send fd";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line contains a garbled comment and what appears to be leftover debugging code (OG(INFO) << ...). This should be cleaned up.

            desc.metadata["socket_path"] = getSocketPath();  // e.g. /tmp/nvlink_export_12345

Comment on lines 587 to 589
LOG(INFO) << "register memory: addr " << addr << ", length " << length;
if (globalConfig().trace) {
LOG(INFO) << "register memory: addr " << addr << ", length " << length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message on line 587 is duplicated. The one on line 589 is conditional, while the one on 587 is not. The unconditional log should probably be removed to avoid redundant logging.

    if (globalConfig().trace) {
        LOG(INFO) << "register memory: addr " << addr << ", length " << length;
    }

prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_FABRIC;
// prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out line of code should be removed to keep the codebase clean.


private:

void startExportServer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function startExportServer() is declared here and defined in the .cpp file, but it is never called. If this function is not used, it should be removed to avoid dead code.

Comment on lines 27 to 33
struct ShareableHandle {
int type; // 2 = POSIX_FD
union {
int fd;
uint8_t fabric[32];
} value;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ShareableHandle struct appears to be unused throughout the codebase. If it is indeed dead code, it should be removed to improve code clarity and maintainability.

Comment on lines 911 to 937
result = cuMemAddressReserve((CUdeviceptr*)&mapped_addr, entry.length, 0, 0, 0);
if (result != CUDA_SUCCESS) goto fail;

result = cuMemMap((CUdeviceptr)mapped_addr, entry.length, 0, imported_fd, 0);
if (result != CUDA_SUCCESS) goto fail;

// Grant access
for (int i = 0; i < device_count; ++i) {
access_descs[i].location.type = CU_MEM_LOCATION_TYPE_DEVICE;
access_descs[i].location.id = i;
access_descs[i].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
}

result = cuMemSetAccess((CUdeviceptr)mapped_addr, entry.length,
access_descs.data(), device_count);
if (result != CUDA_SUCCESS) goto fail;

OpenedShmEntry shm_entry;
shm_entry.shm_addr = mapped_addr;
shm_entry.length = length;
remap_entries_[std::make_pair(target_id, entry.addr)] =
shm_entry;

dest_addr = dest_addr - entry.addr + (uint64_t)mapped_addr;
return 0;

fail:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of goto fail for error handling is C-style and can make the code harder to follow and maintain in C++. Consider using RAII principles for resource management. For example, you could use std::unique_ptr with custom deleters or a scope guard object to ensure that cleanup logic (like cuMemUnmap, cuMemRelease, cleanupSocket) is automatically executed when the scope is exited, whether normally or due to an error. This would eliminate the need for goto and make the resource handling more robust.

@stmatengss
Copy link
Collaborator

Please use clang-format or pre-commit for format checking.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds intranode NVLINK transport support for Hopper GPUs by implementing a POSIX file descriptor-based memory handle sharing mechanism. The implementation creates a Unix domain socket server to export GPU memory handles between processes when the CUDA Fabric memory API is not available, enabling NVLINK communication on Hopper architecture GPUs that don't support the CU_MEM_HANDLE_TYPE_FABRIC handle type.

Key changes:

  • Implements fallback from FABRIC handles to POSIX_FD handles for Hopper GPU compatibility
  • Adds Unix domain socket-based IPC mechanism for sharing file descriptors across processes
  • Enhances metadata serialization to include handle type and socket path information

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 36 comments.

Show a summary per file
File Description
mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp Core implementation of POSIX FD export/import server, memory backend detection, and socket-based IPC communication
mooncake-transfer-engine/include/transport/nvlink_transport/nvlink_transport.h Added MemoryBackend enum, ExportedBuffer struct, and server management methods
mooncake-transfer-engine/include/transfer_metadata.h Added metadata map to BufferDesc for storing handle type and socket information
mooncake-transfer-engine/src/transfer_metadata.cpp Implemented JSON serialization/deserialization for new metadata field
mooncake-transfer-engine/src/transfer_metadata_plugin.cpp Added debug logging for connection string parsing
mooncake-transfer-engine/example/transfer_engine_bench.cpp Added shutdown call for NvlinkTransport server cleanup
mooncake-transfer-engine/src/multi_transport.cpp Added commented debug logging
mooncake-transfer-engine/nvlink-allocator/nvlink_allocator.cpp Added commented alternative handle type

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


desc.metadata["handle_type"] = "2"; // POSIX_FD
desc.metadata["export_pid"] = std::to_string(getpid());
desc.metadata["socket_path"] = getSocketPath(); // 如 /tmp/nvlink_export_12345OG(INFO) << "Directly send fd";
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment contains Chinese characters and incomplete text. Use English and complete the comment properly.

Suggested change
desc.metadata["socket_path"] = getSocketPath(); // /tmp/nvlink_export_12345OG(INFO) << "Directly send fd";
desc.metadata["socket_path"] = getSocketPath(); // e.g., /tmp/nvlink_export_12345; directly send fd

Copilot uses AI. Check for mistakes.
desc.shm_name = serializeBinaryData(&fd, sizeof(int));
return metadata_->addLocalMemoryBuffer(desc, true);
}
LOG(INFO) << "still use shm_data";
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LOG(INFO) statement appears to be leftover debug logging. Consider removing it for production code.

Suggested change
LOG(INFO) << "still use shm_data";

Copilot uses AI. Check for mistakes.
Comment on lines 376 to 379
server_running_ = true;
export_server_thread_ = std::thread(&NvlinkTransport::exportServerLoop, this);
LOG(INFO) << "NvlinkTransport: FD export server started at " << getSocketPath();
}
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: The export server thread is started in the constructor before the object is fully constructed. If the server thread accesses member variables during initialization, it could lead to undefined behavior. Consider moving the thread start to a separate init() method that is called after construction is complete.

Suggested change
server_running_ = true;
export_server_thread_ = std::thread(&NvlinkTransport::exportServerLoop, this);
LOG(INFO) << "NvlinkTransport: FD export server started at " << getSocketPath();
}
server_running_ = true;
// Thread is now started in init()
}
void NvlinkTransport::init() {
export_server_thread_ = std::thread(&NvlinkTransport::exportServerLoop, this);
LOG(INFO) << "NvlinkTransport: FD export server started at " << getSocketPath();
}

Copilot uses AI. Check for mistakes.
if (result == CUDA_SUCCESS){
// Blackwell GB200
memcpy(shm_data.data(), &fabric_handle, sizeof(fabric_handle));
handle_type = 1; // 标记为 FABRIC
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment contains Chinese characters. Use English for consistency with the rest of the codebase.

Suggested change
handle_type = 1; // 标记为 FABRIC
handle_type = 1; // Mark as FABRIC

Copilot uses AI. Check for mistakes.
Comment on lines 726 to 728
LOG(INFO) << "sizeof(CUmemFabricHandle) " << sizeof(CUmemFabricHandle);
LOG(INFO) << "use_fabric_mem_ " << use_fabric_mem_;
LOG(INFO) << "size of output buffer " << output_buffer.size();
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These LOG(INFO) statements appear to be leftover debug logging. Consider removing them or making them conditional on a trace flag.

Suggested change
LOG(INFO) << "sizeof(CUmemFabricHandle) " << sizeof(CUmemFabricHandle);
LOG(INFO) << "use_fabric_mem_ " << use_fabric_mem_;
LOG(INFO) << "size of output buffer " << output_buffer.size();
VLOG(1) << "sizeof(CUmemFabricHandle) " << sizeof(CUmemFabricHandle);
VLOG(1) << "use_fabric_mem_ " << use_fabric_mem_;
VLOG(1) << "size of output buffer " << output_buffer.size();

Copilot uses AI. Check for mistakes.
for (int i = 0; i < buffer_num; ++i) {
engine->unregisterLocalMemory(addr[i]);
#ifdef USE_MNNVL
LOG(INFO) << "Prepare to shutdown nvlink transport";
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LOG(INFO) statement appears to be leftover debug logging. Consider removing it for production code.

Suggested change
LOG(INFO) << "Prepare to shutdown nvlink transport";

Copilot uses AI. Check for mistakes.

void startExportServer();
void exportServerLoop();
void cleanupExportServer();
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanupExportServer method is declared but never defined or used in the implementation. Consider removing it or implementing it if it's needed.

Suggested change
void cleanupExportServer();

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanupExportServer method is declared but never defined or used in the implementation. Consider removing it or implementing it if it's needed.

It is useful

Comment on lines 1015 to 1020
LOG(INFO) << "Using Fabric Memory backend";
prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_FABRIC;
break;

case MemoryBackend::IPC_POSIX_FD:
LOG(INFO) << "Using POSIX_FD IPC backend";
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These LOG(INFO) statements appear to be leftover debug logging. Consider removing them or making them conditional on a trace flag to reduce noise in production.

Copilot uses AI. Check for mistakes.
Comment on lines 537 to 539
LOG(INFO) << "The value of parse conn string: " << parsed_conn_string.first;
#ifdef USE_ETCD
LOG(INFO) << "Inside USE_ETCD";
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These LOG statements appear to be leftover debug logging. Consider removing them for production code.

Suggested change
LOG(INFO) << "The value of parse conn string: " << parsed_conn_string.first;
#ifdef USE_ETCD
LOG(INFO) << "Inside USE_ETCD";
#ifdef USE_ETCD

Copilot uses AI. Check for mistakes.
close(received_fd);
cleanupSocket(client_sock, client_socket_path);
return -1;
}
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The received_fd is closed after a failed import, but it should also be closed in the success path after the handle is imported. File descriptors should be closed once they are no longer needed to avoid resource leaks.

Suggested change
}
}
close(received_fd);

Copilot uses AI. Check for mistakes.
@TTThanos TTThanos requested review from YiXR and alogfans December 15, 2025 12:46
@alogfans
Copy link
Collaborator

No further technical problems for me. But I'd like to resolve issues opened by copilot (mostly removing loggings)

@TTThanos
Copy link
Contributor Author

No further technical problems for me. But I'd like to resolve issues opened by copilot (mostly removing loggings)

Hi Feng, I've removed the redundant log prints and added an automatic detection method for nvlink_allocator.
Could you please test the functionality of my patch when you have time? Any feedback or suggestions would be appreciated.

#include <string>
#include <vector>
#include <utility>
#include <cuda.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this xport will be used by NV, HIP and other vendors. So consider changing it as "cuda_alike.h"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, we use regular cudaIpcXxx APIs to share device memory if use_fabric_mem_ is true. So device the POSIX-based method necessary? They seems to achieve the same goal (i.e. intra-node nvlink communication)

Copy link
Contributor Author

@TTThanos TTThanos Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like cudaIpcXXX API is not feasible in 8 * H20, I have tested it and it can not achieve the goal of intraNode nvlink transport. I have give it a try again.

@stmatengss
Copy link
Collaborator

I have no idea why posix_file_test is failed? Did you test it locally? @TTThanos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants