Skip to content

Conversation

@luketong777
Copy link
Collaborator

Description

This feature identifies and determines whether GPU devices and InfiniBand (IB) devices are located under the same PCIe Root Complex.

  1. PCIe Topology Discovery
    The isSamePcieRootComplex function determines whether two PCI devices are under the same root complex
    Uses getCommonParent and isPciRootComplex functions for path analysis
  2. Device Association Optimization
    In the discoverCudaTopology function, prioritizes IB devices that share the same PCIe root with the GPU
    Establishes mapping relationships from GPUs to optimal IB devices to optimize data transfer paths

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

I have tested the PCIe root identification functionality using simulated parameters, covering basic functional scenarios and edge cases. However, this feature still requires validation testing in a real hardware environment where GPU and IB devices are located under the same PCIe root complex.

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Signed-off-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @luketong777, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new feature to intelligently pair GPU devices with InfiniBand devices by detecting their shared PCIe Root Complex. This optimization aims to improve data transfer efficiency by ensuring that GPUs communicate with IB devices that are physically closest in the system's PCIe topology, thereby reducing latency and enhancing overall performance for high-throughput applications.

Highlights

  • PCIe Root Complex Detection: Introduced new utility functions (getCommonParentLength, getCommonParent, isPciRootComplex, isSamePcieRootComplex) to accurately determine if two PCI devices share the same PCIe Root Complex.
  • Optimized Device Pairing Logic: Modified the discoverCudaTopology function to prioritize InfiniBand (IB) devices that reside under the same PCIe Root Complex as the GPU, before falling back to NUMA node proximity or general device availability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@luketong777
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to detect if a GPU and an InfiniBand device share the same PCIe root complex, aiming to optimize device pairing for data transfers. The implementation involves traversing the sysfs file system to determine device topology. My review has identified two critical compilation errors due to typos that must be fixed. I've also provided suggestions to improve code readability and safety by refactoring a complex loop and replacing a C-style function call with a more modern C++ approach.

Comment on lines +171 to +187
static size_t getCommonParentLength(const char *path1, const char *path2) {
if (!path1 || !path2) {
return 0;
}

size_t offset = 0;
size_t parent_length = 0;

do {
if (((path1[offset] == '/') || (path1[offset] == '\0')) &&
((path2[offset] == '/') || (path2[offset] == '\0'))) {
parent_length = offset;
}
} while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0'));

return parent_length;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic in this do-while loop is complex and hard to follow due to the combined condition checking and post-increment of offset. While it appears to be correct, consider refactoring it for better readability and maintainability. A for loop that iterates through the paths and explicitly tracks the last common directory separator would be clearer.

static size_t getCommonParentLength(const char *path1, const char *path2) {
    if (!path1 || !path2) {
        return 0;
    }

    size_t parent_length = 0;
    for (size_t offset = 0; ; ++offset) {
        if ((path1[offset] == '/' || path1[offset] == '\0') &&
            (path2[offset] == '/' || path2[offset] == '\0')) {
            parent_length = offset;
        }

        if (path1[offset] != path2[offset] || path1[offset] == '\0') {
            break;
        }
    }

    return parent_length;
}

Comment on lines +200 to +201
sscanf(path, "/sys/devices/pci%*x:%*x%n", &count);
return count == strlen(path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using sscanf can be risky and is generally discouraged in modern C++ for safety reasons. While its use here for pattern matching on a sysfs path seems contained, consider replacing it with a more robust and C++-idiomatic approach. For example, you could use std::string manipulation functions or std::regex for a safer and more explicit pattern validation.

Signed-off-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful optimization for selecting InfiniBand devices by considering PCIe topology. The new functions for identifying the PCIe root complex are a good addition. However, I've found a few critical issues, including typos that will cause compilation errors, and a logic issue in path manipulation that could lead to incorrect behavior. Please address these points to ensure the feature works as expected.

Comment on lines +171 to +187
static size_t getCommonParentLength(const char *path1, const char *path2) {
if (!path1 || !path2) {
return 0;
}

size_t offset = 0;
size_t parent_length = 0;

do {
if (((path1[offset] == '/') || (path1[offset] == '\0')) &&
((path2[offset] == '/') || (path2[offset] == '\0'))) {
parent_length = offset;
}
} while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0'));

return parent_length;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic in getCommonParentLength is complex and doesn't correctly handle all cases. For example, for paths "/a/b" and "/a/bc", the common parent is "/a", but the current implementation returns an empty path. This can lead to incorrect PCIe root complex detection. A clearer implementation using std::string_view would be more robust. You may need to add #include <string_view> if it's not already available.

static size_t getCommonParentLength(const char *path1, const char *path2) {
    if (!path1 || !path2) {
        return 0;
    }

    std::string_view p1(path1);
    std::string_view p2(path2);

    size_t common_len = 0;
    while (common_len < p1.length() && common_len < p2.length() && p1[common_len] == p2[common_len]) {
        common_len++;
    }

    // If one path is a prefix of the other, it's the common parent.
    if (common_len == p1.length() && (p2.length() == common_len || p2[common_len] == '/')) {
        return p1.length();
    }
    if (common_len == p2.length() && (p1.length() == common_len || p1[common_len] == '/')) {
        return p2.length();
    }

    // Backtrack to the last directory separator in the common part.
    auto common_prefix = p1.substr(0, common_len);
    auto last_slash = common_prefix.rfind('/');
    if (last_slash == std::string_view::npos) {
        return 0;
    }
    if (last_slash == 0) {
        return 1; // for root "/"
    }
    return last_slash;
}

Signed-off-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
@luketong777 luketong777 reopened this Dec 24, 2025
Signed-off-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
@alogfans
Copy link
Collaborator

What's the difference with getPciDistance-based resolution?

Comment on lines +268 to 271
if (isSamePcieRootComplex(hca.pci_bus_id.c_str(), pci_bus_id)) {
samePCIeRoot_hca.push_back(hca);
} else if (isSameNumaNode(hca.pci_bus_id.c_str(), pci_bus_id)) {
sameNuma_hca.push_back(hca);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should SameNumaNode be the first condition? Then it is checking the same PCIeRootComplex

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that SamePCIeRootComplex necessarily implies SameNumaNode. Therefore, if an if-else construct first checks for SameNumaNode, it would capture all devices residing on the same NUMA node, thereby preventing the more specific SamePCIeRootComplex check from ever being executed.

@alogfans
Copy link
Collaborator

What's the difference with getPciDistance-based resolution?

In most the cases, devices in the same PCIe root complex naturally has short distance. So I'm not sure whether there is a case that the result changes with and without this feature.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements PCIe Root Complex detection to optimize GPU-InfiniBand device pairing by identifying whether devices share the same PCIe root complex, enabling more efficient data transfer path selection.

Key changes:

  • Added helper functions to detect and compare PCIe root complex topology
  • Updated device discovery logic to prioritize HCAs under the same PCIe root complex as the GPU
  • Introduced a three-tier prioritization: same PCIe root > same NUMA node > all devices

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

int min_distance = INT_MAX;
std::vector<std::string> min_distance_hcas;

std::vector<InfinibandDevice> samePCIeRoot_hca;
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name "samePCIeRoot_hca" uses inconsistent casing with "PCIe" having mixed case while the rest uses underscore_case. Consider renaming to "same_pcie_root_hca" to match the naming convention used elsewhere in the codebase (e.g., "sameNuma_hca", "min_distance_hcas").

Copilot uses AI. Check for mistakes.
for (const auto &hca : all_hca) {
if (isSameNumaNode(hca.pci_bus_id.c_str(), pci_bus_id)) {
if (isSamePcieRootComplex(hca.pci_bus_id.c_str(), pci_bus_id)) {
samePCIeRoot_hca.push_back(hca);
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name "samePCIeRoot_hca" uses inconsistent casing with "PCIe" having mixed case while the rest uses underscore_case. Consider renaming to "same_pcie_root_hca" to match the naming convention used elsewhere in the codebase (e.g., "sameNuma_hca", "min_distance_hcas").

Copilot uses AI. Check for mistakes.
}
const auto &candidate_preferred_hca =
sameNuma_hca.empty() ? all_hca : sameNuma_hca;
!samePCIeRoot_hca.empty() ? samePCIeRoot_hca
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name "samePCIeRoot_hca" uses inconsistent casing with "PCIe" having mixed case while the rest uses underscore_case. Consider renaming to "same_pcie_root_hca" to match the naming convention used elsewhere in the codebase (e.g., "sameNuma_hca", "min_distance_hcas").

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider this comment.

Comment on lines +171 to +241
static size_t getCommonParentLength(const char *path1, const char *path2) {
if (!path1 || !path2) {
return 0;
}

size_t offset = 0;
size_t parent_length = 0;

do {
if (((path1[offset] == '/') || (path1[offset] == '\0')) &&
((path2[offset] == '/') || (path2[offset] == '\0'))) {
parent_length = offset;
}
} while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0'));

return parent_length;
}

static std::string getCommonParent(const char *path1, const char *path2) {
size_t parent_length = getCommonParentLength(path1, path2);
return std::string(path1, parent_length);
}

static bool isPciRootComplex(const char *path) {
if (!path) {
return false;
}

int count = -1;
sscanf(path, "/sys/devices/pci%*x:%*x%n", &count);
return count == strlen(path);
}

static bool isSamePcieRootComplex(const char *bus1, const char *bus2) {
if (!bus1 || !bus2) {
return false;
}

char buf[PATH_MAX];
char resolved1[PATH_MAX];
char resolved2[PATH_MAX];

snprintf(buf, sizeof(buf), "/sys/bus/pci/devices/%s", bus1);
if (realpath(buf, resolved1) == NULL) {
return false;
}

snprintf(buf, sizeof(buf), "/sys/bus/pci/devices/%s", bus2);
if (realpath(buf, resolved2) == NULL) {
return false;
}

std::string common = getCommonParent(resolved1, resolved2);
if (common.empty()) {
return false;
}

while (!common.empty()) {
if (isPciRootComplex(common.c_str())) {
return true;
}

auto pos = common.find_last_of('/');
if (pos == std::string::npos) {
break;
}
common.resize(pos);
}

return false;
}
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new PCIe root complex detection functions (getCommonParentLength, getCommonParent, isPciRootComplex, and isSamePcieRootComplex) lack test coverage. Given that existing topology functions have comprehensive test coverage in topology_test.cpp, consider adding unit tests to validate these new functions, including edge cases such as: devices under the same root complex, devices under different root complexes, invalid PCI bus IDs, and various path patterns.

Copilot uses AI. Check for mistakes.
Comment on lines +199 to +201
int count = -1;
sscanf(path, "/sys/devices/pci%*x:%*x%n", &count);
return count == strlen(path);
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function isPciRootComplex could fail to correctly identify PCIe root complex paths due to potential misuse of sscanf. The function attempts to match the pattern "/sys/devices/pci%*x:%*x" but doesn't verify that the matched portion ends the string correctly. If the path is "/sys/devices/pci0000:00/extra", the sscanf would match up to "pci0000:00" and set count accordingly, but the comparison with strlen(path) would correctly fail. However, the logic is fragile. Consider using a more explicit check or documenting the expected path format.

Suggested change
int count = -1;
sscanf(path, "/sys/devices/pci%*x:%*x%n", &count);
return count == strlen(path);
int count = 0;
int rc = sscanf(path, "/sys/devices/pci%*x:%*x%n", &count);
return (rc == 1) && (static_cast<size_t>(count) == strlen(path));

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +185
do {
if (((path1[offset] == '/') || (path1[offset] == '\0')) &&
((path2[offset] == '/') || (path2[offset] == '\0'))) {
parent_length = offset;
}
} while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0'));

Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop condition in getCommonParentLength uses post-increment (path1[offset++]) which increments offset AFTER comparing characters. This means when characters mismatch or when reaching the null terminator, offset has already been incremented. While this doesn't cause incorrect behavior in the current implementation, it makes the code harder to understand. Consider restructuring the loop to increment offset explicitly at the end of the loop body for better clarity.

Suggested change
do {
if (((path1[offset] == '/') || (path1[offset] == '\0')) &&
((path2[offset] == '/') || (path2[offset] == '\0'))) {
parent_length = offset;
}
} while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0'));
for (;;) {
if (((path1[offset] == '/') || (path1[offset] == '\0')) &&
((path2[offset] == '/') || (path2[offset] == '\0'))) {
parent_length = offset;
}
if ((path1[offset] != path2[offset]) || (path1[offset] == '\0')) {
break;
}
++offset;
}

Copilot uses AI. Check for mistakes.
Comment on lines +268 to +269
if (isSamePcieRootComplex(hca.pci_bus_id.c_str(), pci_bus_id)) {
samePCIeRoot_hca.push_back(hca);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the order? First, checking NUMA node, then PCIe

}
const auto &candidate_preferred_hca =
sameNuma_hca.empty() ? all_hca : sameNuma_hca;
!samePCIeRoot_hca.empty() ? samePCIeRoot_hca
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same

Copy link
Collaborator

@stmatengss stmatengss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some potential issues are still unresolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants