-
Notifications
You must be signed in to change notification settings - Fork 497
[TE/topology]Implement PCIe Root Complex detection for optimal GPU-InfiniBand pairing #1270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
Summary of ChangesHello @luketong777, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new feature to intelligently pair GPU devices with InfiniBand devices by detecting their shared PCIe Root Complex. This optimization aims to improve data transfer efficiency by ensuring that GPUs communicate with IB devices that are physically closest in the system's PCIe topology, thereby reducing latency and enhancing overall performance for high-throughput applications. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to detect if a GPU and an InfiniBand device share the same PCIe root complex, aiming to optimize device pairing for data transfers. The implementation involves traversing the sysfs file system to determine device topology. My review has identified two critical compilation errors due to typos that must be fixed. I've also provided suggestions to improve code readability and safety by refactoring a complex loop and replacing a C-style function call with a more modern C++ approach.
| static size_t getCommonParentLength(const char *path1, const char *path2) { | ||
| if (!path1 || !path2) { | ||
| return 0; | ||
| } | ||
|
|
||
| size_t offset = 0; | ||
| size_t parent_length = 0; | ||
|
|
||
| do { | ||
| if (((path1[offset] == '/') || (path1[offset] == '\0')) && | ||
| ((path2[offset] == '/') || (path2[offset] == '\0'))) { | ||
| parent_length = offset; | ||
| } | ||
| } while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0')); | ||
|
|
||
| return parent_length; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this do-while loop is complex and hard to follow due to the combined condition checking and post-increment of offset. While it appears to be correct, consider refactoring it for better readability and maintainability. A for loop that iterates through the paths and explicitly tracks the last common directory separator would be clearer.
static size_t getCommonParentLength(const char *path1, const char *path2) {
if (!path1 || !path2) {
return 0;
}
size_t parent_length = 0;
for (size_t offset = 0; ; ++offset) {
if ((path1[offset] == '/' || path1[offset] == '\0') &&
(path2[offset] == '/' || path2[offset] == '\0')) {
parent_length = offset;
}
if (path1[offset] != path2[offset] || path1[offset] == '\0') {
break;
}
}
return parent_length;
}| sscanf(path, "/sys/devices/pci%*x:%*x%n", &count); | ||
| return count == strlen(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sscanf can be risky and is generally discouraged in modern C++ for safety reasons. While its use here for pattern matching on a sysfs path seems contained, consider replacing it with a more robust and C++-idiomatic approach. For example, you could use std::string manipulation functions or std::regex for a safer and more explicit pattern validation.
Signed-off-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a useful optimization for selecting InfiniBand devices by considering PCIe topology. The new functions for identifying the PCIe root complex are a good addition. However, I've found a few critical issues, including typos that will cause compilation errors, and a logic issue in path manipulation that could lead to incorrect behavior. Please address these points to ensure the feature works as expected.
| static size_t getCommonParentLength(const char *path1, const char *path2) { | ||
| if (!path1 || !path2) { | ||
| return 0; | ||
| } | ||
|
|
||
| size_t offset = 0; | ||
| size_t parent_length = 0; | ||
|
|
||
| do { | ||
| if (((path1[offset] == '/') || (path1[offset] == '\0')) && | ||
| ((path2[offset] == '/') || (path2[offset] == '\0'))) { | ||
| parent_length = offset; | ||
| } | ||
| } while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0')); | ||
|
|
||
| return parent_length; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in getCommonParentLength is complex and doesn't correctly handle all cases. For example, for paths "/a/b" and "/a/bc", the common parent is "/a", but the current implementation returns an empty path. This can lead to incorrect PCIe root complex detection. A clearer implementation using std::string_view would be more robust. You may need to add #include <string_view> if it's not already available.
static size_t getCommonParentLength(const char *path1, const char *path2) {
if (!path1 || !path2) {
return 0;
}
std::string_view p1(path1);
std::string_view p2(path2);
size_t common_len = 0;
while (common_len < p1.length() && common_len < p2.length() && p1[common_len] == p2[common_len]) {
common_len++;
}
// If one path is a prefix of the other, it's the common parent.
if (common_len == p1.length() && (p2.length() == common_len || p2[common_len] == '/')) {
return p1.length();
}
if (common_len == p2.length() && (p1.length() == common_len || p1[common_len] == '/')) {
return p2.length();
}
// Backtrack to the last directory separator in the common part.
auto common_prefix = p1.substr(0, common_len);
auto last_slash = common_prefix.rfind('/');
if (last_slash == std::string_view::npos) {
return 0;
}
if (last_slash == 0) {
return 1; // for root "/"
}
return last_slash;
}Signed-off-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
Signed-off-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
|
What's the difference with |
| if (isSamePcieRootComplex(hca.pci_bus_id.c_str(), pci_bus_id)) { | ||
| samePCIeRoot_hca.push_back(hca); | ||
| } else if (isSameNumaNode(hca.pci_bus_id.c_str(), pci_bus_id)) { | ||
| sameNuma_hca.push_back(hca); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should SameNumaNode be the first condition? Then it is checking the same PCIeRootComplex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that SamePCIeRootComplex necessarily implies SameNumaNode. Therefore, if an if-else construct first checks for SameNumaNode, it would capture all devices residing on the same NUMA node, thereby preventing the more specific SamePCIeRootComplex check from ever being executed.
In most the cases, devices in the same PCIe root complex naturally has short distance. So I'm not sure whether there is a case that the result changes with and without this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements PCIe Root Complex detection to optimize GPU-InfiniBand device pairing by identifying whether devices share the same PCIe root complex, enabling more efficient data transfer path selection.
Key changes:
- Added helper functions to detect and compare PCIe root complex topology
- Updated device discovery logic to prioritize HCAs under the same PCIe root complex as the GPU
- Introduced a three-tier prioritization: same PCIe root > same NUMA node > all devices
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int min_distance = INT_MAX; | ||
| std::vector<std::string> min_distance_hcas; | ||
|
|
||
| std::vector<InfinibandDevice> samePCIeRoot_hca; |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name "samePCIeRoot_hca" uses inconsistent casing with "PCIe" having mixed case while the rest uses underscore_case. Consider renaming to "same_pcie_root_hca" to match the naming convention used elsewhere in the codebase (e.g., "sameNuma_hca", "min_distance_hcas").
| for (const auto &hca : all_hca) { | ||
| if (isSameNumaNode(hca.pci_bus_id.c_str(), pci_bus_id)) { | ||
| if (isSamePcieRootComplex(hca.pci_bus_id.c_str(), pci_bus_id)) { | ||
| samePCIeRoot_hca.push_back(hca); |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name "samePCIeRoot_hca" uses inconsistent casing with "PCIe" having mixed case while the rest uses underscore_case. Consider renaming to "same_pcie_root_hca" to match the naming convention used elsewhere in the codebase (e.g., "sameNuma_hca", "min_distance_hcas").
| } | ||
| const auto &candidate_preferred_hca = | ||
| sameNuma_hca.empty() ? all_hca : sameNuma_hca; | ||
| !samePCIeRoot_hca.empty() ? samePCIeRoot_hca |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name "samePCIeRoot_hca" uses inconsistent casing with "PCIe" having mixed case while the rest uses underscore_case. Consider renaming to "same_pcie_root_hca" to match the naming convention used elsewhere in the codebase (e.g., "sameNuma_hca", "min_distance_hcas").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider this comment.
| static size_t getCommonParentLength(const char *path1, const char *path2) { | ||
| if (!path1 || !path2) { | ||
| return 0; | ||
| } | ||
|
|
||
| size_t offset = 0; | ||
| size_t parent_length = 0; | ||
|
|
||
| do { | ||
| if (((path1[offset] == '/') || (path1[offset] == '\0')) && | ||
| ((path2[offset] == '/') || (path2[offset] == '\0'))) { | ||
| parent_length = offset; | ||
| } | ||
| } while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0')); | ||
|
|
||
| return parent_length; | ||
| } | ||
|
|
||
| static std::string getCommonParent(const char *path1, const char *path2) { | ||
| size_t parent_length = getCommonParentLength(path1, path2); | ||
| return std::string(path1, parent_length); | ||
| } | ||
|
|
||
| static bool isPciRootComplex(const char *path) { | ||
| if (!path) { | ||
| return false; | ||
| } | ||
|
|
||
| int count = -1; | ||
| sscanf(path, "/sys/devices/pci%*x:%*x%n", &count); | ||
| return count == strlen(path); | ||
| } | ||
|
|
||
| static bool isSamePcieRootComplex(const char *bus1, const char *bus2) { | ||
| if (!bus1 || !bus2) { | ||
| return false; | ||
| } | ||
|
|
||
| char buf[PATH_MAX]; | ||
| char resolved1[PATH_MAX]; | ||
| char resolved2[PATH_MAX]; | ||
|
|
||
| snprintf(buf, sizeof(buf), "/sys/bus/pci/devices/%s", bus1); | ||
| if (realpath(buf, resolved1) == NULL) { | ||
| return false; | ||
| } | ||
|
|
||
| snprintf(buf, sizeof(buf), "/sys/bus/pci/devices/%s", bus2); | ||
| if (realpath(buf, resolved2) == NULL) { | ||
| return false; | ||
| } | ||
|
|
||
| std::string common = getCommonParent(resolved1, resolved2); | ||
| if (common.empty()) { | ||
| return false; | ||
| } | ||
|
|
||
| while (!common.empty()) { | ||
| if (isPciRootComplex(common.c_str())) { | ||
| return true; | ||
| } | ||
|
|
||
| auto pos = common.find_last_of('/'); | ||
| if (pos == std::string::npos) { | ||
| break; | ||
| } | ||
| common.resize(pos); | ||
| } | ||
|
|
||
| return false; | ||
| } |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new PCIe root complex detection functions (getCommonParentLength, getCommonParent, isPciRootComplex, and isSamePcieRootComplex) lack test coverage. Given that existing topology functions have comprehensive test coverage in topology_test.cpp, consider adding unit tests to validate these new functions, including edge cases such as: devices under the same root complex, devices under different root complexes, invalid PCI bus IDs, and various path patterns.
| int count = -1; | ||
| sscanf(path, "/sys/devices/pci%*x:%*x%n", &count); | ||
| return count == strlen(path); |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function isPciRootComplex could fail to correctly identify PCIe root complex paths due to potential misuse of sscanf. The function attempts to match the pattern "/sys/devices/pci%*x:%*x" but doesn't verify that the matched portion ends the string correctly. If the path is "/sys/devices/pci0000:00/extra", the sscanf would match up to "pci0000:00" and set count accordingly, but the comparison with strlen(path) would correctly fail. However, the logic is fragile. Consider using a more explicit check or documenting the expected path format.
| int count = -1; | |
| sscanf(path, "/sys/devices/pci%*x:%*x%n", &count); | |
| return count == strlen(path); | |
| int count = 0; | |
| int rc = sscanf(path, "/sys/devices/pci%*x:%*x%n", &count); | |
| return (rc == 1) && (static_cast<size_t>(count) == strlen(path)); |
| do { | ||
| if (((path1[offset] == '/') || (path1[offset] == '\0')) && | ||
| ((path2[offset] == '/') || (path2[offset] == '\0'))) { | ||
| parent_length = offset; | ||
| } | ||
| } while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0')); | ||
|
|
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop condition in getCommonParentLength uses post-increment (path1[offset++]) which increments offset AFTER comparing characters. This means when characters mismatch or when reaching the null terminator, offset has already been incremented. While this doesn't cause incorrect behavior in the current implementation, it makes the code harder to understand. Consider restructuring the loop to increment offset explicitly at the end of the loop body for better clarity.
| do { | |
| if (((path1[offset] == '/') || (path1[offset] == '\0')) && | |
| ((path2[offset] == '/') || (path2[offset] == '\0'))) { | |
| parent_length = offset; | |
| } | |
| } while ((path1[offset] == path2[offset]) && (path1[offset++] != '\0')); | |
| for (;;) { | |
| if (((path1[offset] == '/') || (path1[offset] == '\0')) && | |
| ((path2[offset] == '/') || (path2[offset] == '\0'))) { | |
| parent_length = offset; | |
| } | |
| if ((path1[offset] != path2[offset]) || (path1[offset] == '\0')) { | |
| break; | |
| } | |
| ++offset; | |
| } |
| if (isSamePcieRootComplex(hca.pci_bus_id.c_str(), pci_bus_id)) { | ||
| samePCIeRoot_hca.push_back(hca); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the order? First, checking NUMA node, then PCIe
| } | ||
| const auto &candidate_preferred_hca = | ||
| sameNuma_hca.empty() ? all_hca : sameNuma_hca; | ||
| !samePCIeRoot_hca.empty() ? samePCIeRoot_hca |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same
stmatengss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some potential issues are still unresolved.
Description
This feature identifies and determines whether GPU devices and InfiniBand (IB) devices are located under the same PCIe Root Complex.
The isSamePcieRootComplex function determines whether two PCI devices are under the same root complex
Uses getCommonParent and isPciRootComplex functions for path analysis
In the discoverCudaTopology function, prioritizes IB devices that share the same PCIe root with the GPU
Establishes mapping relationships from GPUs to optimal IB devices to optimize data transfer paths
Type of Change
How Has This Been Tested?
I have tested the PCIe root identification functionality using simulated parameters, covering basic functional scenarios and edge cases. However, this feature still requires validation testing in a real hardware environment where GPU and IB devices are located under the same PCIe root complex.
Checklist