-
Notifications
You must be signed in to change notification settings - Fork 89
Open
Description
Problem statement
Current situation:
When querying NVLink topology, go-nvml provides:
device.GetNvLinkRemotePciInfo(linkID) which gives PCIInfo, but this is incomplete because:
- A GPU can have multiple links connecting to the same remote device (NVSwitch or peer GPU)
- Each link connects to a different port on that remote device
- The PCI address alone doesn't tell us which port
For instance:
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-81..bcd)
Link 0: Remote Device 00000008:00:00.0: Link 32
Link 1: Remote Device 00000008:00:00.0: Link 33
What's available:
remotePCI, _ := device.GetNvLinkRemotePciInfo(0) // Returns "00000008:00:00.0"
remotePCI, _ := device.GetNvLinkRemotePciInfo(1) // Returns "00000008:00:00.0" (same)What's missing:
remoteLink, _ := device.GetNvLinkRemoteLinkId(0)
// Should return: 32 for link 0, 33 for link 1Use Case:
In SXID errors, we've something like:
nvidia-nvswitch3: SXid (0008:00:00.0: 20034, Fatal, Link 29 LTSSM Fault Up
Q: Which GPU is affected by this NVSwitch Link 29 error?
Required mapping: (NVSwitch_PCI, Remote_Link) -> (GPU_ID, Local_Link)
Example reverse lookup map:
topology["0008:00:00.0"][29] = {GPUID: 0, LocalLink: 0}
topology["0008:00:00.0"][28] = {GPUID: 0, LocalLink: 1}
topology["0008:00:00.0"][32] = {GPUID: 5, LocalLink: 0}
topology["0008:00:00.0"][33] = {GPUID: 5, LocalLink: 1}Current workaround
Right now, we depend on nvidia-smi nvlink -R to get output like:
$ nvidia-smi nvlink -R
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-0b..bf)
Link 0: Remote Device 00000008:00:00.0: Link 29
Link 1: Remote Device 00000008:00:00.0: Link 28
Link 2: Remote Device 00000005:00:00.0: Link 34
Link 3: Remote Device 00000005:00:00.0: Link 35
Link 4: Remote Device 00000007:00:00.0: Link 34
Link 5: Remote Device 00000007:00:00.0: Link 35
Link 6: Remote Device 0000000A:00:00.0: Link 26
Link 7: Remote Device 0000000A:00:00.0: Link 27
Link 8: Remote Device 00000006:00:00.0: Link 8
Link 9: Remote Device 00000006:00:00.0: Link 9
Link 10: Remote Device 00000009:00:00.0: Link 12
Link 11: Remote Device 00000009:00:00.0: Link 13Problems with this approach is:
- Blocks pure-Go applications
- Needs parsing => fragile
- Not portable since it needs nvidia-smi in
$PATH
Potential solution
Add a new method to the Device interface:
type Device interface {
// Existing methods
GetNvLinkRemotePciInfo(int) (PciInfo, Return)
GetNvLinkRemoteDeviceType(int) (IntNvLinkDeviceType, Return)
GetNvLinkState(int) (EnableState, Return)
// NEW: Get the link/port number on the remote device
GetNvLinkRemoteLinkId(linkID int) (uint, Return)
}Expected usage:
device, _ := nvml.DeviceGetHandleByIndex(0)
for localLink := 0; localLink < nvml.NVLINK_MAX_LINKS; localLink++ {
remotePCI, _ := device.GetNvLinkRemotePciInfo(localLink)
remoteLink, _ := device.GetNvLinkRemoteLinkId(localLink) // ← NEW API
// Build complete topology map
topology[remotePCI][remoteLink] = {GPUID: 0, LocalLink: localLink}
}
// Query: Which GPU is affected by NVSwitch 0008:00:00.0 Link 29 error?
affectedGPU := topology["0008:00:00.0"][29]As this functionality is already available in nvidia-smi, it might be already present in underlying C library, most likely go binding is missing.
lalitadithya and KaivalyaMDabhadkar
Metadata
Metadata
Assignees
Labels
No labels