Skip to content

Feature request: Add API to Query NVLink Remote Link ID #166

@XRFXLP

Description

@XRFXLP

Problem statement

Current situation:

When querying NVLink topology, go-nvml provides:

device.GetNvLinkRemotePciInfo(linkID) 

which gives PCIInfo, but this is incomplete because:

  • A GPU can have multiple links connecting to the same remote device (NVSwitch or peer GPU)
  • Each link connects to a different port on that remote device
  • The PCI address alone doesn't tell us which port

For instance:

GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-81..bcd)
	 Link 0: Remote Device 00000008:00:00.0: Link 32
	 Link 1: Remote Device 00000008:00:00.0: Link 33

What's available:

remotePCI, _ := device.GetNvLinkRemotePciInfo(0) // Returns "00000008:00:00.0"
remotePCI, _ := device.GetNvLinkRemotePciInfo(1) // Returns "00000008:00:00.0" (same)

What's missing:

remoteLink, _ := device.GetNvLinkRemoteLinkId(0) 
// Should return: 32 for link 0, 33 for link 1

Use Case:

In SXID errors, we've something like:

nvidia-nvswitch3: SXid (0008:00:00.0: 20034, Fatal, Link 29 LTSSM Fault Up

Q: Which GPU is affected by this NVSwitch Link 29 error?

Required mapping: (NVSwitch_PCI, Remote_Link) -> (GPU_ID, Local_Link)

Example reverse lookup map:

topology["0008:00:00.0"][29] = {GPUID: 0, LocalLink: 0}
topology["0008:00:00.0"][28] = {GPUID: 0, LocalLink: 1}
topology["0008:00:00.0"][32] = {GPUID: 5, LocalLink: 0}
topology["0008:00:00.0"][33] = {GPUID: 5, LocalLink: 1}

Current workaround

Right now, we depend on nvidia-smi nvlink -R to get output like:

$ nvidia-smi nvlink -R
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-0b..bf)
	 Link 0: Remote Device 00000008:00:00.0: Link 29
	 Link 1: Remote Device 00000008:00:00.0: Link 28
	 Link 2: Remote Device 00000005:00:00.0: Link 34
	 Link 3: Remote Device 00000005:00:00.0: Link 35
	 Link 4: Remote Device 00000007:00:00.0: Link 34
	 Link 5: Remote Device 00000007:00:00.0: Link 35
	 Link 6: Remote Device 0000000A:00:00.0: Link 26
	 Link 7: Remote Device 0000000A:00:00.0: Link 27
	 Link 8: Remote Device 00000006:00:00.0: Link 8
	 Link 9: Remote Device 00000006:00:00.0: Link 9
	 Link 10: Remote Device 00000009:00:00.0: Link 12
	 Link 11: Remote Device 00000009:00:00.0: Link 13

Problems with this approach is:

  • Blocks pure-Go applications
  • Needs parsing => fragile
  • Not portable since it needs nvidia-smi in $PATH

Potential solution

Add a new method to the Device interface:

type Device interface {
    // Existing methods
    GetNvLinkRemotePciInfo(int) (PciInfo, Return)
    GetNvLinkRemoteDeviceType(int) (IntNvLinkDeviceType, Return)
    GetNvLinkState(int) (EnableState, Return)
    
    // NEW: Get the link/port number on the remote device
    GetNvLinkRemoteLinkId(linkID int) (uint, Return)
}

Expected usage:

device, _ := nvml.DeviceGetHandleByIndex(0)

for localLink := 0; localLink < nvml.NVLINK_MAX_LINKS; localLink++ {
    remotePCI, _ := device.GetNvLinkRemotePciInfo(localLink)
    remoteLink, _ := device.GetNvLinkRemoteLinkId(localLink)  // ← NEW API
    
    // Build complete topology map
    topology[remotePCI][remoteLink] = {GPUID: 0, LocalLink: localLink}
}

// Query: Which GPU is affected by NVSwitch 0008:00:00.0 Link 29 error?
affectedGPU := topology["0008:00:00.0"][29]

As this functionality is already available in nvidia-smi, it might be already present in underlying C library, most likely go binding is missing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions