-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Platform: Dual-socket AMD EPYC 9554 + 10× MI210
NUMA affinity: CPU0 bound to GPUs 2–5; CPU1 bound to GPUs 6–9
Tool: rocm-bandwidth-test v2.6.0
Observation
Inter-Device Numa Distance shows clear asymmetry (e.g., CPU0→GPU2–5 distance 20 vs. CPU0→GPU6–9 distance 52/72; flipped for CPU1), confirming topology/affinity differences.
However, Unidirectional/Bidirectional H2D/D2H bandwidth is nearly identical from either NUMA node to any GPU: ~28 GB/s (uni), ~45 GB/s (bi), which looks like PCIe Gen4 x16 saturation.
Questions
-
Is this expected because large-block H2D/D2H tests are SDMA/PCIe limited, thus NUMA effects are masked by the PCIe link (Gen4 x16), yielding identical results regardless of CPU node?
-
How does rocm-bandwidth-test allocate/register pinned memory? Is it bound to a specific NUMA node, and is there a way to force membind/first-touch to explicitly test cross-socket behavior?
-
Are there recommended multi-stream/small-message/concurrent configurations to amplify NUMA differences (e.g., where host DRAM/UPI/XGMI starts to dominate) so we can observe measurable impact beyond PCIe saturation?
============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 15 15 15 72 72 72 72
GPU1 15 0 15 15 72 72 72 72
GPU2 15 15 0 15 72 72 72 72
GPU3 15 15 15 0 72 72 72 72
GPU4 72 72 72 72 0 15 15 15
GPU5 72 72 72 72 15 0 15 15
GPU6 72 72 72 72 15 15 0 15
GPU7 72 72 72 72 15 15 15 0
================================= Hops between two GPUs ==================================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 1 1 1 3 3 3 3
GPU1 1 0 1 1 3 3 3 3
GPU2 1 1 0 1 3 3 3 3
GPU3 1 1 1 0 3 3 3 3
GPU4 3 3 3 3 0 1 1 1
GPU5 3 3 3 3 1 0 1 1
GPU6 3 3 3 3 1 1 0 1
GPU7 3 3 3 3 1 1 1 0
=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 XGMI XGMI XGMI PCIE PCIE PCIE PCIE
GPU1 XGMI 0 XGMI XGMI PCIE PCIE PCIE PCIE
GPU2 XGMI XGMI 0 XGMI PCIE PCIE PCIE PCIE
GPU3 XGMI XGMI XGMI 0 PCIE PCIE PCIE PCIE
GPU4 PCIE PCIE PCIE PCIE 0 XGMI XGMI XGMI
GPU5 PCIE PCIE PCIE PCIE XGMI 0 XGMI XGMI
GPU6 PCIE PCIE PCIE PCIE XGMI XGMI 0 XGMI
GPU7 PCIE PCIE PCIE PCIE XGMI XGMI XGMI 0
======================================= Numa Nodes =======================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: 0
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: 0
GPU[2] : (Topology) Numa Node: 0
GPU[2] : (Topology) Numa Affinity: 0
GPU[3] : (Topology) Numa Node: 0
GPU[3] : (Topology) Numa Affinity: 0
GPU[4] : (Topology) Numa Node: 1
GPU[4] : (Topology) Numa Affinity: 1
GPU[5] : (Topology) Numa Node: 1
GPU[5] : (Topology) Numa Affinity: 1
GPU[6] : (Topology) Numa Node: 1
GPU[6] : (Topology) Numa Affinity: 1
GPU[7] : (Topology) Numa Node: 1
GPU[7] : (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================
RocmBandwidthTest Version: 2.6.0
Launch Command is: ./rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD EPYC 9554 64-Core Processor
Device: 1, AMD EPYC 9554 64-Core Processor
Device: 2, AMD Instinct MI210, GPU-06c9d21390bc2e70, 05:0.0
Device: 3, AMD Instinct MI210, GPU-8b193c69a6fe8e9b, 08:0.0
Device: 4, AMD Instinct MI210, GPU-bdfdc381baca4d61, 45:0.0
Device: 5, AMD Instinct MI210, GPU-4a4febed074f17f1, 48:0.0
Device: 6, AMD Instinct MI210, GPU-ea00f32afadbf90a, 87:0.0
Device: 7, AMD Instinct MI210, GPU-dda0ddc66290f236, 8a:0.0
Device: 8, AMD Instinct MI210, GPU-e0a81ba81d2a0689, c8:0.0
Device: 9, AMD Instinct MI210, GPU-42deef71a398b3a1, cb:0.0
Inter-Device Access
D/D 0 1 2 3 4 5 6 7 8 9
0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1 1 1
8 1 1 1 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1 1 1
Inter-Device Numa Distance
D/D 0 1 2 3 4 5 6 7 8 9
0 0 32 20 20 20 20 52 52 52 52
1 32 0 52 52 52 52 20 20 20 20
2 20 52 0 15 15 15 72 72 72 72
3 20 52 15 0 15 15 72 72 72 72
4 20 52 15 15 0 15 72 72 72 72
5 20 52 15 15 15 0 72 72 72 72
6 52 20 72 72 72 72 0 15 15 15
7 52 20 72 72 72 72 15 0 15 15
8 52 20 72 72 72 72 15 15 0 15
9 52 20 72 72 72 72 15 15 15 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1 2 3 4 5 6 7 8 9
0 N/A N/A 28.080 28.083 28.055 28.065 28.031 28.007 28.054 28.033
1 N/A N/A 28.031 28.042 28.057 28.042 28.078 28.099 28.101 28.091
2 28.269 28.258 1026.750 39.934 39.926 39.919 28.273 28.294 28.282 28.271
3 28.260 28.292 39.957 1020.504 39.839 39.919 28.288 28.277 28.273 28.292
4 28.263 28.246 39.900 39.877 1021.747 39.915 28.258 28.267 28.279 28.263
5 28.292 28.284 39.953 39.941 39.957 1015.555 28.296 28.286 28.292 28.250
6 28.261 28.260 28.252 28.246 28.254 28.242 960.894 39.942 39.927 39.809
7 28.258 28.263 28.273 28.261 28.280 28.267 39.976 966.423 39.923 39.915
8 28.271 28.239 28.248 28.252 28.250 28.260 39.923 39.930 966.423 39.976
9 28.254 28.239 28.233 28.261 28.252 28.248 39.835 39.923 39.965 969.774
Bidirectional copy peak bandwidth GB/s
D/D 0 1 2 3 4 5 6 7 8 9
0 N/A N/A 45.410 45.506 45.190 45.353 45.063 45.165 45.063 45.073
1 N/A N/A 45.250 45.787 45.484 45.070 45.129 45.277 45.398 45.570
2 45.410 45.250 N/A 76.531 76.797 76.713 56.051 56.019 56.022 56.029
3 45.506 45.787 76.531 N/A 76.755 76.755 55.998 56.009 55.980 56.036
4 45.190 45.484 76.797 76.755 N/A 76.566 55.956 56.023 55.993 56.036
5 45.353 45.070 76.713 76.755 76.566 N/A 56.009 56.006 56.018 56.033
6 45.063 45.129 56.051 55.998 55.956 56.009 N/A 76.496 76.673 76.738
7 45.165 45.277 56.019 56.009 56.023 56.006 76.496 N/A 76.720 76.748
8 45.063 45.398 56.022 55.980 55.993 56.018 76.673 76.720 N/A 76.539
9 45.073 45.570 56.029 56.036 56.036 56.033 76.738 76.748 76.539 N/A