Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
this is getting weirder: |
3ec0411 to
54b09cb
Compare
|
CI woes addressed in #70 |
54b09cb to
6ce954a
Compare
751f4f3 to
275f4aa
Compare
275f4aa to
8046bdd
Compare
d14ef64 to
bc6b207
Compare
86e98ca to
bf83b4e
Compare
891e8f6 to
f35b51f
Compare
f35b51f to
8ac2492
Compare
|
Let's see how it could look like. I'm using this tool I wrote and I'm looking at a dual-Xeon Gold 6230R CPU which I have access to. output: we can see we detect 3 PCIE roots per NUMA node. The fact we have more than 1 PCIE root per NUMA node is not surprising: this fact is the very reason why we are moving to use PCIE root locality as better attribute to align resources. The representational problem, however, is that we can have more than a PCIE root which takes resources from the same pool, be them CPUs or NUMA nodes. I'm not sure we can correctly represent the HW topology with consumable capacity alone. I tend to believe we would need kubernetes/enhancements#5942 to be available before we can make progress in this area. |
8ac2492 to
8eb9757
Compare
|
I'm thinking that we should treat PCIe Root not as a boundary of resource capacity, but as a local label, and standardize the calculation algorithm to become a consensus across drivers. This ensures the correctness of capacity representation (NUMA grouping, no overlap) and achieves PCIe alignment across drivers. However, this issue may require extensive discussion |
Did a quick test as well and as you @ffromani already mentioned, a core is listed in more than one PCIe Root: 2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:00" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:0c" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:60" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:60" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:b4" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127" "NUMANode"=1
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:c9" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127" "NUMANode"=1Since a CPU might have adjacency to more than one PCIe roots, we can't currently avoid representing the same core in more than one PCIe as CPU isn't in the PCIe device tree. But as you have mentioned, I also believe that kubernetes/enhancements#5942 will help with that as it introduces |
Import code at k/k@b96a4039358 We need the code merged in kubernetes/kubernetes#137220 and kubernetes/kubernetes#137524 but we can't wait to rebase on top of kube 1.36.0. We will drop this carryover and just depend on the kube libs when we actually rebase. IMPORT NOTICE: trivial reformatting applied to comply with this project rules. No functional changes performed. Mechanical change only: `gci write ./internal/deviceattribute` Signed-off-by: Francesco Romani <fromani@redhat.com>
Linux kernel's sysfs reports which CPUs are local to which PCIE root. We can leverage this feature to group CPUs by PCIE root, which is already the standard attribute exposed by the DRA framework. This commit adds the scan logic which we will later use in the DRA layer. Signed-off-by: Francesco Romani <fromani@redhat.com>
group devices by pcie root, which is the preferred attribute by the DRA framework. The linux sysfs expose CPU locality for each PCI root complex, so it's safe and convenient to expose this. Signed-off-by: Francesco Romani <fromani@redhat.com>
we can now align using standard kubernetes attributes. Signed-off-by: Francesco Romani <fromani@redhat.com>
8eb9757 to
b20f018
Compare
|
on hold till we rebase on top of kube 1.36 to consume the list attributes |
Group CPUs by their PCIE root locality. We gain compatibility with all the kubernetes-compliant drivers, and we still allow optimal resource allocation
more context: kubernetes/kubernetes#132296 (comment)
WIP: needs tests, polishing, docs