WIP: group CPUs by PCIE root by ffromani · Pull Request #68 · kubernetes-sigs/dra-driver-cpu

ffromani · 2026-02-23T09:27:25Z

Group CPUs by their PCIE root locality. We gain compatibility with all the kubernetes-compliant drivers, and we still allow optimal resource allocation

more context: kubernetes/kubernetes#132296 (comment)

WIP: needs tests, polishing, docs

k8s-ci-robot · 2026-02-23T09:27:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ffromani · 2026-02-23T09:32:58Z

this is getting weirder:

fromani@laptop:~/go/src/sigs.k8s.io/dra-driver-cpu$ pre-commit run --all-files
check for merge conflicts................................................Passed
check that executables have shebangs.....................................Passed
check that scripts with shebangs are executable..........................Passed
check json...........................................(no files to check)Skipped
check yaml...............................................................Passed
check for broken symlinks............................(no files to check)Skipped
check for added large files..............................................Passed
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
detect private key.......................................................Passed
mdformat.................................................................Passed
codespell................................................................Passed
shfmt....................................................................Passed
ShellCheck v0.10.0.......................................................Passed
go-fmt...................................................................Passed
go-mod-tidy..............................................................Passed
go-vet-mod...............................................................Passed
go-build-mod.............................................................Passed
Make Tests...............................................................Passed
fromani@laptop:~/go/src/sigs.k8s.io/dra-driver-cpu$ make test-unit &> /dev/null 
fromani@laptop:~/go/src/sigs.k8s.io/dra-driver-cpu$ echo $?
0
fromani@laptop:~/go/src/sigs.k8s.io/dra-driver-cpu$

ffromani · 2026-02-24T17:49:28Z

CI woes addressed in #70

ffromani · 2026-03-17T11:35:37Z

Let's see how it could look like. I'm using this tool I wrote and I'm looking at a dual-Xeon Gold 6230R CPU which I have access to.
YAML used:

apiVersion: v1
kind: Pod
metadata:
  generateName: chk-pod-
spec:
  containers:
  - name: ctrreschk
    image: quay.io/fromani/ctrreschk:v0.0.11
    imagePullPolicy: Always
    command: ["/ctrreschk", "-w", "align"]
    resources:
      limits:
        cpu: 1
        memory: 256Mi
      requests:
        cpu: 1
        memory: 256Mi

output:

kubectl exec -ti  chk-pod-v4t2z -- /ctrreschk pciescan
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:00" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102" "NUMANode"=0
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:17" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102" "NUMANode"=0
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:3a" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102" "NUMANode"=0
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:85" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103" "NUMANode"=1
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:ae" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103" "NUMANode"=1
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:d7" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103" "NUMANode"=1

we can see we detect 3 PCIE roots per NUMA node. The fact we have more than 1 PCIE root per NUMA node is not surprising: this fact is the very reason why we are moving to use PCIE root locality as better attribute to align resources.

The representational problem, however, is that we can have more than a PCIE root which takes resources from the same pool, be them CPUs or NUMA nodes. I'm not sure we can correctly represent the HW topology with consumable capacity alone. I tend to believe we would need kubernetes/enhancements#5942 to be available before we can make progress in this area.

AutuSnow · 2026-03-23T14:41:28Z

I'm thinking that we should treat PCIe Root not as a boundary of resource capacity, but as a local label, and standardize the calculation algorithm to become a consensus across drivers. This ensures the correctness of capacity representation (NUMA grouping, no overlap) and achieves PCIe alignment across drivers. However, this issue may require extensive discussion

fmuyassarov · 2026-03-24T09:18:57Z

we can see we detect 3 PCIE roots per NUMA node. The fact we have more than 1 PCIE root per NUMA node is not surprising: this fact is the very reason why we are moving to use PCIE root locality as better attribute to align resources.

The representational problem, however, is that we can have more than a PCIE root which takes resources from the same pool, be them CPUs or NUMA nodes. I'm not sure we can correctly represent the HW topology with consumable capacity alone. I tend to believe we would need kubernetes/enhancements#5942 to be available before we can make progress in this area.

Did a quick test as well and as you @ffromani already mentioned, a core is listed in more than one PCIe Root:

2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:00" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:0c" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:60" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:60" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:b4" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127" "NUMANode"=1
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:c9" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127" "NUMANode"=1

Since a CPU might have adjacency to more than one PCIe roots, we can't currently avoid representing the same core in more than one PCIe as CPU isn't in the PCIe device tree. But as you have mentioned, I also believe that kubernetes/enhancements#5942 will help with that as it introduces list-of-*, that should allow us to do better grouping.

Import code at k/k@b96a4039358 We need the code merged in kubernetes/kubernetes#137220 and kubernetes/kubernetes#137524 but we can't wait to rebase on top of kube 1.36.0. We will drop this carryover and just depend on the kube libs when we actually rebase. IMPORT NOTICE: trivial reformatting applied to comply with this project rules. No functional changes performed. Mechanical change only: `gci write ./internal/deviceattribute` Signed-off-by: Francesco Romani <fromani@redhat.com>

Linux kernel's sysfs reports which CPUs are local to which PCIE root. We can leverage this feature to group CPUs by PCIE root, which is already the standard attribute exposed by the DRA framework. This commit adds the scan logic which we will later use in the DRA layer. Signed-off-by: Francesco Romani <fromani@redhat.com>

group devices by pcie root, which is the preferred attribute by the DRA framework. The linux sysfs expose CPU locality for each PCI root complex, so it's safe and convenient to expose this. Signed-off-by: Francesco Romani <fromani@redhat.com>

we can now align using standard kubernetes attributes. Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani · 2026-04-13T14:52:32Z

on hold till we rebase on top of kube 1.36 to consume the list attributes

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 23, 2026

k8s-ci-robot requested review from klueska and pravk03 February 23, 2026 09:27

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 23, 2026

ffromani force-pushed the group-by-pcieroot branch 9 times, most recently from 3ec0411 to 54b09cb Compare February 23, 2026 13:52

ffromani force-pushed the group-by-pcieroot branch from 54b09cb to 6ce954a Compare February 24, 2026 18:16

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 24, 2026

ffromani force-pushed the group-by-pcieroot branch 4 times, most recently from 751f4f3 to 275f4aa Compare February 25, 2026 11:37

This was referenced Feb 27, 2026

Refactor: Replace /proc/cpuinfo parsing with pure sysfs-based CPU topology discovery #71

Closed

PLANNING: Release 0.2.0 #57

Open

WIP: POC: CPU device manager #64

Closed

ffromani force-pushed the group-by-pcieroot branch from 275f4aa to 8046bdd Compare March 2, 2026 13:24

k8s-ci-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 2, 2026

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 2, 2026

ffromani force-pushed the group-by-pcieroot branch 2 times, most recently from d14ef64 to bc6b207 Compare March 3, 2026 14:42

ffromani mentioned this pull request Mar 9, 2026

dra: codeorg: simplify device attributes #66

Merged

ffromani force-pushed the group-by-pcieroot branch 2 times, most recently from 86e98ca to bf83b4e Compare March 9, 2026 11:10

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 9, 2026

ffromani force-pushed the group-by-pcieroot branch 2 times, most recently from 891e8f6 to f35b51f Compare March 12, 2026 10:55

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2026

ffromani force-pushed the group-by-pcieroot branch from f35b51f to 8ac2492 Compare March 12, 2026 10:59

ffromani force-pushed the group-by-pcieroot branch from 8ac2492 to 8eb9757 Compare March 17, 2026 13:34

This was referenced Apr 9, 2026

NIC / CPU alignment (by supporting "resource.kubernetes.io/pcieRoot" list attribute) #114

Open

[WIP] DRA: Introduce "GetPCIeRootAttributeMapFromCPUId" helper method for aligning NICs and CPUs kubernetes/kubernetes#138297

Draft

ffromani added 4 commits April 13, 2026 16:52

dra: new grouping mode: by PCIE Root

51243f3

group devices by pcie root, which is the preferred attribute by the DRA framework. The linux sysfs expose CPU locality for each PCI root complex, so it's safe and convenient to expose this. Signed-off-by: Francesco Romani <fromani@redhat.com>

dra: attr: remove compatibility

b20f018

we can now align using standard kubernetes attributes. Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani force-pushed the group-by-pcieroot branch from 8eb9757 to b20f018 Compare April 13, 2026 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: group CPUs by PCIE root#68

WIP: group CPUs by PCIE root#68
ffromani wants to merge 4 commits intokubernetes-sigs:mainfrom
ffromani:group-by-pcieroot

ffromani commented Feb 23, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Feb 23, 2026

Uh oh!

ffromani commented Feb 23, 2026

Uh oh!

ffromani commented Feb 24, 2026

Uh oh!

ffromani commented Mar 17, 2026 •

edited

Loading

Uh oh!

AutuSnow commented Mar 23, 2026

Uh oh!

fmuyassarov commented Mar 24, 2026

Uh oh!

ffromani commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ffromani commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 23, 2026

Uh oh!

ffromani commented Feb 23, 2026

Uh oh!

ffromani commented Feb 24, 2026

Uh oh!

ffromani commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AutuSnow commented Mar 23, 2026

Uh oh!

fmuyassarov commented Mar 24, 2026

Uh oh!

ffromani commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ffromani commented Feb 23, 2026 •

edited

Loading

ffromani commented Mar 17, 2026 •

edited

Loading