Skip to content

WIP: group CPUs by PCIE root#68

Open
ffromani wants to merge 4 commits intokubernetes-sigs:mainfrom
ffromani:group-by-pcieroot
Open

WIP: group CPUs by PCIE root#68
ffromani wants to merge 4 commits intokubernetes-sigs:mainfrom
ffromani:group-by-pcieroot

Conversation

@ffromani
Copy link
Copy Markdown
Contributor

@ffromani ffromani commented Feb 23, 2026

Group CPUs by their PCIE root locality. We gain compatibility with all the kubernetes-compliant drivers, and we still allow optimal resource allocation

more context: kubernetes/kubernetes#132296 (comment)

WIP: needs tests, polishing, docs

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 23, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 23, 2026
@ffromani
Copy link
Copy Markdown
Contributor Author

this is getting weirder:

fromani@laptop:~/go/src/sigs.k8s.io/dra-driver-cpu$ pre-commit run --all-files
check for merge conflicts................................................Passed
check that executables have shebangs.....................................Passed
check that scripts with shebangs are executable..........................Passed
check json...........................................(no files to check)Skipped
check yaml...............................................................Passed
check for broken symlinks............................(no files to check)Skipped
check for added large files..............................................Passed
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
detect private key.......................................................Passed
mdformat.................................................................Passed
codespell................................................................Passed
shfmt....................................................................Passed
ShellCheck v0.10.0.......................................................Passed
go-fmt...................................................................Passed
go-mod-tidy..............................................................Passed
go-vet-mod...............................................................Passed
go-build-mod.............................................................Passed
Make Tests...............................................................Passed
fromani@laptop:~/go/src/sigs.k8s.io/dra-driver-cpu$ make test-unit &> /dev/null 
fromani@laptop:~/go/src/sigs.k8s.io/dra-driver-cpu$ echo $?
0
fromani@laptop:~/go/src/sigs.k8s.io/dra-driver-cpu$ 

@ffromani ffromani force-pushed the group-by-pcieroot branch 9 times, most recently from 3ec0411 to 54b09cb Compare February 23, 2026 13:52
@ffromani
Copy link
Copy Markdown
Contributor Author

CI woes addressed in #70

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 24, 2026
@ffromani ffromani force-pushed the group-by-pcieroot branch 4 times, most recently from 751f4f3 to 275f4aa Compare February 25, 2026 11:37
@ffromani ffromani force-pushed the group-by-pcieroot branch from 275f4aa to 8046bdd Compare March 2, 2026 13:24
@k8s-ci-robot k8s-ci-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 2, 2026
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 2, 2026
@ffromani ffromani force-pushed the group-by-pcieroot branch 2 times, most recently from d14ef64 to bc6b207 Compare March 3, 2026 14:42
@ffromani ffromani force-pushed the group-by-pcieroot branch 2 times, most recently from 86e98ca to bf83b4e Compare March 9, 2026 11:10
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 9, 2026
@ffromani ffromani force-pushed the group-by-pcieroot branch 2 times, most recently from 891e8f6 to f35b51f Compare March 12, 2026 10:55
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2026
@ffromani ffromani force-pushed the group-by-pcieroot branch from f35b51f to 8ac2492 Compare March 12, 2026 10:59
@ffromani
Copy link
Copy Markdown
Contributor Author

ffromani commented Mar 17, 2026

Let's see how it could look like. I'm using this tool I wrote and I'm looking at a dual-Xeon Gold 6230R CPU which I have access to.
YAML used:

apiVersion: v1
kind: Pod
metadata:
  generateName: chk-pod-
spec:
  containers:
  - name: ctrreschk
    image: quay.io/fromani/ctrreschk:v0.0.11
    imagePullPolicy: Always
    command: ["/ctrreschk", "-w", "align"]
    resources:
      limits:
        cpu: 1
        memory: 256Mi
      requests:
        cpu: 1
        memory: 256Mi

output:

kubectl exec -ti  chk-pod-v4t2z -- /ctrreschk pciescan
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:00" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102" "NUMANode"=0
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:17" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102" "NUMANode"=0
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:3a" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102" "NUMANode"=0
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:85" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103" "NUMANode"=1
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:ae" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103" "NUMANode"=1
2026/03/17 11:30:50 "level"=0 "msg"="PCIE domain" "root"="pci0000:d7" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103" "NUMANode"=1

we can see we detect 3 PCIE roots per NUMA node. The fact we have more than 1 PCIE root per NUMA node is not surprising: this fact is the very reason why we are moving to use PCIE root locality as better attribute to align resources.

The representational problem, however, is that we can have more than a PCIE root which takes resources from the same pool, be them CPUs or NUMA nodes. I'm not sure we can correctly represent the HW topology with consumable capacity alone. I tend to believe we would need kubernetes/enhancements#5942 to be available before we can make progress in this area.

@ffromani ffromani force-pushed the group-by-pcieroot branch from 8ac2492 to 8eb9757 Compare March 17, 2026 13:34
@AutuSnow
Copy link
Copy Markdown
Contributor

I'm thinking that we should treat PCIe Root not as a boundary of resource capacity, but as a local label, and standardize the calculation algorithm to become a consensus across drivers. This ensures the correctness of capacity representation (NUMA grouping, no overlap) and achieves PCIe alignment across drivers. However, this issue may require extensive discussion

@fmuyassarov
Copy link
Copy Markdown
Member

we can see we detect 3 PCIE roots per NUMA node. The fact we have more than 1 PCIE root per NUMA node is not surprising: this fact is the very reason why we are moving to use PCIE root locality as better attribute to align resources.

The representational problem, however, is that we can have more than a PCIE root which takes resources from the same pool, be them CPUs or NUMA nodes. I'm not sure we can correctly represent the HW topology with consumable capacity alone. I tend to believe we would need kubernetes/enhancements#5942 to be available before we can make progress in this area.

Did a quick test as well and as you @ffromani already mentioned, a core is listed in more than one PCIe Root:

2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:00" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:0c" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:60" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:60" "localCPUs"="0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126" "NUMANode"=0
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:b4" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127" "NUMANode"=1
2026/03/23 16:03:49 "level"=0 "msg"="PCIE domain" "root"="pci0000:c9" "localCPUs"="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127" "NUMANode"=1

Since a CPU might have adjacency to more than one PCIe roots, we can't currently avoid representing the same core in more than one PCIe as CPU isn't in the PCIe device tree. But as you have mentioned, I also believe that kubernetes/enhancements#5942 will help with that as it introduces list-of-*, that should allow us to do better grouping.

Import code at k/k@b96a4039358

We need the code merged in kubernetes/kubernetes#137220
and kubernetes/kubernetes#137524
but we can't wait to rebase on top of kube 1.36.0.
We will drop this carryover and just depend on the kube libs when we
actually rebase.

IMPORT NOTICE: trivial reformatting applied to comply with this project
rules. No functional changes performed.
Mechanical change only: `gci write ./internal/deviceattribute`

Signed-off-by: Francesco Romani <fromani@redhat.com>
Linux kernel's sysfs reports which CPUs are local
to which PCIE root. We can leverage this feature
to group CPUs by PCIE root, which is already the
standard attribute exposed by the DRA framework.

This commit adds the scan logic which we will later
use in the DRA layer.

Signed-off-by: Francesco Romani <fromani@redhat.com>
group devices by pcie root, which is the preferred attribute
by the DRA framework. The linux sysfs expose CPU locality
for each PCI root complex, so it's safe and convenient to
expose this.

Signed-off-by: Francesco Romani <fromani@redhat.com>
we can now align using standard kubernetes attributes.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani ffromani force-pushed the group-by-pcieroot branch from 8eb9757 to b20f018 Compare April 13, 2026 14:52
@ffromani
Copy link
Copy Markdown
Contributor Author

on hold till we rebase on top of kube 1.36 to consume the list attributes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants