Chore: QoL and Test fixes by catblade · Pull Request #93 · kubernetes-sigs/dra-driver-cpu

catblade · 2026-03-18T21:55:53Z

QoL updates and test fixes.

Fix dracputester CPU affinity to include all CPUs in mask

Test fixes:
- cpu_assignment:
track exclusive CPU allocations per node so same ID on different nodes are not reported as overlapping
derive expected shared pool from target node only
increase verifySharedPoolMatches timeout and improve failure msg
- e2e_suite:
fix BeFailedToCreate to log State.Waiting.Reason instead of State.Terminated.Reason when container is Waiting.
- pod:
on WaitToBeRunning failure, append pod events hint (type, reason, message, time) for debugging Pending pods.
Makefile:
- test-e2e: use single cluster (create if missing, run grouped then individual, delete if we created)

…d if exists, kind-load-test-image)

… with those edits - cpu_assignment: track exclusive CPU allocations per node so same ID on different nodes are not reported as overlapping derive expected shared pool from target node only increase verifySharedPoolMatches timeout and improve failure msg - e2e_suite: fix BeFailedToCreate to log State.Waiting.Reason instead of State.Terminated.Reason when container is Waiting. - pod: on WaitToBeRunning failure, append pod events hint (type, reason, message, time) for debugging Pending pods. Makefile: - test-e2e: use single cluster (create if missing, run grouped then individual, delete if we created)

k8s-ci-robot · 2026-03-18T21:56:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: catblade
Once this PR has been reviewed and has the lgtm label, please assign klueska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/cpuinfo/cpuinfo.go

pravk03 · 2026-03-18T22:50:44Z

test/e2e/cpu_assignment_test.go

+						if alloc.CPUAssigned.Size() != cpusPerClaim {
+							return fmt.Errorf("pod %d: got %d CPUs, want %d", i, alloc.CPUAssigned.Size(), cpusPerClaim)
+						}
+						if !alloc.CPUAssigned.IsSubsetOf(availableCPUs) {


This might fail since available CPUs is not tracked per node ?

There could be a few other places in this test where we implicitly assume that all pods run on the same node.

For shared pods we explicitly set the node name(viamustCreateBestEffortPod), but it looks like we missed setting the node name for the exclusive-cpu pods (in `makeTesterPodWithExclusiveCPUClaim). Currently, the test still passes consistently even with this bug because our CI creates a kind cluster with just 1 worker node - kind-ci.yaml

We should probably just pin the exclusive-cpu pods to the target node as well for now, and keep this as a single-node test ?

We can update, but I have this here because it was breaking for me in my little multi node cluster. Happy to apply any updates but given that we expect people to use this, probably want multi node tests.

Addressed. Looked for other locations as well.

I changes look good.

Non-blocking comment - I wonder if we gain any meaningful additional coverage with multi-node tests at the driver level, given that node placement is ultimately decided by the scheduler?

cc @ffromani

test/e2e/cpu_assignment_test.go

- get rid of magic numbers for availableCPUsByNode (still a default) and discover allAllocatedCPUsByNode - Verify shared pool on every node that has exclusive pods. Use unique discovery pod names per node (discovery-pod-<nodeName>) to avoid name clashes on multi-node clusters. - move code to the dracputester app. Add in associated test file (with tests)

pravk03 · 2026-03-19T18:33:34Z

Makefile

+	env DRACPU_E2E_TEST_IMAGE=$(IMAGE_TEST) DRACPU_E2E_RESERVED_CPUS=$(DRACPU_E2E_RESERVED_CPUS) DRACPU_E2E_CPU_DEVICE_MODE=grouped go test -v ./test/e2e/ --ginkgo.v
+
+test-e2e-individual-run: ## patch daemonset to individual and run e2e (requires kind-e2e-setup)
+	kubectl -n kube-system patch daemonset dracpu --type=json -p='$(call e2e_daemonset_patch,individual)'
+	kubectl -n kube-system rollout status daemonset/dracpu --timeout=120s
+	env DRACPU_E2E_TEST_IMAGE=$(IMAGE_TEST) DRACPU_E2E_RESERVED_CPUS=$(DRACPU_E2E_RESERVED_CPUS) DRACPU_E2E_CPU_DEVICE_MODE=individual go test -v ./test/e2e/ --ginkgo.v
+
+test-e2e: ## run e2e in both grouped and individual mode (one cluster: create if missing, run both, delete if we created)


This changes here overwrites the custom CI manifests deployed in CI workflows with make ci-kind-setup. Not sure if we would want that @ffromani

I suggest we move the MakeFile improvements to a separate PR to give more time to iterate on after 0.1 release is cut. We can limit this PR to bug fixes and improvements in tests.

Okay. Will do tomorrow.

pravk03 · 2026-03-20T00:48:34Z

Extracting only the test fixes into #98 to have it merged before 0.1 release. This gives more time to iterate on the improvements.

@catblade @ffromani

k8s-ci-robot · 2026-03-21T01:58:14Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

catblade added 3 commits March 17, 2026 11:02

Chore: Makefile changes for running tests (with-kind, skip image buil…

3bcae14

…d if exists, kind-load-test-image)

Fix dracputester CPU affinity to include all CPUs in mask

74daec7

k8s-ci-robot requested review from ffromani and pravk03 March 18, 2026 21:56

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 18, 2026

pravk03 reviewed Mar 18, 2026

View reviewed changes

catblade changed the title ~~QoL and Test fixes~~ Chore: QoL and Test fixes Mar 19, 2026

pravk03 reviewed Mar 19, 2026

View reviewed changes

pravk03 mentioned this pull request Mar 20, 2026

E2E test fixes #98

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chore: QoL and Test fixes#93

Chore: QoL and Test fixes#93
catblade wants to merge 4 commits intokubernetes-sigs:mainfrom
catblade:catblade/affinity_fix

catblade commented Mar 18, 2026

Uh oh!

k8s-ci-robot commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

pravk03 Mar 18, 2026

Uh oh!

pravk03 Mar 18, 2026

Uh oh!

catblade Mar 19, 2026

Uh oh!

catblade Mar 19, 2026

Uh oh!

pravk03 Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

pravk03 Mar 19, 2026 •

edited

Loading

Uh oh!

pravk03 Mar 19, 2026

Uh oh!

catblade Mar 19, 2026

Uh oh!

pravk03 commented Mar 20, 2026

Uh oh!

k8s-ci-robot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

catblade commented Mar 18, 2026

Uh oh!

k8s-ci-robot commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

pravk03 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

pravk03 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

catblade Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

catblade Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

pravk03 Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pravk03 Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pravk03 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

catblade Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

pravk03 commented Mar 20, 2026

Uh oh!

k8s-ci-robot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pravk03 Mar 19, 2026 •

edited

Loading

pravk03 Mar 19, 2026 •

edited

Loading