Skip to content

Chore: QoL and Test fixes#93

Open
catblade wants to merge 4 commits intokubernetes-sigs:mainfrom
catblade:catblade/affinity_fix
Open

Chore: QoL and Test fixes#93
catblade wants to merge 4 commits intokubernetes-sigs:mainfrom
catblade:catblade/affinity_fix

Conversation

@catblade
Copy link
Copy Markdown
Contributor

QoL updates and test fixes.

Fix dracputester CPU affinity to include all CPUs in mask

Test fixes:
- cpu_assignment:
track exclusive CPU allocations per node so same ID on different nodes are not reported as overlapping
derive expected shared pool from target node only
increase verifySharedPoolMatches timeout and improve failure msg
- e2e_suite:
fix BeFailedToCreate to log State.Waiting.Reason instead of State.Terminated.Reason when container is Waiting.
- pod:
on WaitToBeRunning failure, append pod events hint (type, reason, message, time) for debugging Pending pods.
Makefile:
- test-e2e: use single cluster (create if missing, run grouped then individual, delete if we created)

… with those edits

- cpu_assignment:
  track exclusive CPU allocations per node so same ID on different nodes are not reported as overlapping
  derive expected shared pool from target node only
  increase verifySharedPoolMatches timeout and improve failure msg
- e2e_suite:
  fix BeFailedToCreate to log State.Waiting.Reason instead of State.Terminated.Reason when container is Waiting.
- pod:
  on WaitToBeRunning failure, append pod events hint (type, reason, message, time) for debugging Pending pods.
Makefile:
- test-e2e: use single cluster (create if missing, run grouped then individual, delete if we created)
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: catblade
Once this PR has been reviewed and has the lgtm label, please assign klueska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 18, 2026
if alloc.CPUAssigned.Size() != cpusPerClaim {
return fmt.Errorf("pod %d: got %d CPUs, want %d", i, alloc.CPUAssigned.Size(), cpusPerClaim)
}
if !alloc.CPUAssigned.IsSubsetOf(availableCPUs) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might fail since available CPUs is not tracked per node ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a few other places in this test where we implicitly assume that all pods run on the same node.

For shared pods we explicitly set the node name(viamustCreateBestEffortPod), but it looks like we missed setting the node name for the exclusive-cpu pods (in `makeTesterPodWithExclusiveCPUClaim). Currently, the test still passes consistently even with this bug because our CI creates a kind cluster with just 1 worker node - kind-ci.yaml

We should probably just pin the exclusive-cpu pods to the target node as well for now, and keep this as a single-node test ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update, but I have this here because it was breaking for me in my little multi node cluster. Happy to apply any updates but given that we expect people to use this, probably want multi node tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Looked for other locations as well.

Copy link
Copy Markdown
Contributor

@pravk03 pravk03 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changes look good.

Non-blocking comment - I wonder if we gain any meaningful additional coverage with multi-node tests at the driver level, given that node placement is ultimately decided by the scheduler?

cc @ffromani

@catblade catblade changed the title QoL and Test fixes Chore: QoL and Test fixes Mar 19, 2026
- get rid of magic numbers for availableCPUsByNode (still a default) and discover allAllocatedCPUsByNode
- Verify shared pool on every node that has exclusive pods. Use unique
  discovery pod names per node (discovery-pod-<nodeName>) to avoid
  name clashes on multi-node clusters.
- move code to the dracputester app. Add in associated test file (with tests)
Comment on lines +227 to +234
env DRACPU_E2E_TEST_IMAGE=$(IMAGE_TEST) DRACPU_E2E_RESERVED_CPUS=$(DRACPU_E2E_RESERVED_CPUS) DRACPU_E2E_CPU_DEVICE_MODE=grouped go test -v ./test/e2e/ --ginkgo.v

test-e2e-individual-run: ## patch daemonset to individual and run e2e (requires kind-e2e-setup)
kubectl -n kube-system patch daemonset dracpu --type=json -p='$(call e2e_daemonset_patch,individual)'
kubectl -n kube-system rollout status daemonset/dracpu --timeout=120s
env DRACPU_E2E_TEST_IMAGE=$(IMAGE_TEST) DRACPU_E2E_RESERVED_CPUS=$(DRACPU_E2E_RESERVED_CPUS) DRACPU_E2E_CPU_DEVICE_MODE=individual go test -v ./test/e2e/ --ginkgo.v

test-e2e: ## run e2e in both grouped and individual mode (one cluster: create if missing, run both, delete if we created)
Copy link
Copy Markdown
Contributor

@pravk03 pravk03 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes here overwrites the custom CI manifests deployed in CI workflows with make ci-kind-setup. Not sure if we would want that @ffromani

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we move the MakeFile improvements to a separate PR to give more time to iterate on after 0.1 release is cut. We can limit this PR to bug fixes and improvements in tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Will do tomorrow.

@pravk03 pravk03 mentioned this pull request Mar 20, 2026
@pravk03
Copy link
Copy Markdown
Contributor

pravk03 commented Mar 20, 2026

Extracting only the test fixes into #98 to have it merged before 0.1 release. This gives more time to iterate on the improvements.

@catblade @ffromani

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants