Conversation
…d if exists, kind-load-test-image)
… with those edits - cpu_assignment: track exclusive CPU allocations per node so same ID on different nodes are not reported as overlapping derive expected shared pool from target node only increase verifySharedPoolMatches timeout and improve failure msg - e2e_suite: fix BeFailedToCreate to log State.Waiting.Reason instead of State.Terminated.Reason when container is Waiting. - pod: on WaitToBeRunning failure, append pod events hint (type, reason, message, time) for debugging Pending pods. Makefile: - test-e2e: use single cluster (create if missing, run grouped then individual, delete if we created)
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: catblade The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
test/e2e/cpu_assignment_test.go
Outdated
| if alloc.CPUAssigned.Size() != cpusPerClaim { | ||
| return fmt.Errorf("pod %d: got %d CPUs, want %d", i, alloc.CPUAssigned.Size(), cpusPerClaim) | ||
| } | ||
| if !alloc.CPUAssigned.IsSubsetOf(availableCPUs) { |
There was a problem hiding this comment.
This might fail since available CPUs is not tracked per node ?
There was a problem hiding this comment.
There could be a few other places in this test where we implicitly assume that all pods run on the same node.
For shared pods we explicitly set the node name(viamustCreateBestEffortPod), but it looks like we missed setting the node name for the exclusive-cpu pods (in `makeTesterPodWithExclusiveCPUClaim). Currently, the test still passes consistently even with this bug because our CI creates a kind cluster with just 1 worker node - kind-ci.yaml
We should probably just pin the exclusive-cpu pods to the target node as well for now, and keep this as a single-node test ?
There was a problem hiding this comment.
We can update, but I have this here because it was breaking for me in my little multi node cluster. Happy to apply any updates but given that we expect people to use this, probably want multi node tests.
There was a problem hiding this comment.
Addressed. Looked for other locations as well.
There was a problem hiding this comment.
I changes look good.
Non-blocking comment - I wonder if we gain any meaningful additional coverage with multi-node tests at the driver level, given that node placement is ultimately decided by the scheduler?
cc @ffromani
- get rid of magic numbers for availableCPUsByNode (still a default) and discover allAllocatedCPUsByNode - Verify shared pool on every node that has exclusive pods. Use unique discovery pod names per node (discovery-pod-<nodeName>) to avoid name clashes on multi-node clusters. - move code to the dracputester app. Add in associated test file (with tests)
| env DRACPU_E2E_TEST_IMAGE=$(IMAGE_TEST) DRACPU_E2E_RESERVED_CPUS=$(DRACPU_E2E_RESERVED_CPUS) DRACPU_E2E_CPU_DEVICE_MODE=grouped go test -v ./test/e2e/ --ginkgo.v | ||
|
|
||
| test-e2e-individual-run: ## patch daemonset to individual and run e2e (requires kind-e2e-setup) | ||
| kubectl -n kube-system patch daemonset dracpu --type=json -p='$(call e2e_daemonset_patch,individual)' | ||
| kubectl -n kube-system rollout status daemonset/dracpu --timeout=120s | ||
| env DRACPU_E2E_TEST_IMAGE=$(IMAGE_TEST) DRACPU_E2E_RESERVED_CPUS=$(DRACPU_E2E_RESERVED_CPUS) DRACPU_E2E_CPU_DEVICE_MODE=individual go test -v ./test/e2e/ --ginkgo.v | ||
|
|
||
| test-e2e: ## run e2e in both grouped and individual mode (one cluster: create if missing, run both, delete if we created) |
There was a problem hiding this comment.
This changes here overwrites the custom CI manifests deployed in CI workflows with make ci-kind-setup. Not sure if we would want that @ffromani
There was a problem hiding this comment.
I suggest we move the MakeFile improvements to a separate PR to give more time to iterate on after 0.1 release is cut. We can limit this PR to bug fixes and improvements in tests.
There was a problem hiding this comment.
Okay. Will do tomorrow.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
QoL updates and test fixes.
Test fixes:
- cpu_assignment:
track exclusive CPU allocations per node so same ID on different nodes are not reported as overlapping
derive expected shared pool from target node only
increase verifySharedPoolMatches timeout and improve failure msg
- e2e_suite:
fix BeFailedToCreate to log State.Waiting.Reason instead of State.Terminated.Reason when container is Waiting.
- pod:
on WaitToBeRunning failure, append pod events hint (type, reason, message, time) for debugging Pending pods.
Makefile:
- test-e2e: use single cluster (create if missing, run grouped then individual, delete if we created)