Conversation
running, including some code fixes. - NRI parses CDI claim UIDs when DRA env is absent. - update test-e2e-all to force rebuild and use explicit reserved CPU wiring - resolve claim UIDs from CDI devices in NRI sync/create paths with stronger ownership checks - harden e2e failure detection and accept admission rejection as a valid negative claim-sharing outcome
pravk03
left a comment
There was a problem hiding this comment.
Leaving some initial comments; I will continue reviewing
| } | ||
|
|
||
| var errs []string | ||
| for _, container := range pod.Spec.Containers { |
There was a problem hiding this comment.
I am thinking if we should just enforce the restriction a pod level instead of container level?
pod level requests >= sum of container requests (non-init containers) + claims. ,
Pro's:
- Smoother migration to KEP (KEP-5517: Introduce DRA for Native Resources KEP kubernetes/enhancements#5755 (comment)).
- It allows for claims to be shared between containers in the future, should we loosen the restrictions currently enforced byhttps://github.com/Shared ResourceClaim Causes Incorrect CPU Accounting #6.
Cons:
- This would require that CPU requests are explicitly defined at the pod level
WDYT ? @catblade @ffromani . Would this work for slurm use-cases ?
There was a problem hiding this comment.
I'm generally in favor considering this is the latest trend in the community, but this would make us dependant on another FG, because IIRC pod-level-resources (PLR) is not yet beta.
I don't think this is necessarily a problem, just stating the fact.
There was a problem hiding this comment.
Maybe we put an issue to update when the other FG goes in, and for now we do container-level?
There was a problem hiding this comment.
(We can figure out the Slurm use cases, regardless of the structure-we'll just map)
There was a problem hiding this comment.
Changing to pod-level validation.
There was a problem hiding this comment.
Looks like the recent changes still sum the container requests and not Pod-Level Resources (pod.Spec.Resources) ?
|
|
||
| // Prefer allocated device info when available. | ||
| if total, err := h.claimCPUCountFromSlices(claim); err != nil || total > 0 { | ||
| return total, err |
There was a problem hiding this comment.
I think we can just reject when a pod references a claim that is already allocated.
There was a problem hiding this comment.
+1, we don't support this scenario. But I'd still keep the validation also in the driver because it's the last line of defense.
There was a problem hiding this comment.
Changing to validate in the admission controller that the claim is not already allocated.
|
I'll review ASAP, thanks for posting this. |
| if !ok { | ||
| return 0, false | ||
| } | ||
| value, ok := cpuQuantity.AsInt64() |
There was a problem hiding this comment.
Fractional requests are rounded off here. Should we use MilliValue() ?
There was a problem hiding this comment.
elsewhere we have logic to ensure the request is integral (correct thing to do!)
I think we should at least use the same code for rounding logic everywhere or do similarly the scheduler does and carry along auxiliary data with rounded, validated data and make logic on that.
Either works for me at this stage
There was a problem hiding this comment.
I intentionally forced integral. Thoughts?
ffromani
left a comment
There was a problem hiding this comment.
Thanks for posting this! it is surely a good contribution. Validation helps and this can be further extended later on with another PR to make it mutating.
The biggest open question here is if we can assume that resource claims are available when a pod is being validated. I'm not sure this is the case. And if so, the webhooks would need to deal with resourceclaims possibly being created later on.
| @@ -0,0 +1,461 @@ | |||
| /* | |||
| Copyright 2025 The Kubernetes Authors. | |||
There was a problem hiding this comment.
designwise: should this be a separate executable or should this be merged with the main plugin? If separate executable, I agree it should ship in the main (and only) container image.
There was a problem hiding this comment.
I would really prefer to keep things as modular as possible. It is potential for a user to want to do something specific with their admission controller, beyond (or less than) the capabilities this one provides.
There was a problem hiding this comment.
The reasoning is that if we decompose at package level and if we push all the logic in the packages, both of which are desirable anyway, then the main.go code, or equivalent, just becomes trivial orchestration and wiring, which perhaps some easy command line flag processing.
If that holds, then there is an argument to merge into the main executable (packages are shared anyway) for easier shipping and to actually reduce the trivial wiring.
That said, I'm fine keeping separate entry points and executable, we trend towards pushing logic in the packages anyway. If needed, we can always revisit the decision in the future.
| test-e2e: ## run e2e test against an existing configured cluster | ||
| env DRACPU_E2E_TEST_IMAGE=$(IMAGE_TEST) DRACPU_E2E_RESERVED_CPUS=$(DRACPU_E2E_RESERVED_CPUS) go test -v ./test/e2e/ --ginkgo.v | ||
|
|
||
| test-e2e-kind: ci-kind-setup test-e2e ## run e2e test against a purpose-built kind cluster | ||
|
|
There was a problem hiding this comment.
the idea here was not to assume that e2e runs against the CI/development kind cluster; our e2e tests should run just fine against a real cluster. This is to reduce the chance of hidden/implicit assumptions sneak into our own e2e tests and also to enable to use the e2e tests as validation tool for real deployments, even though not production deployments likely (but we're quite far from production ready anyway... :\ )
If possible, we should keep this distinction.
There was a problem hiding this comment.
This is the last item-I'm going to think on the best way to do this. The rest should be addressed. I think Praveen had an outstanding thing he told me in slack, which hopefully I can handle early next week.
There was a problem hiding this comment.
@ffromani I replaced all the e2e tests so that they will run on a cluster already up in the event that one is already up, BUT in the event that there is no cluster, starts one, runs the tests, and then shuts down the cluster. Is there a different flow you would like? I had to verify that I had this fully running with a different cluster.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: catblade The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
2ca2c64 to
0b97d3a
Compare
- Use same identifiers across the project - Update message to use DRA Prioritized lists is not supported - Move health check until after webhook is up - Skip entries with ResourceClaimName == nil up front
- Add comments on error conditions - Move pod validation into admission pkg - Fix up various small items found by running tests - PinToNode handles casess where pinned pods may want to be scheduled on tainted nodes - Added control plane toleration for install-admission - In cpu_assigmnet_test.go, pin exclusive-claim pods to the target node so they all run on the same node
- Move to pod rather than container totals - Simplify some logic to better map to other parts of the project - requests have to match cpu requests - Round up for fractional requests (this maybe we should change)
We can keep the admission a separate executable (therefore with a separate entry point). However, we should strive to push all the logic in packages, so the entrypoint (the code holding the I'll have a review pass as soon as you are happy with the upload @catblade |
|
@ffromani I moved functionality into the packages. Please let me know if you want any more changes. |
ffromani
left a comment
There was a problem hiding this comment.
Thanks for the updates. The biggest factor here is the claim wait logic. It's remarkably better than before, but not sure yet. Need to check with DRA folks to doublecheck.
Please extract from this PR the many QoL changes you made and extract them in one or better more targeted PRs. I think we can merge them quicker.
Misc comments about code org, the biggest/only blocker is how we handle claims, once we figure out I think we can quickly converge.
| healthzPath string | ||
| claimGetRetryWait time.Duration | ||
| claimGetRetryTotal time.Duration | ||
| healthzStatus atomicBool |
There was a problem hiding this comment.
I still think we don't need this, or do we? we now set asynchronously, but just once and just to cover a small (hopefully) window at startup. I'd rearrange the code not to need it and just hardcode true in the healthz page, we don't seem to need anything more complex than this anytime soon.
Or, from the other angle: what's the reason to keep it? If we HAD to keep it, can we at least use a stdlib type atomic.Bool?
There was a problem hiding this comment.
The healthzStatus flag is still needed as /healthz returns 503 until the server is listening. Store is called after the server is up, so readiness probes should fail until the webhook is actually ready. Will change code to atomic.Bool.
| type admissionHandler struct { | ||
| driverName string | ||
| clientset kubernetes.Interface | ||
| claimGetRetryWait time.Duration | ||
| claimGetRetryTotal time.Duration | ||
| } |
There was a problem hiding this comment.
I think we should move this type and all its methods and helpers in pkg/admission (or another package) and do only setup and wiring here in main.
| // Fetch the ResourceClaim and sum CPU requests for this driver. | ||
| // Claims can be created asynchronously (for example from Pod claim templates), | ||
| // so retry briefly on NotFound before treating the claim as not yet available. |
There was a problem hiding this comment.
this is better, but not sure it's robust enough, literally not sure. We should check in with other DRA developers. I'll ask in the slack channel.
There was a problem hiding this comment.
Added some more robustness... not sure what else we can add.
| claimGetRetryWait = durationFromEnv("DRACPU_ADMISSION_CLAIM_GET_RETRY_WAIT", claimGetRetryWait) | ||
| claimGetRetryTotal = durationFromEnv("DRACPU_ADMISSION_CLAIM_GET_RETRY_TOTAL", claimGetRetryTotal) |
There was a problem hiding this comment.
why do these and only these have an environment variable to override?
There was a problem hiding this comment.
Removing for simplicity. I was doing some testing and missed cleaning them up.
|
|
||
| // TestAdmissionHandler_ValidatePodClaimsWiring ensures the handler implements | ||
| // ClaimCPUCountGetter and that admission.ValidatePodClaims works when called with it. | ||
| func TestAdmissionHandler_ValidatePodClaimsWiring(t *testing.T) { |
There was a problem hiding this comment.
likewise, all tests should be pushed alongside the code in pkg/admission
| @@ -0,0 +1,26 @@ | |||
| apiVersion: resource.k8s.io/v1 | |||
There was a problem hiding this comment.
please embed using the embed package or move in testdata
| @@ -0,0 +1,13 @@ | |||
| apiVersion: v1 | |||
- Move code out of the cmd package into the pkg directory - Use atomic.Bool instead of writing our own - Move files into the testdata directory - Harden claimCPUCount to handle other potential issues - Remove unneeded environmental overrides
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
I'll have another look shortly. I'm inclined to move forward and to iterate in future PRs. |
| func TestValidatePodClaims_CPURequestMatchesClaimCount(t *testing.T) { | ||
| getter := fakeClaimCPUCountGetter{"default/claim-4": 4} | ||
| pod := podWithClaims("default", "pod-ok", "claim-ref", "claim-4") | ||
| pod.Spec.Containers[0].Resources.Requests = corev1.ResourceList{ | ||
| corev1.ResourceCPU: resource.MustParse("4"), | ||
| } | ||
|
|
||
| errs := ValidatePodClaims(context.Background(), pod, DefaultDriverName, getter) | ||
| if len(errs) != 0 { | ||
| t.Fatalf("expected no errors, got %v", errs) | ||
| } | ||
| } | ||
|
|
||
| // With pod-level validation, a pod with dra.cpu claims must have total CPU request equal to claim total. | ||
| func TestValidatePodClaims_NoCPURequestWithClaimRejected(t *testing.T) { | ||
| getter := fakeClaimCPUCountGetter{"default/claim-2": 2} | ||
| pod := podWithClaims("default", "pod-claim-only", "claim-ref", "claim-2") | ||
|
|
||
| errs := ValidatePodClaims(context.Background(), pod, DefaultDriverName, getter) | ||
| if len(errs) == 0 { | ||
| t.Fatal("expected error (pod CPU requests 0 < claim total 2), got none") | ||
| } | ||
| } |
There was a problem hiding this comment.
Could we convert this and perhaps the other new test files to table driven tests ?
| func CPURequestCount(cpuQuantity resource.Quantity) int64 { | ||
| millis := cpuQuantity.MilliValue() | ||
| if millis <= 0 { | ||
| return 0 | ||
| } | ||
| value := cpuQuantity.Value() | ||
| if value*1000 == millis { | ||
| return value | ||
| } | ||
| // Round up to whole cores: 400m -> 1, 500m -> 1, 1500m -> 2 | ||
| return (millis + 999) / 1000 | ||
| } |
There was a problem hiding this comment.
We should not be rounding off standard requests. If a pod spec requests 800m and the claim is requesting 1 CPU, we currently consider this as valid, which is incorrect.
| // ValidatePodClaims enforces at pod level: when a pod has dra.cpu claims, the sum of | ||
| // non-init container CPU requests must equal the sum of CPUs from those claims. | ||
| // It returns a list of errors | ||
| func ValidatePodClaims(ctx context.Context, pod *corev1.Pod, driverName string, getter ClaimCPUCountGetter) []string { |
There was a problem hiding this comment.
I propose an alternate logic for this function:
Case 1: pod.Spec.Resources not populated
- For individual containers with claims, standard request == claim request
- If multiple containers reference the same claim, sum of standard requests from containers referencing same claim == claim request
Case 2: pod.Spec.Resources is specified
- Containers with claims should not specify standard CPU requests.
- pod.Spec.Resources >= sum of claim requests from pod.Spec.ResourceClaims + sum of standard requests from non-init containers
Once we introduce a mutating webhook, we could always set pod.Spec.Resources which would make migration to KEP-5517 seamless.
cc: @ffromani
| if count < 1 { | ||
| count = 1 | ||
| } | ||
| if req.Capacity != nil && len(req.Capacity.Requests) > 0 { |
There was a problem hiding this comment.
I realized there is an edge case withcpu-device-mode=grouped. In grouped mode, a count of 1 without req.Capacity doesn't mean 1 CPU - —it means 1 device (like an entire NUMA node or socket depending on --cpu-device-group-by) is allocated to the claim.
To address this, we could pass the --cpu-device-mode flag value to the webhook
- If mode == individual, we can keep the logic as-is and just read the count.
- If mode == grouped, we require req.Capacity to also be set. We could temporarily throw a validation error if req.Capacity is nil with a TODO to handle this later.
Add in a webadmission controller to make sure that CPUs counts are in line together.
Change makefile and expand to make easy e2e testing.
Add in easy install targets for individual/grouped modes in Makefile.
Some changes to clean up NRI from bugs discovered with testing.