Refactor cronjob controller by alaypatel07 · Pull Request #2 · alaypatel07/kubernetes

alaypatel07 · 2020-06-22T03:22:39Z

No description provided.

Update cri-tools to v1.18.0 (#2)

…ncOnce

[Attempt #2] apiserver: Treat error decoding a mutating webhook patch as error calling the webhook

Instead of creating a new test case, the permutation is passed down. This enables adding the event numbers to the log output, which is useful to understand better which output belongs to which input: === RUN TestListPatchedResourceSlices/update-patch/2_3_0_1 tracker.go:396: I0929 14:28:40.032318] event #1: ResourceSlice add slice="s1" tracker.go:581: I0929 14:28:40.032404] event #1: syncing ResourceSlice resourceslice="s1" tracker.go:659: I0929 14:28:40.032446] event #1: ResourceSlice synced resourceslice="s1" change="add" tracker.go:396: I0929 14:28:40.032502] event #2: ResourceSlice add slice="s2" tracker.go:581: I0929 14:28:40.032536] event #2: syncing ResourceSlice resourceslice="s2" tracker.go:659: I0929 14:28:40.032568] event #2: ResourceSlice synced resourceslice="s2" change="add" tracker.go:463: I0929 14:28:40.032609] event #0/#0: DeviceTaintRule add patch="rule" tracker.go:581: I0929 14:28:40.032639] event #0/#0: syncing ResourceSlice resourceslice="s1" tracker.go:703: I0929 14:28:40.032675] event #0/#0: processing DeviceTaintRule resourceslice="s1" deviceTaintRule="rule" tracker.go:807: I0929 14:28:40.032712] event #0/#0: applying matching DeviceTaintRule resourceslice="s1" deviceTaintRule="rule" device="driver1.example.com/pool-1/device-1" tracker.go:868: I0929 14:28:40.032780] event #0/#0: Assigned new taint ID, no matching taint resourceslice="s1" deviceTaintRule="rule" device="driver1.example.com/pool-1/device-1" taintID=0 taint="example.com/taint=tainted:NoExecute" tracker.go:654: I0929 14:28:40.033023] event #0/#0: ResourceSlice synced resourceslice="s1" change="update" diff=< @@ -23,7 +23,32 @@ "BindingConditions": null, "BindingFailureConditions": null, "AllowMultipleAllocations": null, - "Taints": null + "Taints": [ + { + "Rule": { + "metadata": { + "name": "rule" + }, + "spec": { + "deviceSelector": { + "pool": "pool-1" + }, + "taint": { + "key": "example.com/taint", + "value": "tainted", + "effect": "NoExecute", + "timeAdded": "2006-01-02T15:04:05Z" + } + }, + "status": {} + }, + "ID": 1, + "key": "example.com/taint", + "value": "tainted", + "effect": "NoExecute", + "timeAdded": "2006-01-02T15:04:05Z" + } + ] } ], "Taints": null, > tracker.go:482: I0929 14:28:40.033224] event #0/#1: DeviceTaintRule update patch="rule" diff=< @@ -4,7 +4,7 @@ }, "spec": { "deviceSelector": { - "pool": "pool-1" + "pool": "pool-2" }, "taint": { "key": "example.com/taint", > tracker.go:581: I0929 14:28:40.033285] event #0/#1: syncing ResourceSlice resourceslice="s1" tracker.go:703: I0929 14:28:40.033319] event #0/#1: processing DeviceTaintRule resourceslice="s1" deviceTaintRule="rule" tracker.go:654: I0929 14:28:40.033478] event #0/#1: ResourceSlice synced resourceslice="s1" change="update" diff=< @@ -23,32 +23,7 @@ "BindingConditions": null, "BindingFailureConditions": null, "AllowMultipleAllocations": null, - "Taints": [ - { - "Rule": { - "metadata": { - "name": "rule" - }, - "spec": { - "deviceSelector": { - "pool": "pool-1" - }, - "taint": { - "key": "example.com/taint", - "value": "tainted", - "effect": "NoExecute", - "timeAdded": "2006-01-02T15:04:05Z" - } - }, - "status": {} - }, - "ID": 1, - "key": "example.com/taint", - "value": "tainted", - "effect": "NoExecute", - "timeAdded": "2006-01-02T15:04:05Z" - } - ] + "Taints": null } ], "Taints": null, > tracker.go:581: I0929 14:28:40.033601] event #0/#1: syncing ResourceSlice resourceslice="s2" tracker.go:703: I0929 14:28:40.033633] event #0/#1: processing DeviceTaintRule resourceslice="s2" deviceTaintRule="rule" ... Disabling event checking only worked when actually running all sub-tests. When selectively running only one permutation with -run, the boolean variable was wrong: $ go test -run='.*/^update-patch$' ./staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/ ok k8s.io/dynamic-resource-allocation/resourceslice/tracker $ go test -run='.*/^update-patch$/3_2_0_1' ./staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/ --- FAIL: TestListPatchedResourceSlices (0.01s) --- FAIL: TestListPatchedResourceSlices/update-patch (0.00s) --- FAIL: TestListPatchedResourceSlices/update-patch/3_2_0_1 (0.00s) tracker_test.go:762: Error Trace: /nvme/gopath/src/k8s.io/kubernetes/staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/tracker_test.go:762 /nvme/gopath/src/k8s.io/kubernetes/staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/tracker_test.go:856 Error: Not equal: expected: []tracker.handlerEvent{tracker.handlerEvent{event:"add", oldObj:(*api.ResourceSlice)(nil), newObj:(*api.ResourceSlice)(0xc000301d40)}, tracker.handlerEvent{event:"add", oldObj:(*api.ResourceSlice)(nil), newObj:(*api.ResourceSlice)(0xc000346000)}} actual : []tracker.handlerEvent{tracker.handlerEvent{event:"add", oldObj:(*api.ResourceSlice)(nil), newObj:(*api.ResourceSlice)(0xc0001f9ba0)}, tracker.handlerEvent{event:"add", oldObj:(*api.ResourceSlice)(nil), newObj:(*api.ResourceSlice)(0xc000301d40)}, tracker.handlerEvent{event:"update", oldObj:(*api.ResourceSlice)(0xc000301d40), newObj:(*api.ResourceSlice)(0xc0003dba00)}, tracker.handlerEvent{event:"update", oldObj:(*api.ResourceSlice)(0xc0003dba00), newObj:(*api.ResourceSlice)(0xc000301d40)}, tracker.handlerEvent{event:"update", oldObj:(*api.ResourceSlice)(0xc0001f9ba0), newObj:(*api.ResourceSlice)(0xc0003dbba0)}} Now permutations are detected automatically based on the indices. While at it, documentation gets moved around a bit to make reading test cases easier without going to the implementation.

DRA depends on the assume cache having invoked all event handlers before Assume() returns, because DRA maintains state that is relevant for scheduling through those event handlers. This log snippet shows how this went wrong during PreBind: dynamicresources.go:1150: I0115 10:35:29.264437] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636" dra_manager.go:198: I0115 10:35:29.264448] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 version="287" dynamicresources.go:1157: I0115 10:35:29.264463] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636" allocation=< ... allocateddevices.go:189: I0115 10:35:29.267315] scheduler: Observed device allocation device="testdra-all-usesallresources-kqjpj.driver/worker-1/worker-1-device-096" claim="testdra-all-usesallresources-kqjpj/claim-0091" - goroutine #1: UpdateStatus result delivered via informer. AssumeCache updates cache, pushes event A, emitEvents pulls event A from queue. *Not* done with delivering it yet! - goroutine #2: AssumeCache.Assume called. Updates cache, pushes event B, emits it. Old and new claim have allocation, so no "Observed device allocation". - goroutine #3: Schedules next pod, without considering device as allocated (not in the log snippet). - goroutine #1: Finally delivers event A: "Observed device allocation", but too late. Also, events are delivered out-of-order. The fix is to let emitEvents when called by Assume wait for a potentially running emitEvents in some other goroutine, thus ensuring that an event pulled out of the queue by that other goroutine got delivered before Assume itself checks the queue one more time and then returns. The time window were things go wrong is small. An E2E test covering this only flaked rarely, and only in the CI. An integration test (separate commit) with higher number of pods finally made it possible to reproduce locally. It also uncovered a second race (fix in separate commit). The unit test fails without the fix: === RUN TestAssumeConcurrency assume_cache_test.go:311: FATAL ERROR: Assume should have blocked and didn't. --- FAIL: TestAssumeConcurrency (0.00s)

GatherAllocatedState and ListAllAllocatedDevices need to collect information from different sources (allocated devices, in-flight claims), potentially even multiple times (GatherAllocatedState first gets allocated devices, then the capacities). The underlying assumption that nothing bad happens in parallel is not always true. The following log snippet shows how an update of the assume cache (feeding the allocated devices tracker) and in-flight claims lands such that GatherAllocatedState doesn't see the device in that claim as allocated: dra_manager.go:263: I0115 15:11:04.407714 18778] scheduler: Starting GatherAllocatedState ... allocateddevices.go:189: I0115 15:11:04.407945 18066] scheduler: Observed device allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-094" claim="testdra-all-usesallresources-hvs5d/claim-0553" dynamicresources.go:1150: I0115 15:11:04.407981 89109] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" dra_manager.go:201: I0115 15:11:04.408008 89109] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 version="1211" dynamicresources.go:1157: I0115 15:11:04.408044 89109] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" allocation=< { "devices": { "results": [ { "request": "req-1", "driver": "testdra-all-usesallresources-hvs5d.driver", "pool": "worker-5", "device": "worker-5-device-094" } ] }, "nodeSelector": { "nodeSelectorTerms": [ { "matchFields": [ { "key": "metadata.name", "operator": "In", "values": [ "worker-5" ] } ] } ] }, "allocationTimestamp": "2026-01-15T14:11:04Z" } > dra_manager.go:280: I0115 15:11:04.408085 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-095" claim="testdra-all-usesallresources-hvs5d/claim-0086" dra_manager.go:280: I0115 15:11:04.408137 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-096" claim="testdra-all-usesallresources-hvs5d/claim-0165" default_binder.go:69: I0115 15:11:04.408175 89109] scheduler: Attempting to bind pod to node pod="testdra-all-usesallresources-hvs5d/my-pod-0553" node="worker-5" dra_manager.go:265: I0115 15:11:04.408264 18778] scheduler: Finished GatherAllocatedState allocatedDevices=<map[string]interface {} | len:2>: { Initial state: "worker-5-device-094" is in-flight, not in cache - goroutine #1: starts GatherAllocatedState, copies cache - goroutine #2: adds to assume cache, removes from in-flight - goroutine #1: checks in-flight => device never seen as allocated This is the second reason for double allocation of the same device in two different claims. The other was timing in the assume cache. Both were tracked down with an integration test (separate commit). It did not fail all the time, but enough that regressions should show up as flakes.

alaypatel07 pushed a commit that referenced this pull request Jun 22, 2020

Merge pull request kubernetes#91564 from saschagrunert/update-cri-tools

c544449

Update cri-tools to v1.18.0 (#2)

alaypatel07 added 4 commits June 21, 2020 23:28

wire up job and cronjob informers for cronjob controller

25bb6e1

add cronjob config package, wire up controller Run method

f2c00ac

checkpoint: rework reconciliation function

1ca779c

replace Ratelimiting Queue with DelayingQueue, remove sched var in sy…

a20566d

…ncOnce

alaypatel07 force-pushed the refactor-cronjob-controller branch from 886b563 to a20566d Compare June 22, 2020 03:30

alaypatel07 pushed a commit that referenced this pull request Jul 17, 2025

Merge pull request kubernetes#131627 from dims/mutating-webhook-error

3aa66a2

[Attempt #2] apiserver: Treat error decoding a mutating webhook patch as error calling the webhook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor cronjob controller#2

Refactor cronjob controller#2
alaypatel07 wants to merge 4 commits intomasterfrom
refactor-cronjob-controller

alaypatel07 commented Jun 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alaypatel07 commented Jun 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant