Conversation
There was a problem hiding this comment.
Leave a TODO here, we should try to limit the amount of data we scrape here. I'd start with a label selector which should help us with this.
There was a problem hiding this comment.
the cronjob spec does not currently seem to have a selector field. Are you suggesting we add that to the spec and then use that to filter the jobs for the CronJob?
We could have a specific label apps.k8s.io=cronjob_ns/cronjob_name to filter out the jobs immediately, but I wonder if this is reliable? a user can easily (sometimes unintentionally) change this and it would skip the reconciliation loop (a side effect I'd like to not introduce)
There was a problem hiding this comment.
Yeah, it doesn't have to be here, but would be nice from a performance pov.
There was a problem hiding this comment.
scheduledTime will be a time from the past, if I'm not mistaken, so I'm not sure that's what you want to use in the next two conditions, you'll just want to re-queue it for the next execution time, and getRecentUnmetScheduleTimes doesn't return that. It only returns past times. You'll need some kind of getNextScheduleTime.
There was a problem hiding this comment.
my bad, I misunderstood our last conversation, I'll revert that change as I had before, good catch, +1
2689cd9 to
b2d8826
Compare
There was a problem hiding this comment.
@soltysh the way I am trying to think about the problem here is that, if the change is schedule results in next requeue having to be sooner than it already was, it will be handled here by the queue. If the next requeue is further than previous schedule, the sync loop will essentially be a no-op for the already queued key with old schedule.
971d9c7 to
e74eea6
Compare
6ac7233 to
3ee24ab
Compare
bcd1019 to
868cb40
Compare
31e8b4e to
92e44ed
Compare
…ional-probe-protocol Support customize load balancer health probe protocol and request path
fix golint fix gofmt
7321af6 to
9e12624
Compare
…ndency remove label dependency on k8s api in Azure
9e12624 to
881748d
Compare
local-up-cluster.sh: Pass CLUSTER_CIDR to kube-proxy
Add --experimental-logging-sanitization flag to control plane components
bbcad5a to
43227e5
Compare
…rs_without_vpc_native Forbid creating clusters with more than 100 nodes without vpc-native
Remove ready directory which created in empty volumeMounter setUp func
…te-evt Ignore some update Pod events in scheduler
e2e test for PodFsGroupChangePolicy feature
Remove duplicate import
…er-version Add vCenter info metric
…err-conn-killed stop logging killing connection/stream because serving request timed out and response had been started
…ere/update_windows_container_resources Add WindowsContainerResources to UpdateContainerResourcesRequest
…y-beta Move fsGroupChangePolicy feature to beta
Fix command and arg in NPD e2e
…v2-docs cloud-provider: update docs and guidance for InstanceV2 and Zones
Remove Const IPVSProxyMode
…ymlink-fix fixing issue where SMB share paths cannot resolve with CRI-containerD on Windows
75e1832 to
7f669ff
Compare
7f669ff to
38bb535
Compare
The code path for handling non-JSON output from etcd was broken:
- It did not skip over already parsed JSON output.
- It got stuck in the wrong for loop and repeatedly tried
parsing the same non-JSON output. This prevented test shutdown.
Not sure why yet, but in the branch with DRA v1 graduation the following error
started to show up for the first time (?!):
2025/07/18 19:24:48 WARNING: [core] [Server #3]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
DRA depends on the assume cache having invoked all event handlers before
Assume() returns, because DRA maintains state that is relevant for scheduling
through those event handlers.
This log snippet shows how this went wrong during PreBind:
dynamicresources.go:1150: I0115 10:35:29.264437] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636"
dra_manager.go:198: I0115 10:35:29.264448] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 version="287"
dynamicresources.go:1157: I0115 10:35:29.264463] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636" allocation=<
...
allocateddevices.go:189: I0115 10:35:29.267315] scheduler: Observed device allocation device="testdra-all-usesallresources-kqjpj.driver/worker-1/worker-1-device-096" claim="testdra-all-usesallresources-kqjpj/claim-0091"
- goroutine #1: UpdateStatus result delivered via informer.
AssumeCache updates cache, pushes event A, emitEvents pulls event A from queue.
*Not* done with delivering it yet!
- goroutine #2: AssumeCache.Assume called. Updates cache, pushes event B, emits it.
Old and new claim have allocation, so no "Observed device allocation".
- goroutine #3: Schedules next pod, without considering device as allocated (not in the log snippet).
- goroutine #1: Finally delivers event A: "Observed device allocation", but too late.
Also, events are delivered out-of-order.
The fix is to let emitEvents when called by Assume wait for a potentially
running emitEvents in some other goroutine, thus ensuring that an event pulled
out of the queue by that other goroutine got delivered before Assume itself
checks the queue one more time and then returns.
The time window were things go wrong is small. An E2E test covering this only
flaked rarely, and only in the CI. An integration test (separate commit) with
higher number of pods finally made it possible to reproduce locally. It also
uncovered a second race (fix in separate commit).
The unit test fails without the fix:
=== RUN TestAssumeConcurrency
assume_cache_test.go:311: FATAL ERROR:
Assume should have blocked and didn't.
--- FAIL: TestAssumeConcurrency (0.00s)
No description provided.