Skip to content

Add new cronjob controller#3

Open
alaypatel07 wants to merge 1042 commits intomasterfrom
add-new-cronjob-controller
Open

Add new cronjob controller#3
alaypatel07 wants to merge 1042 commits intomasterfrom
add-new-cronjob-controller

Conversation

@alaypatel07
Copy link
Owner

No description provided.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave a TODO here, we should try to limit the amount of data we scrape here. I'd start with a label selector which should help us with this.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cronjob spec does not currently seem to have a selector field. Are you suggesting we add that to the spec and then use that to filter the jobs for the CronJob?

We could have a specific label apps.k8s.io=cronjob_ns/cronjob_name to filter out the jobs immediately, but I wonder if this is reliable? a user can easily (sometimes unintentionally) change this and it would skip the reconciliation loop (a side effect I'd like to not introduce)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it doesn't have to be here, but would be nice from a performance pov.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

klog.V(2).Infof

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

klog.V(2).Infof

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

klog.V(2).Infof

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduledTime will be a time from the past, if I'm not mistaken, so I'm not sure that's what you want to use in the next two conditions, you'll just want to re-queue it for the next execution time, and getRecentUnmetScheduleTimes doesn't return that. It only returns past times. You'll need some kind of getNextScheduleTime.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, I misunderstood our last conversation, I'll revert that change as I had before, good catch, +1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still the biggest one 😄

@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch from 2689cd9 to b2d8826 Compare July 11, 2020 01:45
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh the way I am trying to think about the problem here is that, if the change is schedule results in next requeue having to be sooner than it already was, it will be handled here by the queue. If the next requeue is further than previous schedule, the sync loop will essentially be a no-op for the already queued key with old schedule.

@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch 7 times, most recently from 971d9c7 to e74eea6 Compare July 26, 2020 21:53
@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch 3 times, most recently from 6ac7233 to 3ee24ab Compare August 27, 2020 04:42
@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch 7 times, most recently from bcd1019 to 868cb40 Compare October 6, 2020 04:34
@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch 10 times, most recently from 31e8b4e to 92e44ed Compare October 27, 2020 05:30
pigletfly and others added 3 commits November 10, 2020 19:11
…ional-probe-protocol

Support customize load balancer health probe protocol and request path
@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch from 7321af6 to 9e12624 Compare November 10, 2020 14:20
…ndency

remove label dependency on k8s api in Azure
@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch from 9e12624 to 881748d Compare November 10, 2020 15:09
k8s-ci-robot and others added 3 commits November 10, 2020 07:17
local-up-cluster.sh: Pass CLUSTER_CIDR to kube-proxy
Add --experimental-logging-sanitization flag to control plane components
@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch from bbcad5a to 43227e5 Compare November 10, 2020 16:16
…rs_without_vpc_native

Forbid creating clusters with more than 100 nodes without vpc-native
Remove ready directory which created in empty volumeMounter setUp func
…te-evt

Ignore some update Pod events in scheduler
e2e test for PodFsGroupChangePolicy feature
…err-conn-killed

stop logging killing connection/stream because serving request timed out and response had been started
…ere/update_windows_container_resources

Add WindowsContainerResources to UpdateContainerResourcesRequest
…y-beta

Move fsGroupChangePolicy feature to beta
…v2-docs

cloud-provider: update docs and guidance for InstanceV2 and Zones
…ymlink-fix

fixing issue where SMB share paths cannot resolve with CRI-containerD on Windows
@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch 2 times, most recently from 75e1832 to 7f669ff Compare November 10, 2020 22:16
@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch from 7f669ff to 38bb535 Compare November 10, 2020 22:32
alaypatel07 pushed a commit that referenced this pull request Jul 21, 2025
The code path for handling non-JSON output from etcd was broken:
- It did not skip over already parsed JSON output.
- It got stuck in the wrong for loop and repeatedly tried
  parsing the same non-JSON output. This prevented test shutdown.

Not sure why yet, but in the branch with DRA v1 graduation the following error
started to show up for the first time (?!):

     2025/07/18 19:24:48 WARNING: [core] [Server #3]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
alaypatel07 pushed a commit that referenced this pull request Feb 17, 2026
DRA depends on the assume cache having invoked all event handlers before
Assume() returns, because DRA maintains state that is relevant for scheduling
through those event handlers.

This log snippet shows how this went wrong during PreBind:

    dynamicresources.go:1150: I0115 10:35:29.264437] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636"
    dra_manager.go:198: I0115 10:35:29.264448] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 version="287"
    dynamicresources.go:1157: I0115 10:35:29.264463] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636" allocation=<
    ...
    allocateddevices.go:189: I0115 10:35:29.267315] scheduler: Observed device allocation device="testdra-all-usesallresources-kqjpj.driver/worker-1/worker-1-device-096" claim="testdra-all-usesallresources-kqjpj/claim-0091"

- goroutine #1: UpdateStatus result delivered via informer.
  AssumeCache updates cache, pushes event A, emitEvents pulls event A from queue.
  *Not* done with delivering it yet!
- goroutine #2: AssumeCache.Assume called. Updates cache, pushes event B, emits it.
  Old and new claim have allocation, so no "Observed device allocation".
- goroutine #3: Schedules next pod, without considering device as allocated (not in the log snippet).
- goroutine #1: Finally delivers event A: "Observed device allocation", but too late.

Also, events are delivered out-of-order.

The fix is to let emitEvents when called by Assume wait for a potentially
running emitEvents in some other goroutine, thus ensuring that an event pulled
out of the queue by that other goroutine got delivered before Assume itself
checks the queue one more time and then returns.

The time window were things go wrong is small. An E2E test covering this only
flaked rarely, and only in the CI. An integration test (separate commit) with
higher number of pods finally made it possible to reproduce locally. It also
uncovered a second race (fix in separate commit).

The unit test fails without the fix:

    === RUN   TestAssumeConcurrency
        assume_cache_test.go:311: FATAL ERROR:
            	Assume should have blocked and didn't.
    --- FAIL: TestAssumeConcurrency (0.00s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.