-
Notifications
You must be signed in to change notification settings - Fork 7
janitor resource cleanup draft #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| join clusters cl on cl.system_cluster_id = jj.job_cluster_id | ||
| where | ||
| aj.agent_name is null | ||
| and jj.job_status_id != 7 -- Not canceling. Jobs can be canceled before being assigned to an agent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jobs with the status of "NEW" have not yet been assigned an agent. These jobs CAN be canceled though. This where clause will prohibit those canceled jobs from being picked up and sent to workers.
| system_job_id in ( {{ .Slice }} ) and | ||
| job_status_id = 3 -- running | ||
| system_job_id in ({{ .Slice }}) | ||
| and job_status_id in (2, 3) -- defensive check; job could complete before being marked as stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small chance that "ACCEPTED" and "RUNNING" stale jobs actually complete between the SELECT and the UPDATE calls. This would prevent their status from changing.
"NEW" jobs cannot go stale because they have not been assigned an agent or heartbeat.
PR Details
Clickup Link -
Description
The purpose of this PR is to complete the cancellation and resource cleanup process within heimdall. Currently, cancellation puts jobs in a cancelling state, which terminates the handler. This means that remote resources could still be running, however, so we want a mechanism that can pick up stale or cancelling jobs and ensure that all resources have been properly terminated.
In order to do this, we pass the command handlers to the janitor, who then actively queries the DB for any stale or cancelling jobs. When it finds any, it grabs the appropriate cluster and activates the cleanUp Handler associated with each plugin (using the same interface that the normal handler uses). It then becomes necessary for each plugin to write a cleanup() function that will take any values from the job/command/cluster context to terminate resources.
I have updated the ECS plugin as an example of how the above approach would be implemented.
This PR was tested by creating multiple scripts which launched dozens of async jobs using different command criteria. The jobs would be canceled at different speeds/intervals, and the results were analyzed.
Some things to remember:
cleanup()function.Types of changes
Checklist