Skip to content

Conversation

@wlggraham
Copy link
Contributor

@wlggraham wlggraham commented Dec 20, 2025

PR Details

Clickup Link -

Description

The purpose of this PR is to complete the cancellation and resource cleanup process within heimdall. Currently, cancellation puts jobs in a cancelling state, which terminates the handler. This means that remote resources could still be running, however, so we want a mechanism that can pick up stale or cancelling jobs and ensure that all resources have been properly terminated.

In order to do this, we pass the command handlers to the janitor, who then actively queries the DB for any stale or cancelling jobs. When it finds any, it grabs the appropriate cluster and activates the cleanUp Handler associated with each plugin (using the same interface that the normal handler uses). It then becomes necessary for each plugin to write a cleanup() function that will take any values from the job/command/cluster context to terminate resources.

I have updated the ECS plugin as an example of how the above approach would be implemented.

This PR was tested by creating multiple scripts which launched dozens of async jobs using different command criteria. The jobs would be canceled at different speeds/intervals, and the results were analyzed.

Some things to remember:

  1. When a job is cancelled via API, the status is updated in the DB. A go routine that is monitoring that status flag runs every 10 seconds to then pass the cancellation signal to the routine running the job. This means that if the job completes (fails/succeeds) before the cancellation context is triggered, then the final status will be recorded as the final state, not canceled. This is intentional.
  2. Each Heimdall instance running will kick off a janitor every minute. It will start by creating a canceling-job channel, query any stale or canceling jobs, update the statuses, then send the jobs to a cancel channel. This means that the status' of the jobs are updated right before they are actually canceled. This is done to avoid keeping a connection open on the DB while running each plugins cleanup() function.

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation and I have updated the documentation accordingly.
  • I have added tests to cover my changes.

@hladush hladush changed the title janitor resource cleanup draft [WIP]janitor resource cleanup draft Jan 7, 2026
@wlggraham wlggraham changed the title [WIP]janitor resource cleanup draft janitor resource cleanup draft Jan 9, 2026
join clusters cl on cl.system_cluster_id = jj.job_cluster_id
where
aj.agent_name is null
and jj.job_status_id != 7 -- Not canceling. Jobs can be canceled before being assigned to an agent.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jobs with the status of "NEW" have not yet been assigned an agent. These jobs CAN be canceled though. This where clause will prohibit those canceled jobs from being picked up and sent to workers.

system_job_id in ( {{ .Slice }} ) and
job_status_id = 3 -- running
system_job_id in ({{ .Slice }})
and job_status_id in (2, 3) -- defensive check; job could complete before being marked as stale
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small chance that "ACCEPTED" and "RUNNING" stale jobs actually complete between the SELECT and the UPDATE calls. This would prevent their status from changing.
"NEW" jobs cannot go stale because they have not been assigned an agent or heartbeat.

@wlggraham wlggraham marked this pull request as ready for review January 9, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants