janitor resource cleanup draft #73

wlggraham · 2025-12-20T00:51:47Z

PR Details

Clickup Link -

Description

The purpose of this PR is to complete the cancellation and resource cleanup process within heimdall. Currently, cancellation puts jobs in a cancelling state, which terminates the handler. This means that remote resources could still be running, however, so we want a mechanism that can pick up stale or cancelling jobs and ensure that all resources have been properly terminated.

In order to do this, we pass the command handlers to the janitor, who then actively queries the DB for any stale or cancelling jobs. When it finds any, it grabs the appropriate cluster and activates the cleanUp Handler associated with each plugin (using the same interface that the normal handler uses). It then becomes necessary for each plugin to write a cleanup() function that will take any values from the job/command/cluster context to terminate resources.

I have updated the ECS plugin as an example of how the above approach would be implemented.

This PR was tested by creating multiple scripts which launched dozens of async jobs using different command criteria. The jobs would be canceled at different speeds/intervals, and the results were analyzed.

Some things to remember:

When a job is cancelled via API, the status is updated in the DB. A go routine that is monitoring that status flag runs every 10 seconds to then pass the cancellation signal to the routine running the job. This means that if the job completes (fails/succeeds) before the cancellation context is triggered, then the final status will be recorded as the final state, not canceled. This is intentional.
Each Heimdall instance running will kick off a janitor every minute. It will start by creating a canceling-job channel, query any stale or canceling jobs, update the statuses, then send the jobs to a cancel channel. This means that the status' of the jobs are updated right before they are actually canceled. This is done to avoid keeping a connection open on the DB while running each plugins cleanup() function.

Types of changes

Docs change / refactoring / dependency upgrade
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

My code follows the code style of this project.
My change requires a change to the documentation and I have updated the documentation accordingly.
I have added tests to cover my changes.

internal/pkg/janitor/janitor.go

wlggraham · 2026-01-09T23:25:23Z

internal/pkg/heimdall/queries/job/active_select.sql

        join clusters cl on cl.system_cluster_id = jj.job_cluster_id
    where
        aj.agent_name is null
+        and jj.job_status_id != 7  -- Not canceling. Jobs can be canceled before being assigned to an agent.


Jobs with the status of "NEW" have not yet been assigned an agent. These jobs CAN be canceled though. This where clause will prohibit those canceled jobs from being picked up and sent to workers.

wlggraham · 2026-01-09T23:36:13Z

internal/pkg/janitor/queries/jobs_set_failed.sql

-    system_job_id in ( {{ .Slice }} ) and
-    job_status_id = 3 -- running
+    system_job_id in ({{ .Slice }})
+    and job_status_id in (2, 3) -- defensive check; job could complete before being marked as stale


There is a small chance that "ACCEPTED" and "RUNNING" stale jobs actually complete between the SELECT and the UPDATE calls. This would prevent their status from changing.
"NEW" jobs cannot go stale because they have not been assigned an agent or heartbeat.

janitor resource cleanup draft

44d2b6d

hladush changed the title ~~janitor resource cleanup draft~~ [WIP]janitor resource cleanup draft Jan 7, 2026

hladush reviewed Jan 7, 2026

View reviewed changes

internal/pkg/janitor/janitor.go Outdated Show resolved Hide resolved

wlggraham and others added 4 commits January 9, 2026 16:04

rebuild janitor logic with clean up channel

cd0ba90

update plugin New interface

3cecdcd

fix cancel spelling & add janitor support to main functions and queries

8bacf3c

Merge branch 'main' into janitor_resource_cleanup

864f531

wlggraham changed the title ~~[WIP]janitor resource cleanup draft~~ janitor resource cleanup draft Jan 9, 2026

wlggraham commented Jan 9, 2026

View reviewed changes

wlggraham marked this pull request as ready for review January 9, 2026 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

janitor resource cleanup draft #73

janitor resource cleanup draft #73

Uh oh!

wlggraham commented Dec 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

wlggraham Jan 9, 2026

Uh oh!

wlggraham Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janitor resource cleanup draft #73

Are you sure you want to change the base?

janitor resource cleanup draft #73

Uh oh!

Conversation

wlggraham commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Details

Description

Types of changes

Checklist

Uh oh!

Uh oh!

wlggraham Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

wlggraham Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wlggraham commented Dec 20, 2025 •

edited

Loading