Add cleanup module to handle graceful shutdown, improve logging experience by juliusgeo · Pull Request #3260 · hatchet-dev/hatchet

juliusgeo · 2026-03-12T22:13:13Z

Description

Fixes # (issue)

Type of change

Bug fix (non-breaking change which fixes an issue)
Documentation change (pure documentation change)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (non-breaking changes to code which doesn't change any behaviour)
CI (any automation pipeline changes)
Chore (changes which are not directly related to any business logic)
Test changes (add, refactor, improve or change a test)
This change requires a documentation update

What's Changed

Add a list of tasks or features here...

vercel · 2026-03-12T22:13:18Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hatchet-docs	Ready	Preview, Comment	Mar 14, 2026 3:29pm

gregfurman · 2026-03-13T10:55:08Z

pkg/cleanup/cleanup.go

+}
+
+func (c *Cleanup) Run() error {
+	log.Printf("waiting for all other services to gracefully exit...")


Should we rather be passing that zerologger and using it for logging ~~since (afaik) this is logging to stdout~~?

hatchet/cmd/hatchet-engine/engine/run.go

Line 95 in bf05890

var l = server.Logger

Fixed it so that it uses warn level log statements. Made it so that we only actually log the warnings if the total time limit is exceeded for cleanup.

gregfurman · 2026-03-13T14:30:59Z

pkg/cleanup/cleanup.go

+		if err := fn.Fn(); err != nil {
+			return fmt.Errorf("could not teardown %s: %w", fn.Name, err)
+		}


q: What happens if we early exit without shutting down all processes? Is there a chance shutdown could hang?

So for this change, I'm trying to preserve the same behavior we had before: https://github.com/hatchet-dev/hatchet/pull/3260/changes/BASE..bf05890f0ba6f88de0c1e861a7ba0323ae7d2406#diff-a474308cfad07b710f145423c053e3cca49384ea95cb1144e3e18b4509808ba5L117
which also errors out when cleanup could not be completed. I don't think it would hang, though, it would just shutdown without gracefully cleaning up. Which if there was an error, is already the case.

gregfurman · 2026-03-13T14:54:02Z

cmd/hatchet-engine/engine/run.go

+	cleanup.Add(func() error {
+		return server.Disconnect()
+	}, "database")


nit: why not pass the function as a param instead?

Suggested change

cleanup.Add(func() error {

return server.Disconnect()

}, "database")

cleanup.Add(server.Disconnect, "database")

Good catch, fixed.

gregfurman · 2026-03-13T16:29:14Z

pkg/cleanup/cleanup.go

+
+func (c *Cleanup) Run(l *zerolog.Logger) error {
+	lines := []string{}
+	start := time.Now()


IMO if we're adding a time-limit (or deadline) we should probably consider using a context.WithTimeout.

Also, passing a ctx here gives us the opportunity to cancel early (if a force shutdown or whatever).

Could do something like:

func (c *Cleanup) Run(ctx context.Context) error { // Could either assume the caller has set a deadline or attach one ourselves i.e // ctx, cancel = context.WithTimeout(ctx, time.Second * 30) // defer cancel() // loop over all fns... select { case <-ctx.Done: // optionally, if you want the cleanup exceeded error, check whether the error is context.DeadlineExceeded and return that formatted error. // (the deadline could be extracted via ctx.Deadline()) // fmt.Errorf("cleanup exceeded time limit of %d seconds", ctx.) return ctx.Err() default: if err := fn.Fn(); err != nil { return fmt.Errorf("could not teardown %s: %w", fn.Name, err) } } }

So I'm not super sure about this because we don't want to cancel early, we just want to know when the cleanup took a long time to complete, so then we can look at the logs to figure out what service was taking the extra time. So using context.WithTimeout and then returning an error prior to all the cleanups finishing would lose info about which cleanup was taking that extra time.

gregfurman · 2026-03-13T16:34:53Z

pkg/cleanup/cleanup.go

+		if err := fn.Fn(); err != nil {
+			return fmt.Errorf("could not teardown %s: %w", fn.Name, err)
+		}
+		lines = append(lines, fmt.Sprintf("successfully shutdown %s in %s (%d/%d)\n", fn.Name, time.Since(before), i+1, len(c.Fns)))


This makes sense! However... I'm concerned that we actually could want to see these logs happening live on shutdown.

Like, what if a service is actually taking a while to shutdown but doesn't cause an error? Could be useful to optionally see that info.

Perhaps one of the other devs can weigh in but IMO it's nicer for devex to optionally see what is happening on startup/shutdown.

So in terms of seeing the logs live on shutdown--the main reason I'm not just dumping them to INFO is that this is intended for cloud deployments where the log level is set to WARN. I could make it so that it both logs to info and warn, but that would be confusing in the case someone is self hosting and has the log level set to INFO--they would get a bunch of repeated logs.

gregfurman · 2026-03-13T16:36:16Z

pkg/cleanup/cleanup.go

+	})
+}
+
+func (c *Cleanup) Run(l *zerolog.Logger) error {


nit: Thanks! Wonder if it could it be more idiomatic to make this an attribute of the Cleanup struct that we set on New(l *zerolog.Logger)? 👀

abelanger5 · 2026-03-14T14:48:34Z

pkg/cleanup/cleanup.go

+		for _, line := range lines {
+			c.logger.Warn().Msg(line)
+		}
+		return fmt.Errorf("cleanup exceeded time limit of %d seconds", c.TimeLimit)


nit: since TimeLimit is a duration with a String() arg I think this can just be:

return fmt.Errorf("cleanup exceeded time limit of %s", c.TimeLimit)

abelanger5 · 2026-03-14T14:51:00Z

pkg/cleanup/cleanup.go

+		lines = append(lines, fmt.Sprintf("successfully shutdown %s in %s (%d/%d)\n", fn.Name, time.Since(before), i+1, len(c.Fns)))
+	}
+	lines = append(lines, "all services have successfully gracefully exited")
+	if time.Since(start) > c.TimeLimit {


As-written, I don't think this gets us the behavior we need? The problem is that if we get a SIGTERM followed by a SIGKILL after 30 seconds, we're never going to reach this line, we'll have been killed already. I think we need to be checking the time limit exceeding async with time.After in a goroutine, and then safely printing the log lines that we do have

Ah that makes sense. I used @gregfurman's suggestion above to make it so that it uses a context with timeout that logs with Error when the timeout is exceeded. I also changed the timeout to 10 seconds because we wait 20 seconds in run.go for the server shutdown wait period. I also removed the error return because if the deadline is exceeded the process is already dead.

abelanger5 · 2026-03-14T14:51:24Z

pkg/cleanup/cleanup.go

+	lines = append(lines, "all services have successfully gracefully exited")
+	if time.Since(start) > c.TimeLimit {
+		for _, line := range lines {
+			c.logger.Warn().Msg(line)


I think this should be Error() as we won't see warnings unless we tail directly in our setup

github-actions · 2026-03-14T15:27:27Z

Benchmark results

goos: linux
goarch: amd64
pkg: github.com/hatchet-dev/hatchet/pkg/scheduling/v1
cpu: AMD Ryzen 9 7950X3D 16-Core Processor          
              │ /tmp/old.txt │         /tmp/new.txt          │
              │    sec/op    │    sec/op     vs base         │
RateLimiter-8   51.35µ ± 14%   53.05µ ± 18%  ~ (p=0.394 n=6)

              │ /tmp/old.txt │         /tmp/new.txt          │
              │     B/op     │     B/op      vs base         │
RateLimiter-8   137.7Ki ± 0%   137.7Ki ± 0%  ~ (p=0.613 n=6)

              │ /tmp/old.txt │          /tmp/new.txt          │
              │  allocs/op   │  allocs/op   vs base           │
RateLimiter-8    1.022k ± 0%   1.022k ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

_{Compared against main (69951ac)}

juliusgeo added 2 commits March 12, 2026 12:05

initial commit

f597996

fix logs

bf05890

vercel bot deployed to Preview March 12, 2026 22:16 View deployment

gregfurman reviewed Mar 13, 2026

View reviewed changes

make warn, fix

e92c6ad

vercel bot deployed to Preview March 13, 2026 15:48 View deployment

add buffering

ad3c0be

juliusgeo requested a review from abelanger5 March 13, 2026 16:05

vercel bot deployed to Preview March 13, 2026 16:08 View deployment

gregfurman reviewed Mar 13, 2026

View reviewed changes

make logger part of cleanup struct

3d174fc

vercel bot deployed to Preview March 13, 2026 17:20 View deployment

abelanger5 reviewed Mar 14, 2026

View reviewed changes

change it so deadline collection starts in goroutine

53328cd

vercel bot deployed to Preview March 14, 2026 15:29 View deployment

Conversation

juliusgeo commented Mar 12, 2026

Description

Type of change

What's Changed

Uh oh!

vercel bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregfurman Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliusgeo Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliusgeo Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 14, 2026

Benchmark results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Mar 12, 2026 •

edited

Loading

gregfurman Mar 13, 2026 •

edited

Loading

juliusgeo Mar 13, 2026 •

edited

Loading

juliusgeo Mar 14, 2026 •

edited

Loading