Skip to content

Add cleanup module to handle graceful shutdown, improve logging experience#3260

Open
juliusgeo wants to merge 6 commits intomainfrom
cleanup_module
Open

Add cleanup module to handle graceful shutdown, improve logging experience#3260
juliusgeo wants to merge 6 commits intomainfrom
cleanup_module

Conversation

@juliusgeo
Copy link
Contributor

Description

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation change (pure documentation change)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking changes to code which doesn't change any behaviour)
  • CI (any automation pipeline changes)
  • Chore (changes which are not directly related to any business logic)
  • Test changes (add, refactor, improve or change a test)
  • This change requires a documentation update

What's Changed

  • Add a list of tasks or features here...

@vercel
Copy link

vercel bot commented Mar 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hatchet-docs Ready Ready Preview, Comment Mar 14, 2026 3:29pm

Request Review

}

func (c *Cleanup) Run() error {
log.Printf("waiting for all other services to gracefully exit...")
Copy link
Collaborator

@gregfurman gregfurman Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rather be passing that zerologger and using it for logging since (afaik) this is logging to stdout?

var l = server.Logger

Copy link
Contributor Author

@juliusgeo juliusgeo Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it so that it uses warn level log statements. Made it so that we only actually log the warnings if the total time limit is exceeded for cleanup.

Comment on lines +36 to +38
if err := fn.Fn(); err != nil {
return fmt.Errorf("could not teardown %s: %w", fn.Name, err)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: What happens if we early exit without shutting down all processes? Is there a chance shutdown could hang?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for this change, I'm trying to preserve the same behavior we had before: https://github.com/hatchet-dev/hatchet/pull/3260/changes/BASE..bf05890f0ba6f88de0c1e861a7ba0323ae7d2406#diff-a474308cfad07b710f145423c053e3cca49384ea95cb1144e3e18b4509808ba5L117
which also errors out when cleanup could not be completed. I don't think it would hang, though, it would just shutdown without gracefully cleaning up. Which if there was an error, is already the case.

Comment on lines +99 to +101
cleanup.Add(func() error {
return server.Disconnect()
}, "database")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not pass the function as a param instead?

Suggested change
cleanup.Add(func() error {
return server.Disconnect()
}, "database")
cleanup.Add(server.Disconnect, "database")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.


func (c *Cleanup) Run(l *zerolog.Logger) error {
lines := []string{}
start := time.Now()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if we're adding a time-limit (or deadline) we should probably consider using a context.WithTimeout.

Also, passing a ctx here gives us the opportunity to cancel early (if a force shutdown or whatever).

Could do something like:

func (c *Cleanup) Run(ctx context.Context) error {

// Could either assume the caller has set a deadline or attach one ourselves i.e
// ctx, cancel = context.WithTimeout(ctx, time.Second * 30)
// defer cancel()

// loop over all fns...

select {
   case <-ctx.Done:
      // optionally, if you want the cleanup exceeded error, check whether the error is context.DeadlineExceeded and return that formatted error.
      // (the deadline could be extracted via ctx.Deadline())
      // fmt.Errorf("cleanup exceeded time limit of %d seconds", ctx.)
      return ctx.Err()
   default:
      if err := fn.Fn(); err != nil {
			return fmt.Errorf("could not teardown %s: %w", fn.Name, err)
		}
}

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not super sure about this because we don't want to cancel early, we just want to know when the cleanup took a long time to complete, so then we can look at the logs to figure out what service was taking the extra time. So using context.WithTimeout and then returning an error prior to all the cleanups finishing would lose info about which cleanup was taking that extra time.

if err := fn.Fn(); err != nil {
return fmt.Errorf("could not teardown %s: %w", fn.Name, err)
}
lines = append(lines, fmt.Sprintf("successfully shutdown %s in %s (%d/%d)\n", fn.Name, time.Since(before), i+1, len(c.Fns)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense! However... I'm concerned that we actually could want to see these logs happening live on shutdown.

Like, what if a service is actually taking a while to shutdown but doesn't cause an error? Could be useful to optionally see that info.

Perhaps one of the other devs can weigh in but IMO it's nicer for devex to optionally see what is happening on startup/shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in terms of seeing the logs live on shutdown--the main reason I'm not just dumping them to INFO is that this is intended for cloud deployments where the log level is set to WARN. I could make it so that it both logs to info and warn, but that would be confusing in the case someone is self hosting and has the log level set to INFO--they would get a bunch of repeated logs.

})
}

func (c *Cleanup) Run(l *zerolog.Logger) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Thanks! Wonder if it could it be more idiomatic to make this an attribute of the Cleanup struct that we set on New(l *zerolog.Logger)? 👀

for _, line := range lines {
c.logger.Warn().Msg(line)
}
return fmt.Errorf("cleanup exceeded time limit of %d seconds", c.TimeLimit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since TimeLimit is a duration with a String() arg I think this can just be:

return fmt.Errorf("cleanup exceeded time limit of %s", c.TimeLimit)

lines = append(lines, fmt.Sprintf("successfully shutdown %s in %s (%d/%d)\n", fn.Name, time.Since(before), i+1, len(c.Fns)))
}
lines = append(lines, "all services have successfully gracefully exited")
if time.Since(start) > c.TimeLimit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As-written, I don't think this gets us the behavior we need? The problem is that if we get a SIGTERM followed by a SIGKILL after 30 seconds, we're never going to reach this line, we'll have been killed already. I think we need to be checking the time limit exceeding async with time.After in a goroutine, and then safely printing the log lines that we do have

Copy link
Contributor Author

@juliusgeo juliusgeo Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense. I used @gregfurman's suggestion above to make it so that it uses a context with timeout that logs with Error when the timeout is exceeded. I also changed the timeout to 10 seconds because we wait 20 seconds in run.go for the server shutdown wait period. I also removed the error return because if the deadline is exceeded the process is already dead.

lines = append(lines, "all services have successfully gracefully exited")
if time.Since(start) > c.TimeLimit {
for _, line := range lines {
c.logger.Warn().Msg(line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be Error() as we won't see warnings unless we tail directly in our setup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@github-actions
Copy link
Contributor

Benchmark results

goos: linux
goarch: amd64
pkg: github.com/hatchet-dev/hatchet/pkg/scheduling/v1
cpu: AMD Ryzen 9 7950X3D 16-Core Processor          
              │ /tmp/old.txt │         /tmp/new.txt          │
              │    sec/op    │    sec/op     vs base         │
RateLimiter-8   51.35µ ± 14%   53.05µ ± 18%  ~ (p=0.394 n=6)

              │ /tmp/old.txt │         /tmp/new.txt          │
              │     B/op     │     B/op      vs base         │
RateLimiter-8   137.7Ki ± 0%   137.7Ki ± 0%  ~ (p=0.613 n=6)

              │ /tmp/old.txt │          /tmp/new.txt          │
              │  allocs/op   │  allocs/op   vs base           │
RateLimiter-8    1.022k ± 0%   1.022k ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

Compared against main (69951ac)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants