Add cleanup module to handle graceful shutdown, improve logging experience#3260
Add cleanup module to handle graceful shutdown, improve logging experience#3260
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
pkg/cleanup/cleanup.go
Outdated
| } | ||
|
|
||
| func (c *Cleanup) Run() error { | ||
| log.Printf("waiting for all other services to gracefully exit...") |
There was a problem hiding this comment.
Should we rather be passing that zerologger and using it for logging since (afaik) this is logging to stdout?
hatchet/cmd/hatchet-engine/engine/run.go
Line 95 in bf05890
There was a problem hiding this comment.
Fixed it so that it uses warn level log statements. Made it so that we only actually log the warnings if the total time limit is exceeded for cleanup.
| if err := fn.Fn(); err != nil { | ||
| return fmt.Errorf("could not teardown %s: %w", fn.Name, err) | ||
| } |
There was a problem hiding this comment.
q: What happens if we early exit without shutting down all processes? Is there a chance shutdown could hang?
There was a problem hiding this comment.
So for this change, I'm trying to preserve the same behavior we had before: https://github.com/hatchet-dev/hatchet/pull/3260/changes/BASE..bf05890f0ba6f88de0c1e861a7ba0323ae7d2406#diff-a474308cfad07b710f145423c053e3cca49384ea95cb1144e3e18b4509808ba5L117
which also errors out when cleanup could not be completed. I don't think it would hang, though, it would just shutdown without gracefully cleaning up. Which if there was an error, is already the case.
cmd/hatchet-engine/engine/run.go
Outdated
| cleanup.Add(func() error { | ||
| return server.Disconnect() | ||
| }, "database") |
There was a problem hiding this comment.
nit: why not pass the function as a param instead?
| cleanup.Add(func() error { | |
| return server.Disconnect() | |
| }, "database") | |
| cleanup.Add(server.Disconnect, "database") |
There was a problem hiding this comment.
Good catch, fixed.
pkg/cleanup/cleanup.go
Outdated
|
|
||
| func (c *Cleanup) Run(l *zerolog.Logger) error { | ||
| lines := []string{} | ||
| start := time.Now() |
There was a problem hiding this comment.
IMO if we're adding a time-limit (or deadline) we should probably consider using a context.WithTimeout.
Also, passing a ctx here gives us the opportunity to cancel early (if a force shutdown or whatever).
Could do something like:
func (c *Cleanup) Run(ctx context.Context) error {
// Could either assume the caller has set a deadline or attach one ourselves i.e
// ctx, cancel = context.WithTimeout(ctx, time.Second * 30)
// defer cancel()
// loop over all fns...
select {
case <-ctx.Done:
// optionally, if you want the cleanup exceeded error, check whether the error is context.DeadlineExceeded and return that formatted error.
// (the deadline could be extracted via ctx.Deadline())
// fmt.Errorf("cleanup exceeded time limit of %d seconds", ctx.)
return ctx.Err()
default:
if err := fn.Fn(); err != nil {
return fmt.Errorf("could not teardown %s: %w", fn.Name, err)
}
}
}There was a problem hiding this comment.
So I'm not super sure about this because we don't want to cancel early, we just want to know when the cleanup took a long time to complete, so then we can look at the logs to figure out what service was taking the extra time. So using context.WithTimeout and then returning an error prior to all the cleanups finishing would lose info about which cleanup was taking that extra time.
| if err := fn.Fn(); err != nil { | ||
| return fmt.Errorf("could not teardown %s: %w", fn.Name, err) | ||
| } | ||
| lines = append(lines, fmt.Sprintf("successfully shutdown %s in %s (%d/%d)\n", fn.Name, time.Since(before), i+1, len(c.Fns))) |
There was a problem hiding this comment.
This makes sense! However... I'm concerned that we actually could want to see these logs happening live on shutdown.
Like, what if a service is actually taking a while to shutdown but doesn't cause an error? Could be useful to optionally see that info.
Perhaps one of the other devs can weigh in but IMO it's nicer for devex to optionally see what is happening on startup/shutdown.
There was a problem hiding this comment.
So in terms of seeing the logs live on shutdown--the main reason I'm not just dumping them to INFO is that this is intended for cloud deployments where the log level is set to WARN. I could make it so that it both logs to info and warn, but that would be confusing in the case someone is self hosting and has the log level set to INFO--they would get a bunch of repeated logs.
pkg/cleanup/cleanup.go
Outdated
| }) | ||
| } | ||
|
|
||
| func (c *Cleanup) Run(l *zerolog.Logger) error { |
There was a problem hiding this comment.
nit: Thanks! Wonder if it could it be more idiomatic to make this an attribute of the Cleanup struct that we set on New(l *zerolog.Logger)? 👀
pkg/cleanup/cleanup.go
Outdated
| for _, line := range lines { | ||
| c.logger.Warn().Msg(line) | ||
| } | ||
| return fmt.Errorf("cleanup exceeded time limit of %d seconds", c.TimeLimit) |
There was a problem hiding this comment.
nit: since TimeLimit is a duration with a String() arg I think this can just be:
return fmt.Errorf("cleanup exceeded time limit of %s", c.TimeLimit)
pkg/cleanup/cleanup.go
Outdated
| lines = append(lines, fmt.Sprintf("successfully shutdown %s in %s (%d/%d)\n", fn.Name, time.Since(before), i+1, len(c.Fns))) | ||
| } | ||
| lines = append(lines, "all services have successfully gracefully exited") | ||
| if time.Since(start) > c.TimeLimit { |
There was a problem hiding this comment.
As-written, I don't think this gets us the behavior we need? The problem is that if we get a SIGTERM followed by a SIGKILL after 30 seconds, we're never going to reach this line, we'll have been killed already. I think we need to be checking the time limit exceeding async with time.After in a goroutine, and then safely printing the log lines that we do have
There was a problem hiding this comment.
Ah that makes sense. I used @gregfurman's suggestion above to make it so that it uses a context with timeout that logs with Error when the timeout is exceeded. I also changed the timeout to 10 seconds because we wait 20 seconds in run.go for the server shutdown wait period. I also removed the error return because if the deadline is exceeded the process is already dead.
pkg/cleanup/cleanup.go
Outdated
| lines = append(lines, "all services have successfully gracefully exited") | ||
| if time.Since(start) > c.TimeLimit { | ||
| for _, line := range lines { | ||
| c.logger.Warn().Msg(line) |
There was a problem hiding this comment.
I think this should be Error() as we won't see warnings unless we tail directly in our setup
Benchmark resultsCompared against |
Description
Fixes # (issue)
Type of change
What's Changed