Skip to content

Add worker heartbeat support#2148

Closed
yuandrew wants to merge 23 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat
Closed

Add worker heartbeat support#2148
yuandrew wants to merge 23 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat

Conversation

@yuandrew
Copy link
Contributor

@yuandrew yuandrew commented Jan 14, 2026

What was changed

Key changes:

  • Add HeartbeatMetricsHandler to capture metrics needed for heartbeats
  • Add internal_worker_heartbeat.go for heartbeat worker management
  • 💥 Removed contrib/resourcetuner lib, resource tuner portion has been moved into the SDK and is exposed through the worker package, while a new contrib/sysinfo package was created to handle system resource measurement
    • We never released a v1.0.0 of contrib/resourcetuner, so this should be okay.
  • Plumb heartbeat data through workers (workflow, activity, nexus, local activity)
  • always send ShutdownWorker RPC to give final heartbeat, even if not using sticky queues
  • Move verifyNamespaceExist calls from each specific worker, and instead call at the AggregatedWorker level and cache for worker heartbeating

Why?

New feature!

Checklist

  1. Closes Worker Heartbeating #2094

  2. How was this tested:

Added a bunch of tests to worker_heartbeat_test.go

  • Add integration tests for worker heartbeat functionality
  1. Any docs updates needed?

Note

High Risk
Touches core worker lifecycle, RPCs, and metrics plumbing, adding new background goroutines and shutdown behavior that could affect worker stability and server load if misconfigured.

Overview
Adds worker heartbeat support: the SDK now periodically reports per-worker slot/poller stats, sticky cache counters, plugin names, and host CPU/memory usage to the server via RecordWorkerHeartbeat, and sends a final ShutdownWorker RPC (with optional sticky queue) on worker stop.

Introduces a shared per-namespace heartbeat manager and in-memory metric capture (heartbeatMetricsHandler), plumbs poll-success timestamps from workflow/activity/nexus pollers, and adds client.Options.WorkerHeartbeatInterval with capability gating via DescribeNamespace.

Refactors/migrates resource tuning: removes contrib/resourcetuner by moving the resource-based tuner/controller and PID logic into internal and re-exporting from worker, requiring a SysInfoProvider; adds new contrib/sysinfo (gopsutil + Linux cgroups) as the recommended provider and updates integration/tests accordingly (including new worker_heartbeat_test.go).

Written by Cursor Bugbot for commit da40521. This will update automatically on new commits. Configure here.

Implement worker heartbeats that report worker status, slot usage, and metrics
to the server via the WorkerHeartbeat RPC.

Key changes:
- Add HeartbeatMetricsHandler to capture metrics needed for heartbeats
- Add internal_worker_heartbeat.go for heartbeat worker management
- Add hostmetrics package for CPU/memory usage reporting
- Plumb heartbeat data through workers (workflow, activity, nexus, local activity)
- Add integration tests for worker heartbeat functionality
- Fix nil pointer in shutdownWorker when heartbeating is disabled
@yuandrew yuandrew requested a review from a team as a code owner January 14, 2026 19:54
s.service.EXPECT().DescribeNamespace(gomock.Any(), gomock.Any(), gomock.Any()).Return(nil, nil)
s.service.EXPECT().PollWorkflowTaskQueue(gomock.Any(), gomock.Any(), gomock.Any()).Return(&workflowservice.PollWorkflowTaskQueueResponse{}, nil).AnyTimes()
s.service.EXPECT().RespondWorkflowTaskCompleted(gomock.Any(), gomock.Any(), gomock.Any()).Return(nil, nil).AnyTimes()
s.service.EXPECT().ShutdownWorker(gomock.Any(), gomock.Any(), gomock.Any()).Return(&workflowservice.ShutdownWorkerResponse{}, nil).Times(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ones that went away seem wrong. We should still be calling shutdown worker in cases where we were previously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The level at where ShutdownWorker is called went up a level, so now this test shuts down a workflow worker no longer makes the gRPC request, it's made at the AggregatedWorker level

Copy link
Contributor

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an early pass on general structure

// RecordPollSuccess records a successful poll time if the handler supports it.
// pollerType should be one of PollerTypeWorkflowTask, PollerTypeWorkflowStickyTask,
// PollerTypeActivityTask, or PollerTypeNexusTask.
func RecordPollSuccess(h Handler, pollerType string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be divorced from metrics handler. Specifically, I wonder if the heartbeat metrics stuff has a handler for some metrics, and then *HeartbeatMetricsHandler is set on pollers specifically just for recording poll success. If you must keep this, why even have the interface instead of just type asserting that it is a *HeartbeatMetricsHandler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific concern with divorcing from metrics handler? Today, heartbeat metrics should be a transparent layer, so passing in a specific *HeartbeatMetricsHandler doesn't feel too different than just using the metrics handler itself, maybe except being a little clarity, at the cost of passing an object (*HeartbeatMetricsHandler) that we already have access to from wtp.metricsHandler.

Thanks for the callout on the interface being unneccesary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main struggle for me is that if you hide everything behind metrics handler interface as if you were a user, it's hard for code readers in the SDK that may make future changes to recognize metrics handler is not just about user metrics, it's leveraged for internal utilities too. But not a big deal to leave as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still lean towards having this method, primarily for code simplicity sake. I've renamed to recordPollSuccessIfHeartbeat, does that help make it more distinct we're doing this for heartbeat, not user metrics?

Copy link
Contributor

@cretz cretz Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name change goes wrong in the other direction. We should be recording poll successes regardless of metrics handler types. What I was trying to say is that heartbeats don't need to use metrics handlers for this data since workers already control where it is called. Extracting poll data out of a metrics handler adds for a bunch of confusing logic when you could just call something here that records it more specifically for that use case (and leave metrics handler alone to record it as it always has, no need to intercept).

But if you still want to extract poll metrics from the metrics handler abstraction, ok, but definitely no need to change the name or only do it if it's a certain metrics handler type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a separate pollTimeTracker and plumbed through separately

Comment on lines +98 to +108
// Track the worker type if present in tags
workerType := h.workerType
if wt, ok := tags[WorkerTypeTagName]; ok {
workerType = wt
}

// Track the poller type if present in tags
pollerType := h.pollerType
if pt, ok := tags[PollerTypeTagName]; ok {
pollerType = pt
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these two values should not be extracted from WithTags but be done more explicitly when they are set which is only once per worker/poller. Not a big deal, but it may clean up some of this abstraction if the worker/poller type specific metrics were in their own struct maybe? And/or a worker or poller type was required to get a handler out here? I can fashion some ideas if needed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explicit forWorker and forPoller and made WithTags a passthrough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean double up calls on the caller side, I meant you can do what forWorker and forPoller do inside of WithTags (i.e. store more explicit state) and/or make these forWorker and forPoller calls do the WithTags themselves. It doesn't make sense to have two always-consecutive calls both independently copy this thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I had an additional WithTags call for forWorker. I feel better about this, where we always call WithTags, and if heartbeating, we supplementally call forWorker/forPoller before hand.

Don't want to change the function signature of WithTags, and now sure how we can be more explicit with WithTags without doing so. And I feel like putting the WithTags call inside of forWorker makes the caller less clear that it's still calling WithTags

// RecordPollSuccess records a successful poll time if the handler supports it.
// pollerType should be one of PollerTypeWorkflowTask, PollerTypeWorkflowStickyTask,
// PollerTypeActivityTask, or PollerTypeNexusTask.
func RecordPollSuccess(h Handler, pollerType string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main struggle for me is that if you hide everything behind metrics handler interface as if you were a user, it's hard for code readers in the SDK that may make future changes to recognize metrics handler is not just about user metrics, it's leveraged for internal utilities too. But not a big deal to leave as is.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Copy link
Contributor

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, almost all of my stuff now is pedantic

// RecordPollSuccess records a successful poll time if the handler supports it.
// pollerType should be one of PollerTypeWorkflowTask, PollerTypeWorkflowStickyTask,
// PollerTypeActivityTask, or PollerTypeNexusTask.
func RecordPollSuccess(h Handler, pollerType string) {
Copy link
Contributor

@cretz cretz Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name change goes wrong in the other direction. We should be recording poll successes regardless of metrics handler types. What I was trying to say is that heartbeats don't need to use metrics handlers for this data since workers already control where it is called. Extracting poll data out of a metrics handler adds for a bunch of confusing logic when you could just call something here that records it more specifically for that use case (and leave metrics handler alone to record it as it always has, no need to intercept).

But if you still want to extract poll metrics from the metrics handler abstraction, ok, but definitely no need to change the name or only do it if it's a certain metrics handler type.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

err := p.updateCGroupStats()
// Stop updates if not in a container. No need to return the error and log it.
if !errors.Is(err, fs.ErrNotExist) {
if errors.Is(err, fs.ErrNotExist) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems inverted, I switched it around

Copy link
Contributor

@cretz cretz Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That this wasn't caught before makes me concerned this logic isn't being tested properly (but doesn't have to be part of this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this in a separate PR, left myself a note to add a test for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link
Contributor

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, few more things

// WorkerHeartbeatInterval is the interval at which the worker will send heartbeats to the server.
// Interval must be between 1s and 60s, inclusive.
//
// default: 60s. To disable, set to 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic, but would maybe recommend "negative to disable" and remove the pointer and assume 0/unset means use default. Not a big deal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember consulting claude on this design decision, it seemed like this was the more idiomatic choice for Go,

  1. HashiCorp consul-template — the most extensive example. Nearly every duration config is
    *time.Duration with a Finalize() method that fills defaults for nil. They even have a
    TimeDurationPresent() helper that returns true only if non-nil AND non-zero, explicitly treating zero
    as a distinct state from nil.

https://pkg.go.dev/github.com/hashicorp/consul-template/config

options.ConnectionOptions.GetSystemInfoTimeout = defaultGetSystemInfoTimeout
}

if options.Logger == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand this change? All user-facing code that calls NewServiceClient should already be setting this default. Is there a new code path that calls NewServiceClient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new code path, but sessionEnvironmentImpl.SignalCreationResponse calls GetClient without a logger, this seems like it would fix that missing scenario, but also seems like good practice to have in general?

Copy link
Contributor

@cretz cretz Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it would fix that missing scenario, but also seems like good practice to have in general?

They may have left logger off intentionally, unsure, but would take some more digging. It's possible/likely you're right, but it seems unrelated to this project and may deserve a separate issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's fair, I'll remove this for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out this was due to newHeartbeatManager using the logger, but I added the check into that function itself, so SignalCreationResponse is unaffected

@@ -1468,9 +1479,67 @@ func (aw *AggregatedWorker) Stop() {
WorkerInstanceKey: aw.workerInstanceKey,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably we should go ahead and update API dependency and set this on the poll and shutdown calls, but we can make an issue to do that in successive PR if we want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer to do this in a separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +73 to +76
var workflowClient *WorkflowClient
if wc, ok := opts.client.(*WorkflowClient); ok {
workflowClient = wc
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we ever accept a situation where client is not this? But in general, why does nexusWorker need a client? Loading capabilities seems like something the outer/aggregate worker would do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree we don't want this requirement on WorkerClient, removed all individual specific workers using this and instead AggregateWorker makes this call instead

// SystemInfoContext provides context for SystemInfoSupplier calls.
//
// Exposed as: [go.temporal.io/sdk/worker.SystemInfoContext]
type SystemInfoContext struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider embedding context.Context in here even if just context.Background() (but not that important)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping out for now, localActivityContext, NexusOperationContext and testContext are all *Context structs that are missing context.Context, if we want to add it in later to use, we can

Copy link
Contributor

@cretz cretz Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then IMO we should consider passing a Go context into the calls alongside where this is passed (even if it's not used for anything yet)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning on leaving it out for now, and adding in the context into the SysInfoContext struct when we need the context. It seems a little redundant to pass in a Context and SysInfoContext together, any reason we'd need to add it into the calls alongside right now?

…eTracker more idiomatic, move namespace capabilities to AW, remove Get from func names
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

hw.callbacksMutex.Unlock()

return nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition between register and unregister in heartbeat manager

Medium Severity

registerWorker releases workersMutex (after getOrCreateSharedNamespaceWorker returns) before adding the callback to the sharedNamespaceWorker. A concurrent unregisterWorker call for a different worker on the same namespace can acquire workersMutex, find zero remaining callbacks, stop the sharedNamespaceWorker, and remove it from the map — all between the mutex release and the callback addition. The registering worker's callback is then silently added to a stopped worker whose goroutine has already exited, so its heartbeats are never sent.

Additional Locations (1)

Fix in Cursor Fix in Web

TotalFailedTasks: int32(totalFailed),
LastIntervalProcessedTasks: int32(intervalProcessed),
LastIntervalFailureTasks: int32(intervalFailed),
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unchecked int64-to-int32 truncation may wrap task counters

Low Severity

buildSlotsInfo casts totalProcessed and totalFailed (int64 values from atomic.Int64) directly to int32 without bounds checking. A long-running worker that processes more than ~2.1 billion tasks will silently overflow, producing incorrect negative values in TotalProcessedTasks, TotalFailedTasks, and the LastInterval* fields. Clamping to math.MaxInt32 before the cast would prevent this.

Fix in Cursor Fix in Web

TotalFailedTasks: int32(totalFailed),
LastIntervalProcessedTasks: int32(intervalProcessed),
LastIntervalFailureTasks: int32(intervalFailed),
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unchecked int64-to-int32 truncation may wrap task counters

Low Severity

buildSlotsInfo casts totalProcessed and totalFailed (int64 values from atomic.Int64) directly to int32 without bounds checking. A long-running worker that processes more than ~2.1 billion tasks will silently overflow, producing incorrect negative values in TotalProcessedTasks, TotalFailedTasks, and the LastInterval* fields. Clamping to math.MaxInt32 before the cast would prevent this.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker Heartbeating

3 participants