Add worker heartbeat support by yuandrew · Pull Request #2148 · temporalio/sdk-go

yuandrew · 2026-01-14T19:54:58Z

What was changed

Key changes:

Add HeartbeatMetricsHandler to capture metrics needed for heartbeats
Add internal_worker_heartbeat.go for heartbeat worker management
💥 Removed contrib/resourcetuner lib, resource tuner portion has been moved into the SDK and is exposed through the worker package, while a new contrib/sysinfo package was created to handle system resource measurement
- We never released a v1.0.0 of contrib/resourcetuner, so this should be okay.
Plumb heartbeat data through workers (workflow, activity, nexus, local activity)
always send ShutdownWorker RPC to give final heartbeat, even if not using sticky queues
Move verifyNamespaceExist calls from each specific worker, and instead call at the AggregatedWorker level and cache for worker heartbeating

Why?

New feature!

Checklist

Closes Worker Heartbeating #2094
How was this tested:

Added a bunch of tests to worker_heartbeat_test.go

Add integration tests for worker heartbeat functionality

Any docs updates needed?

Note

High Risk
Touches core worker lifecycle, RPCs, and metrics plumbing, adding new background goroutines and shutdown behavior that could affect worker stability and server load if misconfigured.

Overview
Adds worker heartbeat support: the SDK now periodically reports per-worker slot/poller stats, sticky cache counters, plugin names, and host CPU/memory usage to the server via RecordWorkerHeartbeat, and sends a final ShutdownWorker RPC (with optional sticky queue) on worker stop.

Introduces a shared per-namespace heartbeat manager and in-memory metric capture (heartbeatMetricsHandler), plumbs poll-success timestamps from workflow/activity/nexus pollers, and adds client.Options.WorkerHeartbeatInterval with capability gating via DescribeNamespace.

Refactors/migrates resource tuning: removes contrib/resourcetuner by moving the resource-based tuner/controller and PID logic into internal and re-exporting from worker, requiring a SysInfoProvider; adds new contrib/sysinfo (gopsutil + Linux cgroups) as the recommended provider and updates integration/tests accordingly (including new worker_heartbeat_test.go).

^{Written by Cursor Bugbot for commit da40521. This will update automatically on new commits. Configure here.}

Implement worker heartbeats that report worker status, slot usage, and metrics to the server via the WorkerHeartbeat RPC. Key changes: - Add HeartbeatMetricsHandler to capture metrics needed for heartbeats - Add internal_worker_heartbeat.go for heartbeat worker management - Add hostmetrics package for CPU/memory usage reporting - Plumb heartbeat data through workers (workflow, activity, nexus, local activity) - Add integration tests for worker heartbeat functionality - Fix nil pointer in shutdownWorker when heartbeating is disabled

internal/internal_worker.go

worker/hostmetrics/hostmetrics.go

internal/internal_worker_heartbeat.go

contrib/resourcetuner/go.mod

internal/common/metrics/heartbeat_handler.go

internal/internal_worker.go

Sushisource · 2026-01-15T17:50:46Z

internal/internal_workers_test.go

 	s.service.EXPECT().DescribeNamespace(gomock.Any(), gomock.Any(), gomock.Any()).Return(nil, nil)
 	s.service.EXPECT().PollWorkflowTaskQueue(gomock.Any(), gomock.Any(), gomock.Any()).Return(&workflowservice.PollWorkflowTaskQueueResponse{}, nil).AnyTimes()
 	s.service.EXPECT().RespondWorkflowTaskCompleted(gomock.Any(), gomock.Any(), gomock.Any()).Return(nil, nil).AnyTimes()
-	s.service.EXPECT().ShutdownWorker(gomock.Any(), gomock.Any(), gomock.Any()).Return(&workflowservice.ShutdownWorkerResponse{}, nil).Times(1)


These ones that went away seem wrong. We should still be calling shutdown worker in cases where we were previously

The level at where ShutdownWorker is called went up a level, so now this test shuts down a workflow worker no longer makes the gRPC request, it's made at the AggregatedWorker level

internal/internal_worker.go

test/worker_heartbeat_test.go

cretz

Did an early pass on general structure

worker/hostmetrics/hostmetrics.go

internal/common/metrics/heartbeat_handler.go

cretz · 2026-01-15T00:02:07Z

internal/common/metrics/heartbeat_handler.go

+// RecordPollSuccess records a successful poll time if the handler supports it.
+// pollerType should be one of PollerTypeWorkflowTask, PollerTypeWorkflowStickyTask,
+// PollerTypeActivityTask, or PollerTypeNexusTask.
+func RecordPollSuccess(h Handler, pollerType string) {


I wonder if this should be divorced from metrics handler. Specifically, I wonder if the heartbeat metrics stuff has a handler for some metrics, and then *HeartbeatMetricsHandler is set on pollers specifically just for recording poll success. If you must keep this, why even have the interface instead of just type asserting that it is a *HeartbeatMetricsHandler?

Is there a specific concern with divorcing from metrics handler? Today, heartbeat metrics should be a transparent layer, so passing in a specific *HeartbeatMetricsHandler doesn't feel too different than just using the metrics handler itself, maybe except being a little clarity, at the cost of passing an object (*HeartbeatMetricsHandler) that we already have access to from wtp.metricsHandler.

Thanks for the callout on the interface being unneccesary

The main struggle for me is that if you hide everything behind metrics handler interface as if you were a user, it's hard for code readers in the SDK that may make future changes to recognize metrics handler is not just about user metrics, it's leveraged for internal utilities too. But not a big deal to leave as is.

I still lean towards having this method, primarily for code simplicity sake. I've renamed to recordPollSuccessIfHeartbeat, does that help make it more distinct we're doing this for heartbeat, not user metrics?

I think the name change goes wrong in the other direction. We should be recording poll successes regardless of metrics handler types. What I was trying to say is that heartbeats don't need to use metrics handlers for this data since workers already control where it is called. Extracting poll data out of a metrics handler adds for a bunch of confusing logic when you could just call something here that records it more specifically for that use case (and leave metrics handler alone to record it as it always has, no need to intercept).

But if you still want to extract poll metrics from the metrics handler abstraction, ok, but definitely no need to change the name or only do it if it's a certain metrics handler type.

Created a separate pollTimeTracker and plumbed through separately

cretz · 2026-01-15T00:08:11Z

internal/common/metrics/heartbeat_handler.go

+	// Track the worker type if present in tags
+	workerType := h.workerType
+	if wt, ok := tags[WorkerTypeTagName]; ok {
+		workerType = wt
+	}
+
+	// Track the poller type if present in tags
+	pollerType := h.pollerType
+	if pt, ok := tags[PollerTypeTagName]; ok {
+		pollerType = pt
+	}


I wonder if these two values should not be extracted from WithTags but be done more explicitly when they are set which is only once per worker/poller. Not a big deal, but it may clean up some of this abstraction if the worker/poller type specific metrics were in their own struct maybe? And/or a worker or poller type was required to get a handler out here? I can fashion some ideas if needed...

Added an explicit forWorker and forPoller and made WithTags a passthrough

I didn't mean double up calls on the caller side, I meant you can do what forWorker and forPoller do inside of WithTags (i.e. store more explicit state) and/or make these forWorker and forPoller calls do the WithTags themselves. It doesn't make sense to have two always-consecutive calls both independently copy this thing.

oops, I had an additional WithTags call for forWorker. I feel better about this, where we always call WithTags, and if heartbeating, we supplementally call forWorker/forPoller before hand.

Don't want to change the function signature of WithTags, and now sure how we can be more explicit with WithTags without doing so. And I feel like putting the WithTags call inside of forWorker makes the caller less clear that it's still calling WithTags

internal/internal_workflow_client.go

internal/tuning.go

worker/worker.go

internal/internal_worker.go

internal/internal_workflow_client.go

go.mod

contrib/resourcetuner/go.mod

internal/tuning.go

internal/internal_worker_heartbeat.go

internal/common/metrics/heartbeat_handler.go

internal/internal_worker_heartbeat.go

internal/internal_workflow_client.go

cretz · 2026-01-27T17:28:47Z

internal/common/metrics/heartbeat_handler.go

+// RecordPollSuccess records a successful poll time if the handler supports it.
+// pollerType should be one of PollerTypeWorkflowTask, PollerTypeWorkflowStickyTask,
+// PollerTypeActivityTask, or PollerTypeNexusTask.
+func RecordPollSuccess(h Handler, pollerType string) {


The main struggle for me is that if you hide everything behind metrics handler interface as if you were a user, it's hard for code readers in the SDK that may make future changes to recognize metrics handler is not just about user metrics, it's leveraged for internal utilities too. But not a big deal to leave as is.

internal/internal_workflow_client.go

contrib/sysinfo/cgroups.go

… AggregatedWorker.start()

…rrent-safe

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

internal/internal_worker_heartbeat.go

test/worker_tuner_test.go

internal/internal_worker.go

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

test/worker_heartbeat_test.go

internal/internal_worker_heartbeat.go

internal/client.go

cretz

Looks great, almost all of my stuff now is pedantic

contrib/sysinfo/sysinfo.go

contrib/hostinfo/hostinfo.go

cretz · 2026-01-29T14:27:20Z

internal/common/metrics/heartbeat_handler.go

+// RecordPollSuccess records a successful poll time if the handler supports it.
+// pollerType should be one of PollerTypeWorkflowTask, PollerTypeWorkflowStickyTask,
+// PollerTypeActivityTask, or PollerTypeNexusTask.
+func RecordPollSuccess(h Handler, pollerType string) {


I think the name change goes wrong in the other direction. We should be recording poll successes regardless of metrics handler types. What I was trying to say is that heartbeats don't need to use metrics handlers for this data since workers already control where it is called. Extracting poll data out of a metrics handler adds for a bunch of confusing logic when you could just call something here that records it more specifically for that use case (and leave metrics handler alone to record it as it always has, no need to intercept).

But if you still want to extract poll metrics from the metrics handler abstraction, ok, but definitely no need to change the name or only do it if it's a certain metrics handler type.

internal/internal_worker_heartbeat.go

internal/client.go

… clarify code

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

internal/client.go

contrib/sysinfo/go.mod

internal/internal_worker.go

…t of metrics

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

test/worker_heartbeat_test.go

yuandrew · 2026-02-03T19:03:40Z

contrib/sysinfo/cgroups.go

 	err := p.updateCGroupStats()
 	// Stop updates if not in a container. No need to return the error and log it.
-	if !errors.Is(err, fs.ErrNotExist) {
+	if errors.Is(err, fs.ErrNotExist) {


This check seems inverted, I switched it around

That this wasn't caught before makes me concerned this logic isn't being tested properly (but doesn't have to be part of this PR)

Let's do this in a separate PR, left myself a note to add a test for this

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

test/worker_heartbeat_test.go

…sion

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

internal/internal_workflow_client.go

cretz

Sorry, few more things

contrib/sysinfo/sysinfo.go

worker/tuning.go

cretz · 2026-02-04T15:27:26Z

internal/client.go

+		// WorkerHeartbeatInterval is the interval at which the worker will send heartbeats to the server.
+		// Interval must be between 1s and 60s, inclusive.
+		//
+		// default: 60s. To disable, set to 0.


Pedantic, but would maybe recommend "negative to disable" and remove the pointer and assume 0/unset means use default. Not a big deal though.

I remember consulting claude on this design decision, it seemed like this was the more idiomatic choice for Go,

HashiCorp consul-template — the most extensive example. Nearly every duration config is
*time.Duration with a Finalize() method that fills defaults for nil. They even have a
TimeDurationPresent() helper that returns true only if non-nil AND non-zero, explicitly treating zero
as a distinct state from nil.

https://pkg.go.dev/github.com/hashicorp/consul-template/config

cretz · 2026-02-04T15:30:05Z

internal/client.go

 		options.ConnectionOptions.GetSystemInfoTimeout = defaultGetSystemInfoTimeout
 	}

+	if options.Logger == nil {


Can you help me understand this change? All user-facing code that calls NewServiceClient should already be setting this default. Is there a new code path that calls NewServiceClient?

No new code path, but sessionEnvironmentImpl.SignalCreationResponse calls GetClient without a logger, this seems like it would fix that missing scenario, but also seems like good practice to have in general?

this seems like it would fix that missing scenario, but also seems like good practice to have in general?

They may have left logger off intentionally, unsure, but would take some more digging. It's possible/likely you're right, but it seems unrelated to this project and may deserve a separate issue.

yeah that's fair, I'll remove this for now

turns out this was due to newHeartbeatManager using the logger, but I added the check into that function itself, so SignalCreationResponse is unaffected

cretz · 2026-02-04T15:32:29Z

internal/internal_worker.go

@@ -1468,9 +1479,67 @@ func (aw *AggregatedWorker) Stop() {
 		WorkerInstanceKey: aw.workerInstanceKey,


Arguably we should go ahead and update API dependency and set this on the poll and shutdown calls, but we can make an issue to do that in successive PR if we want

would prefer to do this in a separate PR

internal/internal_worker_heartbeat.go

cretz · 2026-02-04T15:45:08Z

internal/internal_nexus_worker.go

+	var workflowClient *WorkflowClient
+	if wc, ok := opts.client.(*WorkflowClient); ok {
+		workflowClient = wc
+	}


Should we ever accept a situation where client is not this? But in general, why does nexusWorker need a client? Loading capabilities seems like something the outer/aggregate worker would do.

agree we don't want this requirement on WorkerClient, removed all individual specific workers using this and instead AggregateWorker makes this call instead

internal/internal_worker_heartbeat.go

internal/resource_tuner.go

cretz · 2026-02-04T16:12:58Z

internal/resource_tuner.go

+// SystemInfoContext provides context for SystemInfoSupplier calls.
+//
+// Exposed as: [go.temporal.io/sdk/worker.SystemInfoContext]
+type SystemInfoContext struct {


Consider embedding context.Context in here even if just context.Background() (but not that important)

keeping out for now, localActivityContext, NexusOperationContext and testContext are all *Context structs that are missing context.Context, if we want to add it in later to use, we can

Then IMO we should consider passing a Go context into the calls alongside where this is passed (even if it's not used for anything yet)

I'm planning on leaving it out for now, and adding in the context into the SysInfoContext struct when we need the context. It seems a little redundant to pass in a Context and SysInfoContext together, any reason we'd need to add it into the calls alongside right now?

…eTracker more idiomatic, move namespace capabilities to AW, remove Get from func names

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

internal/internal_worker_heartbeat.go

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

internal/internal_worker_heartbeat.go

# Conflicts: # contrib/sysinfo/go.sum # internal/cmd/build/main.go

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

internal/internal_worker.go

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

cursor · 2026-02-10T01:46:24Z

internal/internal_worker_heartbeat.go

+	hw.callbacksMutex.Unlock()
+
+	return nil
+}


Race condition between register and unregister in heartbeat manager

Medium Severity

registerWorker releases workersMutex (after getOrCreateSharedNamespaceWorker returns) before adding the callback to the sharedNamespaceWorker. A concurrent unregisterWorker call for a different worker on the same namespace can acquire workersMutex, find zero remaining callbacks, stop the sharedNamespaceWorker, and remove it from the map — all between the mutex release and the callback addition. The registering worker's callback is then silently added to a stopped worker whose goroutine has already exited, so its heartbeats are never sent.

Additional Locations (1)

internal/internal_worker_heartbeat.go#L59-L81

cursor · 2026-02-10T01:46:24Z

internal/internal_worker_heartbeat_metrics.go

+		TotalFailedTasks:           int32(totalFailed),
+		LastIntervalProcessedTasks: int32(intervalProcessed),
+		LastIntervalFailureTasks:   int32(intervalFailed),
+	}


Unchecked int64-to-int32 truncation may wrap task counters

Low Severity

buildSlotsInfo casts totalProcessed and totalFailed (int64 values from atomic.Int64) directly to int32 without bounds checking. A long-running worker that processes more than ~2.1 billion tasks will silently overflow, producing incorrect negative values in TotalProcessedTasks, TotalFailedTasks, and the LastInterval* fields. Clamping to math.MaxInt32 before the cast would prevent this.

cursor · 2026-02-10T01:46:24Z

internal/internal_worker_heartbeat_metrics.go

+		TotalFailedTasks:           int32(totalFailed),
+		LastIntervalProcessedTasks: int32(intervalProcessed),
+		LastIntervalFailureTasks:   int32(intervalFailed),
+	}


Unchecked int64-to-int32 truncation may wrap task counters

Low Severity

buildSlotsInfo casts totalProcessed and totalFailed (int64 values from atomic.Int64) directly to int32 without bounds checking. A long-running worker that processes more than ~2.1 billion tasks will silently overflow, producing incorrect negative values in TotalProcessedTasks, TotalFailedTasks, and the LastInterval* fields. Clamping to math.MaxInt32 before the cast would prevent this.

yuandrew requested a review from a team as a code owner January 14, 2026 19:54

cursor bot reviewed Jan 14, 2026

View reviewed changes

internal/internal_worker.go Outdated Show resolved Hide resolved

worker/hostmetrics/hostmetrics.go Outdated Show resolved Hide resolved

internal/internal_worker_heartbeat.go Outdated Show resolved Hide resolved

internal/internal_worker_heartbeat.go Outdated Show resolved Hide resolved

Sushisource reviewed Jan 15, 2026

View reviewed changes

cretz reviewed Jan 15, 2026

View reviewed changes

worker/worker.go Outdated Show resolved Hide resolved

yuandrew added 2 commits January 22, 2026 16:54

vendor gopsutil

6af4db8

PR feedback

04ff20f

cursor bot reviewed Jan 26, 2026

View reviewed changes

internal/internal_worker.go Show resolved Hide resolved

internal/internal_workflow_client.go Outdated Show resolved Hide resolved

cretz reviewed Jan 27, 2026

View reviewed changes

Sort plugin names

819080b

yuandrew commented Jan 28, 2026

View reviewed changes

contrib/sysinfo/cgroups.go Show resolved Hide resolved

yuandrew added 7 commits January 27, 2026 22:21

Create new hostinfo package

c156e0d

make methods/structs private, remove aw.workerHeartbeatManager

ebf1064

tighten lock, consolidate describeNamespace calls to a single call in…

b8893b9

… AggregatedWorker.start()

simplify heartbeat metrics, decouple poller/worker type from WithTags()

a6135de

remove unused nexus worker, tighten heartbeat callback and make concu…

4f43e75

…rrent-safe

Merge branch 'master' into worker-heartbeat

54ddf1f

Fix tests

e7fbc03

cursor bot reviewed Jan 29, 2026

View reviewed changes

internal/internal_worker_heartbeat.go Outdated Show resolved Hide resolved

internal/internal_worker_heartbeat.go Show resolved Hide resolved

test/worker_tuner_test.go Show resolved Hide resolved

internal/internal_worker.go Outdated Show resolved Hide resolved

Fix cursor discovered bugs, fix integ tests

edf6e11

cursor bot reviewed Jan 29, 2026

View reviewed changes

test/worker_heartbeat_test.go Show resolved Hide resolved

internal/internal_worker_heartbeat.go Show resolved Hide resolved

internal/client.go Outdated Show resolved Hide resolved

cretz reviewed Jan 29, 2026

View reviewed changes

Rename hostinfo to sysinfo, add interval enforcement, rename mutexes,…

73f4a10

… clarify code

cursor bot reviewed Feb 2, 2026

View reviewed changes

internal/client.go Show resolved Hide resolved

contrib/sysinfo/go.mod Outdated Show resolved Hide resolved

internal/internal_worker.go Show resolved Hide resolved

fix bugs cursor found, sync.oncevalue, separate poll time tracking ou…

972555a

…t of metrics

cursor bot reviewed Feb 2, 2026

View reviewed changes

test/worker_heartbeat_test.go Show resolved Hide resolved

yuandrew commented Feb 3, 2026

View reviewed changes

Add back resource tuner tests that got dropped

f952732

cursor bot reviewed Feb 3, 2026

View reviewed changes

test/worker_heartbeat_test.go Show resolved Hide resolved

yuandrew added 2 commits February 3, 2026 13:14

Fix tests

a25d85d

Fix tests, disable heartbeating for normal tests, bump dev server ver…

53da340

…sion

cursor bot reviewed Feb 3, 2026

View reviewed changes

internal/internal_workflow_client.go Outdated Show resolved Hide resolved

cretz reviewed Feb 4, 2026

View reviewed changes

Finish renames of sysInfoProvider, handle Time.IsZero(), make pollTim…

2155206

…eTracker more idiomatic, move namespace capabilities to AW, remove Get from func names

cursor bot reviewed Feb 6, 2026

View reviewed changes

internal/internal_worker_heartbeat.go Outdated Show resolved Hide resolved

yuandrew mentioned this pull request Feb 6, 2026

Attach worker_instance_key to all poll calls #2178

Closed

Fix tests

dd02159

cursor bot reviewed Feb 6, 2026

View reviewed changes

internal/internal_worker_heartbeat.go Outdated Show resolved Hide resolved

yuandrew added 3 commits February 9, 2026 11:11

remove extra default logger addition, remove dead code

bb556cb

Merge branch 'master' into worker-heartbeat

136d311

# Conflicts: # contrib/sysinfo/go.sum # internal/cmd/build/main.go

forgot a change..

da40521

cursor bot reviewed Feb 10, 2026

View reviewed changes

internal/internal_worker.go Outdated Show resolved Hide resolved

fix unit tests

04f5d4d

cursor bot reviewed Feb 10, 2026

View reviewed changes

yuandrew mentioned this pull request Feb 10, 2026

💥Add worker heartbeat support #2186

Merged

yuandrew closed this Feb 11, 2026

		@@ -1468,9 +1479,67 @@ func (aw *AggregatedWorker) Stop() {
		WorkerInstanceKey: aw.workerInstanceKey,

Conversation

yuandrew commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Why?

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cretz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuandrew commented Jan 14, 2026 •

edited

Loading

cretz Jan 29, 2026 •

edited

Loading

cretz Jan 29, 2026 •

edited

Loading

cretz Feb 4, 2026 •

edited

Loading