Skip to content

💥Add worker heartbeat support#2186

Merged
yuandrew merged 35 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat1
Feb 14, 2026
Merged

💥Add worker heartbeat support#2186
yuandrew merged 35 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat1

Conversation

@yuandrew
Copy link
Contributor

@yuandrew yuandrew commented Feb 10, 2026

NOTE: This is a re-creation of #2148, making a newer, clean PR as the resolved comment count was getting very high and making it hard to navigate.

What was changed

Key changes:

  • Add HeartbeatMetricsHandler to capture metrics needed for heartbeats
  • Add internal_worker_heartbeat.go for heartbeat worker management
  • 💥 Removed contrib/resourcetuner lib, resource tuner portion has been moved into the SDK and is exposed through the worker package, while a new contrib/sysinfo package was created to handle system resource measurement
    • We never released a v1.0.0 of contrib/resourcetuner, so this should be okay.
  • Plumb heartbeat data through workers (workflow, activity, nexus, local activity)
  • always send ShutdownWorker RPC to give final heartbeat, even if not using sticky queues
  • Move verifyNamespaceExist calls from each specific worker, and instead call at the AggregatedWorker level and cache for worker heartbeating

Why?

New feature!

Checklist

  1. Closes Worker Heartbeating #2094

  2. How was this tested:

Added a bunch of tests to worker_heartbeat_test.go

  • Add integration tests for worker heartbeat functionality
  1. Any docs updates needed?

Note

Medium Risk
Touches core worker start/stop and polling paths and adds new background goroutines/RPCs for heartbeating; failures could impact worker shutdown behavior or add load if misconfigured, but the feature is gated by an experimental interval option and server capabilities.

Overview
Adds experimental worker heartbeats: clients can now be configured with WorkerHeartbeatInterval to periodically call RecordWorkerHeartbeat and to send a final heartbeat via ShutdownWorker on stop, including worker/poller/slot metrics, plugin names, deployment version, and optional CPU/memory usage.

Refactors tuning/sysinfo: moves the resource-based tuner into the SDK worker API (requiring an injected SysInfoProvider), introduces new contrib/sysinfo (gopsutil + cgroup-aware) implementation, and wires poll-success tracking plus heartbeat-specific metrics capturing across workflow/activity/nexus pollers.

Updates dev/test configs to enable WorkerHeartbeatsEnabled/ListWorkersEnabled, adjusts integration tests accordingly, and adds a large new worker_heartbeat_test.go coverage suite.

Written by Cursor Bugbot for commit 1cb7c90. This will update automatically on new commits. Configure here.

Implement worker heartbeats that report worker status, slot usage, and metrics
to the server via the WorkerHeartbeat RPC.

Key changes:
- Add HeartbeatMetricsHandler to capture metrics needed for heartbeats
- Add internal_worker_heartbeat.go for heartbeat worker management
- Add hostmetrics package for CPU/memory usage reporting
- Plumb heartbeat data through workers (workflow, activity, nexus, local activity)
- Add integration tests for worker heartbeat functionality
- Fix nil pointer in shutdownWorker when heartbeating is disabled
…eTracker more idiomatic, move namespace capabilities to AW, remove Get from func names
# Conflicts:
#	contrib/sysinfo/go.sum
#	internal/cmd/build/main.go
…tbeat worker creation and callback registration
// SysInfoContext provides context for SysInfoProvider calls.
//
// Exposed as: [go.temporal.io/sdk/worker.SysInfoContext]
type SysInfoContext struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2148 (comment)

cretz
Consider embedding context.Context in here even if just context.Background() (but not that important)

yuandrew
keeping out for now, localActivityContext, NexusOperationContext and testContext are all *Context structs that are missing context.Context, if we want to add it in later to use, we can

cretz
Then IMO we should consider passing a Go context into the calls alongside where this is passed (even if it's not used for anything yet)

yuandrew
I'm planning on leaving it out for now, and adding in the context into the SysInfoContext struct when we need the context. It seems a little redundant to pass in a Context and SysInfoContext together, any reason we'd need to add it into the calls alongside right now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go API that is invoked that others implement basically all need a context.Context always even if you don't do anything with it yet. At this point it might as well be part of Go API heh. You can wait and embed later if you want, though harmless to do now.

@yuandrew yuandrew marked this pull request as ready for review February 11, 2026 17:24
@yuandrew yuandrew requested a review from a team as a code owner February 11, 2026 17:24
}
proto.Merge(aw.capabilities, capabilities)

if _, err := aw.client.loadNamespaceCapabilities(aw.executionParams.MetricsHandler); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmaeagle99 this is similar to your change in external storage, we can coordinate

Copy link
Contributor

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking, just a few minor comments, though worth looking at them and seeing if we want to address

NumNexusSlots: defaultMaxConcurrentTaskExecutionSize,
})
}
if params.pollTimeTracker == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the situation where this is not always nil? Should this be set where params is created and not here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always nil, but this is the most common place to nil check, mainly for unit tests sake. I had this set where params are created in NewAggregatedWorker, but I'd still like to keep this check here for tests

stickyTaskQueue = getWorkerTaskQueue(aw.workflowWorker.stickyUUID)
}

_, err := aw.client.workflowService.ShutdownWorker(grpcCtx, &workflowservice.ShutdownWorkerRequest{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass the worker instance key here and in polls?Or if these should be done in separate issue, can we make it and put it up for triage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's still not on shutdown here, but is on polls?

}

if err != nil {
aw.logger.Debug("ShutdownWorker rpc errored during worker shutdown.", tagError, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was debug in old code, but we should consider making this a warning, but can be done in a separate issue. This failing has negative user effects, so effectively swallowing this error and not telling user can be rough. But we have to think about situations where this is just noise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will address separately

// SysInfoContext provides context for SysInfoProvider calls.
//
// Exposed as: [go.temporal.io/sdk/worker.SysInfoContext]
type SysInfoContext struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go API that is invoked that others implement basically all need a context.Context always even if you don't do anything with it yet. At this point it might as well be part of Go API heh. You can wait and embed later if you want, though harmless to do now.

…w/activity poll requests, make WorkerHeartbeatInterval not a pointer
TotalFailedTasks: int32(totalFailed),
LastIntervalProcessedTasks: int32(intervalProcessed),
LastIntervalFailureTasks: int32(intervalFailed),
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent int64-to-int32 truncation corrupts heartbeat task counters

Medium Severity

buildSlotsInfo truncates int64 counter values to int32 via int32(totalProcessed) and int32(totalFailed) when populating the WorkerSlotsInfo proto. The internal tracking uses atomic.Int64, which can grow beyond math.MaxInt32. For a worker processing ~1000 tasks/second, this overflows in ~25 days, silently producing negative or wrapped values for TotalProcessedTasks and TotalFailedTasks. The same applies to TotalStickyCacheHit/TotalStickyCacheMiss in PopulateHeartbeat. Clamping to MaxInt32 would prevent silent corruption.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively these should just be stored as int32 since that's what we have to send them as eventually anyway

Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense to me. I didn't do a super detailed review but the high level and tests look reasonable. A few small things.

stickyTaskQueue = getWorkerTaskQueue(aw.workflowWorker.stickyUUID)
}

_, err := aw.client.workflowService.ShutdownWorker(grpcCtx, &workflowservice.ShutdownWorkerRequest{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's still not on shutdown here, but is on polls?

TotalFailedTasks: int32(totalFailed),
LastIntervalProcessedTasks: int32(intervalProcessed),
LastIntervalFailureTasks: int32(intervalFailed),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively these should just be stored as int32 since that's what we have to send them as eventually anyway

} else {
wtp.pollTimeTracker.recordPollSuccess(metrics.PollerTypeWorkflowTask)
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idle pollers report stale heartbeat timestamps

Medium Severity

pollTimeTracker.recordPollSuccess is only called when a poll returns a task token. Successful empty polls are skipped, so LastSuccessfulPollTime in worker heartbeats can remain stale or zero while pollers are healthy and actively polling. This makes heartbeat poller health data inaccurate during idle periods.

Additional Locations (2)

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

}

aw.shutdownWorker()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown sent for never-started workers

Medium Severity

AggregatedWorker.Stop() now always calls shutdownWorker() even when the worker was never started. This can send a ShutdownWorker RPC with a synthesized heartbeat for a worker that never polled, creating misleading worker state and adding avoidable network/retry delay during startup-failure cleanup.

Fix in Cursor Fix in Web

@yuandrew yuandrew merged commit 3f64d34 into temporalio:master Feb 14, 2026
31 of 34 checks passed
@yuandrew yuandrew deleted the worker-heartbeat1 branch February 14, 2026 02:17
@yuandrew yuandrew changed the title Add worker heartbeat support 💥Add worker heartbeat support Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker Heartbeating

3 participants