Skip to content

Comments

agent: emit TLS certificate expiry metrics.#27538

Open
jrasell wants to merge 2 commits intomainfrom
f-NMD-1055
Open

agent: emit TLS certificate expiry metrics.#27538
jrasell wants to merge 2 commits intomainfrom
f-NMD-1055

Conversation

@jrasell
Copy link
Member

@jrasell jrasell commented Feb 19, 2026

Periodically emit gauge metrics tracking the TTL of the agent's TLS certificate and CA certificate. The metrics are refreshed on reload and stopped when TLS is removed or the agent shuts down

The metrics are emitted from the agent as the provided TLS certificates apply to both the server and client agent mode. It therefore felt like the best implementation approach to take and provides an easy to reason and understand approach.

AI Disclosure: I attempted to use Claude to generate some of the TLS metric handle boiler plate. Almost all of the generated code was deleted or reworked. I did not attempt to use any GenAI for the agent code.

Docs: I'll add as a follow up in the unified repo.

Testing & Reproduction steps

Running an agent in dev mode with TLS enabled, the new metrics can be review using the following commands (assuming JQ is available):

curl -sk https://localhost:4646/v1/metrics | jq '.Gauges[] | select(.Name=="nomad.agent.tls.cert.expiration_seconds")'
curl -sk https://localhost:4646/v1/metrics | jq '.Gauges[] | select(.Name=="nomad.agent.tls.ca.expiration_seconds")'

Links

Jira: https://hashicorp.atlassian.net/browse/NMD-1055
Closes: #26997

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad product documentation, which is stored in the
    web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
    Please also consider whether the change requires notes within the upgrade
    guide
    . If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

Periodically emit gauge metrics tracking the TTL of the agent's TLS
certificate and CA certificate. The metrics are refreshed on reload
and stopped when TLS is removed or the agent shuts down.
@jrasell jrasell added the backport/1.11.x backport to 1.11.x release line label Feb 20, 2026
@jrasell jrasell marked this pull request as ready for review February 20, 2026 10:32
@jrasell jrasell requested review from a team as code owners February 20, 2026 10:32
Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks really good, but I left some comments that I'd like clarified.

// We have successfully initialized the new TLS metrics emitter, so
// we can stop the old one (if it exists) and start the new one.
if a.tlsMetrics != nil {
a.tlsMetrics.stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shuts down the channel but doesn't really handle goroutines stop, vide your comment above. Doesn't a reload lead to goroutine leak in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tlsMetrics.emitLoop which is run as a go-routine has a select on the channel which gets triggered when it is closed and uses a return to cause the routine to exit.

I've tested this locally using the following workflow:

  • start a Nomad agent with debug and TLS enabled
  • regenerate the TLS certificate and send SIGHUP to the agent
  • navigate to the pprof endpoint
  • ensure there is only one tlsMetrics routine running, probably in select as shown below
goroutine 1135 [select]:
github.com/hashicorp/nomad/command/agent.(*tlsMetrics).emitLoop(0x1400135bb60, 0x3b9aca00)
	github.com/hashicorp/nomad/command/agent/tls_metrics.go:98 +0x8c
created by github.com/hashicorp/nomad/command/agent.(*tlsMetrics).start in goroutine 1
	github.com/hashicorp/nomad/command/agent/tls_metrics.go:79 +0x8c

Is there something else I am missing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good I might've gotten confused in the emitLoop.

Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.11.x backport to 1.11.x release line

Projects

None yet

Development

Successfully merging this pull request may close these issues.

metric: Add new agent metric to monitor TLS certificate expiry

2 participants