Skip to content

Report exemplars for histogram and timer metrics#215

Open
sfackler wants to merge 3 commits intodevelopfrom
exemplars
Open

Report exemplars for histogram and timer metrics#215
sfackler wants to merge 3 commits intodevelopfrom
exemplars

Conversation

@sfackler
Copy link
Copy Markdown
Member

@sfackler sfackler commented Jan 6, 2025

Before this PR

We didn't report any exemplars for metrics, making it a bit harder to investigate slowness or other badness.

After this PR

==COMMIT_MSG==
Histogram and timer metrics now report exemplars.
==COMMIT_MSG==

The behavior here matches WC-Java - we report at most a single exemplar per metric, corresponding to the measurement from the sampled trace made within the last reporting window which had the highest value (i.e. was the slowest for a timer metric).

Depends on palantir/witchcraft-rust-logging#40.

Metric output with an exemplar:

{"type":"metric.1","time":"2025-11-30T22:06:08.389198Z","metricName":"server.response","metricType":"timer","values":{"1m":0.024078182461403894,"count":2,"max":270.667,"p95":270.667,"p99":270.667,"p999":270.667},"samples":[{"value":270.667,"time":"2025-11-30T22:05:49.878164Z","traceId":"aa9ab66b2e728b7a"}],"tags":{"endpoint":"foo","service-name":"TestResource"}}

@sfackler sfackler requested a review from a team January 6, 2025 15:01
@changelog-app
Copy link
Copy Markdown

changelog-app bot commented Jan 6, 2025

Generate changelog in changelog/@unreleased

What do the change types mean?
  • feature: A new feature of the service.
  • improvement: An incremental improvement in the functionality or operation of the service.
  • fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
  • break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
    and external consumers of the service's API (e.g. customer-written software or integrations).
  • deprecation: Advertises the intention to remove service functionality without any change to the
    operation of the service itself.
  • manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
    performing database surgery, ...) at the time of upgrade for it to succeed.
  • migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?
  • ❗The break and manual task changelog types will result in a major release!
  • 🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
  • ✨ All others will result in a minor version release.

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Histogram and timer metrics now report exemplars.

Check the box to generate changelog(s)

  • Generate changelog entry

@stale
Copy link
Copy Markdown

stale bot commented Jun 27, 2025

This PR has been automatically marked as stale because it has not been touched in the last 14 days. If you'd like to keep it open, please leave a comment or add the 'long-lived' label, otherwise it'll be closed in 7 days.

@stale stale bot added the stale label Jun 27, 2025
@sfackler sfackler removed the stale label Jun 27, 2025
@stale
Copy link
Copy Markdown

stale bot commented Oct 18, 2025

This PR has been automatically marked as stale because it has not been touched in the last 14 days. If you'd like to keep it open, please leave a comment or add the 'long-lived' label, otherwise it'll be closed in 7 days.

@stale stale bot added the stale label Oct 18, 2025
@sfackler sfackler removed the stale label Oct 20, 2025
}

#[pinned_drop]
impl<F> PinnedDrop for TracePropagationFuture<F> {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are required to ensure that the zipkin thread local state is set when the endpoint metric layer handles its timer updates.

.insert_values("p95", snapshot.value(0.95) / NANOS_PER_MICRO_F64)
.insert_values("p99", snapshot.value(0.99) / NANOS_PER_MICRO_F64)
.insert_values("p999", snapshot.value(0.999) / NANOS_PER_MICRO_F64)
.insert_values("max", (snapshot.max() as f64) / NANOS_PER_MICRO)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by fix - previously the max would be rounded down to the nearest whole microsecond while the percentiles wouldn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant