7986: metrics: add task schedule latency metric#88
7986: metrics: add task schedule latency metric#88martin-augment wants to merge 8 commits intomasterfrom
Conversation
This new metric tracks the amount of time between when a task is scheduled and when it is polled, also known as queue delay. This duration is recorded in a histogram, just like the poll time metric. This metric is useful for implementing queue management algorithms in systems using tokio. For example, it could be used to implement a generic http load shedder using the CoDel algorithm.
WalkthroughSchedule-latency histogram metrics support is added to Tokio's runtime. The implementation introduces configuration options in the runtime builder, tracking infrastructure for per-task scheduling timestamps, histogram batch accumulation in metrics collection, and new query APIs in RuntimeMetrics. Changes span the builder, task core structures, scheduler implementations (current-thread and multi-thread), metrics collection paths, and includes a validation test. The feature is gated behind unstable and 64-bit target width conditions. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces task schedule latency tracking as an unstable metric for both current-thread and multi-thread runtimes. It adds configuration methods to the runtime builder, updates task headers to store scheduling timestamps on 64-bit targets, and provides new public APIs in RuntimeMetrics to access histogram data. A review comment correctly identifies that the Notified struct is missing a get_scheduled_at method, which is required for the multi-threaded scheduler to record these metrics.
| pub(crate) fn task_meta<'meta>(&self) -> crate::runtime::TaskMeta<'meta> { | ||
| self.0.task_meta() | ||
| } | ||
|
|
||
| #[cfg(all(tokio_unstable, target_pointer_width = "64"))] | ||
| pub(crate) fn set_scheduled_at(&self, nanos: NonZeroU64) { | ||
| // SAFETY: There are no concurrent writes because there is only ever one `Notified` | ||
| // reference per task. There are no concurrent reads because this field is only read | ||
| // when polling the task, which can only happen after it's scheduled. | ||
| unsafe { | ||
| self.0.header().set_scheduled_at(nanos); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The Notified struct is missing the get_scheduled_at method, which is required by the multi-threaded scheduler to record schedule latency metrics. While LocalNotified has this method added, Notified (used in the multi-threaded runtime) also needs it to avoid compilation errors when latency tracking is enabled.
| pub(crate) fn task_meta<'meta>(&self) -> crate::runtime::TaskMeta<'meta> { | |
| self.0.task_meta() | |
| } | |
| #[cfg(all(tokio_unstable, target_pointer_width = "64"))] | |
| pub(crate) fn set_scheduled_at(&self, nanos: NonZeroU64) { | |
| // SAFETY: There are no concurrent writes because there is only ever one `Notified` | |
| // reference per task. There are no concurrent reads because this field is only read | |
| // when polling the task, which can only happen after it's scheduled. | |
| unsafe { | |
| self.0.header().set_scheduled_at(nanos); | |
| } | |
| } | |
| } | |
| pub(crate) fn task_meta<'meta>(&self) -> crate::runtime::TaskMeta<'meta> { | |
| self.0.task_meta() | |
| } | |
| #[cfg(all(tokio_unstable, target_pointer_width = "64"))] | |
| pub(crate) fn set_scheduled_at(&self, nanos: NonZeroU64) { | |
| // SAFETY: There are no concurrent writes because there is only ever one `Notified` | |
| // reference per task. There are no concurrent reads because this field is only read | |
| // when polling the task, which can only happen after it's scheduled. | |
| unsafe { | |
| self.0.header().set_scheduled_at(nanos); | |
| } | |
| } | |
| #[cfg(all(tokio_unstable, target_pointer_width = "64"))] | |
| pub(crate) fn get_scheduled_at(&self) -> Option<NonZeroU64> { | |
| self.0.header().get_scheduled_at() | |
| } | |
| } |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Notifier has a setter and LocalNotifier has a getter. They both should have both setters and getters.
🤖 Augment PR SummarySummary: Adds an (unstable) runtime metric to track task schedule latency as a histogram. Changes:
Technical Notes: Instrumentation uses 🤖 Was this summary useful? React with 👍 or 👎 |
| } | ||
|
|
||
| fn metrics_schedule_latency_histogram_builder(&self) -> Option<HistogramBuilder> { | ||
| if self.metrics_schedule_latency_histogram_enable { |
There was a problem hiding this comment.
tokio/src/runtime/builder.rs:1789: metrics_schedule_latency_histogram_builder() returns Some whenever enabled, even on targets where schedule-latency instrumentation is cfg’d out (e.g., 32-bit), which can lead to the metric being reported as enabled but never producing counts. Consider aligning the cfg/behavior with the docs stating 64-bit-only support.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! Half of the related implementation is gated behind a #[cfg(target_pointer_width = "64")] but the rest does not use this gate. This makes the second half visible for usage in 32bit systems but it won'd do anything. Prevents leaking implementation details in 32 bits builds.
| /// [`enable_metrics_schedule_latency_histogram()`]: crate::runtime::Builder::enable_metrics_schedule_latency_histogram | ||
| /// [`Instant::now()`]: std::time::Instant::now | ||
| pub fn schedule_latency_histogram_enabled(&self) -> bool { | ||
| self.handle.inner.worker_metrics(0).schedule_latency_histogram.is_some() |
There was a problem hiding this comment.
tokio/src/runtime/metrics/runtime.rs:605: schedule_latency_histogram_enabled() only checks whether the histogram is configured, but on 32-bit targets the per-task scheduled_at plumbing and schedule_latency_histogram_bucket_count() API are cfg’d out, so this may return true while the metric can’t be observed. Consider gating these accessors consistently (or clarifying the documented support matrix).
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! Half of the related implementation is gated behind a #[cfg(target_pointer_width = "64")] but the rest does not use this gate. This makes the second half visible for usage in 32bit systems but it won'd do anything. Prevents leaking implementation details in 32 bits builds.
| pub(super) poll_count_histogram: Option<Histogram>, | ||
|
|
||
| #[cfg(tokio_unstable)] | ||
| /// If `Some`, tracks the number of times tasks were scheduled by duration range. |
There was a problem hiding this comment.
tokio/src/runtime/metrics/worker.rs:70: The doc comment on schedule_latency_histogram says it tracks “times tasks were scheduled by duration range”, but the implementation increments buckets based on schedule-to-poll latency observed at poll time. This mismatch could mislead users interpreting what the histogram counts represent.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! The documentation is a bit unclear what is meant by "duration range". It should be improved by explaining that these are the poll latency durations.
There was a problem hiding this comment.
🧹 Nitpick comments (3)
CLAUDE.md (1)
1-1: Optional wording polish for clarity and branding consistency.Consider changing “Github” → “GitHub” and tightening phrasing for readability (e.g., “Never link to issues or pull requests in any GitHub repository.”).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CLAUDE.md` at line 1, Update the sentence "Never link to any issue or a pull request in any Github repository" in CLAUDE.md to improve branding and clarity: change "Github" to "GitHub" and tighten phrasing to something like "Never link to issues or pull requests in any GitHub repository." Replace the original string exactly and ensure punctuation/capitalization matches the rest of the document.tokio/src/runtime/scheduler/multi_thread/worker.rs (1)
1289-1304: Verify safety ofNonZeroU64::new_uncheckedusage.The safety invariant relies on
.max(1)ensuring the value is never zero. While theas u64cast fromu128can theoretically truncate, this only occurs after ~584 years of uptime, which is acceptable.Consider adding a brief comment noting the truncation behavior for future readers:
📝 Optional: Add truncation note
// SAFETY: `.max(1)` ensures the value can never be 0. + // Note: truncation from u128 to u64 only occurs after ~584 years of uptime. let scheduled_at = unsafe {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tokio/src/runtime/scheduler/multi_thread/worker.rs` around lines 1289 - 1304, The call to NonZeroU64::new_unchecked in schedule_task relies on .max(1) to avoid zero, but the u128->u64 cast can truncate after extremely long uptime; update the schedule_task block (around the scheduled_at calculation and task.set_scheduled_at call) to preserve the existing logic and add a concise comment documenting that the as u64 cast may truncate the u128 nanoseconds after ~584 years and that this truncation is acceptable for metrics, and keep the .max(1) safety guarantee in place so new_unchecked is never passed zero.tokio/src/runtime/builder.rs (1)
1567-1611: Consider clarifying 32-bit platform behavior.The documentation correctly states "This feature is only supported on 64-bit targets" (line 1572), but the method itself has no compile-time guard. On 32-bit targets, calling this method will allocate the histogram but no samples will ever be recorded since measurement is gated behind
target_pointer_width = "64".This is consistent with similar patterns in Tokio where configuration is portable but behavior is platform-dependent. However, users on 32-bit platforms may be confused when the histogram remains empty.
Consider adding a note like: "On 32-bit targets, the histogram will be created but will remain empty as measurement is not supported."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tokio/src/runtime/builder.rs` around lines 1567 - 1611, The docs say schedule-latency histograms are 64-bit only but the builder method enable_metrics_schedule_latency_histogram() currently has no note about 32-bit behavior; add a brief sentence to the method's doc comment stating that on 32-bit targets the histogram will still be created (since metrics_schedule_latency_histogram_enable is set) but will remain empty because measurements are gated on target_pointer_width = "64". Reference the Builder::enable_metrics_schedule_latency_histogram method and the metrics_schedule_latency_histogram_enable field in the comment so callers on 32-bit platforms aren't confused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@CLAUDE.md`:
- Line 1: Update the sentence "Never link to any issue or a pull request in any
Github repository" in CLAUDE.md to improve branding and clarity: change "Github"
to "GitHub" and tighten phrasing to something like "Never link to issues or pull
requests in any GitHub repository." Replace the original string exactly and
ensure punctuation/capitalization matches the rest of the document.
In `@tokio/src/runtime/builder.rs`:
- Around line 1567-1611: The docs say schedule-latency histograms are 64-bit
only but the builder method enable_metrics_schedule_latency_histogram()
currently has no note about 32-bit behavior; add a brief sentence to the
method's doc comment stating that on 32-bit targets the histogram will still be
created (since metrics_schedule_latency_histogram_enable is set) but will remain
empty because measurements are gated on target_pointer_width = "64". Reference
the Builder::enable_metrics_schedule_latency_histogram method and the
metrics_schedule_latency_histogram_enable field in the comment so callers on
32-bit platforms aren't confused.
In `@tokio/src/runtime/scheduler/multi_thread/worker.rs`:
- Around line 1289-1304: The call to NonZeroU64::new_unchecked in schedule_task
relies on .max(1) to avoid zero, but the u128->u64 cast can truncate after
extremely long uptime; update the schedule_task block (around the scheduled_at
calculation and task.set_scheduled_at call) to preserve the existing logic and
add a concise comment documenting that the as u64 cast may truncate the u128
nanoseconds after ~584 years and that this truncation is acceptable for metrics,
and keep the .max(1) safety guarantee in place so new_unchecked is never passed
zero.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: edccf019-e00a-46ac-aa78-20896fe7ce6c
📒 Files selected for processing (12)
CLAUDE.mdtokio/src/runtime/builder.rstokio/src/runtime/config.rstokio/src/runtime/metrics/batch.rstokio/src/runtime/metrics/runtime.rstokio/src/runtime/metrics/worker.rstokio/src/runtime/scheduler/current_thread/mod.rstokio/src/runtime/scheduler/multi_thread/stats.rstokio/src/runtime/scheduler/multi_thread/worker.rstokio/src/runtime/task/core.rstokio/src/runtime/task/mod.rstokio/tests/rt_unstable_metrics.rs
PR Review: Add Task Schedule Latency MetricThis PR adds a useful new metric for tracking task schedule latency (the time between when a task is scheduled and when it is polled). The overall design is sound and mirrors the existing poll count histogram implementation well. Below are my findings. Bugs / Correctness Issues1. Missing The docs state the method panics on an invalid bucket index, but unlike 2. Inconsistent silent-return when histogram is disabled
Safety / Soundness Concern3. The This is likely correct in practice because the queue operations include Additionally, the safety comment on Test Coverage4. Test only verifies total count, not latency distribution The
Minor / Style5. Long lines in The new 6. Doc comment line length Several added doc comment lines (e.g., the default bucket description in 7. Behavior on 32-bit targets
Summary
Overall the implementation is well-structured and correctly follows the existing histogram metric patterns. Addressing the |
value:useful; category:bug; feedback: The Claude AI reviewer is correct! In case of a panic the stack trace will show the file/line/position of the caller instead of the metrics collector function. This would be useful if unwinding is disabled, i.e. panic=abort. |
7986: To review by AI