7986: metrics: add task schedule latency metric#84
7986: metrics: add task schedule latency metric#84martin-augment wants to merge 1 commit intomasterfrom
Conversation
This new metric tracks the amount of time between when a task is scheduled and when it is polled, also known as queue delay. This duration is recorded in a histogram, just like the poll time metric. This metric is useful for implementing queue management algorithms in systems using tokio. For example, it could be used to implement a generic http load shedder using the CoDel algorithm.
WalkthroughThis pull request adds optional task schedule latency histogram instrumentation to the Tokio runtime. New builder methods enable and configure schedule latency tracking. The feature records the elapsed time between when a task is scheduled and when it begins polling. Per-task timestamps are captured in the scheduler when tasks are enqueued. The instrumentation propagates through runtime configuration, worker metrics, and batch processing. Both current-thread and multi-thread runtimes now support this feature behind ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances Tokio's runtime metrics by introducing the ability to measure and report task schedule latency. It provides a granular view into how long tasks wait in the scheduler queue before execution, offering valuable data for performance tuning and bottleneck identification. The changes include new configuration options for runtime builders and an expanded metrics API to access this new data, all while ensuring that the overhead is opt-in. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new unstable metrics feature to the Tokio runtime, enabling the tracking of task schedule latencies. This involves adding configuration options to the Builder, updating internal metric structures (Config, MetricsBatch, RuntimeMetrics, WorkerMetrics), and modifying the schedulers and task core to record and utilize task scheduling timestamps for latency calculation. A new test case has been added to verify this functionality. The review comments suggest improving the documentation for this new feature by clarifying the phrasing regarding when Instant::now() is called for latency measurement, specifically that it occurs once at scheduling and once at the start of polling, rather than "twice per task poll."
| /// execution and when it is polled. | ||
| /// | ||
| /// Task schedule latencies are not instrumented by default as doing | ||
| /// so requires calling [`Instant::now()`] twice per task poll, which |
There was a problem hiding this comment.
The phrasing 'twice per task poll' is a bit misleading. This feature calls Instant::now() once when a task is scheduled, and once when it begins to be polled. For a task that is polled multiple times after being scheduled once, this is not 'twice per poll'. A more accurate description would be clearer for users trying to understand the performance overhead.
| /// so requires calling [`Instant::now()`] twice per task poll, which | |
| /// so requires calling [`Instant::now()`] when a task is scheduled and when it is polled, which |
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! The proposed wording of the docstring suggests that Instant::now() is called twice on each poll. This is not correct and it should be fixed to explain that the first call to Instant::now() is when the task is scheduled.
| /// schedule latencies. | ||
| /// | ||
| /// Task schedule latencies times are not instrumented by default as doing | ||
| /// so requires calling [`Instant::now()`] twice per task poll. The feature |
There was a problem hiding this comment.
The phrasing 'twice per task poll' is a bit misleading. This feature calls Instant::now() once when a task is scheduled, and once when it begins to be polled. For a task that is polled multiple times after being scheduled once, this is not 'twice per poll'. A more accurate description would be clearer for users trying to understand the performance overhead.
| /// so requires calling [`Instant::now()`] twice per task poll. The feature | |
| /// so requires calling [`Instant::now()`] when a task is scheduled and when it is polled. The feature |
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! The proposed wording of the docstring suggests that Instant::now() is called twice on each poll. This is not correct and it should be fixed to explain that the first call to Instant::now() is when the task is scheduled.
🤖 Augment PR SummarySummary: Adds an (unstable) runtime metric for task schedule latency: the time between a task being scheduled and when it is next polled. Changes:
Technical Notes: The default histogram uses a low-footprint linear configuration (10 buckets × 100µs), with optional log histograms for higher granularity. 🤖 Was this summary useful? React with 👍 or 👎 |
| // is only read when polling the task, which can only happen after it's scheduled. | ||
| #[cfg(tokio_unstable)] | ||
| unsafe { | ||
| task.set_scheduled_at(std::time::Instant::now()); |
There was a problem hiding this comment.
task.set_scheduled_at(Instant::now()) runs for every schedule under cfg(tokio_unstable), even if the schedule-latency histogram feature wasn’t enabled; this seems to undermine the “not instrumented by default” contract (and can re-introduce time calls on targets where time metrics are otherwise gated). Consider guarding the timestamp write on whether a schedule-latency histogram is actually configured.
Severity: medium
Other Locations
tokio/src/runtime/scheduler/current_thread/mod.rs:683
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The scheduled time is set for all unstable builds, no matter whether the metric is enabled or not. Prevents extra CPU work and memory usage when the metric is not enabled.
| poll_started_at: now, | ||
| }) | ||
| }); | ||
| let schedule_latencies = worker_metrics |
There was a problem hiding this comment.
schedule_latencies is initialized even when maybe_now is None, but start_poll later falls back to Instant::now() when poll_timer is None, which appears to bypass the existing wasm32-unknown-unknown time-metrics gating. Consider gating schedule-latency histogram collection similarly (e.g., only enabling it when maybe_now is Some).
Severity: medium
Other Locations
tokio/src/runtime/metrics/batch.rs:233
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! There is no point in initializing schedule_latencies if maybe_now is None. Prevents using memory for something that won't be used later.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| #[cfg(tokio_unstable)] | ||
| unsafe { | ||
| task.set_scheduled_at(std::time::Instant::now()); | ||
| } |
There was a problem hiding this comment.
Schedule timestamping always incurs now() overhead
Medium Severity
set_scheduled_at(Instant::now()) runs on every schedule in tokio_unstable, even when schedule-latency metrics are disabled. This adds avoidable timing overhead globally instead of only when enable_metrics_schedule_latency_histogram() is enabled.
Additional Locations (1)
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! The scheduled time is set for all unstable builds, no matter whether the metric is enabled or not. Prevents extra CPU work and memory usage when the metric is not enabled.
| #[cfg(tokio_unstable)] | ||
| core.stats.start_poll(task.get_scheduled_at()); | ||
| #[cfg(not(tokio_unstable))] | ||
| core.stats.start_poll(None); |
There was a problem hiding this comment.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tokio/tests/rt_unstable_metrics.rs (1)
816-824: Pin the multi-thread case to 2 workers.Leaving
new_multi_thread()at the default can collapse to a single worker on constrained CI, which means this test stops exercising the threaded scheduling path. Setting.worker_threads(2)keeps the coverage deterministic.🧪 Suggested tweak
tokio::runtime::Builder::new_multi_thread() + .worker_threads(2) .enable_all() .enable_metrics_schedule_latency_histogram() .metrics_schedule_latency_histogram_configuration(HistogramConfiguration::linear(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tokio/tests/rt_unstable_metrics.rs` around lines 816 - 824, The multi-thread runtime in the test uses tokio::runtime::Builder::new_multi_thread() without fixing the worker count, which can collapse to one worker on constrained CI; update the Builder chain used in the test (tokio::runtime::Builder::new_multi_thread()) to call .worker_threads(2) so the multi-thread scheduling path is exercised deterministically alongside .enable_all() and the existing metrics configuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tokio/src/runtime/metrics/batch.rs`:
- Around line 231-235: The fallback uses Instant::now() directly which bypasses
this module's wasm-safe time gate; change the unwrap_or_else(Instant::now) call
so it uses the module's wasm-safe now() (the same helper used elsewhere in this
file) instead, i.e., replace Instant::now with the local/same-module now
function when computing let now = self.poll_timer.as_ref().map(|p|
p.poll_started_at).unwrap_or_else(...), so duration_as_u64(now -
task_scheduled_at) and schedule_latencies.measure remain wasm-safe.
In `@tokio/src/runtime/metrics/runtime.rs`:
- Around line 578-579: Fix the small doc typo in the runtime metrics comment:
change the phrase "Task schedule latencies times" to "Task schedule latencies"
in the comment that also references Instant::now(), ensuring the rest of the
sentence remains unchanged.
In `@tokio/src/runtime/scheduler/multi_thread/worker.rs`:
- Around line 632-635: The LIFO-drained tasks in run_task (the loop that drains
core.lifo_slot) are missing schedule-latency samples; add a call that records
schedule latency for each task popped from core.lifo_slot before it is
polled—mirror the behavior of core.stats.start_poll(task.get_scheduled_at())
used for the first task by invoking the same sampling path (or a dedicated
record_schedule_latency(task.get_scheduled_at()) helper) for every lifo_slot
task; ensure this is done under the same cfg branches (tokio_unstable vs not)
and reference core.stats.start_poll / record_schedule_latency and
task.get_scheduled_at so the histogram receives samples for the fast-path tasks.
- Around line 1277-1283: The code unconditionally calls Instant::now() in the
#[cfg(tokio_unstable)] unsafe block (task.set_scheduled_at(...)), which runs on
every multi-thread schedule; wrap that write in a check of the histogram being
enabled (use self.shared.config.metrics_schedule_latency_histogram.is_some()) so
the timestamp is only captured when the histogram is configured, and apply the
same guard around the analogous timestamp write in current_thread::mod.rs (lines
referenced in the review) to avoid the extra work on the hot path.
---
Nitpick comments:
In `@tokio/tests/rt_unstable_metrics.rs`:
- Around line 816-824: The multi-thread runtime in the test uses
tokio::runtime::Builder::new_multi_thread() without fixing the worker count,
which can collapse to one worker on constrained CI; update the Builder chain
used in the test (tokio::runtime::Builder::new_multi_thread()) to call
.worker_threads(2) so the multi-thread scheduling path is exercised
deterministically alongside .enable_all() and the existing metrics
configuration.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d403ad45-2126-4834-8abc-7e6016c5d878
📒 Files selected for processing (11)
tokio/src/runtime/builder.rstokio/src/runtime/config.rstokio/src/runtime/metrics/batch.rstokio/src/runtime/metrics/runtime.rstokio/src/runtime/metrics/worker.rstokio/src/runtime/scheduler/current_thread/mod.rstokio/src/runtime/scheduler/multi_thread/stats.rstokio/src/runtime/scheduler/multi_thread/worker.rstokio/src/runtime/task/core.rstokio/src/runtime/task/mod.rstokio/tests/rt_unstable_metrics.rs
| if let Some(task_scheduled_at) = task_scheduled_at { | ||
| if let Some(schedule_latencies) = &mut self.schedule_latencies { | ||
| let now = self.poll_timer.as_ref().map(|p| p.poll_started_at).unwrap_or_else(Instant::now); | ||
| let elapsed = duration_as_u64(now - task_scheduled_at); | ||
| schedule_latencies.measure(elapsed, 1); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that `start_poll` currently bypasses the wasm-safe `now()` helper.
rg -n -C3 'fn now\(|pub\(crate\) fn start_poll\(|schedule_latencies|Instant::now\(' tokio/src/runtime/metrics/batch.rsRepository: martin-augment/tokio
Length of output: 2620
🏁 Script executed:
# Check all direct Instant::now() usages in the file to understand scope
rg -n 'Instant::now' tokio/src/runtime/metrics/batch.rsRepository: martin-augment/tokio
Length of output: 285
🏁 Script executed:
# Verify the now() function signature and check how/where it's used correctly
rg -n 'self\.now\(\)|\.now\(\)|\bor_else\(now\)' tokio/src/runtime/metrics/batch.rsRepository: martin-augment/tokio
Length of output: 46
🏁 Script executed:
# Verify poll_timer can be None at line 233 and understand initialization
rg -n -B10 -A2 'pub\(crate\) fn start_poll' tokio/src/runtime/metrics/batch.rs | head -40Repository: martin-augment/tokio
Length of output: 817
🏁 Script executed:
# Check how poll_timer field is declared and when it can be Some/None
rg -n -B5 'poll_timer:' tokio/src/runtime/metrics/batch.rs | head -30Repository: martin-augment/tokio
Length of output: 235
Use the wasm-safe time gate in schedule-latency measurement.
Line 233 falls back to Instant::now() directly, which bypasses this module's now() guard for unsupported wasm32-unknown-unknown time metrics. This can break the target-specific safety contract and should be gated the same way as other timing paths.
🔧 Proposed fix
- if let Some(task_scheduled_at) = task_scheduled_at {
- if let Some(schedule_latencies) = &mut self.schedule_latencies {
- let now = self.poll_timer.as_ref().map(|p| p.poll_started_at).unwrap_or_else(Instant::now);
- let elapsed = duration_as_u64(now - task_scheduled_at);
- schedule_latencies.measure(elapsed, 1);
- }
- }
+ if let Some(task_scheduled_at) = task_scheduled_at {
+ if let Some(schedule_latencies) = &mut self.schedule_latencies {
+ if let Some(now) = self
+ .poll_timer
+ .as_ref()
+ .map(|p| p.poll_started_at)
+ .or_else(now)
+ {
+ let elapsed = duration_as_u64(now.saturating_duration_since(task_scheduled_at));
+ schedule_latencies.measure(elapsed, 1);
+ }
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tokio/src/runtime/metrics/batch.rs` around lines 231 - 235, The fallback uses
Instant::now() directly which bypasses this module's wasm-safe time gate; change
the unwrap_or_else(Instant::now) call so it uses the module's wasm-safe now()
(the same helper used elsewhere in this file) instead, i.e., replace
Instant::now with the local/same-module now function when computing let now =
self.poll_timer.as_ref().map(|p| p.poll_started_at).unwrap_or_else(...), so
duration_as_u64(now - task_scheduled_at) and schedule_latencies.measure remain
wasm-safe.
| /// Task schedule latencies times are not instrumented by default as doing | ||
| /// so requires calling [`Instant::now()`] twice per task poll. The feature |
There was a problem hiding this comment.
Fix small documentation typo.
Line 578 says “Task schedule latencies times”; remove “times” for clarity.
✏️ Suggested doc fix
- /// Task schedule latencies times are not instrumented by default as doing
+ /// Task schedule latencies are not instrumented by default as doing📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// Task schedule latencies times are not instrumented by default as doing | |
| /// so requires calling [`Instant::now()`] twice per task poll. The feature | |
| /// Task schedule latencies are not instrumented by default as doing | |
| /// so requires calling [`Instant::now()`] twice per task poll. The feature |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tokio/src/runtime/metrics/runtime.rs` around lines 578 - 579, Fix the small
doc typo in the runtime metrics comment: change the phrase "Task schedule
latencies times" to "Task schedule latencies" in the comment that also
references Instant::now(), ensuring the rest of the sentence remains unchanged.
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback: The CodeRabbit AI reviewer is correct! "Latencies" implies "times", so there is no need to use them both
| #[cfg(tokio_unstable)] | ||
| core.stats.start_poll(task.get_scheduled_at()); | ||
| #[cfg(not(tokio_unstable))] | ||
| core.stats.start_poll(None); |
There was a problem hiding this comment.
LIFO-slot polls are missing schedule-latency samples.
Only the first task in run_task goes through core.stats.start_poll(...). Tasks drained from core.lifo_slot at Lines 672-731 are polled without another schedule-latency record, so the new histogram undercounts exactly the fast-path tasks this metric is supposed to observe. A separate record_schedule_latency(task.get_scheduled_at()) path would preserve the existing poll-count/EWMA behavior without dropping those samples.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tokio/src/runtime/scheduler/multi_thread/worker.rs` around lines 632 - 635,
The LIFO-drained tasks in run_task (the loop that drains core.lifo_slot) are
missing schedule-latency samples; add a call that records schedule latency for
each task popped from core.lifo_slot before it is polled—mirror the behavior of
core.stats.start_poll(task.get_scheduled_at()) used for the first task by
invoking the same sampling path (or a dedicated
record_schedule_latency(task.get_scheduled_at()) helper) for every lifo_slot
task; ensure this is done under the same cfg branches (tokio_unstable vs not)
and reference core.stats.start_poll / record_schedule_latency and
task.get_scheduled_at so the histogram receives samples for the fast-path tasks.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The fast-path tasks (tasks scheduled by a currently executing task that should run on the same CPU code) are not measured when polled. Prevents missing/skipping the measurement of tasks.
| // SAFETY: There are no concurrent writes because tasks cannot be scheduled in | ||
| // multiple places concurrently. There are no concurrent reads because this field | ||
| // is only read when polling the task, which can only happen after it's scheduled. | ||
| #[cfg(tokio_unstable)] | ||
| unsafe { | ||
| task.set_scheduled_at(std::time::Instant::now()); | ||
| } |
There was a problem hiding this comment.
Gate timestamp capture on the histogram actually being enabled.
Instant::now() now runs on every multi-thread schedule in tokio_unstable, even when enable_metrics_schedule_latency_histogram() was never configured. That makes the feature non-optional in practice and adds avoidable work to a hot path. Please guard this write on self.shared.config.metrics_schedule_latency_histogram.is_some() and apply the same guard at Lines 678-684 in tokio/src/runtime/scheduler/current_thread/mod.rs.
⚙️ Suggested change
- #[cfg(tokio_unstable)]
- unsafe {
- task.set_scheduled_at(std::time::Instant::now());
- }
+ #[cfg(tokio_unstable)]
+ if self.shared.config.metrics_schedule_latency_histogram.is_some() {
+ unsafe {
+ task.set_scheduled_at(std::time::Instant::now());
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tokio/src/runtime/scheduler/multi_thread/worker.rs` around lines 1277 - 1283,
The code unconditionally calls Instant::now() in the #[cfg(tokio_unstable)]
unsafe block (task.set_scheduled_at(...)), which runs on every multi-thread
schedule; wrap that write in a check of the histogram being enabled (use
self.shared.config.metrics_schedule_latency_histogram.is_some()) so the
timestamp is only captured when the histogram is configured, and apply the same
guard around the analogous timestamp write in current_thread::mod.rs (lines
referenced in the review) to avoid the extra work on the hot path.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The scheduled time is set for all unstable builds, no matter whether the metric is enabled or not. Prevents extra CPU work and memory usage when the metric is not enabled.
|
PR Review for metrics: add task schedule latency metric This PR adds a schedule latency histogram metric, closely mirroring the existing poll_count_histogram pattern. Overall the implementation is solid and well-documented. Here are my findings: Potential Bugs
The line: let elapsed = duration_as_u64(now - task_scheduled_at); Instant subtraction panics in debug builds if now < task_scheduled_at. While Instant is monotonic on most platforms, certain platform quirks can cause this. The existing busy_duration calculation uses .elapsed() which is safe. Consider: let elapsed = duration_as_u64(now.checked_duration_since(task_scheduled_at).unwrap_or_default()); This matches the defensive style expected in a foundational runtime library. Design / Correctness Observations
set_scheduled_at is called unconditionally in schedule_task() regardless of is_yield. This means for tasks that call yield_now(), the metric measures time between the yield and the next poll, not the initial schedule-to-first-poll latency. The documentation only says time between when a task is scheduled for execution and when it is polled - it should clarify that this includes re-scheduling after yield_now().
Checking worker_metrics(0) is an indirect proxy for is the histogram configured. The existing poll_count_histogram_enabled has the same pattern so this is consistent, and it is safe because both runtime flavors always have at least one worker.
Calling metrics_schedule_latency_histogram_configuration() without first calling enable_metrics_schedule_latency_histogram() silently discards the configuration. The docs do not mention this requirement. Consider adding a note: This method has no effect unless enable_metrics_schedule_latency_histogram() is also called.
The docs say the feature requires calling Instant::now() twice per task poll. In practice, when both poll-time histogram and schedule-latency histogram are enabled, the existing poll_started_at timestamp is reused. The twice wording should be corrected. Code Quality
The existing poll_count_histogram field has a doc comment. The new schedule_latency_histogram field is missing one.
let now = self.poll_timer.as_ref().map(|p| p.poll_started_at).unwrap_or_else(Instant::now); should be split across lines for readability. Test Coverage The test covers both current_thread and multi_thread and validates total counts equal N - good baseline. Suggestions:
Summary Category | Severity | Item The implementation is well-structured, follows existing patterns well, has comprehensive API docs, and the unsafe usage is sound. The main actionable item is the checked_duration_since fix to avoid a potential panic. |
value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The proposed wording of the docstring suggests that Instant::now() is called twice on each poll. This is not correct and it should be fixed to explain that the first call to Instant::now() is when the task is scheduled. |
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The new test won't make much sense if the machine where it is executed has only one CPU core. By using two working threads it will always serve its purpose, no matter the hardware it is running on. |


7986: To review by AI