Skip to content

7986: metrics: add task schedule latency metric#84

Open
martin-augment wants to merge 1 commit intomasterfrom
pr-7986-2026-03-25-06-26-12
Open

7986: metrics: add task schedule latency metric#84
martin-augment wants to merge 1 commit intomasterfrom
pr-7986-2026-03-25-06-26-12

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

7986: To review by AI

This new metric tracks the amount of time between when a task is
scheduled and when it is polled, also known as queue delay. This
duration is recorded in a histogram, just like the poll time metric.

This metric is useful for implementing queue management algorithms in
systems using tokio. For example, it could be used to implement a
generic http load shedder using the CoDel algorithm.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

Walkthrough

This pull request adds optional task schedule latency histogram instrumentation to the Tokio runtime. New builder methods enable and configure schedule latency tracking. The feature records the elapsed time between when a task is scheduled and when it begins polling. Per-task timestamps are captured in the scheduler when tasks are enqueued. The instrumentation propagates through runtime configuration, worker metrics, and batch processing. Both current-thread and multi-thread runtimes now support this feature behind tokio_unstable. New public accessor methods expose schedule latency histogram data, and a test validates the metrics collection.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-7986-2026-03-25-06-26-12

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances Tokio's runtime metrics by introducing the ability to measure and report task schedule latency. It provides a granular view into how long tasks wait in the scheduler queue before execution, offering valuable data for performance tuning and bottleneck identification. The changes include new configuration options for runtime builders and an expanded metrics API to access this new data, all while ensuring that the overhead is opt-in.

Highlights

  • New Metric: Task Schedule Latency: Introduced a new unstable metric to track the time elapsed between a task being scheduled for execution and when it is actually polled by a worker. This metric helps in understanding scheduling overheads.
  • Configurable Histograms: Users can now enable and configure histograms for task schedule latencies using tokio::runtime::Builder. This includes options for linear or logarithmic histograms with customizable bucket configurations.
  • Runtime Metrics API: Added new methods to RuntimeMetrics to query the status, number of buckets, bucket ranges, and counts for the task schedule latency histograms, providing detailed insights into task scheduling performance.
  • Performance Consideration: The task schedule latency metric is not enabled by default due to the overhead of calling Instant::now() twice per task poll, allowing users to opt-in when detailed performance analysis is required.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new unstable metrics feature to the Tokio runtime, enabling the tracking of task schedule latencies. This involves adding configuration options to the Builder, updating internal metric structures (Config, MetricsBatch, RuntimeMetrics, WorkerMetrics), and modifying the schedulers and task core to record and utilize task scheduling timestamps for latency calculation. A new test case has been added to verify this functionality. The review comments suggest improving the documentation for this new feature by clarifying the phrasing regarding when Instant::now() is called for latency measurement, specifically that it occurs once at scheduling and once at the start of polling, rather than "twice per task poll."

/// execution and when it is polled.
///
/// Task schedule latencies are not instrumented by default as doing
/// so requires calling [`Instant::now()`] twice per task poll, which
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrasing 'twice per task poll' is a bit misleading. This feature calls Instant::now() once when a task is scheduled, and once when it begins to be polled. For a task that is polled multiple times after being scheduled once, this is not 'twice per poll'. A more accurate description would be clearer for users trying to understand the performance overhead.

Suggested change
/// so requires calling [`Instant::now()`] twice per task poll, which
/// so requires calling [`Instant::now()`] when a task is scheduled and when it is polled, which

Copy link
Copy Markdown
Owner Author

@martin-augment martin-augment Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! The proposed wording of the docstring suggests that Instant::now() is called twice on each poll. This is not correct and it should be fixed to explain that the first call to Instant::now() is when the task is scheduled.

/// schedule latencies.
///
/// Task schedule latencies times are not instrumented by default as doing
/// so requires calling [`Instant::now()`] twice per task poll. The feature
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrasing 'twice per task poll' is a bit misleading. This feature calls Instant::now() once when a task is scheduled, and once when it begins to be polled. For a task that is polled multiple times after being scheduled once, this is not 'twice per poll'. A more accurate description would be clearer for users trying to understand the performance overhead.

Suggested change
/// so requires calling [`Instant::now()`] twice per task poll. The feature
/// so requires calling [`Instant::now()`] when a task is scheduled and when it is polled. The feature

Copy link
Copy Markdown
Owner Author

@martin-augment martin-augment Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! The proposed wording of the docstring suggests that Instant::now() is called twice on each poll. This is not correct and it should be fixed to explain that the first call to Instant::now() is when the task is scheduled.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 25, 2026

🤖 Augment PR Summary

Summary: Adds an (unstable) runtime metric for task schedule latency: the time between a task being scheduled and when it is next polled.

Changes:

  • Extends runtime::Builder with an opt-in enable_metrics_schedule_latency_histogram() flag and a histogram configuration API.
  • Plumbs an optional schedule-latency HistogramBuilder through runtime Config and into per-worker WorkerMetrics.
  • Enhances MetricsBatch to accumulate and submit schedule-latency histogram observations.
  • Instruments both current-thread and multi-thread schedulers to record a per-task “scheduled at” timestamp and pass it into poll-start accounting.
  • Adds new RuntimeMetrics accessors to query enablement, bucket configuration, and per-bucket counts.
  • Introduces an unstable-metrics test that spawns tasks on both runtime flavors and validates the histogram counts sum to the number of tasks.

Technical Notes: The default histogram uses a low-footprint linear configuration (10 buckets × 100µs), with optional log histograms for higher granularity.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

// is only read when polling the task, which can only happen after it's scheduled.
#[cfg(tokio_unstable)]
unsafe {
task.set_scheduled_at(std::time::Instant::now());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task.set_scheduled_at(Instant::now()) runs for every schedule under cfg(tokio_unstable), even if the schedule-latency histogram feature wasn’t enabled; this seems to undermine the “not instrumented by default” contract (and can re-introduce time calls on targets where time metrics are otherwise gated). Consider guarding the timestamp write on whether a schedule-latency histogram is actually configured.

Severity: medium

Other Locations
  • tokio/src/runtime/scheduler/current_thread/mod.rs:683

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The scheduled time is set for all unstable builds, no matter whether the metric is enabled or not. Prevents extra CPU work and memory usage when the metric is not enabled.

poll_started_at: now,
})
});
let schedule_latencies = worker_metrics
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schedule_latencies is initialized even when maybe_now is None, but start_poll later falls back to Instant::now() when poll_timer is None, which appears to bypass the existing wasm32-unknown-unknown time-metrics gating. Consider gating schedule-latency histogram collection similarly (e.g., only enabling it when maybe_now is Some).

Severity: medium

Other Locations
  • tokio/src/runtime/metrics/batch.rs:233

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! There is no point in initializing schedule_latencies if maybe_now is None. Prevents using memory for something that won't be used later.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

#[cfg(tokio_unstable)]
unsafe {
task.set_scheduled_at(std::time::Instant::now());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schedule timestamping always incurs now() overhead

Medium Severity

set_scheduled_at(Instant::now()) runs on every schedule in tokio_unstable, even when schedule-latency metrics are disabled. This adds avoidable timing overhead globally instead of only when enable_metrics_schedule_latency_histogram() is enabled.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! The scheduled time is set for all unstable builds, no matter whether the metric is enabled or not. Prevents extra CPU work and memory usage when the metric is not enabled.

#[cfg(tokio_unstable)]
core.stats.start_poll(task.get_scheduled_at());
#[cfg(not(tokio_unstable))]
core.stats.start_poll(None);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LIFO-polled tasks skip latency histogram measurement

Medium Severity

Latency measurement is started only for the initial task entering run_task. Tasks later polled from the LIFO slot in the same loop do not call start_poll with their own scheduled_at, so their schedule latency is never recorded.

Fix in Cursor Fix in Web

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tokio/tests/rt_unstable_metrics.rs (1)

816-824: Pin the multi-thread case to 2 workers.

Leaving new_multi_thread() at the default can collapse to a single worker on constrained CI, which means this test stops exercising the threaded scheduling path. Setting .worker_threads(2) keeps the coverage deterministic.

🧪 Suggested tweak
         tokio::runtime::Builder::new_multi_thread()
+            .worker_threads(2)
             .enable_all()
             .enable_metrics_schedule_latency_histogram()
             .metrics_schedule_latency_histogram_configuration(HistogramConfiguration::linear(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tokio/tests/rt_unstable_metrics.rs` around lines 816 - 824, The multi-thread
runtime in the test uses tokio::runtime::Builder::new_multi_thread() without
fixing the worker count, which can collapse to one worker on constrained CI;
update the Builder chain used in the test
(tokio::runtime::Builder::new_multi_thread()) to call .worker_threads(2) so the
multi-thread scheduling path is exercised deterministically alongside
.enable_all() and the existing metrics configuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tokio/src/runtime/metrics/batch.rs`:
- Around line 231-235: The fallback uses Instant::now() directly which bypasses
this module's wasm-safe time gate; change the unwrap_or_else(Instant::now) call
so it uses the module's wasm-safe now() (the same helper used elsewhere in this
file) instead, i.e., replace Instant::now with the local/same-module now
function when computing let now = self.poll_timer.as_ref().map(|p|
p.poll_started_at).unwrap_or_else(...), so duration_as_u64(now -
task_scheduled_at) and schedule_latencies.measure remain wasm-safe.

In `@tokio/src/runtime/metrics/runtime.rs`:
- Around line 578-579: Fix the small doc typo in the runtime metrics comment:
change the phrase "Task schedule latencies times" to "Task schedule latencies"
in the comment that also references Instant::now(), ensuring the rest of the
sentence remains unchanged.

In `@tokio/src/runtime/scheduler/multi_thread/worker.rs`:
- Around line 632-635: The LIFO-drained tasks in run_task (the loop that drains
core.lifo_slot) are missing schedule-latency samples; add a call that records
schedule latency for each task popped from core.lifo_slot before it is
polled—mirror the behavior of core.stats.start_poll(task.get_scheduled_at())
used for the first task by invoking the same sampling path (or a dedicated
record_schedule_latency(task.get_scheduled_at()) helper) for every lifo_slot
task; ensure this is done under the same cfg branches (tokio_unstable vs not)
and reference core.stats.start_poll / record_schedule_latency and
task.get_scheduled_at so the histogram receives samples for the fast-path tasks.
- Around line 1277-1283: The code unconditionally calls Instant::now() in the
#[cfg(tokio_unstable)] unsafe block (task.set_scheduled_at(...)), which runs on
every multi-thread schedule; wrap that write in a check of the histogram being
enabled (use self.shared.config.metrics_schedule_latency_histogram.is_some()) so
the timestamp is only captured when the histogram is configured, and apply the
same guard around the analogous timestamp write in current_thread::mod.rs (lines
referenced in the review) to avoid the extra work on the hot path.

---

Nitpick comments:
In `@tokio/tests/rt_unstable_metrics.rs`:
- Around line 816-824: The multi-thread runtime in the test uses
tokio::runtime::Builder::new_multi_thread() without fixing the worker count,
which can collapse to one worker on constrained CI; update the Builder chain
used in the test (tokio::runtime::Builder::new_multi_thread()) to call
.worker_threads(2) so the multi-thread scheduling path is exercised
deterministically alongside .enable_all() and the existing metrics
configuration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d403ad45-2126-4834-8abc-7e6016c5d878

📥 Commits

Reviewing files that changed from the base of the PR and between c65333e and 096b3a4.

📒 Files selected for processing (11)
  • tokio/src/runtime/builder.rs
  • tokio/src/runtime/config.rs
  • tokio/src/runtime/metrics/batch.rs
  • tokio/src/runtime/metrics/runtime.rs
  • tokio/src/runtime/metrics/worker.rs
  • tokio/src/runtime/scheduler/current_thread/mod.rs
  • tokio/src/runtime/scheduler/multi_thread/stats.rs
  • tokio/src/runtime/scheduler/multi_thread/worker.rs
  • tokio/src/runtime/task/core.rs
  • tokio/src/runtime/task/mod.rs
  • tokio/tests/rt_unstable_metrics.rs

Comment on lines +231 to +235
if let Some(task_scheduled_at) = task_scheduled_at {
if let Some(schedule_latencies) = &mut self.schedule_latencies {
let now = self.poll_timer.as_ref().map(|p| p.poll_started_at).unwrap_or_else(Instant::now);
let elapsed = duration_as_u64(now - task_scheduled_at);
schedule_latencies.measure(elapsed, 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that `start_poll` currently bypasses the wasm-safe `now()` helper.
rg -n -C3 'fn now\(|pub\(crate\) fn start_poll\(|schedule_latencies|Instant::now\(' tokio/src/runtime/metrics/batch.rs

Repository: martin-augment/tokio

Length of output: 2620


🏁 Script executed:

# Check all direct Instant::now() usages in the file to understand scope
rg -n 'Instant::now' tokio/src/runtime/metrics/batch.rs

Repository: martin-augment/tokio

Length of output: 285


🏁 Script executed:

# Verify the now() function signature and check how/where it's used correctly
rg -n 'self\.now\(\)|\.now\(\)|\bor_else\(now\)' tokio/src/runtime/metrics/batch.rs

Repository: martin-augment/tokio

Length of output: 46


🏁 Script executed:

# Verify poll_timer can be None at line 233 and understand initialization
rg -n -B10 -A2 'pub\(crate\) fn start_poll' tokio/src/runtime/metrics/batch.rs | head -40

Repository: martin-augment/tokio

Length of output: 817


🏁 Script executed:

# Check how poll_timer field is declared and when it can be Some/None
rg -n -B5 'poll_timer:' tokio/src/runtime/metrics/batch.rs | head -30

Repository: martin-augment/tokio

Length of output: 235


Use the wasm-safe time gate in schedule-latency measurement.

Line 233 falls back to Instant::now() directly, which bypasses this module's now() guard for unsupported wasm32-unknown-unknown time metrics. This can break the target-specific safety contract and should be gated the same way as other timing paths.

🔧 Proposed fix
-                if let Some(task_scheduled_at) = task_scheduled_at {
-                    if let Some(schedule_latencies) = &mut self.schedule_latencies {
-                        let now = self.poll_timer.as_ref().map(|p| p.poll_started_at).unwrap_or_else(Instant::now);
-                        let elapsed = duration_as_u64(now - task_scheduled_at);
-                        schedule_latencies.measure(elapsed, 1);
-                    }
-                }
+                if let Some(task_scheduled_at) = task_scheduled_at {
+                    if let Some(schedule_latencies) = &mut self.schedule_latencies {
+                        if let Some(now) = self
+                            .poll_timer
+                            .as_ref()
+                            .map(|p| p.poll_started_at)
+                            .or_else(now)
+                        {
+                            let elapsed = duration_as_u64(now.saturating_duration_since(task_scheduled_at));
+                            schedule_latencies.measure(elapsed, 1);
+                        }
+                    }
+                }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tokio/src/runtime/metrics/batch.rs` around lines 231 - 235, The fallback uses
Instant::now() directly which bypasses this module's wasm-safe time gate; change
the unwrap_or_else(Instant::now) call so it uses the module's wasm-safe now()
(the same helper used elsewhere in this file) instead, i.e., replace
Instant::now with the local/same-module now function when computing let now =
self.poll_timer.as_ref().map(|p| p.poll_started_at).unwrap_or_else(...), so
duration_as_u64(now - task_scheduled_at) and schedule_latencies.measure remain
wasm-safe.

Comment on lines +578 to +579
/// Task schedule latencies times are not instrumented by default as doing
/// so requires calling [`Instant::now()`] twice per task poll. The feature
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix small documentation typo.

Line 578 says “Task schedule latencies times”; remove “times” for clarity.

✏️ Suggested doc fix
-        /// Task schedule latencies times are not instrumented by default as doing
+        /// Task schedule latencies are not instrumented by default as doing
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Task schedule latencies times are not instrumented by default as doing
/// so requires calling [`Instant::now()`] twice per task poll. The feature
/// Task schedule latencies are not instrumented by default as doing
/// so requires calling [`Instant::now()`] twice per task poll. The feature
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tokio/src/runtime/metrics/runtime.rs` around lines 578 - 579, Fix the small
doc typo in the runtime metrics comment: change the phrase "Task schedule
latencies times" to "Task schedule latencies" in the comment that also
references Instant::now(), ensuring the rest of the sentence remains unchanged.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The CodeRabbit AI reviewer is correct! "Latencies" implies "times", so there is no need to use them both

Comment on lines +632 to +635
#[cfg(tokio_unstable)]
core.stats.start_poll(task.get_scheduled_at());
#[cfg(not(tokio_unstable))]
core.stats.start_poll(None);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

LIFO-slot polls are missing schedule-latency samples.

Only the first task in run_task goes through core.stats.start_poll(...). Tasks drained from core.lifo_slot at Lines 672-731 are polled without another schedule-latency record, so the new histogram undercounts exactly the fast-path tasks this metric is supposed to observe. A separate record_schedule_latency(task.get_scheduled_at()) path would preserve the existing poll-count/EWMA behavior without dropping those samples.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tokio/src/runtime/scheduler/multi_thread/worker.rs` around lines 632 - 635,
The LIFO-drained tasks in run_task (the loop that drains core.lifo_slot) are
missing schedule-latency samples; add a call that records schedule latency for
each task popped from core.lifo_slot before it is polled—mirror the behavior of
core.stats.start_poll(task.get_scheduled_at()) used for the first task by
invoking the same sampling path (or a dedicated
record_schedule_latency(task.get_scheduled_at()) helper) for every lifo_slot
task; ensure this is done under the same cfg branches (tokio_unstable vs not)
and reference core.stats.start_poll / record_schedule_latency and
task.get_scheduled_at so the histogram receives samples for the fast-path tasks.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The fast-path tasks (tasks scheduled by a currently executing task that should run on the same CPU code) are not measured when polled. Prevents missing/skipping the measurement of tasks.

Comment on lines +1277 to +1283
// SAFETY: There are no concurrent writes because tasks cannot be scheduled in
// multiple places concurrently. There are no concurrent reads because this field
// is only read when polling the task, which can only happen after it's scheduled.
#[cfg(tokio_unstable)]
unsafe {
task.set_scheduled_at(std::time::Instant::now());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Gate timestamp capture on the histogram actually being enabled.

Instant::now() now runs on every multi-thread schedule in tokio_unstable, even when enable_metrics_schedule_latency_histogram() was never configured. That makes the feature non-optional in practice and adds avoidable work to a hot path. Please guard this write on self.shared.config.metrics_schedule_latency_histogram.is_some() and apply the same guard at Lines 678-684 in tokio/src/runtime/scheduler/current_thread/mod.rs.

⚙️ Suggested change
-        #[cfg(tokio_unstable)]
-        unsafe {
-            task.set_scheduled_at(std::time::Instant::now());
-        }
+        #[cfg(tokio_unstable)]
+        if self.shared.config.metrics_schedule_latency_histogram.is_some() {
+            unsafe {
+                task.set_scheduled_at(std::time::Instant::now());
+            }
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tokio/src/runtime/scheduler/multi_thread/worker.rs` around lines 1277 - 1283,
The code unconditionally calls Instant::now() in the #[cfg(tokio_unstable)]
unsafe block (task.set_scheduled_at(...)), which runs on every multi-thread
schedule; wrap that write in a check of the histogram being enabled (use
self.shared.config.metrics_schedule_latency_histogram.is_some()) so the
timestamp is only captured when the histogram is configured, and apply the same
guard around the analogous timestamp write in current_thread::mod.rs (lines
referenced in the review) to avoid the extra work on the hot path.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The scheduled time is set for all unstable builds, no matter whether the metric is enabled or not. Prevents extra CPU work and memory usage when the metric is not enabled.

@claude
Copy link
Copy Markdown

claude bot commented Mar 25, 2026

PR Review for metrics: add task schedule latency metric

This PR adds a schedule latency histogram metric, closely mirroring the existing poll_count_histogram pattern. Overall the implementation is solid and well-documented. Here are my findings:


Potential Bugs

  1. Panic on Instant subtraction (batch.rs:234)

The line: let elapsed = duration_as_u64(now - task_scheduled_at);

Instant subtraction panics in debug builds if now < task_scheduled_at. While Instant is monotonic on most platforms, certain platform quirks can cause this. The existing busy_duration calculation uses .elapsed() which is safe. Consider: let elapsed = duration_as_u64(now.checked_duration_since(task_scheduled_at).unwrap_or_default());

This matches the defensive style expected in a foundational runtime library.


Design / Correctness Observations

  1. Yielded tasks included in the metric - should be documented

set_scheduled_at is called unconditionally in schedule_task() regardless of is_yield. This means for tasks that call yield_now(), the metric measures time between the yield and the next poll, not the initial schedule-to-first-poll latency. The documentation only says time between when a task is scheduled for execution and when it is polled - it should clarify that this includes re-scheduling after yield_now().

  1. schedule_latency_histogram_enabled() implementation detail

Checking worker_metrics(0) is an indirect proxy for is the histogram configured. The existing poll_count_histogram_enabled has the same pattern so this is consistent, and it is safe because both runtime flavors always have at least one worker.

  1. Silent no-op when configuration is set without enabling

Calling metrics_schedule_latency_histogram_configuration() without first calling enable_metrics_schedule_latency_histogram() silently discards the configuration. The docs do not mention this requirement. Consider adding a note: This method has no effect unless enable_metrics_schedule_latency_histogram() is also called.

  1. Misleading overhead description in builder.rs docs

The docs say the feature requires calling Instant::now() twice per task poll. In practice, when both poll-time histogram and schedule-latency histogram are enabled, the existing poll_started_at timestamp is reused. The twice wording should be corrected.


Code Quality

  1. Missing doc comment on WorkerMetrics::schedule_latency_histogram

The existing poll_count_histogram field has a doc comment. The new schedule_latency_histogram field is missing one.

  1. Long line in batch.rs:233

let now = self.poll_timer.as_ref().map(|p| p.poll_started_at).unwrap_or_else(Instant::now); should be split across lines for readability.


Test Coverage

The test covers both current_thread and multi_thread and validates total counts equal N - good baseline. Suggestions:

  • Add a test exercising latency bucketing (sleep > 50ms, assert counts in bucket >= 1).
  • Add a test for the disabled case: verify schedule_latency_histogram_enabled() returns false without enabling.
  • Add a test that configuration without enabling does not panic.

Summary

Category | Severity | Item
Bug | Medium | Instant subtraction can panic - use checked_duration_since
Docs | Minor | is_yield re-scheduling not mentioned in metric description
Docs | Minor | metrics_schedule_latency_histogram_configuration should note dependency on enable_
Docs | Minor | Overhead description says twice but is actually once in most configurations
Code style | Minor | Missing doc comment on schedule_latency_histogram in WorkerMetrics
Code style | Minor | Long line in batch.rs:233
Tests | Suggestion | Add bucket precision test and disabled-histogram test

The implementation is well-structured, follows existing patterns well, has comprehensive API docs, and the unsafe usage is sound. The main actionable item is the checked_duration_since fix to avoid a potential panic.

@martin-augment
Copy link
Copy Markdown
Owner Author

martin-augment commented Mar 25, 2026

9. Misleading overhead description in builder.rs docs

The docs say the feature requires calling Instant::now() twice per task poll. In practice, when both poll-time histogram and schedule-latency histogram are enabled, the existing poll_started_at timestamp is reused. The twice wording should be corrected.

value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The proposed wording of the docstring suggests that Instant::now() is called twice on each poll. This is not correct and it should be fixed to explain that the first call to Instant::now() is when the task is scheduled.

@martin-augment
Copy link
Copy Markdown
Owner Author

816-824: Pin the multi-thread case to 2 workers.

Leaving new_multi_thread() at the default can collapse to a single worker on constrained CI, which means this test stops exercising the threaded scheduling path. Setting .worker_threads(2) keeps the coverage deterministic.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The new test won't make much sense if the machine where it is executed has only one CPU core. By using two working threads it will always serve its purpose, no matter the hardware it is running on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants