Skip to content

7986: metrics: add task schedule latency metric#88

Open
martin-augment wants to merge 8 commits intomasterfrom
pr-7986-2026-03-31-09-02-46
Open

7986: metrics: add task schedule latency metric#88
martin-augment wants to merge 8 commits intomasterfrom
pr-7986-2026-03-31-09-02-46

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

7986: To review by AI

robholt and others added 8 commits March 23, 2026 14:32
This new metric tracks the amount of time between when a task is
scheduled and when it is polled, also known as queue delay. This
duration is recorded in a histogram, just like the poll time metric.

This metric is useful for implementing queue management algorithms in
systems using tokio. For example, it could be used to implement a
generic http load shedder using the CoDel algorithm.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

Walkthrough

Schedule-latency histogram metrics support is added to Tokio's runtime. The implementation introduces configuration options in the runtime builder, tracking infrastructure for per-task scheduling timestamps, histogram batch accumulation in metrics collection, and new query APIs in RuntimeMetrics. Changes span the builder, task core structures, scheduler implementations (current-thread and multi-thread), metrics collection paths, and includes a validation test. The feature is gated behind unstable and 64-bit target width conditions.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-7986-2026-03-31-09-02-46

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces task schedule latency tracking as an unstable metric for both current-thread and multi-thread runtimes. It adds configuration methods to the runtime builder, updates task headers to store scheduling timestamps on 64-bit targets, and provides new public APIs in RuntimeMetrics to access histogram data. A review comment correctly identifies that the Notified struct is missing a get_scheduled_at method, which is required for the multi-threaded scheduler to record these metrics.

Comment on lines 249 to 262
pub(crate) fn task_meta<'meta>(&self) -> crate::runtime::TaskMeta<'meta> {
self.0.task_meta()
}

#[cfg(all(tokio_unstable, target_pointer_width = "64"))]
pub(crate) fn set_scheduled_at(&self, nanos: NonZeroU64) {
// SAFETY: There are no concurrent writes because there is only ever one `Notified`
// reference per task. There are no concurrent reads because this field is only read
// when polling the task, which can only happen after it's scheduled.
unsafe {
self.0.header().set_scheduled_at(nanos);
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Notified struct is missing the get_scheduled_at method, which is required by the multi-threaded scheduler to record schedule latency metrics. While LocalNotified has this method added, Notified (used in the multi-threaded runtime) also needs it to avoid compilation errors when latency tracking is enabled.

Suggested change
pub(crate) fn task_meta<'meta>(&self) -> crate::runtime::TaskMeta<'meta> {
self.0.task_meta()
}
#[cfg(all(tokio_unstable, target_pointer_width = "64"))]
pub(crate) fn set_scheduled_at(&self, nanos: NonZeroU64) {
// SAFETY: There are no concurrent writes because there is only ever one `Notified`
// reference per task. There are no concurrent reads because this field is only read
// when polling the task, which can only happen after it's scheduled.
unsafe {
self.0.header().set_scheduled_at(nanos);
}
}
}
pub(crate) fn task_meta<'meta>(&self) -> crate::runtime::TaskMeta<'meta> {
self.0.task_meta()
}
#[cfg(all(tokio_unstable, target_pointer_width = "64"))]
pub(crate) fn set_scheduled_at(&self, nanos: NonZeroU64) {
// SAFETY: There are no concurrent writes because there is only ever one `Notified`
// reference per task. There are no concurrent reads because this field is only read
// when polling the task, which can only happen after it's scheduled.
unsafe {
self.0.header().set_scheduled_at(nanos);
}
}
#[cfg(all(tokio_unstable, target_pointer_width = "64"))]
pub(crate) fn get_scheduled_at(&self) -> Option<NonZeroU64> {
self.0.header().get_scheduled_at()
}
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Notifier has a setter and LocalNotifier has a getter. They both should have both setters and getters.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 31, 2026

🤖 Augment PR Summary

Summary: Adds an (unstable) runtime metric to track task schedule latency as a histogram.

Changes:

  • Extends runtime::Builder with APIs to enable and configure a schedule-latency histogram.
  • Plumbs a new optional metrics_schedule_latency_histogram through runtime Config into per-worker metrics.
  • Records a per-task scheduled_at timestamp (nanoseconds since scheduler startup) when a task is scheduled.
  • Measures schedule-to-poll latency at poll start and aggregates it into a per-worker histogram batch.
  • Exposes RuntimeMetrics accessors for schedule-latency histogram enablement, bucket ranges, and per-bucket counts (when supported).
  • Adds an unstable-metrics regression test asserting total bucket counts match the number of spawned tasks.

Technical Notes: Instrumentation uses Instant::now() on schedule and poll paths and is intended for 64-bit targets (to avoid task header size increases on 32-bit).

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

}

fn metrics_schedule_latency_histogram_builder(&self) -> Option<HistogramBuilder> {
if self.metrics_schedule_latency_histogram_enable {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio/src/runtime/builder.rs:1789: metrics_schedule_latency_histogram_builder() returns Some whenever enabled, even on targets where schedule-latency instrumentation is cfg’d out (e.g., 32-bit), which can lead to the metric being reported as enabled but never producing counts. Consider aligning the cfg/behavior with the docs stating 64-bit-only support.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! Half of the related implementation is gated behind a #[cfg(target_pointer_width = "64")] but the rest does not use this gate. This makes the second half visible for usage in 32bit systems but it won'd do anything. Prevents leaking implementation details in 32 bits builds.

/// [`enable_metrics_schedule_latency_histogram()`]: crate::runtime::Builder::enable_metrics_schedule_latency_histogram
/// [`Instant::now()`]: std::time::Instant::now
pub fn schedule_latency_histogram_enabled(&self) -> bool {
self.handle.inner.worker_metrics(0).schedule_latency_histogram.is_some()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio/src/runtime/metrics/runtime.rs:605: schedule_latency_histogram_enabled() only checks whether the histogram is configured, but on 32-bit targets the per-task scheduled_at plumbing and schedule_latency_histogram_bucket_count() API are cfg’d out, so this may return true while the metric can’t be observed. Consider gating these accessors consistently (or clarifying the documented support matrix).

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! Half of the related implementation is gated behind a #[cfg(target_pointer_width = "64")] but the rest does not use this gate. This makes the second half visible for usage in 32bit systems but it won'd do anything. Prevents leaking implementation details in 32 bits builds.

pub(super) poll_count_histogram: Option<Histogram>,

#[cfg(tokio_unstable)]
/// If `Some`, tracks the number of times tasks were scheduled by duration range.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio/src/runtime/metrics/worker.rs:70: The doc comment on schedule_latency_histogram says it tracks “times tasks were scheduled by duration range”, but the implementation increments buckets based on schedule-to-poll latency observed at poll time. This mismatch could mislead users interpreting what the histogram counts represent.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! The documentation is a bit unclear what is meant by "duration range". It should be improved by explaining that these are the poll latency durations.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
CLAUDE.md (1)

1-1: Optional wording polish for clarity and branding consistency.

Consider changing “Github” → “GitHub” and tightening phrasing for readability (e.g., “Never link to issues or pull requests in any GitHub repository.”).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` at line 1, Update the sentence "Never link to any issue or a pull
request in any Github repository" in CLAUDE.md to improve branding and clarity:
change "Github" to "GitHub" and tighten phrasing to something like "Never link
to issues or pull requests in any GitHub repository." Replace the original
string exactly and ensure punctuation/capitalization matches the rest of the
document.
tokio/src/runtime/scheduler/multi_thread/worker.rs (1)

1289-1304: Verify safety of NonZeroU64::new_unchecked usage.

The safety invariant relies on .max(1) ensuring the value is never zero. While the as u64 cast from u128 can theoretically truncate, this only occurs after ~584 years of uptime, which is acceptable.

Consider adding a brief comment noting the truncation behavior for future readers:

📝 Optional: Add truncation note
             // SAFETY: `.max(1)` ensures the value can never be 0.
+            // Note: truncation from u128 to u64 only occurs after ~584 years of uptime.
             let scheduled_at = unsafe {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tokio/src/runtime/scheduler/multi_thread/worker.rs` around lines 1289 - 1304,
The call to NonZeroU64::new_unchecked in schedule_task relies on .max(1) to
avoid zero, but the u128->u64 cast can truncate after extremely long uptime;
update the schedule_task block (around the scheduled_at calculation and
task.set_scheduled_at call) to preserve the existing logic and add a concise
comment documenting that the as u64 cast may truncate the u128 nanoseconds after
~584 years and that this truncation is acceptable for metrics, and keep the
.max(1) safety guarantee in place so new_unchecked is never passed zero.
tokio/src/runtime/builder.rs (1)

1567-1611: Consider clarifying 32-bit platform behavior.

The documentation correctly states "This feature is only supported on 64-bit targets" (line 1572), but the method itself has no compile-time guard. On 32-bit targets, calling this method will allocate the histogram but no samples will ever be recorded since measurement is gated behind target_pointer_width = "64".

This is consistent with similar patterns in Tokio where configuration is portable but behavior is platform-dependent. However, users on 32-bit platforms may be confused when the histogram remains empty.

Consider adding a note like: "On 32-bit targets, the histogram will be created but will remain empty as measurement is not supported."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tokio/src/runtime/builder.rs` around lines 1567 - 1611, The docs say
schedule-latency histograms are 64-bit only but the builder method
enable_metrics_schedule_latency_histogram() currently has no note about 32-bit
behavior; add a brief sentence to the method's doc comment stating that on
32-bit targets the histogram will still be created (since
metrics_schedule_latency_histogram_enable is set) but will remain empty because
measurements are gated on target_pointer_width = "64". Reference the
Builder::enable_metrics_schedule_latency_histogram method and the
metrics_schedule_latency_histogram_enable field in the comment so callers on
32-bit platforms aren't confused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@CLAUDE.md`:
- Line 1: Update the sentence "Never link to any issue or a pull request in any
Github repository" in CLAUDE.md to improve branding and clarity: change "Github"
to "GitHub" and tighten phrasing to something like "Never link to issues or pull
requests in any GitHub repository." Replace the original string exactly and
ensure punctuation/capitalization matches the rest of the document.

In `@tokio/src/runtime/builder.rs`:
- Around line 1567-1611: The docs say schedule-latency histograms are 64-bit
only but the builder method enable_metrics_schedule_latency_histogram()
currently has no note about 32-bit behavior; add a brief sentence to the
method's doc comment stating that on 32-bit targets the histogram will still be
created (since metrics_schedule_latency_histogram_enable is set) but will remain
empty because measurements are gated on target_pointer_width = "64". Reference
the Builder::enable_metrics_schedule_latency_histogram method and the
metrics_schedule_latency_histogram_enable field in the comment so callers on
32-bit platforms aren't confused.

In `@tokio/src/runtime/scheduler/multi_thread/worker.rs`:
- Around line 1289-1304: The call to NonZeroU64::new_unchecked in schedule_task
relies on .max(1) to avoid zero, but the u128->u64 cast can truncate after
extremely long uptime; update the schedule_task block (around the scheduled_at
calculation and task.set_scheduled_at call) to preserve the existing logic and
add a concise comment documenting that the as u64 cast may truncate the u128
nanoseconds after ~584 years and that this truncation is acceptable for metrics,
and keep the .max(1) safety guarantee in place so new_unchecked is never passed
zero.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: edccf019-e00a-46ac-aa78-20896fe7ce6c

📥 Commits

Reviewing files that changed from the base of the PR and between c49598d and 5b52116.

📒 Files selected for processing (12)
  • CLAUDE.md
  • tokio/src/runtime/builder.rs
  • tokio/src/runtime/config.rs
  • tokio/src/runtime/metrics/batch.rs
  • tokio/src/runtime/metrics/runtime.rs
  • tokio/src/runtime/metrics/worker.rs
  • tokio/src/runtime/scheduler/current_thread/mod.rs
  • tokio/src/runtime/scheduler/multi_thread/stats.rs
  • tokio/src/runtime/scheduler/multi_thread/worker.rs
  • tokio/src/runtime/task/core.rs
  • tokio/src/runtime/task/mod.rs
  • tokio/tests/rt_unstable_metrics.rs

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

PR Review: Add Task Schedule Latency Metric

This PR adds a useful new metric for tracking task schedule latency (the time between when a task is scheduled and when it is polled). The overall design is sound and mirrors the existing poll count histogram implementation well. Below are my findings.


Bugs / Correctness Issues

1. Missing #[track_caller] on schedule_latency_histogram_bucket_range

The docs state the method panics on an invalid bucket index, but unlike schedule_latency_histogram_bucket_count (which has #[track_caller]), this method is missing the attribute. The panic inside histogram.bucket_range(bucket) will be attributed to the method body rather than the caller's call site, making debugging harder.

2. Inconsistent silent-return when histogram is disabled

schedule_latency_histogram_bucket_range returns Range::default() (i.e., 0..0) silently when the histogram is not enabled. This is fine, but the # Panics section only mentions the invalid-bucket case and does not clarify behavior when the histogram is disabled. Adding a note would prevent user confusion.


Safety / Soundness Concern

3. UnsafeCell for cross-thread field without explicit atomic operations

The scheduled_at field in Header is UnsafeCell<Option<NonZeroU64>>. The safety argument in set_scheduled_at / get_scheduled_at relies on the task queue's push/pop operations providing the necessary happens-before relationship between the writer (scheduling thread) and the reader (polling thread).

This is likely correct in practice because the queue operations include AcqRel/SeqCst fences, but the guarantee is implicit. The rest of the metrics code (e.g., Histogram) uses AtomicU64 for cross-thread access. An AtomicU64 with 0 meaning "not set" and nonzero storing the nanosecond value would make the ordering explicit and match established patterns in this codebase.

Additionally, the safety comment on Notified::set_scheduled_at should note that in the multi-thread scheduler, set_scheduled_at can be called during a task's poll (when a waker fires), but this is safe because get_scheduled_at is only read in start_poll before polling begins — this sequential relationship should be explicitly documented.


Test Coverage

4. Test only verifies total count, not latency distribution

The schedule_latency_counts test confirms that the total count across all workers and buckets equals N spawned tasks. This is a good smoke test, but it does not verify:

  • That measured latencies fall within plausible ranges (e.g., most tasks completing quickly and landing in bucket 0).
  • That requesting an out-of-range bucket panics as documented.
  • That using enable_metrics_schedule_latency_histogram() without explicit configuration produces the documented default (10 buckets, 100us each).

Minor / Style

5. Long lines in builder.rs

The new metrics_schedule_latency_histogram_configuration method signature is very long. The rest of the file wraps at ~100 chars; consider splitting the parameter onto a new line to match the style of nearby methods.

6. Doc comment line length

Several added doc comment lines (e.g., the default bucket description in enable_metrics_schedule_latency_histogram) exceed 100 characters, which is inconsistent with the surrounding code.

7. Behavior on 32-bit targets

schedule_latency_histogram_enabled() is callable on 32-bit targets but will always return false there since enable_metrics_schedule_latency_histogram() is only available on 64-bit. A note in the method docs explaining this would prevent user confusion.


Summary

Severity Issue
Bug Missing #[track_caller] on schedule_latency_histogram_bucket_range
Soundness UnsafeCell cross-thread access relies on implicit queue memory ordering; consider AtomicU64
Test Smoke-tests count totals only; missing panic and default-config tests
Minor Long lines, 32-bit behavior not documented

Overall the implementation is well-structured and correctly follows the existing histogram metric patterns. Addressing the #[track_caller] gap, adding clearer safety documentation, and expanding the test suite would bring this to production quality.

@martin-augment
Copy link
Copy Markdown
Owner Author

1. Missing #[track_caller] on schedule_latency_histogram_bucket_range

The docs state the method panics on an invalid bucket index, but unlike schedule_latency_histogram_bucket_count (which has #[track_caller]), this method is missing the attribute. The panic inside histogram.bucket_range(bucket) will be attributed to the method body rather than the caller's call site, making debugging harder.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! In case of a panic the stack trace will show the file/line/position of the caller instead of the metrics collector function. This would be useful if unwinding is disabled, i.e. panic=abort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants