Skip to content

8010: runtime: optional eager I/O driver/timer handoff when polling tasks#92

Open
martin-augment wants to merge 12 commits intomasterfrom
pr-8010-2026-04-04-11-42-40
Open

8010: runtime: optional eager I/O driver/timer handoff when polling tasks#92
martin-augment wants to merge 12 commits intomasterfrom
pr-8010-2026-04-04-11-42-40

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

8010: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 4, 2026

Walkthrough

This change introduces eager driver handoff functionality to Tokio's multi-threaded runtime. A new enable_eager_driver_handoff configuration flag is added to the Builder and Config structures, exposed via an unstable builder method. The parking mechanism is refactored to track whether a worker parked on the I/O driver via a new HadDriver enum. Worker scheduling logic is updated to record driver participation state and, when the feature is enabled, attempt to notify parked local workers before polling tasks. A new test file demonstrates the feature's ability to prevent runtime deadlocks by changing worker notification behavior during driver-related transitions.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-8010-2026-04-04-11-42-40

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental 'eager driver handoff' feature for multi-threaded runtimes, accessible via the tokio_unstable flag. The feature allows a worker thread that has been polling the I/O or time driver to notify another worker to take over driver duties before it begins polling a task, thereby reducing latency and mitigating potential deadlocks. The implementation involves tracking driver ownership during parking and updating the worker's execution loop to trigger notifications. Review feedback suggests a logic refinement in the worker loop to ensure the driver ownership state is reset even when a notification was already sent during a searching transition, preventing redundant wakeups for subsequent tasks.

Comment on lines +645 to +653
if cfg!(tokio_unstable)
&& core.enable_eager_driver_handoff
&& core.had_driver == park::HadDriver::Yes
&& !notified_parked_worker
// don't do it a second time
{
core.had_driver = park::HadDriver::No;
self.worker.handle.notify_parked_local();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The had_driver state should be reset to No even if notified_parked_worker is true. If a worker transitions from searching and notifies another worker, it has already fulfilled the "handoff" requirement. If had_driver is not reset in this case, the worker might trigger a redundant notification when polling the next task in its queue, as notified_parked_worker will be false for subsequent tasks while had_driver remains Yes.

Suggested change
if cfg!(tokio_unstable)
&& core.enable_eager_driver_handoff
&& core.had_driver == park::HadDriver::Yes
&& !notified_parked_worker
// don't do it a second time
{
core.had_driver = park::HadDriver::No;
self.worker.handle.notify_parked_local();
}
if cfg!(tokio_unstable)
&& core.enable_eager_driver_handoff
&& core.had_driver == park::HadDriver::Yes
{
core.had_driver = park::HadDriver::No;
if !notified_parked_worker {
self.worker.handle.notify_parked_local();
}
}

@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

PR Review: runtime: optional eager I/O driver/timer handoff when polling tasks

This is a well-motivated feature that addresses a real class of deadlocks (#4730). The overall design is sound — deferring the handoff until task-poll time (rather than on unpark) avoids spurious wakeups in the common case. Below are findings ranging from a correctness bug to minor nits.


Bug: had_driver not reset when notified_parked_worker == true

File: tokio/src/runtime/scheduler/multi_thread/worker.rs, run_task (~line 645)

The current logic:

let notified_parked_worker = core.transition_from_searching(&self.worker);

if cfg!(tokio_unstable)
    && core.enable_eager_driver_handoff
    && core.had_driver == park::HadDriver::Yes
    && !notified_parked_worker  // don't do it a second time
{
    core.had_driver = park::HadDriver::No;
    self.worker.handle.notify_parked_local();
}

core.had_driver is only reset inside the if block. When notified_parked_worker == true (i.e. transition_from_searching already woke a worker), the eager handoff is correctly skipped — but had_driver is left as Yes.

On the next call to run_task (e.g. the worker drains its local queue), is_searching is already false, so transition_from_searching returns false immediately, making notified_parked_worker = false. Now all conditions are true and the eager handoff fires spuriously, potentially waking an idle worker for no reason. This can cascade on queues with many tasks.

Suggested fix — always reset had_driver, and only conditionally notify:

let notified_parked_worker = core.transition_from_searching(&self.worker);

if cfg!(tokio_unstable)
    && core.enable_eager_driver_handoff
    && core.had_driver == park::HadDriver::Yes
{
    core.had_driver = park::HadDriver::No; // reset unconditionally
    if !notified_parked_worker {
        self.worker.handle.notify_parked_local();
    }
}

Inconsistency: enable_eager_driver_handoff not gated with #[cfg(tokio_unstable)]

Files: tokio/src/runtime/config.rs and worker.rs (Core struct)

Every other unstable field in Config uses #[cfg(tokio_unstable)]:

#[cfg(tokio_unstable)]
pub(crate) unhandled_panic: crate::runtime::UnhandledPanic,

But enable_eager_driver_handoff is unconditionally present in both Config and Core. The cfg!(tokio_unstable) macro check in run_task does compile away the logic, but the fields themselves remain in stable builds — wasting space and creating inconsistency.

The comment in worker.rs acknowledges the intent ("should result in this whole expression being eliminated at compile-time"), but the fields behind it are still allocated. Wrapping them in #[cfg(tokio_unstable)] would be cleaner and consistent with the rest of the codebase.


Test reliability: deadlocks_consistently may be flaky

File: tokio/tests/rt_unstable_eager_driver_handoff.rs

This test asserts that the deadlock scenario reliably reproduces within 10 seconds. The 100 ms sleep (tokio::time::sleep(Duration::from_millis(100)).await) is intended to ensure bad_task is scheduled on the driver-holding worker before pipe1 is written to, but this is a timing assumption.

On a heavily loaded CI machine or a fast bare-metal machine, the scheduling order may differ. If the test ever fails (i.e. the runtime does NOT deadlock without the fix), CI will incorrectly flag that as a failure of the guard test. Consider adding a comment acknowledging this inherent raciness and/or running the scenario multiple times to increase confidence.


Typo in test comment

File: tokio/tests/rt_unstable_eager_driver_handoff.rs, in do_test

// thread that polled the I/0 driver. Now, block this

I/0 (zero) should be I/O (letter O).


Minor: HadDriver should derive Debug

File: tokio/src/runtime/scheduler/multi_thread/park.rs

#[derive(Copy, Clone, Eq, PartialEq)]
pub(crate) enum HadDriver { Yes, No }

Adding Debug would make it much easier to inspect Core state during debugging or tracing, and is essentially free.


Nit: notify_parked_local() return value discarded in eager handoff path

In run_task (line 652), after the eager handoff fires:

self.worker.handle.notify_parked_local();

The bool return value is discarded. That's intentional here (if there's no idle worker to notify, that's fine). A let _ = or a comment explaining the discard would make it clear this isn't an oversight, especially since the function was specifically changed in this PR to return bool.


Summary

Severity Issue
Bug had_driver not reset when handoff skipped due to notified_parked_worker == true → spurious extra wakeups on subsequent task polls
Inconsistency enable_eager_driver_handoff / had_driver fields not #[cfg(tokio_unstable)]-gated unlike all other unstable fields
Test concern deadlocks_consistently test depends on scheduler timing and may be flaky
Minor Typo I/0I/O
Minor HadDriver missing Debug derive
Nit Discarded bool from notify_parked_local() should be let _ = or explained

The core mechanism and API design look good. The primary thing to fix before merging is the had_driver reset bug.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e277211. Configure here.

{
core.had_driver = park::HadDriver::No;
self.worker.handle.notify_parked_local();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had_driver Not Reset When Searching Transition Notifies

Medium Severity

When transition_from_searching already notified a parked worker (notified_parked_worker == true), the eager-handoff block is skipped — but core.had_driver is never reset to HadDriver::No. On the very next call to run_task for the following task, is_searching is already false, so transition_from_searching returns false, making !notified_parked_worker true, and the condition fires an extra spurious notify_parked_local(). The had_driver reset should happen unconditionally whenever had_driver == HadDriver::Yes, with the additional notify_parked_local() call guarded only by !notified_parked_worker.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e277211. Configure here.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tokio/src/runtime/scheduler/multi_thread/worker.rs`:
- Around line 630-653: transition_from_searching() can return true meaning the
post-driver state was already consumed, but core.had_driver remains Yes and can
cause a second notify later; change the logic so that when cfg!(tokio_unstable)
&& core.enable_eager_driver_handoff && core.had_driver == park::HadDriver::Yes
you always reset core.had_driver = park::HadDriver::No immediately, and only
call self.worker.handle.notify_parked_local() conditionally when
notified_parked_worker is false. In other words, move or duplicate the
core.had_driver = No assignment out of the inner conditional so the had_driver
flag is cleared on the first post-park task, and keep notify_parked_local
guarded by !notified_parked_worker; refer to transition_from_searching,
core.had_driver, core.enable_eager_driver_handoff, notified_parked_worker, and
self.worker.handle.notify_parked_local.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aa75a05b-c59e-4e48-9181-cdf86be63e45

📥 Commits

Reviewing files that changed from the base of the PR and between 80a0b97 and e277211.

📒 Files selected for processing (5)
  • tokio/src/runtime/builder.rs
  • tokio/src/runtime/config.rs
  • tokio/src/runtime/scheduler/multi_thread/park.rs
  • tokio/src/runtime/scheduler/multi_thread/worker.rs
  • tokio/tests/rt_unstable_eager_driver_handoff.rs

Comment on lines +630 to +653
let notified_parked_worker = core.transition_from_searching(&self.worker);

// If the setting to wake eagerly when releasing the I/O driver is
// enabled, and this worker had the driver, wake a parked worker to come
// grab it from us.
//
// Note that this is only done when we are *actually* about to poll a
// task, rather than whenever the worker has unparked. When the worker
// has been unparked, it may not actually have any tasks to poll, and if
// it's still holding the I/O driver, it should just go back to polling
// the driver again, rather than trying to wake someone else spuriously.
//
// Note that this explicitly checks `cfg!(tokio_unstable)` in addition,
// as that should result in this whole expression being eliminated at
// compile-time when unstable features are disabled.
if cfg!(tokio_unstable)
&& core.enable_eager_driver_handoff
&& core.had_driver == park::HadDriver::Yes
&& !notified_parked_worker
// don't do it a second time
{
core.had_driver = park::HadDriver::No;
self.worker.handle.notify_parked_local();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset had_driver on the first post-park task, even when searching already woke someone.

If transition_from_searching() returns true, this first task already consumed the post-driver state, but core.had_driver stays Yes. The next run_task can then call notify_parked_local() again from the same park cycle, which defeats the intended one-shot handoff and adds extra wakeups.

💡 Suggested fix
         let notified_parked_worker = core.transition_from_searching(&self.worker);
+        let had_driver = core.had_driver == park::HadDriver::Yes;
+        core.had_driver = park::HadDriver::No;
 
         // If the setting to wake eagerly when releasing the I/O driver is
         // enabled, and this worker had the driver, wake a parked worker to come
         // grab it from us.
@@
         if cfg!(tokio_unstable)
             && core.enable_eager_driver_handoff
-            && core.had_driver == park::HadDriver::Yes
+            && had_driver
             && !notified_parked_worker
-        // don't do it a second time
         {
-            core.had_driver = park::HadDriver::No;
             self.worker.handle.notify_parked_local();
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let notified_parked_worker = core.transition_from_searching(&self.worker);
// If the setting to wake eagerly when releasing the I/O driver is
// enabled, and this worker had the driver, wake a parked worker to come
// grab it from us.
//
// Note that this is only done when we are *actually* about to poll a
// task, rather than whenever the worker has unparked. When the worker
// has been unparked, it may not actually have any tasks to poll, and if
// it's still holding the I/O driver, it should just go back to polling
// the driver again, rather than trying to wake someone else spuriously.
//
// Note that this explicitly checks `cfg!(tokio_unstable)` in addition,
// as that should result in this whole expression being eliminated at
// compile-time when unstable features are disabled.
if cfg!(tokio_unstable)
&& core.enable_eager_driver_handoff
&& core.had_driver == park::HadDriver::Yes
&& !notified_parked_worker
// don't do it a second time
{
core.had_driver = park::HadDriver::No;
self.worker.handle.notify_parked_local();
}
let notified_parked_worker = core.transition_from_searching(&self.worker);
let had_driver = core.had_driver == park::HadDriver::Yes;
core.had_driver = park::HadDriver::No;
// If the setting to wake eagerly when releasing the I/O driver is
// enabled, and this worker had the driver, wake a parked worker to come
// grab it from us.
//
// Note that this is only done when we are *actually* about to poll a
// task, rather than whenever the worker has unparked. When the worker
// has been unparked, it may not actually have any tasks to poll, and if
// it's still holding the I/O driver, it should just go back to polling
// the driver again, rather than trying to wake someone else spuriously.
//
// Note that this explicitly checks `cfg!(tokio_unstable)` in addition,
// as that should result in this whole expression being eliminated at
// compile-time when unstable features are disabled.
if cfg!(tokio_unstable)
&& core.enable_eager_driver_handoff
&& had_driver
&& !notified_parked_worker
{
self.worker.handle.notify_parked_local();
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tokio/src/runtime/scheduler/multi_thread/worker.rs` around lines 630 - 653,
transition_from_searching() can return true meaning the post-driver state was
already consumed, but core.had_driver remains Yes and can cause a second notify
later; change the logic so that when cfg!(tokio_unstable) &&
core.enable_eager_driver_handoff && core.had_driver == park::HadDriver::Yes you
always reset core.had_driver = park::HadDriver::No immediately, and only call
self.worker.handle.notify_parked_local() conditionally when
notified_parked_worker is false. In other words, move or duplicate the
core.had_driver = No assignment out of the inner conditional so the had_driver
flag is cleared on the first post-park task, and keep notify_parked_local
guarded by !notified_parked_worker; refer to transition_from_searching,
core.had_driver, core.enable_eager_driver_handoff, notified_parked_worker, and
self.worker.handle.notify_parked_local.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 6, 2026

🤖 Augment PR Summary

Summary: This PR adds an (unstable) option to eagerly hand off polling of the I/O/time driver when a worker transitions from driver polling back to task polling.

Changes:

  • Added a new builder/config flag enable_eager_driver_handoff (default off) behind tokio_unstable + rt-multi-thread.
  • Plumbed the flag through runtime Config and into the multi-thread scheduler worker core.
  • Extended the multi-thread parker to report whether the last park used the shared driver vs a condvar (HadDriver).
  • When enabled, a worker that just returned from a driver park may unpark another idle worker right before polling its next task, so the other worker can start polling the driver.
  • Adjusted searching/notification helpers to return whether a parked worker was actually notified, to avoid redundant wakeups.
  • Added a Unix-only regression test that reproduces a driver/task deadlock without the flag and verifies it is avoided when the flag is enabled.

Technical Notes: The eager handoff logic is guarded so it compiles away when unstable features are disabled, and it only affects multi-thread runtimes.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

// whether or not the runtime has deadlocked.
let (done_tx, done_rx) = std::sync::mpsc::channel();

std::thread::spawn(move || {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio/tests/rt_unstable_eager_driver_handoff.rs:94: In the expected-deadlock case, this spawns a thread that never terminates (it remains blocked in rt.block_on), which can leak an OS thread for the remainder of the test binary and potentially interfere with other tests running in the same process.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

done_tx.send(()).unwrap();
});

done_rx.recv_timeout(Duration::from_secs(10))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio/tests/rt_unstable_eager_driver_handoff.rs:138: deadlocks_consistently appears to pass only after waiting the full 10s recv_timeout, which will reliably add ~10s to the test suite on supported platforms; is that duration intentional for CI?

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants