7973: rt: shard the multi-thread inject queue to reduce remote spawn contention#93
7973: rt: shard the multi-thread inject queue to reduce remote spawn contention#93martin-augment wants to merge 5 commits intomasterfrom
Conversation
…tion The multi-threaded scheduler's inject queue was protected by a single global mutex (shared with idle coordination state). Every remote task spawn — any spawn from outside a worker thread — acquired this lock, serializing concurrent spawners and limiting throughput. This change introduces `inject::Sharded`, which splits the inject queue into up to 8 independent shards, each an existing `Shared`/`Synced` pair with its own mutex and cache-line padding. Design: - Push: each thread is assigned a home shard on first push (via a global counter) and sticks with it. This keeps consecutive pushes from one thread cache-local while spreading distinct threads across distinct locks. - Pop: workers rotate through shards starting at their own index, skipping empty shards via per-shard atomic length. pop_n drains from one shard at a time to keep critical sections bounded. - Shard count: capped at 8 (and 1 under loom). Contention drops off steeply past a handful of shards, and is_empty()/len() scan all shards in the worker hot loop. - is_closed: a single Release atomic set after all shards are closed, so the shutdown check stays lock-free. Random shard selection via context::thread_rng_n (as used in tokio-rs#7757 for the blocking pool) was measured and found to be 20-33% slower on remote_spawn at 8+ threads. The inject workload is a tight loop of trivial pushes where producer-side cache locality dominates: with RNG, a hot thread bounces between shard cache lines on every push; with sticky assignment it stays hot on one mutex and list tail. RNG did win slightly (5-9%) on single-producer benchmarks where spreading tasks lets workers pop in parallel, but not enough to offset the regression at scale. The inject state is removed from the global Synced mutex, which now only guards idle coordination. This also helps the single-threaded path since remote pushes no longer contend with worker park/unpark. Results on remote_spawn benchmark (12,800 no-op tasks, N spawner threads, 64-core box): threads before after improvement 1 9.38 ms 7.33 ms -22% 2 14.94 ms 6.64 ms -56% 4 23.69 ms 5.34 ms -77% 8 34.81 ms 4.69 ms -87% 16 32.33 ms 4.54 ms -86% 32 30.37 ms 4.73 ms -84% 64 26.59 ms 5.34 ms -80% rt_multi_threaded benchmarks: spawn_many_local -8%, spawn_many_remote_idle -7%, yield_many -1%, rest neutral. Developed in conjunction with Claude.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sharded::pop already checks each shard's emptiness as a fast path before locking, so the outer is_empty scan was iterating all shards twice for no benefit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WalkthroughThe changes introduce a sharded injection queue mechanism for the Tokio multi-threaded runtime scheduler. A new ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: Sharded Inject QueueThis is a well-motivated change with a clear architecture. Sharding the inject queue behind per-shard mutexes is the right approach for reducing remote-spawn contention, and the implementation is generally clean. Here are my findings, from most to least significant. Correctness Issues1. Misleading safety comment in The doc comment on
This invariant only holds after 2. If two threads call 3. let num_shards = shard_hint.clamp(1, MAX_SHARDS).next_power_of_two();Because debug_assert!(MAX_SHARDS.is_power_of_two());
let num_shards = shard_hint.next_power_of_two().min(MAX_SHARDS);Design / Documentation Issues4. Shard lock is held for the entire In let mut synced = shard.synced.lock();
// ...
let pop = unsafe { shard.shared.pop_n(&mut synced, n) };
return Some(f(pop)); // synced (MutexGuard) still alive hereThe lock drops only when 5. Per-thread
Minor / Nits6. A release-build call with 7. while let Some(task) = self.next_remote_task(0) {Draining always begins at shard 0, changing the order relative to the original FIFO inject queue. This is fine for correctness, but worth a comment noting that drain ordering across shards is not guaranteed. Positive Highlights
|
There was a problem hiding this comment.
Code Review
This pull request introduces a sharded inject queue for the multi-threaded scheduler to mitigate contention on the global mutex. By partitioning the queue into multiple shards and using thread-local indices for shard assignment, the implementation improves scalability and cache locality. The review feedback suggests optimizing the push and push_batch methods by adding lock-free checks for the closed state to avoid unnecessary mutex acquisitions during shutdown.
| pub(crate) fn push(&self, task: task::Notified<T>) { | ||
| let idx = self.next_push_shard(); | ||
| let shard = &*self.shards[idx]; | ||
|
|
||
| let mut synced = shard.synced.lock(); | ||
| // safety: `synced` belongs to `shard.shared` | ||
| unsafe { shard.shared.push(&mut synced, task) }; | ||
| } |
There was a problem hiding this comment.
While Shared::push performs an internal check on the closed state under the lock, adding a lock-free check of self.is_closed() at the start of Sharded::push can avoid unnecessary mutex acquisition during runtime shutdown. This is a minor optimization for the shutdown path.
pub(crate) fn push(&self, task: task::Notified<T>) {
if self.is_closed() {
return;
}
let idx = self.next_push_shard();
let shard = &*self.shards[idx];
let mut synced = shard.synced.lock();
// safety: `synced` belongs to `shard.shared`
unsafe { shard.shared.push(&mut synced, task) };
}There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Checking whether the shard is already closed would avoid the acquiring of the lock and executing the unsafe code completely.
| pub(crate) fn push_batch<I>(&self, iter: I) | ||
| where | ||
| I: Iterator<Item = task::Notified<T>>, | ||
| { | ||
| let idx = self.next_push_shard(); | ||
| let shard = &*self.shards[idx]; | ||
|
|
||
| // safety: `&shard.synced` yields `&mut Synced` for the same | ||
| // `Shared` instance that `push_batch` operates on. The underlying | ||
| // implementation links the batch outside the lock and only | ||
| // acquires it for the list splice. | ||
| unsafe { shard.shared.push_batch(&shard.synced, iter) }; | ||
| } |
There was a problem hiding this comment.
Similar to push, adding a lock-free check for self.is_closed() here can prevent taking a lock when the queue is already closed during shutdown.
pub(crate) fn push_batch<I>(&self, iter: I)
where
I: Iterator<Item = task::Notified<T>>,
{
if self.is_closed() {
return;
}
let idx = self.next_push_shard();
let shard = &*self.shards[idx];
// safety: `&shard.synced` yields `&mut Synced` for the same
// `Shared` instance that `push_batch` operates on. The underlying
// implementation links the batch outside the lock and only
// acquires it for the list splice.
unsafe { shard.shared.push_batch(&shard.synced, iter) };
}There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Checking whether the shard is already closed would avoid the acquiring of the lock and executing the unsafe code completely.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tokio/src/runtime/task/trace/mod.rs (1)
358-377:⚠️ Potential issue | 🟠 MajorTaskdump no longer quiesces remote inject producers.
This loop now drains the sharded inject queue one shard lock at a time, but
Handle::push_remote_taskconcurrently enqueues straight intoself.shared.inject.push(task). That means remote spawns can keep arriving while the dump is in progress, so the "local and injection queues are drained" precondition below is no longer guaranteed. In practice this can make taskdumps miss freshly injected notified tasks, or spin indefinitely if remote spawn pressure stays high. Consider adding a tracing-only producer gate or taking all shard locks before draining.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tokio/src/runtime/task/trace/mod.rs` around lines 358 - 377, trace_multi_thread assumes the injection queue is quiesced but injection.pop(0) only drains one shard at a time while Handle::push_remote_task can concurrently push into other shards; fix by quiescing remote producers before draining the sharded injection: either acquire all shard locks for the Sharded<Arc<multi_thread::Handle>> (i.e., lock each shard/mutex/slot and then pop from each while holding the locks) before the "clear the injection queue" loop, or add a tracing-only producer gate that prevents Handle::push_remote_task from enqueuing during the dump; after ensuring no concurrent pushes, continue to collect dequeued and call trace_owned as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tokio/src/runtime/task/trace/mod.rs`:
- Around line 358-377: trace_multi_thread assumes the injection queue is
quiesced but injection.pop(0) only drains one shard at a time while
Handle::push_remote_task can concurrently push into other shards; fix by
quiescing remote producers before draining the sharded injection: either acquire
all shard locks for the Sharded<Arc<multi_thread::Handle>> (i.e., lock each
shard/mutex/slot and then pop from each while holding the locks) before the
"clear the injection queue" loop, or add a tracing-only producer gate that
prevents Handle::push_remote_task from enqueuing during the dump; after ensuring
no concurrent pushes, continue to collect dequeued and call trace_owned as
before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2a52cf4b-4b1f-4b6f-aaee-1830bca72427
📒 Files selected for processing (9)
spellcheck.dictokio/src/runtime/context.rstokio/src/runtime/scheduler/inject.rstokio/src/runtime/scheduler/inject/sharded.rstokio/src/runtime/scheduler/inject/shared.rstokio/src/runtime/scheduler/multi_thread/mod.rstokio/src/runtime/scheduler/multi_thread/worker.rstokio/src/runtime/scheduler/multi_thread/worker/taskdump.rstokio/src/runtime/task/trace/mod.rs
💤 Files with no reviewable changes (1)
- tokio/src/runtime/scheduler/multi_thread/mod.rs
🤖 Augment PR SummarySummary: This PR reduces contention when many external threads spawn into the multi-thread Tokio runtime by sharding the global inject queue. Changes:
Technical Notes: Shard count is capped (1 under loom, 8 otherwise) and rounded to a power-of-two to allow fast masking for shard selection. 🤖 Was this summary useful? React with 👍 or 👎 |
|
|
||
| cfg_rt_multi_thread! { | ||
| /// Sentinel indicating the per-thread inject push shard has not been assigned. | ||
| const INJECT_SHARD_UNASSIGNED: usize = usize::MAX; |
There was a problem hiding this comment.
tokio/src/runtime/context.rs:218: Using usize::MAX as INJECT_SHARD_UNASSIGNED can theoretically collide if the NEXT_SHARD counter ever wraps to usize::MAX, causing a thread to keep reinitializing its shard assignment. Consider ensuring the stored value range can’t overlap the sentinel (e.g., store idx + 1).
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-but-wont-fix; category:bug; feedback: The Augment AI reviewer is correct! Spawning usize::MAX remote tasks is highly unlikely because it is a very big number. Any optimizations could be left for a follow up if someone ever needs it.
| impl<T: 'static> Sharded<T> { | ||
| /// Creates a new sharded inject queue with a shard count derived | ||
| /// from the requested hint (rounded up to a power of two). | ||
| pub(crate) fn new(shard_hint: usize) -> Sharded<T> { |
There was a problem hiding this comment.
tokio/src/runtime/scheduler/inject/sharded.rs:69: There are unit tests for inject::Shared in tokio/src/runtime/tests/inject.rs, but none exercising inject::Sharded (multi-shard push/pop rotation, pop_n behavior, and close handling). Adding coverage here would help catch regressions in the new sharding logic.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! There are no tests for pop_n() and it would be good to add some to prevent from regressions in the future.
value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The docstring is not correct if the close() method is still running in another thread. In that case some shards will be closed but others may be still running and accepting remote tasks. |
value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The docstring of |
7973: To review by AI