7944: bench: add remote_spawn benchmark for inject queue contention#69
7944: bench: add remote_spawn benchmark for inject queue contention#69martin-augment wants to merge 1 commit intomasterfrom
Conversation
Add a benchmark that measures contention on the scheduler's inject queue mutex (push_remote_task) when multiple external threads spawn tasks into the runtime simultaneously. Every rt.spawn() from a non-worker thread unconditionally goes through push_remote_task, so this directly measures the scalability of the inject queue. Results on an M1 Max MacBook Pro (10 cores), spawning 12,800 total tasks: threads/1: 3.39 ms (265 ns/task, 1.00x) threads/2: 4.74 ms (370 ns/task, 1.40x) threads/4: 5.89 ms (460 ns/task, 1.74x) threads/8: 8.10 ms (633 ns/task, 2.39x) Wall-clock time increases with more threads despite constant total work, confirming the single mutex serializes producers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the project's benchmarking suite by adding a new benchmark focused on remote task spawning within the Tokio runtime. The primary goal is to assess and understand the performance implications of contention on the scheduler's inject queue when tasks are submitted from multiple external threads, which is crucial for identifying and optimizing potential bottlenecks in highly concurrent applications. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis pull request introduces a new benchmark for measuring task spawning contention. The benches/Cargo.toml configuration is updated to register a new benchmark target named "remote_spawn". A corresponding benchmark file, benches/remote_spawn.rs, is created to measure the overhead of spawning tasks remotely from external threads into a Tokio multi-threaded runtime. The benchmark iterates across various thread counts (1, 2, 4, 8, 16, 32, 64 up to available parallelism), spawning trivial async tasks and synchronizing threads to maximize contention during measurement. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new benchmark, remote_spawn, to measure contention on the scheduler's inject queue. The implementation is sound, but I've identified an opportunity to improve the accuracy of the benchmark measurement. My feedback focuses on isolating the task spawning time from the task execution time to better align the benchmark with its stated goal.
| b.iter(|| { | ||
| let barrier = Barrier::new(num_threads); | ||
|
|
||
| std::thread::scope(|s| { | ||
| let handles: Vec<_> = (0..num_threads) | ||
| .map(|_| { | ||
| let barrier = &barrier; | ||
| let rt = &rt; | ||
| s.spawn(move || { | ||
| let mut join_handles = Vec::with_capacity(tasks_per_thread); | ||
| barrier.wait(); | ||
|
|
||
| for _ in 0..tasks_per_thread { | ||
| join_handles.push(rt.spawn(async {})); | ||
| } | ||
| join_handles | ||
| }) | ||
| }) | ||
| .collect(); | ||
|
|
||
| let all_handles: Vec<_> = handles | ||
| .into_iter() | ||
| .flat_map(|h| h.join().unwrap()) | ||
| .collect(); | ||
|
|
||
| rt.block_on(async { | ||
| for h in all_handles { | ||
| h.await.unwrap(); | ||
| } | ||
| }); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
The current implementation with b.iter measures the time for both spawning tasks and running them to completion. To more accurately measure the inject queue contention during spawning, it's better to isolate the spawning phase from the task execution/cleanup phase. Using b.iter_custom allows for manual timer control, which will exclude the cleanup phase from the measurement and provide a more precise result for the spawning overhead.
b.iter_custom(|iters| {
let mut total_duration = std::time::Duration::ZERO;
for _ in 0..iters {
let barrier = Barrier::new(num_threads);
let start = std::time::Instant::now();
let all_handles = std::thread::scope(|s| {
let handles: Vec<_> = (0..num_threads)
.map(|_| {
let barrier = &barrier;
let rt = &rt;
s.spawn(move || {
let mut join_handles =
Vec::with_capacity(tasks_per_thread);
barrier.wait();
for _ in 0..tasks_per_thread {
join_handles.push(rt.spawn(async {}));
}
join_handles
})
})
.collect();
handles
.into_iter()
.flat_map(|h| h.join().unwrap())
.collect::<Vec<_>>()
});
total_duration += start.elapsed();
// Cleanup: wait for all tasks to complete before the next iteration.
rt.block_on(async {
for h in all_handles {
h.await.unwrap();
}
});
}
total_duration
});There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).
PR Review: bench: add remote_spawn benchmark for inject queue contentionOverall this is a well-motivated and clearly documented benchmark. The goal of measuring inject-queue mutex contention at varying concurrency levels is distinct enough from the existing Issues1.
|
| Category | Assessment |
|---|---|
| Concept | Good -- fills a real gap in existing benchmarks |
| Documentation | Good |
| Correctness | Correct; barrier synchronisation logic is sound |
| Measurement accuracy | Needs improvement -- Barrier in hot loop, variable runtime config |
| Style | Consistent with repo conventions |
Addressing points 1-3 (move Barrier outside iter, fix worker count, drop enable_all) would make this a solid addition.
🤖 Augment PR SummarySummary: Adds a new Criterion benchmark to study contention when spawning tasks into Tokio from non-worker threads. 🤖 Was this summary useful? React with 👍 or 👎 |
| b.iter(|| { | ||
| let barrier = Barrier::new(num_threads); | ||
|
|
||
| std::thread::scope(|s| { |
There was a problem hiding this comment.
b.iter currently includes creating/joining num_threads OS threads each iteration (std::thread::scope), which can dominate timings and obscure inject-queue contention. Consider separating thread setup from the timed region if the goal is to measure push_remote_task overhead.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).
| .flat_map(|h| h.join().unwrap()) | ||
| .collect(); | ||
|
|
||
| rt.block_on(async { |
There was a problem hiding this comment.
The timed region also includes collecting all JoinHandles and awaiting them via rt.block_on, so results reflect more than just remote-spawn/inject-queue work. If this benchmark is intended to isolate inject-queue contention, clarifying/controlling this extra work would help interpretation.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
benches/remote_spawn.rs (1)
34-35: Add an explicit divisibility invariant forTOTAL_TASKS.This avoids silent workload drift if either constant changes later.
Suggested patch
let rt = rt(); let tasks_per_thread = TOTAL_TASKS / num_threads; + assert_eq!( + TOTAL_TASKS % num_threads, + 0, + "TOTAL_TASKS must be divisible by num_threads" + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benches/remote_spawn.rs` around lines 34 - 35, Add an explicit runtime (or debug) assertion that TOTAL_TASKS is divisible by num_threads immediately before computing tasks_per_thread to prevent silent workload drift; e.g., in benches/remote_spawn.rs near the tasks_per_thread calculation, add an assert!(TOTAL_TASKS % num_threads == 0, "TOTAL_TASKS must be divisible by num_threads") (or debug_assert! if you prefer non-production checks) so the invariant involving TOTAL_TASKS and num_threads is enforced before let tasks_per_thread = TOTAL_TASKS / num_threads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benches/remote_spawn.rs`:
- Around line 36-67: The benchmark currently recreates OS threads inside
b.iter(), conflating thread lifecycle with spawn contention; refactor so thread
creation (std::thread::scope / s.spawn / join) happens once outside b.iter() and
each spawned thread runs a per-iteration loop that waits on a shared Barrier and
performs the rt.spawn(...) tasks, sending join handles back to the main bench
loop; then inside b.iter() only trigger the Barrier to start the iteration,
collect the handles produced by worker threads, and use rt.block_on to await
them—this isolates the inject-queue contention measurement while still using the
existing Barrier, rt.spawn, and rt.block_on logic.
---
Nitpick comments:
In `@benches/remote_spawn.rs`:
- Around line 34-35: Add an explicit runtime (or debug) assertion that
TOTAL_TASKS is divisible by num_threads immediately before computing
tasks_per_thread to prevent silent workload drift; e.g., in
benches/remote_spawn.rs near the tasks_per_thread calculation, add an
assert!(TOTAL_TASKS % num_threads == 0, "TOTAL_TASKS must be divisible by
num_threads") (or debug_assert! if you prefer non-production checks) so the
invariant involving TOTAL_TASKS and num_threads is enforced before let
tasks_per_thread = TOTAL_TASKS / num_threads.
| b.iter(|| { | ||
| let barrier = Barrier::new(num_threads); | ||
|
|
||
| std::thread::scope(|s| { | ||
| let handles: Vec<_> = (0..num_threads) | ||
| .map(|_| { | ||
| let barrier = &barrier; | ||
| let rt = &rt; | ||
| s.spawn(move || { | ||
| let mut join_handles = Vec::with_capacity(tasks_per_thread); | ||
| barrier.wait(); | ||
|
|
||
| for _ in 0..tasks_per_thread { | ||
| join_handles.push(rt.spawn(async {})); | ||
| } | ||
| join_handles | ||
| }) | ||
| }) | ||
| .collect(); | ||
|
|
||
| let all_handles: Vec<_> = handles | ||
| .into_iter() | ||
| .flat_map(|h| h.join().unwrap()) | ||
| .collect(); | ||
|
|
||
| rt.block_on(async { | ||
| for h in all_handles { | ||
| h.await.unwrap(); | ||
| } | ||
| }); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's see if the file exists and read the relevant section
cat -n benches/remote_spawn.rs | head -80Repository: martin-augment/tokio
Length of output: 3784
Measured region currently mixes thread lifecycle cost with spawn contention.
The code structure confirms the issue: std::thread::scope, s.spawn(...), and join() (lines 39–58) are all inside the b.iter() closure, so each sample includes OS thread creation and teardown overhead. Per the benchmark's stated goal (measuring inject-queue mutex contention), this conflates orthogonal costs. Consider moving thread creation outside b.iter() and measuring only the task enqueue and await cycle to isolate inject-queue contention.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benches/remote_spawn.rs` around lines 36 - 67, The benchmark currently
recreates OS threads inside b.iter(), conflating thread lifecycle with spawn
contention; refactor so thread creation (std::thread::scope / s.spawn / join)
happens once outside b.iter() and each spawned thread runs a per-iteration loop
that waits on a shared Barrier and performs the rt.spawn(...) tasks, sending
join handles back to the main bench loop; then inside b.iter() only trigger the
Barrier to start the iteration, collect the handles produced by worker threads,
and use rt.block_on to await them—this isolates the inject-queue contention
measurement while still using the existing Barrier, rt.spawn, and rt.block_on
logic.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).
value:useful; category:bug; feedback: The Claude AI reviewer is correct! The Barrier construction could be moved out of the loop to avoid measuring it. It won't change the execution time in a noticeable way but it will make it clear that it is not part of the measured logic. |
value:useful; category:bug; feedback: The Claude AI reviewer is correct! There is no need to enable functionalities which are not really needed by the benchmark test. Prevents introducing random noise to the benchmark by background functionalities which are not needed. |
7944: To review by AI