7944: bench: add remote_spawn benchmark for inject queue contention by martin-augment · Pull Request #69 · martin-augment/tokio

martin-augment · 2026-03-03T12:36:32Z

7944: To review by AI

Add a benchmark that measures contention on the scheduler's inject queue mutex (push_remote_task) when multiple external threads spawn tasks into the runtime simultaneously. Every rt.spawn() from a non-worker thread unconditionally goes through push_remote_task, so this directly measures the scalability of the inject queue. Results on an M1 Max MacBook Pro (10 cores), spawning 12,800 total tasks: threads/1: 3.39 ms (265 ns/task, 1.00x) threads/2: 4.74 ms (370 ns/task, 1.40x) threads/4: 5.89 ms (460 ns/task, 1.74x) threads/8: 8.10 ms (633 ns/task, 2.39x) Wall-clock time increases with more threads despite constant total work, confirming the single mutex serializes producers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-03-03T12:36:45Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the project's benchmarking suite by adding a new benchmark focused on remote task spawning within the Tokio runtime. The primary goal is to assess and understand the performance implications of contention on the scheduler's inject queue when tasks are submitted from multiple external threads, which is crucial for identifying and optimizing potential bottlenecks in highly concurrent applications.

Highlights

New Benchmark Added: A new benchmark named remote_spawn has been introduced to measure contention on the Tokio runtime's inject queue mutex. This benchmark specifically targets scenarios where multiple external (non-worker) threads concurrently spawn tasks into the runtime.
Contention Measurement: The benchmark is designed to isolate and quantify the overhead of push_remote_task by using trivial no-op tasks and synchronizing external threads with a barrier to maximize contention at various parallelism levels (1, 2, 4, 8, 16, 32, 64 threads).

Changelog

benches/Cargo.toml
- Added a new [[bench]] entry to register the remote_spawn benchmark.
benches/remote_spawn.rs
- Added a new benchmark file remote_spawn.rs containing the implementation for measuring remote task spawning contention.

Activity

The pull request was created and is awaiting initial review.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-03T12:36:49Z

Walkthrough

This pull request introduces a new benchmark for measuring task spawning contention. The benches/Cargo.toml configuration is updated to register a new benchmark target named "remote_spawn". A corresponding benchmark file, benches/remote_spawn.rs, is created to measure the overhead of spawning tasks remotely from external threads into a Tokio multi-threaded runtime. The benchmark iterates across various thread counts (1, 2, 4, 8, 16, 32, 64 up to available parallelism), spawning trivial async tasks and synchronizing threads to maximize contention during measurement.

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-7944-2026-03-03-12-36-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new benchmark, remote_spawn, to measure contention on the scheduler's inject queue. The implementation is sound, but I've identified an opportunity to improve the accuracy of the benchmark measurement. My feedback focuses on isolating the task spawning time from the task execution time to better align the benchmark with its stated goal.

gemini-code-assist · 2026-03-03T12:38:33Z

benches/remote_spawn.rs

+                b.iter(|| {
+                    let barrier = Barrier::new(num_threads);
+
+                    std::thread::scope(|s| {
+                        let handles: Vec<_> = (0..num_threads)
+                            .map(|_| {
+                                let barrier = &barrier;
+                                let rt = &rt;
+                                s.spawn(move || {
+                                    let mut join_handles = Vec::with_capacity(tasks_per_thread);
+                                    barrier.wait();
+
+                                    for _ in 0..tasks_per_thread {
+                                        join_handles.push(rt.spawn(async {}));
+                                    }
+                                    join_handles
+                                })
+                            })
+                            .collect();
+
+                        let all_handles: Vec<_> = handles
+                            .into_iter()
+                            .flat_map(|h| h.join().unwrap())
+                            .collect();
+
+                        rt.block_on(async {
+                            for h in all_handles {
+                                h.await.unwrap();
+                            }
+                        });
+                    });
+                });


The current implementation with b.iter measures the time for both spawning tasks and running them to completion. To more accurately measure the inject queue contention during spawning, it's better to isolate the spawning phase from the task execution/cleanup phase. Using b.iter_custom allows for manual timer control, which will exclude the cleanup phase from the measurement and provide a more precise result for the spawning overhead.

b.iter_custom(|iters| { let mut total_duration = std::time::Duration::ZERO; for _ in 0..iters { let barrier = Barrier::new(num_threads); let start = std::time::Instant::now(); let all_handles = std::thread::scope(|s| { let handles: Vec<_> = (0..num_threads) .map(|_| { let barrier = &barrier; let rt = &rt; s.spawn(move || { let mut join_handles = Vec::with_capacity(tasks_per_thread); barrier.wait(); for _ in 0..tasks_per_thread { join_handles.push(rt.spawn(async {})); } join_handles }) }) .collect(); handles .into_iter() .flat_map(|h| h.join().unwrap()) .collect::<Vec<_>>() }); total_duration += start.elapsed(); // Cleanup: wait for all tasks to complete before the next iteration. rt.block_on(async { for h in all_handles { h.await.unwrap(); } }); } total_duration });

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

claude · 2026-03-03T12:39:11Z

PR Review: bench: add remote_spawn benchmark for inject queue contention

Overall this is a well-motivated and clearly documented benchmark. The goal of measuring inject-queue mutex contention at varying concurrency levels is distinct enough from the existing spawn_many_remote_idle benchmark to be worth adding. The code is readable and follows the repo's existing Criterion patterns. A few issues are worth addressing before merging.

Issues

1. `Barrier` allocated inside `b.iter()` adds noise to measured time

b.iter(|| {
    let barrier = Barrier::new(num_threads);  // allocation every iteration
    ...
});

std::sync::Barrier resets automatically after all threads pass through, so it can be created once outside b.iter() and reused across iterations. Allocating it inside the hot loop adds overhead that is not part of what is being measured.

Suggestion: Move it outside b.iter(). Since std::thread::scope borrows by reference, this works cleanly:

let barrier = Barrier::new(num_threads);
b.iter(|| {
    std::thread::scope(|s| {
        // borrow &barrier as before
    });
});

2. No fixed `worker_threads` count makes results machine-dependent

fn rt() -> Runtime {
    runtime::Builder::new_multi_thread()
        .enable_all()   // worker_threads not set -- uses system default
        .build()
        .unwrap()
}

Without a fixed worker_threads count, results vary by machine. Other benchmarks in this directory (e.g., rt_multi_threaded.rs) pin NUM_WORKERS = 4. For a contention benchmark, the ratio of spawning threads to worker threads matters significantly -- leaving it as the system default makes cross-machine comparisons unreliable.

Suggestion: Either pin a specific worker count (e.g., 4) or document explicitly that the variable count is intentional and why.

3. `enable_all()` starts unneeded IO/timer drivers

For no-op task spawning that only exercises the inject queue, enable_all() brings in IO and timer drivers that are never used. Plain .build() (without enable_all()) is sufficient and reduces setup noise. Compare with spawn.rs which uses new_multi_thread().worker_threads(1).build() for a similar purpose.

4. Runtime recreated per parallelism variant (minor)

The rt() call is inside bench_with_input but outside b.iter(), so a new runtime is created for each parallelism level. While this does not pollute the measured time, a single runtime shared across all variants would be more efficient. The existing rt_multi_spawn_many_remote_idle creates the runtime once for the whole function.

Minor Notes

Compile-time assertion for TOTAL_TASKS: The comment "Must be divisible by the largest parallelism level (64)" is helpful, but a compile-time assertion would enforce this automatically:
```
const _: () = assert!(TOTAL_TASKS % 64 == 0, "TOTAL_TASKS must be divisible by 64");
```
Using std::thread::scope to borrow barrier and rt by reference (avoiding Arc) is clean and idiomatic.
The module-level doc comment is thorough and clear -- good practice for benchmarks.

Summary

Category	Assessment
Concept	Good -- fills a real gap in existing benchmarks
Documentation	Good
Correctness	Correct; barrier synchronisation logic is sound
Measurement accuracy	Needs improvement -- Barrier in hot loop, variable runtime config
Style	Consistent with repo conventions

Addressing points 1-3 (move Barrier outside iter, fix worker count, drop enable_all) would make this a solid addition.

augmentcode · 2026-03-03T12:39:44Z

🤖 Augment PR Summary

Summary: Adds a new Criterion benchmark to study contention when spawning tasks into Tokio from non-worker threads.

Changes: Registers a new remote_spawn bench target and introduces benches/remote_spawn.rs.

The benchmark builds a multi-thread runtime, starts N external OS threads (1–64, capped by available_parallelism), and synchronizes them with a barrier.

Each thread spawns TOTAL_TASKS / N no-op tasks via rt.spawn to exercise push_remote_task / the inject queue.

After spawning, the benchmark awaits all task JoinHandles to keep iterations self-contained.

Goal: Provide a micro-benchmark for inject-queue mutex contention under concurrent remote spawns.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-03-03T12:39:45Z

benches/remote_spawn.rs

+                b.iter(|| {
+                    let barrier = Barrier::new(num_threads);
+
+                    std::thread::scope(|s| {


b.iter currently includes creating/joining num_threads OS threads each iteration (std::thread::scope), which can dominate timings and obscure inject-queue contention. Consider separating thread setup from the timed region if the goal is to measure push_remote_task overhead.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

augmentcode · 2026-03-03T12:39:45Z

benches/remote_spawn.rs

+                            .flat_map(|h| h.join().unwrap())
+                            .collect();
+
+                        rt.block_on(async {


The timed region also includes collecting all JoinHandles and awaiting them via rt.block_on, so results reflect more than just remote-spawn/inject-queue work. If this benchmark is intended to isolate inject-queue contention, clarifying/controlling this extra work would help interpretation.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

benches/remote_spawn.rs (1)

34-35: Add an explicit divisibility invariant for TOTAL_TASKS.

This avoids silent workload drift if either constant changes later.

Suggested patch

                 let rt = rt();
                 let tasks_per_thread = TOTAL_TASKS / num_threads;
+                assert_eq!(
+                    TOTAL_TASKS % num_threads,
+                    0,
+                    "TOTAL_TASKS must be divisible by num_threads"
+                );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benches/remote_spawn.rs` around lines 34 - 35, Add an explicit runtime (or
debug) assertion that TOTAL_TASKS is divisible by num_threads immediately before
computing tasks_per_thread to prevent silent workload drift; e.g., in
benches/remote_spawn.rs near the tasks_per_thread calculation, add an
assert!(TOTAL_TASKS % num_threads == 0, "TOTAL_TASKS must be divisible by
num_threads") (or debug_assert! if you prefer non-production checks) so the
invariant involving TOTAL_TASKS and num_threads is enforced before let
tasks_per_thread = TOTAL_TASKS / num_threads.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benches/remote_spawn.rs`:
- Around line 36-67: The benchmark currently recreates OS threads inside
b.iter(), conflating thread lifecycle with spawn contention; refactor so thread
creation (std::thread::scope / s.spawn / join) happens once outside b.iter() and
each spawned thread runs a per-iteration loop that waits on a shared Barrier and
performs the rt.spawn(...) tasks, sending join handles back to the main bench
loop; then inside b.iter() only trigger the Barrier to start the iteration,
collect the handles produced by worker threads, and use rt.block_on to await
them—this isolates the inject-queue contention measurement while still using the
existing Barrier, rt.spawn, and rt.block_on logic.

---

Nitpick comments:
In `@benches/remote_spawn.rs`:
- Around line 34-35: Add an explicit runtime (or debug) assertion that
TOTAL_TASKS is divisible by num_threads immediately before computing
tasks_per_thread to prevent silent workload drift; e.g., in
benches/remote_spawn.rs near the tasks_per_thread calculation, add an
assert!(TOTAL_TASKS % num_threads == 0, "TOTAL_TASKS must be divisible by
num_threads") (or debug_assert! if you prefer non-production checks) so the
invariant involving TOTAL_TASKS and num_threads is enforced before let
tasks_per_thread = TOTAL_TASKS / num_threads.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3f1a84 and ce47f1a.

📒 Files selected for processing (2)

benches/Cargo.toml
benches/remote_spawn.rs

coderabbitai · 2026-03-03T12:39:59Z

benches/remote_spawn.rs

+                b.iter(|| {
+                    let barrier = Barrier::new(num_threads);
+
+                    std::thread::scope(|s| {
+                        let handles: Vec<_> = (0..num_threads)
+                            .map(|_| {
+                                let barrier = &barrier;
+                                let rt = &rt;
+                                s.spawn(move || {
+                                    let mut join_handles = Vec::with_capacity(tasks_per_thread);
+                                    barrier.wait();
+
+                                    for _ in 0..tasks_per_thread {
+                                        join_handles.push(rt.spawn(async {}));
+                                    }
+                                    join_handles
+                                })
+                            })
+                            .collect();
+
+                        let all_handles: Vec<_> = handles
+                            .into_iter()
+                            .flat_map(|h| h.join().unwrap())
+                            .collect();
+
+                        rt.block_on(async {
+                            for h in all_handles {
+                                h.await.unwrap();
+                            }
+                        });
+                    });
+                });


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's see if the file exists and read the relevant section cat -n benches/remote_spawn.rs | head -80

Repository: martin-augment/tokio

Length of output: 3784

Measured region currently mixes thread lifecycle cost with spawn contention.

The code structure confirms the issue: std::thread::scope, s.spawn(...), and join() (lines 39–58) are all inside the b.iter() closure, so each sample includes OS thread creation and teardown overhead. Per the benchmark's stated goal (measuring inject-queue mutex contention), this conflates orthogonal costs. Consider moving thread creation outside b.iter() and measuring only the task enqueue and await cycle to isolate inject-queue contention.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benches/remote_spawn.rs` around lines 36 - 67, The benchmark currently recreates OS threads inside b.iter(), conflating thread lifecycle with spawn contention; refactor so thread creation (std::thread::scope / s.spawn / join) happens once outside b.iter() and each spawned thread runs a per-iteration loop that waits on a shared Barrier and performs the rt.spawn(...) tasks, sending join handles back to the main bench loop; then inside b.iter() only trigger the Barrier to start the iteration, collect the handles produced by worker threads, and use rt.block_on to await them—this isolates the inject-queue contention measurement while still using the existing Barrier, rt.spawn, and rt.block_on logic.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

martin-augment · 2026-03-03T13:08:49Z

1. Barrier allocated inside b.iter() adds noise to measured time
b.iter(|| {
    let barrier = Barrier::new(num_threads);  // allocation every iteration
    ...
});
std::sync::Barrier resets automatically after all threads pass through, so it can be created once outside b.iter() and reused across iterations. Allocating it inside the hot loop adds overhead that is not part of what is being measured.

Suggestion: Move it outside b.iter(). Since std::thread::scope borrows by reference, this works cleanly:
let barrier = Barrier::new(num_threads);
b.iter(|| {
    std::thread::scope(|s| {
        // borrow &barrier as before
    });
});

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The Barrier construction could be moved out of the loop to avoid measuring it. It won't change the execution time in a noticeable way but it will make it clear that it is not part of the measured logic.

martin-augment · 2026-03-03T13:10:24Z

3. enable_all() starts unneeded IO/timer drivers

For no-op task spawning that only exercises the inject queue, enable_all() brings in IO and timer drivers that are never used. Plain .build() (without enable_all()) is sufficient and reduces setup noise. Compare with spawn.rs which uses new_multi_thread().worker_threads(1).build() for a similar purpose.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! There is no need to enable functionalities which are not really needed by the benchmark test. Prevents introducing random noise to the benchmark by background functionalities which are not needed.

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

augmentcode bot reviewed Mar 3, 2026

View reviewed changes

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

Conversation

martin-augment commented Mar 3, 2026

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 3, 2026

PR Review: bench: add remote_spawn benchmark for inject queue contention

Issues

1. Barrier allocated inside b.iter() adds noise to measured time

2. No fixed worker_threads count makes results machine-dependent

3. enable_all() starts unneeded IO/timer drivers

4. Runtime recreated per parallelism variant (minor)

Minor Notes

Summary

Uh oh!

augmentcode bot commented Mar 3, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Mar 3, 2026

1. Barrier allocated inside b.iter() adds noise to measured time

Uh oh!

martin-augment commented Mar 3, 2026

3. enable_all() starts unneeded IO/timer drivers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

1. `Barrier` allocated inside `b.iter()` adds noise to measured time

2. No fixed `worker_threads` count makes results machine-dependent

3. `enable_all()` starts unneeded IO/timer drivers

1. `Barrier` allocated inside `b.iter()` adds noise to measured time

3. `enable_all()` starts unneeded IO/timer drivers