Skip to content

7944: bench: add remote_spawn benchmark for inject queue contention#69

Open
martin-augment wants to merge 1 commit intomasterfrom
pr-7944-2026-03-03-12-36-29
Open

7944: bench: add remote_spawn benchmark for inject queue contention#69
martin-augment wants to merge 1 commit intomasterfrom
pr-7944-2026-03-03-12-36-29

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

7944: To review by AI

Add a benchmark that measures contention on the scheduler's inject queue
mutex (push_remote_task) when multiple external threads spawn tasks into
the runtime simultaneously. Every rt.spawn() from a non-worker thread
unconditionally goes through push_remote_task, so this directly measures
the scalability of the inject queue.

Results on an M1 Max MacBook Pro (10 cores), spawning 12,800 total tasks:

  threads/1:  3.39 ms  (265 ns/task, 1.00x)
  threads/2:  4.74 ms  (370 ns/task, 1.40x)
  threads/4:  5.89 ms  (460 ns/task, 1.74x)
  threads/8:  8.10 ms  (633 ns/task, 2.39x)

Wall-clock time increases with more threads despite constant total work,
confirming the single mutex serializes producers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the project's benchmarking suite by adding a new benchmark focused on remote task spawning within the Tokio runtime. The primary goal is to assess and understand the performance implications of contention on the scheduler's inject queue when tasks are submitted from multiple external threads, which is crucial for identifying and optimizing potential bottlenecks in highly concurrent applications.

Highlights

  • New Benchmark Added: A new benchmark named remote_spawn has been introduced to measure contention on the Tokio runtime's inject queue mutex. This benchmark specifically targets scenarios where multiple external (non-worker) threads concurrently spawn tasks into the runtime.
  • Contention Measurement: The benchmark is designed to isolate and quantify the overhead of push_remote_task by using trivial no-op tasks and synchronizing external threads with a barrier to maximize contention at various parallelism levels (1, 2, 4, 8, 16, 32, 64 threads).
Changelog
  • benches/Cargo.toml
    • Added a new [[bench]] entry to register the remote_spawn benchmark.
  • benches/remote_spawn.rs
    • Added a new benchmark file remote_spawn.rs containing the implementation for measuring remote task spawning contention.
Activity
  • The pull request was created and is awaiting initial review.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 3, 2026

Walkthrough

This pull request introduces a new benchmark for measuring task spawning contention. The benches/Cargo.toml configuration is updated to register a new benchmark target named "remote_spawn". A corresponding benchmark file, benches/remote_spawn.rs, is created to measure the overhead of spawning tasks remotely from external threads into a Tokio multi-threaded runtime. The benchmark iterates across various thread counts (1, 2, 4, 8, 16, 32, 64 up to available parallelism), spawning trivial async tasks and synchronizing threads to maximize contention during measurement.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-7944-2026-03-03-12-36-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new benchmark, remote_spawn, to measure contention on the scheduler's inject queue. The implementation is sound, but I've identified an opportunity to improve the accuracy of the benchmark measurement. My feedback focuses on isolating the task spawning time from the task execution time to better align the benchmark with its stated goal.

Comment on lines +36 to +67
b.iter(|| {
let barrier = Barrier::new(num_threads);

std::thread::scope(|s| {
let handles: Vec<_> = (0..num_threads)
.map(|_| {
let barrier = &barrier;
let rt = &rt;
s.spawn(move || {
let mut join_handles = Vec::with_capacity(tasks_per_thread);
barrier.wait();

for _ in 0..tasks_per_thread {
join_handles.push(rt.spawn(async {}));
}
join_handles
})
})
.collect();

let all_handles: Vec<_> = handles
.into_iter()
.flat_map(|h| h.join().unwrap())
.collect();

rt.block_on(async {
for h in all_handles {
h.await.unwrap();
}
});
});
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation with b.iter measures the time for both spawning tasks and running them to completion. To more accurately measure the inject queue contention during spawning, it's better to isolate the spawning phase from the task execution/cleanup phase. Using b.iter_custom allows for manual timer control, which will exclude the cleanup phase from the measurement and provide a more precise result for the spawning overhead.

                b.iter_custom(|iters| {
                    let mut total_duration = std::time::Duration::ZERO;
                    for _ in 0..iters {
                        let barrier = Barrier::new(num_threads);

                        let start = std::time::Instant::now();

                        let all_handles = std::thread::scope(|s| {
                            let handles: Vec<_> = (0..num_threads)
                                .map(|_| {
                                    let barrier = &barrier;
                                    let rt = &rt;
                                    s.spawn(move || {
                                        let mut join_handles =
                                            Vec::with_capacity(tasks_per_thread);
                                        barrier.wait();

                                        for _ in 0..tasks_per_thread {
                                            join_handles.push(rt.spawn(async {}));
                                        }
                                        join_handles
                                    })
                                })
                                .collect();

                            handles
                                .into_iter()
                                .flat_map(|h| h.join().unwrap())
                                .collect::<Vec<_>>()
                        });

                        total_duration += start.elapsed();

                        // Cleanup: wait for all tasks to complete before the next iteration.
                        rt.block_on(async {
                            for h in all_handles {
                                h.await.unwrap();
                            }
                        });
                    }
                    total_duration
                });

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

@claude
Copy link
Copy Markdown

claude bot commented Mar 3, 2026

PR Review: bench: add remote_spawn benchmark for inject queue contention

Overall this is a well-motivated and clearly documented benchmark. The goal of measuring inject-queue mutex contention at varying concurrency levels is distinct enough from the existing spawn_many_remote_idle benchmark to be worth adding. The code is readable and follows the repo's existing Criterion patterns. A few issues are worth addressing before merging.


Issues

1. Barrier allocated inside b.iter() adds noise to measured time

b.iter(|| {
    let barrier = Barrier::new(num_threads);  // allocation every iteration
    ...
});

std::sync::Barrier resets automatically after all threads pass through, so it can be created once outside b.iter() and reused across iterations. Allocating it inside the hot loop adds overhead that is not part of what is being measured.

Suggestion: Move it outside b.iter(). Since std::thread::scope borrows by reference, this works cleanly:

let barrier = Barrier::new(num_threads);
b.iter(|| {
    std::thread::scope(|s| {
        // borrow &barrier as before
    });
});

2. No fixed worker_threads count makes results machine-dependent

fn rt() -> Runtime {
    runtime::Builder::new_multi_thread()
        .enable_all()   // worker_threads not set -- uses system default
        .build()
        .unwrap()
}

Without a fixed worker_threads count, results vary by machine. Other benchmarks in this directory (e.g., rt_multi_threaded.rs) pin NUM_WORKERS = 4. For a contention benchmark, the ratio of spawning threads to worker threads matters significantly -- leaving it as the system default makes cross-machine comparisons unreliable.

Suggestion: Either pin a specific worker count (e.g., 4) or document explicitly that the variable count is intentional and why.

3. enable_all() starts unneeded IO/timer drivers

For no-op task spawning that only exercises the inject queue, enable_all() brings in IO and timer drivers that are never used. Plain .build() (without enable_all()) is sufficient and reduces setup noise. Compare with spawn.rs which uses new_multi_thread().worker_threads(1).build() for a similar purpose.

4. Runtime recreated per parallelism variant (minor)

The rt() call is inside bench_with_input but outside b.iter(), so a new runtime is created for each parallelism level. While this does not pollute the measured time, a single runtime shared across all variants would be more efficient. The existing rt_multi_spawn_many_remote_idle creates the runtime once for the whole function.


Minor Notes

  • Compile-time assertion for TOTAL_TASKS: The comment "Must be divisible by the largest parallelism level (64)" is helpful, but a compile-time assertion would enforce this automatically:
    const _: () = assert!(TOTAL_TASKS % 64 == 0, "TOTAL_TASKS must be divisible by 64");
  • Using std::thread::scope to borrow barrier and rt by reference (avoiding Arc) is clean and idiomatic.
  • The module-level doc comment is thorough and clear -- good practice for benchmarks.

Summary

Category Assessment
Concept Good -- fills a real gap in existing benchmarks
Documentation Good
Correctness Correct; barrier synchronisation logic is sound
Measurement accuracy Needs improvement -- Barrier in hot loop, variable runtime config
Style Consistent with repo conventions

Addressing points 1-3 (move Barrier outside iter, fix worker count, drop enable_all) would make this a solid addition.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 3, 2026

🤖 Augment PR Summary

Summary: Adds a new Criterion benchmark to study contention when spawning tasks into Tokio from non-worker threads.

Changes: Registers a new remote_spawn bench target and introduces benches/remote_spawn.rs.

The benchmark builds a multi-thread runtime, starts N external OS threads (1–64, capped by available_parallelism), and synchronizes them with a barrier.

Each thread spawns TOTAL_TASKS / N no-op tasks via rt.spawn to exercise push_remote_task / the inject queue.

After spawning, the benchmark awaits all task JoinHandles to keep iterations self-contained.

Goal: Provide a micro-benchmark for inject-queue mutex contention under concurrent remote spawns.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

b.iter(|| {
let barrier = Barrier::new(num_threads);

std::thread::scope(|s| {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b.iter currently includes creating/joining num_threads OS threads each iteration (std::thread::scope), which can dominate timings and obscure inject-queue contention. Consider separating thread setup from the timed region if the goal is to measure push_remote_task overhead.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

.flat_map(|h| h.join().unwrap())
.collect();

rt.block_on(async {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timed region also includes collecting all JoinHandles and awaiting them via rt.block_on, so results reflect more than just remote-spawn/inject-queue work. If this benchmark is intended to isolate inject-queue contention, clarifying/controlling this extra work would help interpretation.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
benches/remote_spawn.rs (1)

34-35: Add an explicit divisibility invariant for TOTAL_TASKS.

This avoids silent workload drift if either constant changes later.

Suggested patch
                 let rt = rt();
                 let tasks_per_thread = TOTAL_TASKS / num_threads;
+                assert_eq!(
+                    TOTAL_TASKS % num_threads,
+                    0,
+                    "TOTAL_TASKS must be divisible by num_threads"
+                );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benches/remote_spawn.rs` around lines 34 - 35, Add an explicit runtime (or
debug) assertion that TOTAL_TASKS is divisible by num_threads immediately before
computing tasks_per_thread to prevent silent workload drift; e.g., in
benches/remote_spawn.rs near the tasks_per_thread calculation, add an
assert!(TOTAL_TASKS % num_threads == 0, "TOTAL_TASKS must be divisible by
num_threads") (or debug_assert! if you prefer non-production checks) so the
invariant involving TOTAL_TASKS and num_threads is enforced before let
tasks_per_thread = TOTAL_TASKS / num_threads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benches/remote_spawn.rs`:
- Around line 36-67: The benchmark currently recreates OS threads inside
b.iter(), conflating thread lifecycle with spawn contention; refactor so thread
creation (std::thread::scope / s.spawn / join) happens once outside b.iter() and
each spawned thread runs a per-iteration loop that waits on a shared Barrier and
performs the rt.spawn(...) tasks, sending join handles back to the main bench
loop; then inside b.iter() only trigger the Barrier to start the iteration,
collect the handles produced by worker threads, and use rt.block_on to await
them—this isolates the inject-queue contention measurement while still using the
existing Barrier, rt.spawn, and rt.block_on logic.

---

Nitpick comments:
In `@benches/remote_spawn.rs`:
- Around line 34-35: Add an explicit runtime (or debug) assertion that
TOTAL_TASKS is divisible by num_threads immediately before computing
tasks_per_thread to prevent silent workload drift; e.g., in
benches/remote_spawn.rs near the tasks_per_thread calculation, add an
assert!(TOTAL_TASKS % num_threads == 0, "TOTAL_TASKS must be divisible by
num_threads") (or debug_assert! if you prefer non-production checks) so the
invariant involving TOTAL_TASKS and num_threads is enforced before let
tasks_per_thread = TOTAL_TASKS / num_threads.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3f1a84 and ce47f1a.

📒 Files selected for processing (2)
  • benches/Cargo.toml
  • benches/remote_spawn.rs

Comment on lines +36 to +67
b.iter(|| {
let barrier = Barrier::new(num_threads);

std::thread::scope(|s| {
let handles: Vec<_> = (0..num_threads)
.map(|_| {
let barrier = &barrier;
let rt = &rt;
s.spawn(move || {
let mut join_handles = Vec::with_capacity(tasks_per_thread);
barrier.wait();

for _ in 0..tasks_per_thread {
join_handles.push(rt.spawn(async {}));
}
join_handles
})
})
.collect();

let all_handles: Vec<_> = handles
.into_iter()
.flat_map(|h| h.join().unwrap())
.collect();

rt.block_on(async {
for h in all_handles {
h.await.unwrap();
}
});
});
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's see if the file exists and read the relevant section
cat -n benches/remote_spawn.rs | head -80

Repository: martin-augment/tokio

Length of output: 3784


Measured region currently mixes thread lifecycle cost with spawn contention.

The code structure confirms the issue: std::thread::scope, s.spawn(...), and join() (lines 39–58) are all inside the b.iter() closure, so each sample includes OS thread creation and teardown overhead. Per the benchmark's stated goal (measuring inject-queue mutex contention), this conflates orthogonal costs. Consider moving thread creation outside b.iter() and measuring only the task enqueue and await cycle to isolate inject-queue contention.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benches/remote_spawn.rs` around lines 36 - 67, The benchmark currently
recreates OS threads inside b.iter(), conflating thread lifecycle with spawn
contention; refactor so thread creation (std::thread::scope / s.spawn / join)
happens once outside b.iter() and each spawned thread runs a per-iteration loop
that waits on a shared Barrier and performs the rt.spawn(...) tasks, sending
join handles back to the main bench loop; then inside b.iter() only trigger the
Barrier to start the iteration, collect the handles produced by worker threads,
and use rt.block_on to await them—this isolates the inject-queue contention
measurement while still using the existing Barrier, rt.spawn, and rt.block_on
logic.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

@martin-augment
Copy link
Copy Markdown
Owner Author

1. Barrier allocated inside b.iter() adds noise to measured time

b.iter(|| {
    let barrier = Barrier::new(num_threads);  // allocation every iteration
    ...
});

std::sync::Barrier resets automatically after all threads pass through, so it can be created once outside b.iter() and reused across iterations. Allocating it inside the hot loop adds overhead that is not part of what is being measured.

Suggestion: Move it outside b.iter(). Since std::thread::scope borrows by reference, this works cleanly:

let barrier = Barrier::new(num_threads);
b.iter(|| {
    std::thread::scope(|s| {
        // borrow &barrier as before
    });
});

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The Barrier construction could be moved out of the loop to avoid measuring it. It won't change the execution time in a noticeable way but it will make it clear that it is not part of the measured logic.

@martin-augment
Copy link
Copy Markdown
Owner Author

3. enable_all() starts unneeded IO/timer drivers

For no-op task spawning that only exercises the inject queue, enable_all() brings in IO and timer drivers that are never used. Plain .build() (without enable_all()) is sufficient and reduces setup noise. Compare with spawn.rs which uses new_multi_thread().worker_threads(1).build() for a similar purpose.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! There is no need to enable functionalities which are not really needed by the benchmark test. Prevents introducing random noise to the benchmark by background functionalities which are not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants