Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions benches/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -101,5 +101,10 @@ name = "spawn_blocking"
path = "spawn_blocking.rs"
harness = false

[[bench]]
name = "remote_spawn"
path = "remote_spawn.rs"
harness = false

[lints]
workspace = true
95 changes: 95 additions & 0 deletions benches/remote_spawn.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
//! Benchmark remote task spawning (push_remote_task) at different concurrency
//! levels on the multi-threaded scheduler.
//!
//! This measures contention on the scheduler's inject queue mutex when multiple
//! external (non-worker) threads spawn tasks into the tokio runtime simultaneously.
//! Every rt.spawn() from an external thread unconditionally goes through
//! push_remote_task, making this a direct measurement of inject queue contention.
//!
//! For each parallelism level N (1, 2, 4, 8, 16, 32, 64, capped at available parallelism):
//! - Spawns N std::threads (external to the runtime)
//! - Each thread spawns TOTAL_TASKS / N tasks into the runtime via rt.spawn()
//! - All threads are synchronized with a barrier to maximize contention
//! - Tasks are trivial no-ops to isolate the push overhead

use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
use std::sync::Barrier;
use tokio::runtime::{self, Runtime};

/// Total number of tasks spawned across all threads per iteration.
/// Must be divisible by the largest parallelism level (64).
const TOTAL_TASKS: usize = 12_800;

fn remote_spawn_contention(c: &mut Criterion) {
let parallelism_levels = parallelism_levels();
let mut group = c.benchmark_group("remote_spawn");

for num_threads in &parallelism_levels {
let num_threads = *num_threads;
group.bench_with_input(
BenchmarkId::new("threads", num_threads),
&num_threads,
|b, &num_threads| {
let rt = rt();
let tasks_per_thread = TOTAL_TASKS / num_threads;

b.iter(|| {
let barrier = Barrier::new(num_threads);

std::thread::scope(|s| {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b.iter currently includes creating/joining num_threads OS threads each iteration (std::thread::scope), which can dominate timings and obscure inject-queue contention. Consider separating thread setup from the timed region if the goal is to measure push_remote_task overhead.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

let handles: Vec<_> = (0..num_threads)
.map(|_| {
let barrier = &barrier;
let rt = &rt;
s.spawn(move || {
let mut join_handles = Vec::with_capacity(tasks_per_thread);
barrier.wait();

for _ in 0..tasks_per_thread {
join_handles.push(rt.spawn(async {}));
}
join_handles
})
})
.collect();

let all_handles: Vec<_> = handles
.into_iter()
.flat_map(|h| h.join().unwrap())
.collect();

rt.block_on(async {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timed region also includes collecting all JoinHandles and awaiting them via rt.block_on, so results reflect more than just remote-spawn/inject-queue work. If this benchmark is intended to isolate inject-queue contention, clarifying/controlling this extra work would help interpretation.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

for h in all_handles {
h.await.unwrap();
}
});
});
});
Comment on lines +36 to +67
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation with b.iter measures the time for both spawning tasks and running them to completion. To more accurately measure the inject queue contention during spawning, it's better to isolate the spawning phase from the task execution/cleanup phase. Using b.iter_custom allows for manual timer control, which will exclude the cleanup phase from the measurement and provide a more precise result for the spawning overhead.

                b.iter_custom(|iters| {
                    let mut total_duration = std::time::Duration::ZERO;
                    for _ in 0..iters {
                        let barrier = Barrier::new(num_threads);

                        let start = std::time::Instant::now();

                        let all_handles = std::thread::scope(|s| {
                            let handles: Vec<_> = (0..num_threads)
                                .map(|_| {
                                    let barrier = &barrier;
                                    let rt = &rt;
                                    s.spawn(move || {
                                        let mut join_handles =
                                            Vec::with_capacity(tasks_per_thread);
                                        barrier.wait();

                                        for _ in 0..tasks_per_thread {
                                            join_handles.push(rt.spawn(async {}));
                                        }
                                        join_handles
                                    })
                                })
                                .collect();

                            handles
                                .into_iter()
                                .flat_map(|h| h.join().unwrap())
                                .collect::<Vec<_>>()
                        });

                        total_duration += start.elapsed();

                        // Cleanup: wait for all tasks to complete before the next iteration.
                        rt.block_on(async {
                            for h in all_handles {
                                h.await.unwrap();
                            }
                        });
                    }
                    total_duration
                });

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

Comment on lines +36 to +67
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's see if the file exists and read the relevant section
cat -n benches/remote_spawn.rs | head -80

Repository: martin-augment/tokio

Length of output: 3784


Measured region currently mixes thread lifecycle cost with spawn contention.

The code structure confirms the issue: std::thread::scope, s.spawn(...), and join() (lines 39–58) are all inside the b.iter() closure, so each sample includes OS thread creation and teardown overhead. Per the benchmark's stated goal (measuring inject-queue mutex contention), this conflates orthogonal costs. Consider moving thread creation outside b.iter() and measuring only the task enqueue and await cycle to isolate inject-queue contention.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benches/remote_spawn.rs` around lines 36 - 67, The benchmark currently
recreates OS threads inside b.iter(), conflating thread lifecycle with spawn
contention; refactor so thread creation (std::thread::scope / s.spawn / join)
happens once outside b.iter() and each spawned thread runs a per-iteration loop
that waits on a shared Barrier and performs the rt.spawn(...) tasks, sending
join handles back to the main bench loop; then inside b.iter() only trigger the
Barrier to start the iteration, collect the handles produced by worker threads,
and use rt.block_on to await them—this isolates the inject-queue contention
measurement while still using the existing Barrier, rt.spawn, and rt.block_on
logic.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! By using Bencher::iter_custom() the measurement could be isolated to include only the spawning and exclude the setup and teardown (awaiting).

},
);
}

group.finish();
}

fn parallelism_levels() -> Vec<usize> {
let max_parallelism = std::thread::available_parallelism()
.map(|p| p.get())
.unwrap_or(1);

[1, 2, 4, 8, 16, 32, 64]
.into_iter()
.filter(|&n| n <= max_parallelism)
.collect()
}

fn rt() -> Runtime {
runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap()
}

criterion_group!(remote_spawn_benches, remote_spawn_contention);

criterion_main!(remote_spawn_benches);