7983: rt: (io-uring) ensure fds are closed when invoking cancel op#80
7983: rt: (io-uring) ensure fds are closed when invoking cancel op#80martin-augment wants to merge 9 commits intomasterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical resource management issue within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
WalkthroughThe changes modify the io_uring driver's cancellation handling and add a test to verify proper file descriptor cleanup. In the driver code, when a cancelled operation is detected with ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request aims to fix a file descriptor leak when an io_uring open operation is cancelled. It does so by checking for cancelled Open operations and explicitly closing the file descriptor if the operation completed successfully before cancellation was processed. A new test is added to verify this behavior.
While the intention is correct, the implementation introduces a regression. The change to specifically handle CancelData::Open in a match arm will cause other cancelled operations (like Read or Write) to hit a panic!, as they will fall into the Some(other) catch-all arm. I've provided a suggestion to fix this by handling all Cancelled operations in a single arm and then checking for the Open variant inside it. I've also included a small stylistic improvement to make the intentional drop of the OwnedFd more explicit.
| Some(Lifecycle::Cancelled(CancelData::Open(_))) => { | ||
| if let Ok(fd) = CqeResult::from(cqe).result { | ||
| // SAFETY: the successful CQE result provides | ||
| // a non-negative integer, and the event is | ||
| // related to an open operation. | ||
| unsafe { OwnedFd::from_raw_fd(fd as i32) }; | ||
| } | ||
| // Op future was cancelled, so we discard the result. | ||
| // We just remove the entry from the slab. | ||
| ops.remove(idx); | ||
| } |
There was a problem hiding this comment.
This change correctly handles closing file descriptors for cancelled open operations, but it introduces a regression. By matching only on Lifecycle::Cancelled(CancelData::Open(_)), any other cancelled operation (e.g., Read, Write) will now fall through to the Some(other) case and cause a panic. The previous implementation correctly handled all Cancelled variants.
To fix this, you should handle all Lifecycle::Cancelled variants in one arm and then check if it's an Open operation inside.
Additionally, using let _ = ... makes it more explicit that you are intentionally creating and dropping the OwnedFd to close the file descriptor.
Some(Lifecycle::Cancelled(cancel_data)) => {
if matches!(cancel_data, CancelData::Open(_)) {
if let Ok(fd) = CqeResult::from(cqe).result {
// SAFETY: the successful CQE result provides
// a non-negative integer, and the event is
// related to an open operation.
let _ = unsafe { OwnedFd::from_raw_fd(fd as i32) };
}
}
// Op future was cancelled, so we discard the result.
ops.remove(idx);
}There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Before the proposed changes any cancelation will lead to removing the index from the list of operations. With the proposed changes only canceled Open operations are handled. Prevents leaking index operations for all but the Open operations.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| // a non-negative integer, and the event is | ||
| // related to an open operation. | ||
| unsafe { OwnedFd::from_raw_fd(fd as i32) }; | ||
| } |
There was a problem hiding this comment.
Cancelled Read/Write operations cause runtime panic
High Severity
The previous code matched all Cancelled(_) variants in dispatch_completions, gracefully removing them from the slab. The new code only matches Cancelled(CancelData::Open(_)), so Cancelled(CancelData::Read(_)) and Cancelled(CancelData::Write(_)) now fall through to the Some(other) arm which panics. Since Op::drop calls cancel_op for any polled Read or Write operation, this will crash the runtime when the kernel returns a CQE for a cancelled read or write.
Additional Locations (1)
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! Before the proposed changes any cancelation will lead to removing the index from the list of operations. With the proposed changes only canceled Open operations are handled. Prevents leaking index operations for all but the Open operations.
🤖 Augment PR SummarySummary: This PR tightens cancellation handling for io-uring “open” operations to avoid leaking file descriptors when an in-flight open is cancelled. Changes:
Technical Notes: The test is motivated by tokio issue tokio-rs#7979 and aims to ensure cancelled open CQEs don’t leave unowned fds behind. 🤖 Was this summary useful? React with 👍 or 👎 |
| *ops.get_mut(idx).unwrap() = Lifecycle::Completed(cqe); | ||
| } | ||
| Some(Lifecycle::Cancelled(_)) => { | ||
| Some(Lifecycle::Cancelled(CancelData::Open(_))) => { |
There was a problem hiding this comment.
CancelData also has Read/Write variants; with this match only handling Cancelled(Open(_)), any cancelled read/write that later completes will now hit the Some(other) arm and panic the runtime. That seems like a functional regression from the previous Cancelled(_) handling (remove-and-discard).
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! Before the proposed changes any cancelation will lead to removing the index from the list of operations. With the proposed changes only canceled Open operations are handled. Prevents leaking index operations for all but the Open operations.
|
|
||
| // If io_uring is enabled (and not falling back to the thread pool), | ||
| // the first poll should return Pending. | ||
| let _pending = Box::pin(fut).poll_unpin(cx); |
There was a problem hiding this comment.
This stores the first poll result in _pending but never asserts it is actually Poll::Pending; if the runtime falls back (or the op completes synchronously), the test can pass without exercising the cancelled in-flight io_uring open case it’s meant to cover.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The result of poll_unpin() is just ignored assuming that it is Pending but it is not asserted. Prevents silently passing the test even when the result is not Pending
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tokio/src/runtime/io/driver/uring.rs`:
- Around line 82-88: The code only drops the fd when the slot is already
Lifecycle::Cancelled(CancelData::Open(_)) at CQE handling; you must also perform
the same cleanup when cancel_op removes a slot that is already
Lifecycle::Completed with an Open result. Update the cancel_op (and/or the
branch that handles Lifecycle::Completed(_)) to detect
Completed(CancelData::Open(_)) and, if CqeResult::from(cqe).result is Ok(fd),
call the same unsafe cleanup (OwnedFd::from_raw_fd(fd as i32)) to consume and
drop the fd; reuse the same logic used in the existing
Some(Lifecycle::Cancelled(CancelData::Open(_))) branch so successful open
results cannot leak.
- Around line 82-91: The match arm currently only handles
Some(Lifecycle::Cancelled(CancelData::Open(_))) which lets CancelData::Read and
CancelData::Write fall through and panic; change the arm to handle any
Lifecycle::Cancelled(_) so cancelled read/write ops are discarded instead of
panicking. Specifically, update the match from
Some(Lifecycle::Cancelled(CancelData::Open(_))) to
Some(Lifecycle::Cancelled(cancel_data)) (or use a wildcard
Some(Lifecycle::Cancelled(_))), then inside the block only call
CqeResult::from(cqe).result -> if it’s Ok(fd) and the cancel_data is
CancelData::Open(_) then SAFELY convert to OwnedFd::from_raw_fd(fd as i32); in
all cases call ops.remove(idx) to discard the cancelled operation.
In `@tokio/tests/fs_uring.rs`:
- Around line 154-193: The test currently ignores the result of
Box::pin(fut).poll_unpin(cx) so synchronous Ready cases bypass the intended
Pending path; change the poll site in the spawned task (the poll_fn that calls
OpenOptions::open / the Box::pin(fut).poll_unpin(cx)) to capture the Poll result
and assert it is Pending before sending on tx, ensuring the first poll is
actually Pending; after aborting the handle and awaiting cancellation, poll /
read the fd count in a short loop with a small sleep (or a bounded retry) until
it settles back to the baseline (or a documented tight bound) and only then
compute leaked and assert leaked <= 64.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ad86eeab-261b-4c1b-8497-1d2745719228
📒 Files selected for processing (2)
tokio/src/runtime/io/driver/uring.rstokio/tests/fs_uring.rs
| Some(Lifecycle::Cancelled(CancelData::Open(_))) => { | ||
| if let Ok(fd) = CqeResult::from(cqe).result { | ||
| // SAFETY: the successful CQE result provides | ||
| // a non-negative integer, and the event is | ||
| // related to an open operation. | ||
| unsafe { OwnedFd::from_raw_fd(fd as i32) }; | ||
| } |
There was a problem hiding this comment.
This still misses the post-completion cancel path.
This only handles the case where the slot is already Cancelled(Open) when the CQE is dispatched. If the CQE lands first, cancel_op later sees Lifecycle::Completed(_) and removes the slot without consuming a successful open result, so the fd can still leak. Please reuse the same cleanup when discarding a completed Open CQE.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tokio/src/runtime/io/driver/uring.rs` around lines 82 - 88, The code only
drops the fd when the slot is already Lifecycle::Cancelled(CancelData::Open(_))
at CQE handling; you must also perform the same cleanup when cancel_op removes a
slot that is already Lifecycle::Completed with an Open result. Update the
cancel_op (and/or the branch that handles Lifecycle::Completed(_)) to detect
Completed(CancelData::Open(_)) and, if CqeResult::from(cqe).result is Ok(fd),
call the same unsafe cleanup (OwnedFd::from_raw_fd(fd as i32)) to consume and
drop the fd; reuse the same logic used in the existing
Some(Lifecycle::Cancelled(CancelData::Open(_))) branch so successful open
results cannot leak.
| Some(Lifecycle::Cancelled(CancelData::Open(_))) => { | ||
| if let Ok(fd) = CqeResult::from(cqe).result { | ||
| // SAFETY: the successful CQE result provides | ||
| // a non-negative integer, and the event is | ||
| // related to an open operation. | ||
| unsafe { OwnedFd::from_raw_fd(fd as i32) }; | ||
| } | ||
| // Op future was cancelled, so we discard the result. | ||
| // We just remove the entry from the slab. | ||
| ops.remove(idx); | ||
| } |
There was a problem hiding this comment.
Don't regress cancelled read/write ops into a panic.
tokio/src/runtime/driver/op.rs still defines CancelData::Read and CancelData::Write, so narrowing this arm to Cancelled(Open) makes those completions fall through to Some(other) and panic instead of being discarded.
Suggested fix
- Some(Lifecycle::Cancelled(CancelData::Open(_))) => {
- if let Ok(fd) = CqeResult::from(cqe).result {
- // SAFETY: the successful CQE result provides
- // a non-negative integer, and the event is
- // related to an open operation.
- unsafe { OwnedFd::from_raw_fd(fd as i32) };
- }
+ Some(Lifecycle::Cancelled(cancel_data)) => {
+ let is_open = matches!(cancel_data, CancelData::Open(_));
+ if is_open {
+ if let Ok(fd) = CqeResult::from(cqe).result {
+ // SAFETY: the successful CQE result provides
+ // a non-negative integer, and the event is
+ // related to an open operation.
+ drop(unsafe { OwnedFd::from_raw_fd(fd as i32) });
+ }
+ }
// Op future was cancelled, so we discard the result.
ops.remove(idx);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tokio/src/runtime/io/driver/uring.rs` around lines 82 - 91, The match arm
currently only handles Some(Lifecycle::Cancelled(CancelData::Open(_))) which
lets CancelData::Read and CancelData::Write fall through and panic; change the
arm to handle any Lifecycle::Cancelled(_) so cancelled read/write ops are
discarded instead of panicking. Specifically, update the match from
Some(Lifecycle::Cancelled(CancelData::Open(_))) to
Some(Lifecycle::Cancelled(cancel_data)) (or use a wildcard
Some(Lifecycle::Cancelled(_))), then inside the block only call
CqeResult::from(cqe).result -> if it’s Ok(fd) and the cancel_data is
CancelData::Open(_) then SAFELY convert to OwnedFd::from_raw_fd(fd as i32); in
all cases call ops.remove(idx) to discard the cancelled operation.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! Before the proposed changes any cancelation will lead to removing the index from the list of operations. With the proposed changes only canceled Open operations are handled. Prevents leaking index operations for all but the Open operations.
| let fd_count_before_access = fs::read_dir("/proc/self/fd").unwrap().count(); | ||
|
|
||
| for _ in 0..128 { | ||
| let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel(); | ||
|
|
||
| let path = path.clone(); | ||
| let handle = tokio::spawn(async move { | ||
| poll_fn(|cx| { | ||
| let opt = { | ||
| let mut opt = tokio::fs::OpenOptions::new(); | ||
| opt.read(true); | ||
| opt | ||
| }; | ||
|
|
||
| let fut = opt.open(&path[0]); | ||
|
|
||
| // If io_uring is enabled (and not falling back to the thread pool), | ||
| // the first poll should return Pending. | ||
| let _pending = Box::pin(fut).poll_unpin(cx); | ||
|
|
||
| tx.send(()).unwrap(); | ||
|
|
||
| Poll::<()>::Pending | ||
| }) | ||
| .await; | ||
| }); | ||
|
|
||
| // Wait for the first poll | ||
| rx.recv().await.unwrap(); | ||
|
|
||
| handle.abort(); | ||
|
|
||
| let res = handle.await.unwrap_err(); | ||
| assert!(res.is_cancelled()); | ||
| } | ||
|
|
||
| let fd_count_after_cancel = fs::read_dir("/proc/self/fd").unwrap().count(); | ||
| let leaked = fd_count_after_cancel.saturating_sub(fd_count_before_access); | ||
|
|
||
| assert!(leaked <= 64); |
There was a problem hiding this comment.
This regression can pass without ever exercising the bug.
poll_unpin is ignored, so a synchronous Ready (fallback / non-io_uring path) turns the loop into a no-op, and the final leaked <= 64 check runs before completions have necessarily drained. That both weakens coverage and makes the bound noisy. Assert that the first poll is Pending, then wait for the fd count to settle back to the baseline (or a tight, explained bound) before asserting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tokio/tests/fs_uring.rs` around lines 154 - 193, The test currently ignores
the result of Box::pin(fut).poll_unpin(cx) so synchronous Ready cases bypass the
intended Pending path; change the poll site in the spawned task (the poll_fn
that calls OpenOptions::open / the Box::pin(fut).poll_unpin(cx)) to capture the
Poll result and assert it is Pending before sending on tx, ensuring the first
poll is actually Pending; after aborting the handle and awaiting cancellation,
poll / read the fd count in a short loop with a small sleep (or a bounded retry)
until it settles back to the baseline (or a documented tight bound) and only
then compute leaked and assert leaked <= 64.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The result of poll_unpin() is just ignored assuming that it is Pending but it is not asserted. Prevents silently passing the test even when the result is not Pending


7983: To review by AI