Skip to content

fix(runtime/smol): eliminate lost-wakeup race and mutex-poison panic#790

Closed
nightness wants to merge 11 commits intowebrtc-rs:masterfrom
Brainwires:fix/smol-runtime-correctness
Closed

fix(runtime/smol): eliminate lost-wakeup race and mutex-poison panic#790
nightness wants to merge 11 commits intowebrtc-rs:masterfrom
Brainwires:fix/smol-runtime-correctness

Conversation

@nightness
Copy link
Copy Markdown

Summary

Two correctness bugs in the smol runtime backend:

1. SmolJoinHandle — mutex poison panic

detach() and abort() used .lock().unwrap() on a std::sync::Mutex. If another thread panicked while holding the lock, this would trigger a double-panic and abort the process. Fixed with .unwrap_or_else(|e| e.into_inner()) which recovers the data even from a poisoned mutex.

2. SmolNotify — lost-wakeup race (permanent task hang)

The previous implementation used try_lock() in notify_one() / notify_waiters(). If the lock was held by a concurrent notified() call registering itself as a waiter, the notification was silently discarded — causing any task waiting on notified() to hang permanently.

New design adds an AtomicBool "pending" flag:

  • notify_* sets the flag (Release) before attempting to acquire the waiter list — so even if the lock is contended, a concurrent notified() will see the flag.
  • notified() checks the flag (AcqRel swap) before and after acquiring the lock, so it cannot miss a notification that arrived between the two checks.

Also eliminates the duplicated AsyncNotify impl body by delegating to the SmolNotify inherent methods.

Test Plan

  • cargo build (tokio, default) passes
  • cargo build --no-default-features --features runtime-smol passes

🤖 Generated with Claude Code

nightness and others added 11 commits April 1, 2026 04:15
Without a timeout, `resolve_host` in `gather_from_stun_server` could block
indefinitely on a misconfigured or unreachable DNS server, hanging ICE
gathering. STUN response timeouts are already handled by the Sans-IO
StunClient retry loop (7 attempts × 300ms RTO), so no separate fix is
needed for webrtc-rs#778.

Closes webrtc-rs#774

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…#784)

Adds tests/datachannel_video_interop.rs verifying that an RTCPeerConnection
with both a video transceiver and a data channel negotiates successfully:
  - Offer SDP contains both m=video (active, with codecs) and m=application
  - Answer SDP contains both m-lines
  - ICE + DTLS + SCTP establish and data channel message is exchanged

Key finding: the feature works correctly in master when default codecs are
registered on both sides. Users must call register_default_codecs() on the
MediaEngine for video m-lines to have non-rejected codecs in the offer.

Closes webrtc-rs#784

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shows that a DataChannel and a video transceiver coexist on the same
PeerConnection. The key callout — prominently documented in the module
doc — is that MediaEngine::register_default_codecs() must be called;
without it the offer emits m=video 0 (rejected), which misleads users
into thinking mixed DC+video is broken.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Re-export MulticastDnsMode from src/peer_connection/mod.rs
- Add PeerConnectionBuilder::with_mdns_mode() so callers explicitly
  enable the multicast socket alongside with_setting_engine(); no
  submodule changes required
- When mode != Disabled, PeerConnectionImpl::new() creates a multicast
  UDP socket via rtc_mdns::MulticastSocket and adds it to the driver's
  socket map; existing non-ICE routing forwards mDNS packets to the core
- Add examples/mdns-local-peers/ demonstrating two in-process peers
  using QueryAndGather mode without a STUN server

Closes webrtc-rs#782

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-rs#781)

- Add AsyncTcpListener + AsyncTcpStream traits to Runtime abstraction;
  implement for both Tokio and smol runtimes
- Add tokio io-util feature for AsyncReadExt/AsyncWriteExt
- Add rtc-shared direct dependency for tcp_framing utilities
- PeerConnectionImpl::new() binds TCP passive listeners from with_tcp_addrs()
  and passes them to the driver
- RTCIceGatherer emits tcptype passive host candidates for each TCP listener
  alongside existing UDP host candidates
- PeerConnectionDriver:
  - Spawns an accept loop per TCP listener; registers accepted connections
  - Per connection: dedicated read task (RFC 4571 decode → tcp_inbound_rx)
    and write task (tcp_write_txs channel → RFC 4571 encode)
  - select! loop handles tcp_new_conn_rx and tcp_inbound_rx alongside UDP
  - handle_write() routes TCP packets to write channels; dials out via
    runtime.connect_tcp() on first packet to an unknown peer (active mode)
- Add examples/ice-tcp/ demonstrating two in-process peers over TCP only

Closes webrtc-rs#781

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SmolJoinHandle: replace .lock().unwrap() with .lock()
.unwrap_or_else(|e| e.into_inner()) in both detach() and abort().
Recovering from a poisoned mutex is safe in cleanup paths; panicking
would cause a double-panic and abort the process.

SmolNotify: rewrite to eliminate the lost-wakeup race introduced by
using try_lock() in notify_one() / notify_waiters().

Previous design: try_lock() would silently discard a notification if
the lock was held by a concurrent notified() call, permanently hanging
any task that was registering itself as a waiter at that instant.

New design uses an AtomicBool "pending" flag:
- notify_*: sets the flag (Release) BEFORE attempting to acquire the
  waiter list, so any concurrent notified() will see the flag even if
  the lock is still held.
- notified(): checks the flag (AcqRel swap) before AND after acquiring
  the lock, so it cannot miss a notification that arrived between the
  two checks.

Also eliminates the duplicated AsyncNotify impl body by delegating to
the SmolNotify inherent methods.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nightness
Copy link
Copy Markdown
Author

Superseded by #795 which consolidates all driver robustness fixes into a single PR.

@nightness
Copy link
Copy Markdown
Author

Superseded by #795 which is a properly scoped PR for these driver/runtime fixes.

@nightness nightness closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant