You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello!
First of all, thank you so much for developing this. I really appreciate this work.
Describe the bug
All communication between a peer and the nexus router becomes blocked if the peer exposes an RPC endpoint and there exists a caller that becomes “unhealthy” (e.g., unstable/super slow connection) while invoking that RPC.
The issue happens on the Router implementation, including nexusd. I could reproduce this error with various WAMP client implementation.
The issue happens if:
There is a peer that exposes an RPC endpoint (CALLEE).
There is a peer that invokes the RPC (CALLER) and waits for a response.
For some reason, the CALLER becomes unresponsive to nexusd (e.g., due to a bad internet connection or can be a bug), and nexusd starts logging messages like:
!!! Dropped EVENT to session {id}: blocked
The CALLEE subsequently returns a response (YIELD) to CALLER.
At this point, the goroutine assigned for the CALLEE is responsible for sending (queueing) the YIELD response blocks to the CALLER message queue. As a result, the entire communication of CALLEE is stalled because of an unhealthy CALLER.
Hello!
First of all, thank you so much for developing this. I really appreciate this work.
Describe the bug
All communication between a peer and the nexus router becomes blocked if the peer exposes an RPC endpoint and there exists a caller that becomes “unhealthy” (e.g., unstable/super slow connection) while invoking that RPC.
The issue happens on the Router implementation, including
nexusd. I could reproduce this error with various WAMP client implementation.The issue happens if:
At this point, the goroutine assigned for the CALLEE is responsible for sending (queueing) the YIELD response blocks to the CALLER message queue. As a result, the entire communication of CALLEE is stalled because of an unhealthy CALLER.
Relevant code:
nexus/router/dealer.go
Lines 328 to 360 in 5cfa511
nexus/router/realm.go
Lines 475 to 476 in 5cfa511
To Reproduce
go run ./server.gobad_caller.py:nexusd.mp4
Here, I use
nexusfor the callee andxconn-pythonfor the caller, but it can be anything.server.go
bad_caller.py
Ideally, the communication of a callee should not be affected by the state of a caller, especially in an asynchronous RPC context.
Environment (please complete the following information):
Additional context
nexusd2.mp4
OutQueueSizemay hide the issue. But the value needs to be very large if nexusd is used for high-frequency communication.