fix(stream): avoid removing peer when other connections remain#538
Conversation
|
cc: @marcus-pousette when you get a chance to review 🙏 |
|
Hey! Sorry for a late reply. I have been quite busy over past weeks. Super happy you found this. Seems to be a real bug, but I have been struggling a bit back and forth with libp2p and how it managed connections, and it has been quite a journey to get it working well. From what it looks like you change make connection handling better, but let's see also if this in the future would have other side-effects the testing suite does not cover. I will merge this and do a release! |
marcus-pousette
left a comment
There was a problem hiding this comment.
Thank you, and I ran the tests and it worked
|
bootstrap servers might still be affected by this bug, will update them as soon as possible. If you are running your own relay server you can try it out immedaitely. |
I was having massive amounts of trouble getting non-flaky (consistent) WebRTC only connections browser-to-browser with Peerbit. I noticed that about 20-50% of the time replication would work flawlessly (aka I would connect to Peer A via Peer B Browser and see the data reflected in Peer B. If I added/modified data in Peer A it would show up immediately in Peer B). However most of the other times it would initially load the data and then...never load any updates again.
I created an anchoring test and let an AI Agent add instrumentation to every single connection and log within the stack to identify where the data was being dropped and why replication happened sometimes and not others and it arrived on the solution below which now makes all my tests pass and my app now performs flawlessly.
tl;dr
What fixed the flake (core change)
Summary
Problem
When a peer has multiple connections (e.g. relay + direct), a single connection close can trigger DirectStream peer removal. This cascades into pubsub unsubscribe and stops replication even though another connection is still alive.
Fix
If the connection manager still has other connections for the peer (or the disconnect event cannot be matched to a tracked connection), skip removing the peer. Only remove when there is a single tracked connection and it matches the disconnect.
Notes
Testing
Edit: I am happy to share the code I've been working on to demonstrate that it does work. You might know the codebase better for this test in any case so I of course will allow edits by maintainers on this PR :)