fix(replay, rpc): make SlotTracker & EpockTracker thread safe #1254
Open
fix(replay, rpc): make SlotTracker & EpockTracker thread safe #1254
Conversation
14ec9a9 to
2502875
Compare
5f97a25 to
b8b2c68
Compare
b8b2c68 to
f4762ca
Compare
prestonsn
commented
Feb 26, 2026
avoid potential leak
5f14743 to
35419f7
Compare
…ix UAF in getSlotAncestorsPtr UnrootedEpochBuffer.get() was returning a raw pointer without incrementing the reference count, unlike RootedEpochBuffer.get(). This meant callers going through getEpochInfoNoOffset() on the unrooted path would release a ref that was never acquired. Also fix a use-after-free in vote_listener's getSlotAncestorsPtr where a SlotTracker.Reference was released (via defer) before the caller used the returned ancestors pointer. Changed to return the Reference directly so the caller controls the lifetime. Renamed get() to getEpochInfoRef() on both buffers to make the RC contract explicit, and added matching release() calls in tests.
35419f7 to
e8ad904
Compare
kprotty
reviewed
Feb 27, 2026
d2b4625 to
deab244
Compare
kprotty
approved these changes
Feb 27, 2026
yewman
reviewed
Mar 2, 2026
| root: Atomic(Epoch) = .init(0), | ||
|
|
||
| fn deinit(self: *const RootedEpochBuffer, allocator: Allocator) void { | ||
| fn deinit(self: *const RootedEpochBuffer, _: Allocator) void { |
Contributor
There was a problem hiding this comment.
If no longer using the allocator, remove the arg
| pub const MAX_FORKS = 4; | ||
|
|
||
| fn deinit(self: *const UnrootedEpochBuffer, allocator: Allocator) void { | ||
| fn deinit(self: *const UnrootedEpochBuffer, _: Allocator) void { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
RwMuxand reference counting toSlotTrackerandEpochTrackerto allow for safe concurrent access from the RPC thread. This fixes data races currently present due to the addition of the RPC server thread now accessing data. One should note that the data races are very rare. SlotTracker prunes unrooted slots (which the RPC thread never has an interest in), and the EpochTracker only deinits on epoch boundaries (likelihood of this causing a data race with an RPC request is hence very low).Reference counting was necessary for correctness with
SlotTrackersince methods on the tracker would hand out references intoElement(s) by visiting the top-levelslotshashmap. There is a code-path where replay callspruneNonRooted()while the RPC thread could be accessing that prunedElement, invalidating its pointers. This should be quite rare (only pruning non-rooted combined with RPC not being interested in these slots, but it's still a potential foot gun).This PR adds an
RwMuxaround that map access, and adds reference counting to theReferencewrapper type that is handed out. Consumers call.release()when they're done with their borrow. This does mean thatElementnow holds onto theAllocatorso that it can be destroyed on either the replay or RPC thread safely.For
EpochTracker,, the same pattern applies torooted_epochs. The replay thread writes newEpochInfoentries when epochs are rooted (once every ~2 days on mainnet), while the RPC thread reads from it viagetVoteAccountsviagetEpochInfo(). There is a race where replay'sinsert()overwrites a buffer slot (callingdeinit+destroyon the oldEpochInfo) while the RPC thread holds a pointer into thatEpochInfo's stakes data.This PR wraps
rooted_epochsin anRwMuxand adds aReferenceCounterplus a stored Allocator toEpochInfo.get()increments the RC under the read lock, callers use the pointer freely without holding any lock, and callrelease()when done-if the RC hits zero, theEpochInfodeinits and frees itself.insert()overwriting an old entry calls release() instead of directly destroying it, so the entry stays alive if another thread still holds a reference.getLeaderSchedules()returns a newLeaderSchedulesWithEpochInfoswrapper that bundles the schedule with its RC'dEpochInforeferences, requiring callers to callrelease()when done.Minor potential leak site fix
In
SlotTracker.put(), there's a potential leak due to not properly de allocating clobberedElements ifput()is called with a slot that already has anElementin theslotsmap. This isn't possible given the access pattern, but is still a potential leak site if things change in the future. Fixed in this PR.Testing
Ran sig on testnet a few times, re-checked available RPCs. Did not see sig fall behind the network. slot replay timings seem around the same as before from the logs (likely since most of these accesses through the
RwMuxare uncontended 99.99% of the time).