Skip to content

fix(replay, rpc): make SlotTracker & EpockTracker thread safe #1254

Open
prestonsn wants to merge 8 commits intomainfrom
prestonsn/fix-make-trackers-thread-safe-v2
Open

fix(replay, rpc): make SlotTracker & EpockTracker thread safe #1254
prestonsn wants to merge 8 commits intomainfrom
prestonsn/fix-make-trackers-thread-safe-v2

Conversation

@prestonsn
Copy link
Contributor

@prestonsn prestonsn commented Feb 24, 2026

Adds RwMux and reference counting to SlotTracker and EpochTracker to allow for safe concurrent access from the RPC thread. This fixes data races currently present due to the addition of the RPC server thread now accessing data. One should note that the data races are very rare. SlotTracker prunes unrooted slots (which the RPC thread never has an interest in), and the EpochTracker only deinits on epoch boundaries (likelihood of this causing a data race with an RPC request is hence very low).

Reference counting was necessary for correctness with SlotTracker since methods on the tracker would hand out references into Element(s) by visiting the top-level slots hashmap. There is a code-path where replay calls pruneNonRooted() while the RPC thread could be accessing that pruned Element, invalidating its pointers. This should be quite rare (only pruning non-rooted combined with RPC not being interested in these slots, but it's still a potential foot gun).

This PR adds an RwMux around that map access, and adds reference counting to the Reference wrapper type that is handed out. Consumers call .release() when they're done with their borrow. This does mean that Element now holds onto the Allocator so that it can be destroyed on either the replay or RPC thread safely.

For EpochTracker,, the same pattern applies to rooted_epochs. The replay thread writes new EpochInfo entries when epochs are rooted (once every ~2 days on mainnet), while the RPC thread reads from it via getVoteAccounts via getEpochInfo(). There is a race where replay's insert() overwrites a buffer slot (calling deinit + destroy on the old EpochInfo) while the RPC thread holds a pointer into that EpochInfo's stakes data.

This PR wraps rooted_epochs in an RwMux and adds a ReferenceCounter plus a stored Allocator to EpochInfo. get() increments the RC under the read lock, callers use the pointer freely without holding any lock, and call release() when done-if the RC hits zero, the EpochInfo deinits and frees itself. insert() overwriting an old entry calls release() instead of directly destroying it, so the entry stays alive if another thread still holds a reference. getLeaderSchedules() returns a new LeaderSchedulesWithEpochInfos wrapper that bundles the schedule with its RC'd EpochInfo references, requiring callers to call release() when done.

Minor potential leak site fix

In SlotTracker.put(), there's a potential leak due to not properly de allocating clobbered Elements if put() is called with a slot that already has an Element in the slots map. This isn't possible given the access pattern, but is still a potential leak site if things change in the future. Fixed in this PR.

Testing

Ran sig on testnet a few times, re-checked available RPCs. Did not see sig fall behind the network. slot replay timings seem around the same as before from the logs (likely since most of these accesses through the RwMux are uncontended 99.99% of the time).

@github-project-automation github-project-automation bot moved this to 🏗 In progress in Sig Feb 24, 2026
@prestonsn prestonsn self-assigned this Feb 24, 2026
@prestonsn prestonsn force-pushed the prestonsn/fix-make-trackers-thread-safe-v2 branch 2 times, most recently from 14ec9a9 to 2502875 Compare February 24, 2026 19:49
@prestonsn prestonsn changed the title fix(replay, rpc): make trackers thread safe fix(replay, rpc): make SlotTracker & EpockTracker thread safe Feb 24, 2026
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 97.01493% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../shred_network/transmitter/shred_retransmitter.zig 0.00% 4 Missing ⚠️
src/replay/consensus/cluster_sync.zig 97.19% 3 Missing ⚠️
src/core/epoch_tracker.zig 98.37% 2 Missing ⚠️
src/replay/consensus/core.zig 98.81% 2 Missing ⚠️
src/replay/service.zig 95.34% 2 Missing ⚠️
...hred_network/collector/duplicate_shred_handler.zig 0.00% 2 Missing ⚠️
src/consensus/progress_map.zig 94.11% 1 Missing ⚠️
src/consensus/replay_tower.zig 66.66% 1 Missing ⚠️
src/replay/rewards/calculation.zig 0.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/consensus/fork_choice.zig 98.49% <100.00%> (+<0.01%) ⬆️
src/consensus/vote_listener.zig 93.58% <100.00%> (+0.19%) ⬆️
src/replay/consensus/process_result.zig 92.78% <100.00%> (-0.26%) ⬇️
src/replay/epoch_transitions.zig 66.29% <100.00%> (+0.14%) ⬆️
src/replay/execution.zig 91.95% <100.00%> (+0.05%) ⬆️
src/replay/trackers.zig 98.80% <100.00%> (+0.15%) ⬆️
src/replay/update_sysvar.zig 97.36% <100.00%> (+<0.01%) ⬆️
src/rpc/methods.zig 78.12% <ø> (ø)
src/shred_network/collector/shred_receiver.zig 83.78% <100.00%> (+0.17%) ⬆️
src/shred_network/duplicate_shred_listener.zig 89.74% <100.00%> (+0.07%) ⬆️
... and 9 more

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@prestonsn prestonsn force-pushed the prestonsn/fix-make-trackers-thread-safe-v2 branch from 5f97a25 to b8b2c68 Compare February 26, 2026 16:34
@prestonsn prestonsn force-pushed the prestonsn/fix-make-trackers-thread-safe-v2 branch from b8b2c68 to f4762ca Compare February 26, 2026 16:40
@prestonsn prestonsn force-pushed the prestonsn/fix-make-trackers-thread-safe-v2 branch 2 times, most recently from 5f14743 to 35419f7 Compare February 26, 2026 23:01
…ix UAF in getSlotAncestorsPtr

UnrootedEpochBuffer.get() was returning a raw pointer without
incrementing the reference count, unlike RootedEpochBuffer.get().
This meant callers going through getEpochInfoNoOffset() on the
unrooted path would release a ref that was never acquired.

Also fix a use-after-free in vote_listener's getSlotAncestorsPtr
where a SlotTracker.Reference was released (via defer) before the
caller used the returned ancestors pointer. Changed to return the
Reference directly so the caller controls the lifetime.

Renamed get() to getEpochInfoRef() on both buffers to make the RC
contract explicit, and added matching release() calls in tests.
@prestonsn prestonsn force-pushed the prestonsn/fix-make-trackers-thread-safe-v2 branch from 35419f7 to e8ad904 Compare February 26, 2026 23:31
@prestonsn prestonsn marked this pull request as ready for review February 27, 2026 06:23
@prestonsn prestonsn requested a review from kprotty February 27, 2026 14:52
@prestonsn prestonsn force-pushed the prestonsn/fix-make-trackers-thread-safe-v2 branch from d2b4625 to deab244 Compare February 27, 2026 19:15
Copy link
Contributor

@yewman yewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, just one small nit

root: Atomic(Epoch) = .init(0),

fn deinit(self: *const RootedEpochBuffer, allocator: Allocator) void {
fn deinit(self: *const RootedEpochBuffer, _: Allocator) void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no longer using the allocator, remove the arg

pub const MAX_FORKS = 4;

fn deinit(self: *const UnrootedEpochBuffer, allocator: Allocator) void {
fn deinit(self: *const UnrootedEpochBuffer, _: Allocator) void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

3 participants