gui, topo: allow_crash tile flag#8449
gui, topo: allow_crash tile flag#8449jherrera-jump wants to merge 1 commit intofiredancer-io:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an allow_crash flag for tiles in the Firedancer topology system, enabling certain tiles to exit without tearing down the entire application. This complements the existing allow_shutdown flag by providing a mechanism for handling tiles that may exit unexpectedly (with exit code 0) while ensuring the system remains stable through proper flow control management.
Changes:
- Added
allow_crashfield to thefd_topo_tilestructure - Implemented validation to prevent reliable consumption from tiles with
allow_crash=1 - Added runtime handling to provision infinite credits to producers when an
allow_crashtile exits
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/disco/topo/fd_topo.h | Added allow_crash field to tile structure with documentation |
| src/disco/topo/fd_topob.c | Initialized allow_crash field and added validation logic during topology construction and finalization |
| src/app/shared/commands/run/run.c | Implemented runtime handling for tiles with allow_crash=1 that exit cleanly, including fseq credit provisioning |
Comments suppressed due to low confidence (1)
src/disco/topo/fd_topob.c:214
- Spelling error: "reliablaly" should be "reliably".
if( FD_UNLIKELY( topo->tiles[ producer_tile_id ].allow_crash && reliable ) ) FD_LOG_ERR(( "cannot reliablaly consume from a tile with allow_crash=1: %s:%lu ", tile_name, tile_kind_id ) );
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d70a817 to
b53c98c
Compare
b53c98c to
fb5e211
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (3)
src/disco/topo/fd_topob.c:214
- There is a spelling error in the error message. "reliablaly" should be "reliably".
if( FD_UNLIKELY( topo->tiles[ producer_tile_id ].allow_crash && reliable ) ) FD_LOG_ERR(( "cannot reliablaly consume from a tile with allow_crash=1: %s:%lu ", tile_name, tile_kind_id ) );
src/discof/replay/fd_replay_tile.c:769
- The log message on line 769 still mentions "gui" but the GUI tile no longer increments the bank refcnt based on the changes in this PR. The comment should be updated to only mention "tower, rpc" to reflect the actual consumers that increment the refcnt.
FD_LOG_DEBUG(( "bank (idx=%lu, slot=%lu) refcnt incremented to %lu for tower, rpc, gui", bank->data->idx, slot, bank->data->refcnt ));
src/disco/gui/fd_gui_peers.c:849
- The change from stable sort to inplace sort on line 849 could introduce subtle behavioral changes. Previously, when deduplicating node accounts by keeping the vote account with the largest stake, the stable sort (line 849 in old code) would preserve the relative order of elements with equal node_account values after the stake sort. With an inplace sort, if multiple vote accounts have the same node_account and stake, the order becomes undefined. While this is likely rare in practice, it could lead to non-deterministic behavior when selecting which vote account to keep for duplicates with identical stakes. Consider whether this non-determinism is acceptable or if stable sort should be retained.
fd_gui_peers_votes_pkey_sort_inplace( votes_sorted, voter_cnt );
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fb5e211 to
287f7bf
Compare
287f7bf to
31da7ad
Compare
31da7ad to
d82fd79
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (2)
src/discof/replay/fd_replay_tile.c:791
- The log message states "refcnt incremented to %lu for tower, rpc, gui" but the GUI tile no longer holds a bank reference based on the changes in this PR (the gui_replay link was removed and the bank refcnt is no longer incremented for GUI on line 789). The log message should be updated to remove the reference to "gui" to accurately reflect which tiles hold the bank reference.
FD_LOG_DEBUG(( "bank (idx=%lu, slot=%lu) refcnt incremented to %lu for tower, rpc, gui", bank->data->idx, slot, bank->data->refcnt ));
src/disco/topo/fd_topob.c:214
- Spelling error: "reliablaly" should be "reliably"
if( FD_UNLIKELY( topo->tiles[ producer_tile_id ].allow_crash && reliable ) ) FD_LOG_ERR(( "cannot reliablaly consume from a tile with allow_crash=1: %s:%lu ", tile_name, tile_kind_id ) );
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d82fd79 to
a93c92a
Compare
a93c92a to
16520ea
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
16520ea to
929fd71
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/disco/gui/fd_gui_peers.c:941
- Memory leak in voters_map: The voters_map is populated in fd_gui_peers_stage_vote_update but never cleared after fd_gui_peers_commit_vote_updates processes the updates. This will cause the map to continuously grow with stale vote accounts that are no longer active, potentially leading to memory exhaustion over time. Consider clearing the voters_map and voters_pool after processing vote updates in fd_gui_peers_commit_vote_updates, or implementing a mechanism to remove stale entries.
fd_gui_peers_commit_vote_updates( fd_gui_peers_ctx_t * peers,
long now,
fd_pubkey_t * identity ) {
(void)now;
fd_gui_peers_voter_t * votes_sorted = peers->voters_scratch.voters1;
ulong voter_cnt = 0UL;
for( fd_gui_peers_voters_map_iter_t iter = fd_gui_peers_voters_map_iter_init( peers->voters_map, peers->voters_pool );
!fd_gui_peers_voters_map_iter_done( iter, peers->voters_map, peers->voters_pool );
iter = fd_gui_peers_voters_map_iter_next( iter, peers->voters_map, peers->voters_pool ) ) {
FD_TEST( voter_cnt<FD_RUNTIME_MAX_VOTE_ACCOUNTS );
fd_gui_peers_voter_t * ele = fd_gui_peers_voters_map_iter_ele( iter, peers->voters_map, peers->voters_pool );
votes_sorted[ voter_cnt++ ] = *ele;
}
/* deduplicate node accounts, keeping the vote accounts with largest stake */
votes_sorted = fd_gui_peers_votes_stake_sort_inplace( votes_sorted, voter_cnt );
votes_sorted = fd_gui_peers_votes_pkey_sort_stable_fast( votes_sorted, voter_cnt, peers->voters_scratch.voters2 );
ulong total_stake = 0UL;
fd_pubkey_t prev_peer = { 0 };
for( ulong i=0UL; i<voter_cnt; i++ ) {
if( FD_UNLIKELY( !memcmp( prev_peer.uc, votes_sorted[ i ].node_account.uc, sizeof(fd_pubkey_t) ) ) ) {
votes_sorted[ i ].stake = ULONG_MAX; /* flag as duplicate */
} else {
total_stake += votes_sorted[ i ].stake;
}
prev_peer = votes_sorted[ i ].node_account;
}
/* get stake-weighted 67th percentile last_vote_slot */
fd_gui_peers_votes_slot_sort_inplace( votes_sorted, voter_cnt );
ulong cumulative_stake = 0UL;
ulong last_vote_slot_p67 = ULONG_MAX;
for( ulong i=0UL; i<voter_cnt; i++ ) {
if( FD_UNLIKELY( votes_sorted[ i ].stake==ULONG_MAX ) ) continue;
cumulative_stake += votes_sorted[ i ].stake;
if( FD_LIKELY( 3*cumulative_stake>2*total_stake ) ) {
last_vote_slot_p67 = votes_sorted[ i ].last_vote_slot;
}
}
int * actions = peers->voters_scratch.actions;
ulong * idxs = peers->voters_scratch.idxs;
ulong count = 0UL;
for( ulong i=0UL; i<voter_cnt; i++ ) {
if( FD_UNLIKELY( votes_sorted[ i ].stake==ULONG_MAX ) ) continue;
/* votes_sorted is a copy of the vote_states bank field that has
been sorted by stake descending and deduplicated. Deduplicated
here means if multiple vote accounts point to the same identity
key, we go with the one with the most stake. TODO: This logic
will need to change once SIMD-0180 hits mainnet.
As long as the vote account exists, it will be in vote_states,
which get initialized at snapshot load and gets updated by the
runtime. So, on any given fork, `last_voted_slot` should reflect
the last landed vote for ALL the vote accounts (including those
referencing identity->uc) from the perspective of that fork's
bank, even if that slot didn't have landed votes for some of
those accounts. */
if( FD_UNLIKELY( !memcmp( &votes_sorted[ i ].node_account, identity->uc, sizeof(fd_pubkey_t) ) && peers->slot_voted!=votes_sorted[ i ].last_vote_slot ) ) {
peers->slot_voted = fd_ulong_if( votes_sorted[ i ].last_vote_slot==0UL, ULONG_MAX, votes_sorted[ i ].last_vote_slot );
fd_gui_peers_printf_vote_slot( peers );
fd_http_server_ws_broadcast( peers->http );
}
ulong peer_idx = fd_gui_peers_node_pubkey_map_idx_query( peers->node_pubkey_map, &votes_sorted[ i ].node_account, ULONG_MAX, peers->contact_info_table );
if( FD_UNLIKELY( peer_idx==ULONG_MAX ) ) continue; /* peer not on gossip */
fd_gui_peers_node_t * peer = peers->contact_info_table + peer_idx;
/* TODO: we only publish updates when stake changes, otherwise we'd
have to republish for every peer every slot, which ends up being
too much bandwidth because we republish all the peer info.
Ideally, we decouple the vote updates from the reset of the peer
info which would let us make updates quickly. */
int is_delinquent = ((long)last_vote_slot_p67 - (long)votes_sorted[ i ].last_vote_slot) > 150L;
int vote_eq = peer->has_vote_info
&& !memcmp( peer->vote_account.uc, votes_sorted[ i ].vote_account.uc, sizeof(fd_pubkey_t) )
&& peer->stake ==votes_sorted[ i ].stake
// && peer->last_vote_slot ==votes_sorted[ i ].last_vote_slot
// && peer->last_vote_timestamp ==votes_sorted[ i ].last_vote_timestamp
&& peer->is_delinquent ==is_delinquent;
if( FD_LIKELY( vote_eq ) ) continue; /* nop */
peer->has_vote_info = 1;
peer->vote_account = votes_sorted[ i ].vote_account;
peer->last_vote_slot = votes_sorted[ i ].last_vote_slot;
peer->last_vote_timestamp = votes_sorted[ i ].last_vote_timestamp;
peer->is_delinquent = is_delinquent;
if( FD_UNLIKELY( peer->stake!=votes_sorted[ i ].stake ) ) {
fd_gui_peers_live_table_idx_remove( peers->live_table, peer_idx, peers->contact_info_table );
peer->stake = votes_sorted[ i ].stake;
fd_gui_peers_live_table_idx_insert( peers->live_table, peer_idx, peers->contact_info_table );
}
actions[ count ] = FD_GUI_PEERS_NODE_UPDATE;
idxs [ count ] = peer_idx;
count++;
}
if( FD_UNLIKELY( count ) ) {
fd_gui_peers_printf_nodes( peers, actions, idxs, count );
fd_http_server_ws_broadcast( peers->http );
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
929fd71 to
eb7ea0d
Compare
eb7ea0d to
0e912fb
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6e1681d to
5058014
Compare
5058014 to
d8e68e3
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
src/discof/replay/fd_replay_tile.c:812
- Bank refcnt accounting looks inconsistent after removing the gui increment.
publish_slot_completedstill expects "increment by 1 for each consumer that uses bank_idx", but replay later decrements refcnt for IN_KIND_GUI acks unconditionally. With gui enabled (and rpc potentially disabled), this will decrement without a corresponding increment and can underflow/free the bank too early. Add the gui refcnt++ back whenctx->gui_enabledand keep the debug message consistent with the actual consumers.
/* refcnt should be incremented by 1 for each consumer that uses
`bank_idx`. Each consumer should decrement the bank's refcnt once
they are done using the bank. */
bank->data->refcnt++; /* tower_tile */
if( FD_LIKELY( ctx->rpc_enabled ) ) bank->data->refcnt++; /* rpc tile */
if( FD_LIKELY( ctx->gui_enabled ) ) bank->data->refcnt++; /* gui tile */
slot_info->bank_idx = bank->data->idx;
FD_LOG_DEBUG(( "bank (idx=%lu, slot=%lu) refcnt incremented to %lu for tower, rpc, gui", bank->data->idx, slot, bank->data->refcnt ));
src/discof/replay/fd_replay_tile.c:812
slot_info->bank_seqis assigned here, butfd_replay_slot_completed(infd_replay_tile.h) does not currently define abank_seqfield. This will not compile (or will write into the wrong field if a different member name was intended). Add the missing struct member (or remove the assignment) so producer/consumer agree on the message layout.
if( FD_LIKELY( ctx->gui_enabled ) ) bank->data->refcnt++; /* gui tile */
slot_info->bank_idx = bank->data->idx;
FD_LOG_DEBUG(( "bank (idx=%lu, slot=%lu) refcnt incremented to %lu for tower, rpc, gui", bank->data->idx, slot, bank->data->refcnt ));
d8e68e3 to
8bb3961
Compare
8bb3961 to
a10f255
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/discof/replay/fd_replay_tile.c:730
- cost_tracker_snap() now uses ULONG_MAX as the "unknown" sentinel (via memset(..., 0xFF, ...)) when block_cost_limit==0, and the GUI logic checks for ULONG_MAX before updating compute unit fields. However, when the bank has no cost tracker pool element, this branch memset()s slot_info->cost_tracker to 0, which the GUI will treat as a real value and overwrite max/used compute units with 0. Set the no-cost-tracker case to the same ULONG_MAX sentinel (or otherwise keep the sentinel semantics consistent with the GUI).
cost_tracker_snap( fd_bank_t * bank, fd_replay_slot_completed_t * slot_info ) {
if( bank->data->cost_tracker_pool_idx!=fd_bank_cost_tracker_pool_idx_null( fd_bank_get_cost_tracker_pool( bank->data ) ) ) {
fd_cost_tracker_t const * cost_tracker = fd_bank_cost_tracker_locking_query( bank );
if( FD_UNLIKELY( cost_tracker->block_cost_limit==0UL ) ) {
memset( &slot_info->cost_tracker, (int)UINT_MAX, sizeof(slot_info->cost_tracker) );
} else {
slot_info->cost_tracker.block_cost = cost_tracker->block_cost;
slot_info->cost_tracker.vote_cost = cost_tracker->vote_cost;
slot_info->cost_tracker.allocated_accounts_data_size = cost_tracker->allocated_accounts_data_size;
slot_info->cost_tracker.block_cost_limit = cost_tracker->block_cost_limit;
slot_info->cost_tracker.vote_cost_limit = cost_tracker->vote_cost_limit;
slot_info->cost_tracker.account_cost_limit = cost_tracker->account_cost_limit;
}
fd_bank_cost_tracker_end_locking_query( bank );
} else {
memset( &slot_info->cost_tracker, 0, sizeof(slot_info->cost_tracker) );
}
35821c2 to
4bfdc10
Compare
4bfdc10 to
5436d8c
Compare
5436d8c to
b22d12c
Compare
No description provided.