Skip to content

Conversation

@ppca
Copy link
Contributor

@ppca ppca commented Jan 29, 2026

the reason why solana indexer block height stuck could be:

  1. get_transaction_with_config hangs, not getting response and hangs there forever
  2. stream.next() hangs forever

in this PR, both will result in the function returning Err, then the solana client will resubscribe.

fix this issue: #648

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Mitigates Solana indexer “block height stuck” scenarios by adding bounded timeouts/retries around transaction fetches and adding a watchdog to force reconnect when the WebSocket log stream goes silent.

Changes:

  • Added get_tx_with_timeout_retry with configurable timeout, retries, exponential backoff, and jitter for getTransaction calls.
  • Updated CPI event parsing to use the new timeout+retry transaction fetch helper.
  • Reworked Solana WS log subscriptions to use a watchdog via tokio::select! and reconnect on stalled/ended streams.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ppca ppca force-pushed the xiangyi/solana_indexer_hang branch from 13ba881 to 1d05d0b Compare January 29, 2026 06:35
@ppca ppca requested a review from Copilot January 29, 2026 06:35
@ppca ppca force-pushed the xiangyi/solana_indexer_hang branch 2 times, most recently from e88685a to b4f343e Compare January 29, 2026 06:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1030 to +1044
// On any RPC error, retry until max_attempts is reached.
let e_anyhow = anyhow::anyhow!(e).context(format!(
"getTransaction failed (attempt {attempt}/{}) for {}",
cfg.max_attempts, signature
));

Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the retry decision is based on error content, but the implementation retries unconditionally for all Err(e) until max_attempts. Either implement error-based retry filtering (e.g., avoid retrying on "not found"/finalized errors) or adjust the comment to match behavior.

Suggested change
// On any RPC error, retry until max_attempts is reached.
let e_anyhow = anyhow::anyhow!(e).context(format!(
"getTransaction failed (attempt {attempt}/{}) for {}",
cfg.max_attempts, signature
));
// For some RPC errors (e.g., transaction not found or already finalized),
// further retries are not useful. Treat those as terminal based on the
// error message content; otherwise, retry until max_attempts is reached.
let err_msg = e.to_string();
let e_anyhow = anyhow::anyhow!(e).context(format!(
"getTransaction failed (attempt {attempt}/{}) for {}",
cfg.max_attempts, signature
));
// Do not retry on clearly terminal conditions.
if err_msg.contains("not found") || err_msg.contains("finalized") {
return Err(e_anyhow);
}

Copilot uses AI. Check for mistakes.
@ppca ppca force-pushed the xiangyi/solana_indexer_hang branch 2 times, most recently from 7916c0c to 30633db Compare January 29, 2026 07:03
@ppca ppca requested a review from Copilot January 29, 2026 07:04
@ppca ppca requested a review from volovyks January 29, 2026 07:06
@ppca ppca linked an issue Jan 29, 2026 that may be closed by this pull request
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ppca ppca force-pushed the xiangyi/solana_indexer_hang branch from 30633db to 7ec4b4b Compare January 29, 2026 07:14
@ppca ppca requested a review from Copilot January 29, 2026 07:14
@ppca ppca force-pushed the xiangyi/solana_indexer_hang branch from 7ec4b4b to 66a95cd Compare January 29, 2026 07:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +667 to +670
// stall watchdog
let stall_timeout = Duration::from_secs(60);
let mut last_ws_msg = Instant::now();
let mut watchdog = tokio::time::interval(Duration::from_secs(5));
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watchdog currently bails out if no logsSubscribe notification is received for 60s. With RpcTransactionLogsFilter::Mentions, it’s normal to receive no messages during periods with no transactions mentioning this program, so this can force reconnect loops even when the WS connection is healthy. Consider basing the stall detection on a true WS keepalive/ping mechanism (or client-level health check), or making the stall timeout configurable and large enough to tolerate expected inactivity.

Copilot uses AI. Check for mistakes.
Comment on lines +893 to +896
// Watchdog: force reconnect if WS goes silent
let stall_timeout = Duration::from_secs(60);
let mut last_ws_msg = Instant::now();
let mut watchdog = tokio::time::interval(Duration::from_secs(5));
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the CPI subscription: the 60s watchdog is keyed to receipt of logsSubscribe notifications, but no notifications is a normal state when the program has no activity. This can cause perpetual reconnect churn. Consider moving to WS-level keepalive detection or making the stall timeout configurable/disabled by default for low-traffic deployments.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@volovyks volovyks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the first place where we use exponential backoff. Can we create utils and use them across the project?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

solana block height stuck

3 participants