Conversation
| for attempt := 0; attempt <= r.maxRetries; attempt++ { | ||
| if attempt > 0 { | ||
| backoffDuration := min(r.backoff*time.Duration(1<<uint(attempt-1)), 5*time.Second) | ||
| time.Sleep(backoffDuration) |
There was a problem hiding this comment.
Instead of sleep, use time.After and req.Context in select to return as soon as possible if context is done.
| // Reset request body for retry if possible | ||
| if req.Body != nil && req.GetBody != nil { | ||
| var err error | ||
| req.Body, err = req.GetBody() |
There was a problem hiding this comment.
This may be problematic if the Body is already read or closed. There is a check below for non-rewritable body, so it should be ok, I suppose.
In general, only idempotent requests should be retried, but in case of rpc, identifying idempotent requests are not easy. Getting information is ok, but posting information may present a problem by duplicating them.
There was a problem hiding this comment.
Maybe it is the simpliest to increase the values for the timeouts on the RPC side?
These align with HAProxy timeouts to prevent errors, but they look low?
func newHTTPClient() *http.Client {
return &http.Client{
Transport: &http.Transport{
DialContext: (&net.Dialer{
Timeout: 9 * time.Second, // < HAProxy timeout connect (10s)
KeepAlive: 5 * time.Second, // Keep connections alive but close before idle timeout
}).DialContext,
IdleConnTimeout: 8 * time.Second, // < HAProxy timeout client (10s)
TLSHandshakeTimeout: 9 * time.Second, // < HAProxy timeout connect (10s)
ResponseHeaderTimeout: 9 * time.Second, // < HAProxy timeout server (10s)
},
}
}There was a problem hiding this comment.
I agree, even reties would make the rpc endpoint do more work in this case. Increasing the timeout would be better in this case.
| } | ||
|
|
||
| // isRetryableError determines if an error should trigger a retry. | ||
| func isRetryableError(err error) bool { |
There was a problem hiding this comment.
Are these checks for error codes compatible with windows ?
AFAIK that windows have different error codes, maybe some unit tests can reveal it ?
| } | ||
| } | ||
|
|
||
| return nil, fmt.Errorf("request failed after %d retries: %w", r.maxRetries, lastErr) |
There was a problem hiding this comment.
The loop above runs for attempt in 0 to maxRetries, for example with maxRetries = 3, there are 4 attempts, but the error says 3 retries.
|
Take a look on this package, it may help simplifying the implementation: |
Checklist
Description
The bee node was experiencing intermittent connection failures when communicating with blockchain RPC endpoints (especially HAProxy-backed services), resulting in errors like:
These EOF errors occur due to short timeout configurations on HAProxy which can prematurely close connections during RPC communication. Without retry logic, these transient network errors (EOF, connection resets) would cause operations to fail immediately, impacting node stability and reliability.
Implemented a custom retryRoundTripper that wraps the HTTP transport with retry logic
Key Features:
Open API Spec Version Changes (if applicable)
Motivation and Context (Optional)
Related Issue (Optional)
Screenshots (if appropriate):