Skip to content

chore: Improve metrics for failed network requests#18059

Merged
mcmire merged 22 commits intomainfrom
extend-metrics-for-rpc-endpoints
Aug 28, 2025
Merged

chore: Improve metrics for failed network requests#18059
mcmire merged 22 commits intomainfrom
extend-metrics-for-rpc-endpoints

Conversation

@mcmire
Copy link
Contributor

@mcmire mcmire commented Aug 6, 2025

Description

Currently, we only track when an Infura RPC endpoint becomes degraded or unavailable. Now, we would like to have similar insights about custom RPC endpoints so that we can take more informed decisions to improve reliability for other chains. We'd also like to improve the tracking for Infura endpoints so that we can understand failures better.

This commit updates the handlers for the NetworkController:rpcEndpointDegraded and NetworkController:rpcEndpointUnavailable messenger events so that they create a Segment event regardless of the type of endpoint. The event now includes the HTTP status code.

While making these changes, it was noticed that the sampling rate for these Segment event was incorrect. It should have been 1%, not 10%. This has also been corrected. This ensures that we don't store more data in Segment and our downstream services than necessary.

Changelog

CHANGELOG entry: null

Related issues

Closes #17089.

Manual testing steps

  1. Check out this branch, run yarn setup:expo, run yarn watch:clean.
  2. Open node_modules/@metamask/network-controller/dist/rpc-service/rpc-service.cjs, look for async function _RpcService_processRequest and make these changes:
    async function _RpcService_processRequest(fetchOptions) {
        let response;
        try {
            return await __classPrivateFieldGet(this, _RpcService_policy, "f").execute(async () => {
    +           console.log('[REQUEST]', this.endpointUrl.toString(), 'with', fetchOptions);
    +           if (
    +             this.endpointUrl.toString().includes("linea-mainnet.infura.io") ||
    +             this.endpointUrl.toString().includes("mainnet.era.zksync.io")
    +           ) {
    +               console.log('[RESPONSE]', this.endpointUrl.toString(), '=> 502');
    +               throw new controller_utils_1.HttpError(502);
    +           }
                response = await __classPrivateFieldGet(this, _RpcService_fetch, "f").call(this, this.endpointUrl, fetchOptions);
    +           console.log('[RESPONSE]', this.endpointUrl.toString(), '=>', response.status);
                if (!response.ok) {
                    throw new controller_utils_1.HttpError(response.status);
                }
                return await response.json();
            });
        }
  3. Open app/core/Engine/Engine.ts, look for new NetworkController, and make these changes:
            return {
              ...commonOptions,
              policyOptions: {
                maxRetries,
    -           maxConsecutiveFailures: (maxRetries + 1) * 7,
    +           maxConsecutiveFailures: (maxRetries + 1) * 4,
              },
            };
          },
          additionalDefaultNetworks,
        };
        const networkController = new NetworkController(networkControllerOptions);
  4. Open the app, go through onboarding if needed.
  5. Once on the home screen, switch to Linea.
  6. Monitor the messages appearing in your terminal. Pretty quickly, you should see a line that says Creating Segment event "RPC Service Degraded" with {"chain_id_caip":"eip155:59144","rpc_endpoint_url":"linea-mainnet.infura.io","http_status":500}. After about 10 seconds or so, you should see Creating Segment event "RPC Service Unavailable" with {"chain_id_caip":"eip155:59144","rpc_endpoint_url":"linea-mainnet.infura.io","http_status":500}.
  7. Go back to the app, add ZKSync as a network, and switch to it.
  8. Monitor your terminal again. After a few minutes or so, you should see similar "degraded" and "unavailable" lines as above, but with the data {"chain_id_caip":"eip155:324","rpc_endpoint_url":"mainnet.era.zksync.io","http_status":500}.
  9. Go back to the app and add Flare as a network (https://chainid.network/chain/14/). Use https://flare-api.flare.network/ext/C/rpc as the RPC endpoint.
  10. Update node_modules/@metamask/network-controller/dist/rpc-service/rpc-service.cjs again with:
    async function _RpcService_processRequest(fetchOptions) {
        let response;
        try {
            return await __classPrivateFieldGet(this, _RpcService_policy, "f").execute(async () => {
                console.log('[REQUEST]', this.endpointUrl.toString(), 'with', fetchOptions);
                if (
                  this.endpointUrl.toString().includes("linea-mainnet.infura.io") ||
    -             this.endpointUrl.toString().includes("mainnet.era.zksync.io")
    +             this.endpointUrl.toString().includes("mainnet.era.zksync.io") ||
    +             this.endpointUrl.toString().includes("flare-api.flare.network")
                ) {
                    console.log('[RESPONSE]', this.endpointUrl.toString(), '=> 502');
                    throw new controller_utils_1.HttpError(502);
                }
                response = await __classPrivateFieldGet(this, _RpcService_fetch, "f").call(this, this.endpointUrl, fetchOptions);
                console.log('[RESPONSE]', this.endpointUrl.toString(), '=>', response.status);
                if (!response.ok) {
                    throw new controller_utils_1.HttpError(response.status);
                }
                return await response.json();
            });
        }
  11. Reload the app.
  12. Make sure you've switched to the new network.
  13. Monitor your terminal one last time. After a few minutes or so, you should see similar "degraded" and "unavailable" lines as above, but with the data {"chain_id_caip":"eip155:14","rpc_endpoint_url":"flare-api.flare.network","http_status":500}.

Screenshots/Recordings

(N/A)

Before

After

Pre-merge author checklist

Pre-merge reviewer checklist

  • I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed).
  • I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots.

Currently, we only track when an Infura RPC endpoint becomes degraded or
unavailable. Now, we would like to have similar insights about custom
RPC endpoints so that we can take more informed decisions to improve
reliability for other chains. We'd also like to improve the tracking for
Infura endpoints so that we can understand failures better.

This commit updates the handlers for the
`NetworkController:rpcEndpointDegraded` and
`NetworkController:rpcEndpointUnavailable` messenger events so that they
create a Segment event regardless of the type of endpoint. The event now
includes the HTTP status code.

While making these changes, it was noticed that the sampling rate for
these Segment event was incorrect. It should have been 1%, not 10%. This
has also been corrected. This ensures that we don't store more data in
Segment and our downstream services than necessary.
@github-actions
Copy link
Contributor

github-actions bot commented Aug 6, 2025

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

@metamaskbot metamaskbot added the team-wallet-framework-deprecated DEPRECATED: please use "team-core-platform" instead label Aug 6, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Aug 11, 2025

https://bitrise.io/ Bitrise

✅✅✅ pr_smoke_e2e_pipeline passed on Bitrise! ✅✅✅

Commit hash: f597164
Build link: https://app.bitrise.io/app/be69d4368ee7e86d/pipelines/0668f6d1-a142-4bb9-9427-51f57fefd36a

Note

  • You can kick off another pr_smoke_e2e_pipeline on Bitrise by removing and re-applying the Run Smoke E2E label on the pull request

@mcmire mcmire added the QA Passed QA testing has been completed and passed label Aug 11, 2025
@mcmire mcmire marked this pull request as ready for review August 12, 2025 13:55
@mcmire mcmire requested a review from a team as a code owner August 12, 2025 13:55
@mcmire mcmire moved this to Needs dev review in PR review queue Aug 12, 2025
@mcmire mcmire added the needs-dev-review PR needs reviews from other engineers (in order to receive required approvals) label Aug 12, 2025
cursor[bot]

This comment was marked as outdated.

@mcmire mcmire changed the title chore: Improve tracking of RPC endpoint failures chore: Improve metrics for failed network requests Aug 12, 2025
Cal-L
Cal-L previously approved these changes Aug 12, 2025
Copy link
Contributor

@Cal-L Cal-L left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this from Needs dev review to Review finalised - Ready to be merged in PR review queue Aug 12, 2025
@mcmire mcmire marked this pull request as ready for review August 22, 2025 14:33
cursor[bot]

This comment was marked as outdated.

@mcmire mcmire moved this from Review finalised - Ready to be merged to Review in progress in PR review queue Aug 27, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Aug 27, 2025

https://bitrise.io/ Bitrise

❌❌❌ pr_smoke_e2e_pipeline failed on Bitrise! ❌❌❌

Commit hash: be30e86
Build link: https://app.bitrise.io/app/be69d4368ee7e86d/pipelines/3eda3a0e-c811-4288-bf1f-c161e6bc1cba

Note

  • You can rerun any failed steps by opening the Bitrise build, tapping Rebuild on the upper right then Rebuild unsuccessful Workflows
  • You can kick off another pr_smoke_e2e_pipeline on Bitrise by removing and re-applying the Run Smoke E2E label on the pull request

Tip

  • Check the documentation if you have any doubts on how to understand the failure on bitrise

@github-actions
Copy link
Contributor

github-actions bot commented Aug 27, 2025

https://bitrise.io/ Bitrise

✅✅✅ pr_smoke_e2e_pipeline passed on Bitrise! ✅✅✅

Commit hash: 762cc57
Build link: https://app.bitrise.io/app/be69d4368ee7e86d/pipelines/d6d6c5c9-a6bd-4678-916c-b49b10cc8196

Note

  • You can kick off another pr_smoke_e2e_pipeline on Bitrise by removing and re-applying the Run Smoke E2E label on the pull request

@mcmire mcmire moved this from Review in progress to Needs dev review in PR review queue Aug 27, 2025
@sonarqubecloud
Copy link

@github-project-automation github-project-automation bot moved this from Needs dev review to Review finalised - Ready to be merged in PR review queue Aug 27, 2025
@mcmire mcmire added this pull request to the merge queue Aug 28, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 28, 2025
@mcmire mcmire added this pull request to the merge queue Aug 28, 2025
Merged via the queue into main with commit 609f6dd Aug 28, 2025
158 of 175 checks passed
@mcmire mcmire deleted the extend-metrics-for-rpc-endpoints branch August 28, 2025 14:13
@github-project-automation github-project-automation bot moved this from Review finalised - Ready to be merged to Merged, Closed or Archived in PR review queue Aug 28, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Aug 28, 2025
@github-actions github-actions bot removed the needs-dev-review PR needs reviews from other engineers (in order to receive required approvals) label Aug 28, 2025
@metamaskbot metamaskbot added the release-7.55.0 Issue or pull request that will be included in release 7.55.0 label Aug 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

QA Passed QA testing has been completed and passed release-7.55.0 Issue or pull request that will be included in release 7.55.0 size-XL team-wallet-framework-deprecated DEPRECATED: please use "team-core-platform" instead

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Increase visibility around degraded and unavailable custom RPC endpoints

5 participants