Skip to content

[improve][broker]PIP-340 Optimization of Probe Implementation for Automatic Failover#8

Open
yyj8 wants to merge 1 commit intomasterfrom
auto_cluster_failover_optimize
Open

[improve][broker]PIP-340 Optimization of Probe Implementation for Automatic Failover#8
yyj8 wants to merge 1 commit intomasterfrom
auto_cluster_failover_optimize

Conversation

@yyj8
Copy link
Owner

@yyj8 yyj8 commented Feb 27, 2024

Motivation

The current Java client implementation has certain flaws in automatic fault switching.

org.apache.pulsar.client.impl.AutoClusterFailover.java
boolean probeAvailable(String url) {
        try {
            resolver.updateServiceUrl(url);
            InetSocketAddress endpoint = resolver.resolveHost();
            Socket socket = new Socket();
            socket.connect(new InetSocketAddress(endpoint.getHostName(), endpoint.getPort()), TIMEOUT);
            socket.close();

            return true
        } catch (Exception e) {
            log.warn("Failed to probe available, url: {}", url, e);
            return false;
        }
    }

The client only establishes a TCP connection with the exposed connection address of the cluster to determine whether the cluster is available, which cannot adapt to scenarios where the cluster is partially unavailable (half dead). In this scenario, we hope to make corresponding fault switching judgments by initiating cluster health status requests to the cluster. Then within the cluster, we provide an admin management command to update the cluster's health status. To avoid this scenario, all businesses that need to connect to this cluster need to manually switch cluster connection addresses and restart applications, resulting in inconsistent link data among multiple business team due to inconsistent operation steps.

Modifications

  1. Add a new cluster health status request and response request;
case HEALTH_CHECK:
	checkArgument(cmd.hasHealthCheck());
	handleHealthCheck(cmd.getHealthCheck());
	break;

case HEALTH_CHECK_RESPONSE:
	checkArgument(cmd.hasHealthCheckResponse());
	handleHealthCheckResponse(cmd.getHealthCheckResponse());
	break;            
  1. Add a new admin management command to manually update the cluster health status;
//Update cluster health status, available or unavailable. default available
bin/pulsar-admin clusters update-health-status --status unavailable

For other detailed information, please refer to the PR code.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:
apache#22133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant