Skip to content

[improve][broker]PIP-333 Add monitor metrics for the number of connections to client IPs and roles#21935

Open
yyj8 wants to merge 3 commits intoapache:masterfrom
yyj8:ip_role_connections_monitor
Open

[improve][broker]PIP-333 Add monitor metrics for the number of connections to client IPs and roles#21935
yyj8 wants to merge 3 commits intoapache:masterfrom
yyj8:ip_role_connections_monitor

Conversation

@yyj8
Copy link
Contributor

@yyj8 yyj8 commented Jan 20, 2024

For detailed improvement instructions, please refer to issues:
#21934

Motivation

Currently, Pulsar does not monitor the number of connections to client IPs and roles. When there are many business teams accessing the cluster, and there are also many topics and production consumers, it is difficult for us to quickly locate which IPs or roles are accessing the cluster abnormally from a global perspective and optimize them. So, we hope to have a monitoring indicator that can statistically analyze the connection information between IP and role dimensions, making it easier to quickly locate problems in the later stage.

Modifications

Add new metircs to ServerCnx.java class:

    private static final Gauge clientIpAndRoleConnections = Gauge.build()
            .name("pulsar_broker_client_ip_role_connections")
            .labelNames("ip", "role")
            .help("The number of client IP and role connections")
            .register();

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: yyj8#7

@github-actions
Copy link

@yyj8 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-required Your PR changes impact docs and you will update later. and removed doc-label-missing labels Jan 20, 2024
@yyj8
Copy link
Contributor Author

yyj8 commented Jan 20, 2024

/pulsarbot run-failure-checks

@asafm
Copy link
Contributor

asafm commented Jan 25, 2024

This is a good way to kill Prometheus or any other TSDB.
Maybe we can include it in broker stats of some sort.
Also, you must limit this data structure. You can so easily be at a high number.

@dao-jun
Copy link
Member

dao-jun commented Jan 26, 2024

Agree with @asafm
the label name ip maybe will lead to toooo many label values.

@yyj8
Copy link
Contributor Author

yyj8 commented Jan 28, 2024

Dear @asafm @dao-jun ,Thank you for your help.

I mainly considered two scenarios:

  1. k8s container deployment: In this scenario, for pulsar, the perceived client IP is the IP of the k8s slave node, and the number of this IP is relatively small for most user teams, which may be a few or dozens of node scales. A very small number of users may reach a few hundred node scales, or even more.

  2. Virtual machine or physical machine deployment: In this scenario, for Pulsar, the perceived client IP is virtual machine or physical machine IP, and it is relatively rare for it to reach hundreds or even thousands of servers.

So, can we add a parameter exposeIpRoleLevelMetricsInPrometheus in theconf/broker.conf file to let users decide whether to enable this metrics. Just like the metrics of producers and consumers are exposed.

@asafm
Copy link
Contributor

asafm commented Jan 29, 2024

Can a role be enough? It should be low cardinality, I believe.

@yyj8
Copy link
Contributor Author

yyj8 commented Jan 30, 2024

Can a role be enough? It should be low cardinality, I believe.

Do you think the following implementation would be better?

private static final Gauge clientRoleConnections = Gauge.build()
            .name("pulsar_broker_client_role_connections")
            .labelNames("role")
            .help("The number of client role connections")
            .register();

The above metircs may not be able to quickly locate the specific server causing high connectivity in scenarios where the same role has connections on multiple servers. We have encountered this issue in our current production environment.

@asafm
Copy link
Contributor

asafm commented Jan 30, 2024

Just thinking out loud: Given you know the role and your app using the client can expose the role so you can pinpoint the hosts running the clients this way, will the number of connections from the client side suffice?

@yyj8
Copy link
Contributor Author

yyj8 commented Jan 31, 2024

Just thinking out loud: Given you know the role and your app using the client can expose the role so you can pinpoint the hosts running the clients this way, will the number of connections from the client side suffice?

Our current pain point is that multiple business teams are accessing the same pulsar cluster, and the current cluster traffic is not high (within 200MB/s), but there are a lot of topics (divided into multiple tenants according to actual customers, nearly 200000 topic partitions). The number of connections in the pulsar cluster has reached nearly 500000, resulting in frequent instances where the proxy or broker set connection count is full and all client requests are rejected, Because there is no place to obtain the connection status of the role dimension or IP dimension summary, it is impossible to find which business is experiencing abnormal access and promote its optimization.

Main scenarios:

  1. Multiple business teams use different roles to access the same pulsar cluster on different client machines;
  2. The same business team uses the same role to access the same pulsar cluster on multiple client machines or multiple apps;

Only with both of these pieces of information can we accurately locate which team had an abnormal access.

@asafm
Copy link
Contributor

asafm commented Jan 31, 2024

@yyj8 Just validating your requirements - can you elaborate on why monitoring the connections count from the client side and not from the broker side would be impossible?

@dao-jun
Copy link
Member

dao-jun commented Jan 31, 2024

it is totally hard to make the decision, the metric might be helpful to troubleshooting.

but pulsar broker has tooo many metrics and tooo many configuration items, and if we add ip label for the metric, it could easily kill the time series DB.

Since we can't convince each other, I suggest send a discuss to the mail list, let the community decides.

@yyj8
Copy link
Contributor Author

yyj8 commented Feb 1, 2024

it is totally hard to make the decision, the metric might be helpful to troubleshooting.

but pulsar broker has tooo many metrics and tooo many configuration items, and if we add ip label for the metric, it could easily kill the time series DB.

Since we can't convince each other, I suggest send a discuss to the mail list, let the community decides.

mail list:https://lists.apache.org/thread/6oyhv6fmzr72oc1hcxw9llwvzlw09cqp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-required Your PR changes impact docs and you will update later.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants