[improve][broker]PIP-333 Add monitor metrics for the number of connections to client IPs and roles by yyj8 · Pull Request #21935 · apache/pulsar

yyj8 · 2024-01-20T03:56:37Z

For detailed improvement instructions, please refer to issues：
#21934

Motivation

Currently, Pulsar does not monitor the number of connections to client IPs and roles. When there are many business teams accessing the cluster, and there are also many topics and production consumers, it is difficult for us to quickly locate which IPs or roles are accessing the cluster abnormally from a global perspective and optimize them. So, we hope to have a monitoring indicator that can statistically analyze the connection information between IP and role dimensions, making it easier to quickly locate problems in the later stage.

Modifications

Add new metircs to ServerCnx.java class:

    private static final Gauge clientIpAndRoleConnections = Gauge.build()
            .name("pulsar_broker_client_ip_role_connections")
            .labelNames("ip", "role")
            .help("The number of client IP and role connections")
            .register();

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: yyj8#7

…tions to client IPs and roles

github-actions · 2024-01-20T03:57:06Z

@yyj8 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

…tions to client IPs and roles

yyj8 · 2024-01-20T05:21:15Z

/pulsarbot run-failure-checks

asafm · 2024-01-25T11:59:08Z

This is a good way to kill Prometheus or any other TSDB.
Maybe we can include it in broker stats of some sort.
Also, you must limit this data structure. You can so easily be at a high number.

dao-jun · 2024-01-26T04:35:46Z

Agree with @asafm
the label name ip maybe will lead to toooo many label values.

yyj8 · 2024-01-28T13:21:25Z

Dear @asafm @dao-jun ,Thank you for your help.

I mainly considered two scenarios:

k8s container deployment: In this scenario, for pulsar, the perceived client IP is the IP of the k8s slave node, and the number of this IP is relatively small for most user teams, which may be a few or dozens of node scales. A very small number of users may reach a few hundred node scales, or even more.
Virtual machine or physical machine deployment: In this scenario, for Pulsar, the perceived client IP is virtual machine or physical machine IP, and it is relatively rare for it to reach hundreds or even thousands of servers.

So, can we add a parameter exposeIpRoleLevelMetricsInPrometheus in theconf/broker.conf file to let users decide whether to enable this metrics. Just like the metrics of producers and consumers are exposed.

asafm · 2024-01-29T14:03:34Z

Can a role be enough? It should be low cardinality, I believe.

yyj8 · 2024-01-30T04:58:22Z

Can a role be enough? It should be low cardinality, I believe.

Do you think the following implementation would be better?

private static final Gauge clientRoleConnections = Gauge.build()
            .name("pulsar_broker_client_role_connections")
            .labelNames("role")
            .help("The number of client role connections")
            .register();

The above metircs may not be able to quickly locate the specific server causing high connectivity in scenarios where the same role has connections on multiple servers. We have encountered this issue in our current production environment.

asafm · 2024-01-30T10:32:48Z

Just thinking out loud: Given you know the role and your app using the client can expose the role so you can pinpoint the hosts running the clients this way, will the number of connections from the client side suffice?

yyj8 · 2024-01-31T04:27:37Z

Just thinking out loud: Given you know the role and your app using the client can expose the role so you can pinpoint the hosts running the clients this way, will the number of connections from the client side suffice?

Our current pain point is that multiple business teams are accessing the same pulsar cluster, and the current cluster traffic is not high (within 200MB/s), but there are a lot of topics (divided into multiple tenants according to actual customers, nearly 200000 topic partitions). The number of connections in the pulsar cluster has reached nearly 500000, resulting in frequent instances where the proxy or broker set connection count is full and all client requests are rejected, Because there is no place to obtain the connection status of the role dimension or IP dimension summary, it is impossible to find which business is experiencing abnormal access and promote its optimization.

Main scenarios:

Multiple business teams use different roles to access the same pulsar cluster on different client machines;
The same business team uses the same role to access the same pulsar cluster on multiple client machines or multiple apps;

Only with both of these pieces of information can we accurately locate which team had an abnormal access.

asafm · 2024-01-31T08:08:35Z

@yyj8 Just validating your requirements - can you elaborate on why monitoring the connections count from the client side and not from the broker side would be impossible?

dao-jun · 2024-01-31T09:08:10Z

it is totally hard to make the decision, the metric might be helpful to troubleshooting.

but pulsar broker has tooo many metrics and tooo many configuration items, and if we add ip label for the metric, it could easily kill the time series DB.

Since we can't convince each other, I suggest send a discuss to the mail list, let the community decides.

yyj8 · 2024-02-01T02:39:32Z

it is totally hard to make the decision, the metric might be helpful to troubleshooting.

but pulsar broker has tooo many metrics and tooo many configuration items, and if we add ip label for the metric, it could easily kill the time series DB.

Since we can't convince each other, I suggest send a discuss to the mail list, let the community decides.

mail list:https://lists.apache.org/thread/6oyhv6fmzr72oc1hcxw9llwvzlw09cqp

yyj8 added 2 commits January 20, 2024 11:08

[improve][broker]PIP-333 Add monitor metrics for the number of connec…

4588cf3

…tions to client IPs and roles

[improve][broker]PIP-333 Add monitor metrics for the number of connec…

3bdf2d4

…tions to client IPs and roles

github-actions bot added the doc-label-missing label Jan 20, 2024

yyj8 mentioned this pull request Jan 20, 2024

[improve][broker]PIP-333 Add monitor metrics for the number of connections to client IPs and roles #21934

Open

2 tasks

github-actions bot added doc-required Your PR changes impact docs and you will update later. and removed doc-label-missing labels Jan 20, 2024

[improve][broker]PIP-333 Add monitor metrics for the number of connec…

5bc3320

…tions to client IPs and roles

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][broker]PIP-333 Add monitor metrics for the number of connections to client IPs and roles#21935

[improve][broker]PIP-333 Add monitor metrics for the number of connections to client IPs and roles#21935
yyj8 wants to merge 3 commits intoapache:masterfrom
yyj8:ip_role_connections_monitor

yyj8 commented Jan 20, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jan 20, 2024

Uh oh!

yyj8 commented Jan 20, 2024

Uh oh!

asafm commented Jan 25, 2024

Uh oh!

dao-jun commented Jan 26, 2024

Uh oh!

yyj8 commented Jan 28, 2024 •

edited

Loading

Uh oh!

asafm commented Jan 29, 2024

Uh oh!

yyj8 commented Jan 30, 2024 •

edited

Loading

Uh oh!

asafm commented Jan 30, 2024

Uh oh!

yyj8 commented Jan 31, 2024

Uh oh!

asafm commented Jan 31, 2024

Uh oh!

dao-jun commented Jan 31, 2024

Uh oh!

yyj8 commented Feb 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yyj8 commented Jan 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Documentation

Matching PR in forked repository

Uh oh!

github-actions bot commented Jan 20, 2024

Uh oh!

yyj8 commented Jan 20, 2024

Uh oh!

asafm commented Jan 25, 2024

Uh oh!

dao-jun commented Jan 26, 2024

Uh oh!

yyj8 commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asafm commented Jan 29, 2024

Uh oh!

yyj8 commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asafm commented Jan 30, 2024

Uh oh!

yyj8 commented Jan 31, 2024

Uh oh!

asafm commented Jan 31, 2024

Uh oh!

dao-jun commented Jan 31, 2024

Uh oh!

yyj8 commented Feb 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yyj8 commented Jan 20, 2024 •

edited

Loading

yyj8 commented Jan 28, 2024 •

edited

Loading

yyj8 commented Jan 30, 2024 •

edited

Loading