Skip to content

[enhancement] Optimization of Metric Calculation in trace-etl-server​ #586

@sadadw1

Description

@sadadw1

Recently, we have further optimized the metric calculation in ​​trace-etl-server​​. The background for this optimization stems from our observation that in today's microservices architecture, there are often applications with extremely high call volumes and numerous upstream/downstream dependencies—we refer to these as ​​"head applications"​​.

Due to their high traffic, instances of these head applications typically rank among the ​​top consumers​​ in the entire cluster. Additionally, because of their extensive dependencies, the ​​dispersion of metric labels​​ (e.g., Http or Dubbo methodName) can be ​​orders of magnitude higher​​ compared to other applications.

These two factors lead to ​​trace-etl-server generating an enormous volume of metrics​​ when processing head applications. This excessive metric load can cause:

​​Memory exhaustion​​ in trace-etl-server itself.
​​Significant delays​​ when Prometheus scrapes these metrics.
​​High pressure​​ on Prometheus during metric aggregation.

To address these issues, we have implemented the following improvements:

​​1. Configurable metric merging per application.​​
2. ​​A recommendation to isolate head applications​​ by consuming their traces in dedicated ​​trace-etl-server instances​​ in production environments.

This optimization ensures better stability and performance for both ​​trace-etl-server​​ and ​​Prometheus​​ when handling high-traffic services.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions