Add proposal for OpenTelemetry tracing in operator reconciliation loops#197
Conversation
This proposal introduces native OpenTelemetry distributed tracing support for Strimzi operators (Cluster, Topic, and User Operators) to provide visibility into reconciliation performance and enable root-cause analysis. Key aspects of the proposal: - OTLP export by default (compatible with Jaeger, Grafana Tempo, etc.) - Configurable via environment variables following OTel conventions - Opt-in feature gate to minimize impact on existing deployments - Semantic span naming for reconciliation phases - Phased implementation: Cluster Operator → Topic/User Operators This addresses the observability gap where Strimzi supports tracing for data plane components (Connect, Bridge, MirrorMaker) but not for the control plane operators themselves. Signed-off-by: Said BOUDJELDA <bmscomp@gmail.com>
scholzj
left a comment
There was a problem hiding this comment.
Thanks for the proposal. I think this would be usefull addition. Some quick comments:
- It would be good to understand the implementation details. While the idea is good, the proposal does not share much about how it will be done.
- It would be good to better understand how the spans will be managed. The parts you outlined are not in touch with how the operator flow really works.
- While enabling the tracing per-operator seemingly makes sense, I wonder how useful it really is in real life because depending on the sampling, you might not get much traces for given operand to work with. This is not like some messaging system with 1000s of messages per second, but rather a few reconciliations per hour.
|
|
||
| 1. **Slow Reconciliations**: Why did a Kafka reconciliation take 15 minutes instead of 2 minutes? | ||
| 2. **Hidden Latency**: Which phase (rolling update, certificate renewal, Cruise Control call) caused delays? | ||
| 3. **Cross-Component Visibility**: How do changes to `Kafka` CR propagate through to `KafkaNodePool`, `KafkaTopic`, and `KafkaUser` resources? |
There was a problem hiding this comment.
That is not really how it works. So I do not think this makes sense.
| │ Strimzi Cluster Operator │ | ||
| ├─────────────────────────────────────────────────────────────────────────┤ | ||
| │ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ | ||
| │ │ Kafka Assembler │ │ Connect Assembler│ │ Bridge Assembler│ │ |
| value: "0.1" | ||
| ``` | ||
|
|
||
| #### Feature Gate |
There was a problem hiding this comment.
We use feature gates to protect features that might undergo significant changes, be removed later or have a major backwards compatibility impacrt. This does not sound like sch a feature. Why isn't it enough to enable it through the STRIMZI_TRACING_ENABLED environment variable?
| ├── ReconcileUsers (propagated to User Operator) | ||
| └── ReconcileTopics (propagated to Topic Operator) |
There was a problem hiding this comment.
This is not really how it works. The User and Topic operators are completely independnt on the Cluster Operator. There is no synchronization or communication between them. So while tracing the User and Topic operators makes sense, there will never be any kind of trace like this.
This proposal introduces native OpenTelemetry distributed tracing support for Strimzi operators (Cluster, Topic, and User Operators) to provide visibility into reconciliation performance and enable root-cause analysis.
Key aspects of the proposal:
This addresses the observability gap where Strimzi supports tracing for data plane components (Connect, Bridge, MirrorMaker) but not for the control plane operators themselves.