Skip to content

Add proposal for OpenTelemetry tracing in operator reconciliation loops#197

Open
bmscomp wants to merge 1 commit intostrimzi:mainfrom
bmscomp:proposal/opentelemetry-tracing-for-operator-reconciliation
Open

Add proposal for OpenTelemetry tracing in operator reconciliation loops#197
bmscomp wants to merge 1 commit intostrimzi:mainfrom
bmscomp:proposal/opentelemetry-tracing-for-operator-reconciliation

Conversation

@bmscomp
Copy link

@bmscomp bmscomp commented Feb 4, 2026

This proposal introduces native OpenTelemetry distributed tracing support for Strimzi operators (Cluster, Topic, and User Operators) to provide visibility into reconciliation performance and enable root-cause analysis.

Key aspects of the proposal:

  • OTLP export by default (compatible with Jaeger, Grafana Tempo, etc.)
  • Configurable via environment variables following OTel conventions
  • Opt-in feature gate to minimize impact on existing deployments
  • Semantic span naming for reconciliation phases
  • Phased implementation: Cluster Operator → Topic/User Operators

This addresses the observability gap where Strimzi supports tracing for data plane components (Connect, Bridge, MirrorMaker) but not for the control plane operators themselves.

This proposal introduces native OpenTelemetry distributed tracing support
for Strimzi operators (Cluster, Topic, and User Operators) to provide
visibility into reconciliation performance and enable root-cause analysis.

Key aspects of the proposal:
- OTLP export by default (compatible with Jaeger, Grafana Tempo, etc.)
- Configurable via environment variables following OTel conventions
- Opt-in feature gate to minimize impact on existing deployments
- Semantic span naming for reconciliation phases
- Phased implementation: Cluster Operator → Topic/User Operators

This addresses the observability gap where Strimzi supports tracing for
data plane components (Connect, Bridge, MirrorMaker) but not for the
control plane operators themselves.

Signed-off-by: Said BOUDJELDA <bmscomp@gmail.com>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal. I think this would be usefull addition. Some quick comments:

  • It would be good to understand the implementation details. While the idea is good, the proposal does not share much about how it will be done.
  • It would be good to better understand how the spans will be managed. The parts you outlined are not in touch with how the operator flow really works.
  • While enabling the tracing per-operator seemingly makes sense, I wonder how useful it really is in real life because depending on the sampling, you might not get much traces for given operand to work with. This is not like some messaging system with 1000s of messages per second, but rather a few reconciliations per hour.


1. **Slow Reconciliations**: Why did a Kafka reconciliation take 15 minutes instead of 2 minutes?
2. **Hidden Latency**: Which phase (rolling update, certificate renewal, Cruise Control call) caused delays?
3. **Cross-Component Visibility**: How do changes to `Kafka` CR propagate through to `KafkaNodePool`, `KafkaTopic`, and `KafkaUser` resources?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not really how it works. So I do not think this makes sense.

│ Strimzi Cluster Operator │
├─────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ Kafka Assembler │ │ Connect Assembler│ │ Bridge Assembler│ │
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssemblyOperators?

value: "0.1"
```

#### Feature Gate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use feature gates to protect features that might undergo significant changes, be removed later or have a major backwards compatibility impacrt. This does not sound like sch a feature. Why isn't it enough to enable it through the STRIMZI_TRACING_ENABLED environment variable?

Comment on lines +110 to +111
├── ReconcileUsers (propagated to User Operator)
└── ReconcileTopics (propagated to Topic Operator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really how it works. The User and Topic operators are completely independnt on the Cluster Operator. There is no synchronization or communication between them. So while tracing the User and Topic operators makes sense, there will never be any kind of trace like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants