-
Notifications
You must be signed in to change notification settings - Fork 78
Add proposal for OpenTelemetry tracing in operator reconciliation loops #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,300 @@ | ||
| # OpenTelemetry Tracing for Strimzi Operator Reconciliation Loops | ||
|
|
||
| This proposal introduces native OpenTelemetry distributed tracing support for Strimzi operators (Cluster Operator, Topic Operator, and User Operator) to provide visibility into reconciliation performance and enable root-cause analysis of operational issues. | ||
|
|
||
| ## Current situation | ||
|
|
||
| Strimzi currently supports distributed tracing for **Kafka components** (data plane): | ||
| - **Kafka Connect**: Traces messages consumed and produced by connectors | ||
| - **MirrorMaker 2**: Traces messages from source to target clusters | ||
| - **Kafka Bridge**: Traces HTTP-to-Kafka message flows | ||
|
|
||
| This is enabled via `spec.tracing.type: opentelemetry` in the respective custom resources, using the OTLP protocol by default. | ||
|
|
||
| However, **Strimzi operators** (control plane) do **not** support distributed tracing: | ||
| - **Cluster Operator**: No visibility into Kafka/Connect/Bridge reconciliation phases | ||
| - **Topic Operator**: No visibility into topic creation, configuration, or deletion latency | ||
| - **User Operator**: No visibility into user creation, credential generation, or ACL application | ||
|
|
||
| Current debugging relies on: | ||
| 1. Log analysis (verbose but unstructured) | ||
| 2. Prometheus metrics (aggregated, not request-scoped) | ||
| 3. Kubernetes events (coarse-grained) | ||
|
|
||
| ## Motivation | ||
|
|
||
| ### Problem Statement | ||
|
|
||
| Operators often experience reconciliation failures or slowdowns that are difficult to diagnose: | ||
|
|
||
| 1. **Slow Reconciliations**: Why did a Kafka reconciliation take 15 minutes instead of 2 minutes? | ||
| 2. **Hidden Latency**: Which phase (rolling update, certificate renewal, Cruise Control call) caused delays? | ||
| 3. **Cross-Component Visibility**: How do changes to `Kafka` CR propagate through to `KafkaNodePool`, `KafkaTopic`, and `KafkaUser` resources? | ||
| 4. **Production Debugging**: When a reconciliation fails, what was the exact sequence of operations? | ||
|
|
||
| ### Value Proposition | ||
|
|
||
| | Benefit | Description | | ||
| |---------|-------------| | ||
| | **Root Cause Analysis** | Identify exact phase causing reconciliation failures or slowdowns | | ||
| | **Performance Profiling** | Measure duration of rolling updates, CA renewals, topic operations | | ||
| | **Cross-Resource Correlation** | Trace how `Kafka` changes affect dependent resources | | ||
| | **SRE Observability** | Integrate with existing enterprise tracing infrastructure (Jaeger, Grafana Tempo, Datadog) | | ||
| | **Kafka 4.0 Alignment** | Complements KIP-938 (KRaft Performance Metrics) and KIP-1076 (Client Metrics) | | ||
|
|
||
| ### Industry Alignment | ||
|
|
||
| OpenTelemetry is the CNCF standard for observability. Major Kubernetes operators already support tracing: | ||
| - **ArgoCD**: Built-in OTLP tracing for GitOps sync operations | ||
| - **Flux**: OpenTelemetry instrumentation for reconciliation loops | ||
| - **Crossplane**: Tracing for provider reconciliations | ||
|
|
||
| ## Proposal | ||
|
|
||
| ### Overview | ||
|
|
||
| Add optional OpenTelemetry tracing instrumentation to all Strimzi operators, with: | ||
| 1. **OTLP export** by default (compatible with Jaeger, Grafana Tempo, etc.) | ||
| 2. **Configurable via environment variables** (following OTel conventions) | ||
| 3. **Opt-in feature gate** to minimize impact on existing deployments | ||
| 4. **Semantic span naming** following OpenTelemetry conventions | ||
|
|
||
| ### Architecture | ||
|
|
||
| ``` | ||
| ┌─────────────────────────────────────────────────────────────────────────┐ | ||
| │ Strimzi Cluster Operator │ | ||
| ├─────────────────────────────────────────────────────────────────────────┤ | ||
| │ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ | ||
| │ │ Kafka Assembler │ │ Connect Assembler│ │ Bridge Assembler│ │ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AssemblyOperators? |
||
| │ │ [Traced] │ │ [Traced] │ │ [Traced] │ │ | ||
| │ └────────┬────────┘ └────────┬─────────┘ └────────┬────────┘ │ | ||
| │ │ │ │ │ | ||
| │ ┌────────▼─────────────────────▼──────────────────────▼───────┐ │ | ||
| │ │ OpenTelemetry SDK │ │ | ||
| │ │ - Span creation for each reconciliation phase │ │ | ||
| │ │ - Context propagation between operators │ │ | ||
| │ │ - Resource attributes (cluster name, namespace, version) │ │ | ||
| │ └────────────────────────────────┬────────────────────────────┘ │ | ||
| └───────────────────────────────────┼─────────────────────────────────────┘ | ||
| │ OTLP/gRPC or OTLP/HTTP | ||
| ▼ | ||
| ┌──────────────────────────────────┐ | ||
| │ OpenTelemetry Collector │ | ||
| │ (or direct to Jaeger/Tempo) │ | ||
| └──────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| ### Span Hierarchy | ||
|
|
||
| Each reconciliation will create a trace with the following span structure: | ||
|
|
||
| ``` | ||
| ReconcileKafka (root span) | ||
| ├── ValidateResource | ||
| ├── ReconcileNodePools | ||
| │ ├── ReconcileNodePool[pool-a] | ||
| │ └── ReconcileNodePool[pool-b] | ||
| ├── ReconcileBrokers | ||
| │ ├── GenerateConfigs | ||
| │ ├── CreateOrUpdateStatefulSets → CreateOrUpdateStrimziPodSets | ||
| │ └── RollingUpdate | ||
| │ ├── RollPod[my-cluster-kafka-0] | ||
| │ ├── RollPod[my-cluster-kafka-1] | ||
| │ └── RollPod[my-cluster-kafka-2] | ||
| ├── ReconcileCruiseControl | ||
| │ └── CallCruiseControlAPI | ||
| ├── ReconcileCertificates | ||
| │ ├── CheckCAExpiry | ||
| │ └── RenewCertificates | ||
| ├── ReconcileUsers (propagated to User Operator) | ||
| └── ReconcileTopics (propagated to Topic Operator) | ||
|
Comment on lines
+110
to
+111
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not really how it works. The User and Topic operators are completely independnt on the Cluster Operator. There is no synchronization or communication between them. So while tracing the User and Topic operators makes sense, there will never be any kind of trace like this. |
||
| ``` | ||
|
|
||
| ### Configuration | ||
|
|
||
| #### Environment Variables (Cluster Operator Deployment) | ||
|
|
||
| Following OpenTelemetry conventions: | ||
|
|
||
| ```yaml | ||
| apiVersion: apps/v1 | ||
| kind: Deployment | ||
| metadata: | ||
| name: strimzi-cluster-operator | ||
| spec: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - name: strimzi-cluster-operator | ||
| env: | ||
| # Enable tracing (opt-in) | ||
| - name: STRIMZI_TRACING_ENABLED | ||
| value: "true" | ||
| # OpenTelemetry service name | ||
| - name: OTEL_SERVICE_NAME | ||
| value: "strimzi-cluster-operator" | ||
| # OTLP endpoint (gRPC by default) | ||
| - name: OTEL_EXPORTER_OTLP_ENDPOINT | ||
| value: "http://jaeger-collector.observability:4317" | ||
| # Optional: Export protocol (grpc or http/protobuf) | ||
| - name: OTEL_EXPORTER_OTLP_PROTOCOL | ||
| value: "grpc" | ||
| # Optional: Sampling ratio (0.0 to 1.0) | ||
| - name: OTEL_TRACES_SAMPLER | ||
| value: "parentbased_traceidratio" | ||
| - name: OTEL_TRACES_SAMPLER_ARG | ||
| value: "0.1" | ||
| ``` | ||
|
|
||
| #### Feature Gate | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We use feature gates to protect features that might undergo significant changes, be removed later or have a major backwards compatibility impacrt. This does not sound like sch a feature. Why isn't it enough to enable it through the |
||
|
|
||
| Tracing will be controlled by a new feature gate: | ||
|
|
||
| ```yaml | ||
| env: | ||
| - name: STRIMZI_FEATURE_GATES | ||
| value: "+OperatorTracing" | ||
| ``` | ||
|
|
||
| | Gate | Default | Description | | ||
| |------|---------|-------------| | ||
| | `OperatorTracing` | `disabled` | Enable OpenTelemetry tracing for operators | | ||
|
|
||
| ### Span Attributes | ||
|
|
||
| Each span will include semantic attributes following OpenTelemetry conventions: | ||
|
|
||
| | Attribute | Example Value | Description | | ||
| |-----------|---------------|-------------| | ||
| | `strimzi.resource.kind` | `Kafka` | Kind of resource being reconciled | | ||
| | `strimzi.resource.name` | `my-cluster` | Name of the resource | | ||
| | `strimzi.resource.namespace` | `kafka` | Namespace of the resource | | ||
| | `strimzi.resource.generation` | `5` | Generation of the resource | | ||
| | `strimzi.operator.version` | `0.51.0` | Strimzi operator version | | ||
| | `strimzi.kafka.version` | `4.1.1` | Kafka version being managed | | ||
| | `strimzi.reconciliation.trigger` | `periodic`, `watch`, `manual` | What triggered the reconciliation | | ||
|
|
||
| ### Error Handling | ||
|
|
||
| Failed reconciliations will: | ||
| 1. Set span status to `ERROR` | ||
| 2. Record exception details via `span.recordException()` | ||
| 3. Add `error.type` and `error.message` attributes | ||
|
|
||
| ```java | ||
| try { | ||
| reconcileKafka(kafka); | ||
| } catch (Exception e) { | ||
| span.setStatus(StatusCode.ERROR, e.getMessage()); | ||
| span.recordException(e); | ||
| throw e; | ||
| } | ||
| ``` | ||
|
|
||
| ### Implementation Phases | ||
|
|
||
| #### Phase 1: Cluster Operator (Core) | ||
| - Add OpenTelemetry SDK dependency (`io.opentelemetry:opentelemetry-sdk`) | ||
| - Instrument `KafkaAssemblyOperator` reconciliation loop | ||
| - Add spans for major phases: validation, node pools, brokers, rolling updates | ||
| - Feature gate and environment variable configuration | ||
|
|
||
| #### Phase 2: Topic and User Operators | ||
| - Instrument `TopicOperator` batch processing | ||
| - Instrument `UserOperator` credential and ACL operations | ||
| - Context propagation from Cluster Operator via CR annotations | ||
|
|
||
| #### Phase 3: Cross-Operator Correlation | ||
| - Propagate trace context through `strimzi.io/trace-context` annotation | ||
| - Enable end-to-end traces from `Kafka` CR change to `KafkaUser` credential update | ||
|
|
||
| ### Example Trace Visualization | ||
|
|
||
| In Jaeger/Grafana Tempo, a typical trace would show: | ||
|
|
||
| ``` | ||
| Trace ID: abc123... | ||
| Duration: 4m 32s | ||
|
|
||
| [4m 32s] ReconcileKafka (my-cluster) | ||
| ├── [120ms] ValidateResource | ||
| ├── [1.2s] ReconcileNodePools | ||
| │ ├── [600ms] ReconcileNodePool (pool-a) | ||
| │ └── [580ms] ReconcileNodePool (pool-b) | ||
| ├── [3m 45s] ReconcileBrokers ← Bottleneck identified! | ||
| │ ├── [200ms] GenerateConfigs | ||
| │ └── [3m 44s] RollingUpdate | ||
| │ ├── [45s] RollPod (my-cluster-kafka-0) | ||
| │ ├── [48s] RollPod (my-cluster-kafka-1) ← Slow restart | ||
| │ └── [2m 10s] RollPod (my-cluster-kafka-2) ← Very slow! | ||
| ├── [12s] ReconcileCruiseControl | ||
| └── [8s] ReconcileCertificates | ||
| ``` | ||
|
|
||
| ## Affected/not affected projects | ||
|
|
||
| ### Affected | ||
|
|
||
| | Project | Changes | | ||
| |---------|---------| | ||
| | `strimzi-kafka-operator` | Core implementation in Cluster, Topic, and User Operators | | ||
| | `strimzi-kafka-operator` (docs) | Documentation for enabling and configuring tracing | | ||
| | Helm charts | Add environment variable templates for tracing configuration | | ||
|
|
||
| ### Not Affected | ||
|
|
||
| | Project | Reason | | ||
| |---------|--------| | ||
| | `strimzi-kafka-bridge` | Already has OpenTelemetry support | | ||
| | `strimzi-kafka-oauth` | Out of scope (authentication plugin) | | ||
| | `strimzi-drain-cleaner` | Minimal reconciliation, tracing not beneficial | | ||
| | `metrics-reporter` | Metrics-focused, not tracing | | ||
|
|
||
| ## Compatibility | ||
|
|
||
| ### Backwards Compatibility | ||
|
|
||
| - **Opt-in by default**: Tracing is disabled unless explicitly enabled via feature gate | ||
| - **No CRD changes**: Configuration via environment variables only | ||
| - **No behavioral changes**: Operators function identically with tracing disabled | ||
| - **Graceful degradation**: If OTLP endpoint is unavailable, tracing fails silently | ||
|
|
||
| ### Forward Compatibility | ||
|
|
||
| - **OpenTelemetry SDK versioning**: Use BOM for consistent dependency versions | ||
| - **Semantic conventions**: Follow stable OTel semantic conventions to avoid breaking changes | ||
| - **Feature gate graduation**: Plan for Beta → GA promotion in future releases | ||
|
|
||
| ### Upgrade Path | ||
|
|
||
| 1. Upgrade Strimzi as normal | ||
| 2. Optionally enable `+OperatorTracing` feature gate | ||
| 3. Configure `OTEL_EXPORTER_OTLP_ENDPOINT` to point to tracing backend | ||
|
|
||
| ## Rejected alternatives | ||
|
|
||
| ### 1. Jaeger-only implementation | ||
|
|
||
| **Rejected because**: OpenTelemetry provides vendor-neutral support. Users can choose Jaeger, Grafana Tempo, Datadog, or other backends without code changes. | ||
|
|
||
| ### 2. CRD-based configuration (e.g., `Kafka.spec.clusterOperator.tracing`) | ||
|
|
||
| **Rejected because**: | ||
| - Cluster Operator runs independently of any specific `Kafka` CR | ||
| - Environment variables follow OTel conventions and are simpler | ||
| - Avoids chicken-and-egg problem for tracing operator startup | ||
|
|
||
| ### 3. Java agent-based instrumentation | ||
|
|
||
| **Rejected because**: | ||
| - Less control over span structure and attributes | ||
| - Potential conflicts with other Java agents | ||
| - Manual instrumentation allows semantic span naming | ||
|
|
||
| ### 4. Micrometer Tracing instead of OpenTelemetry SDK | ||
|
|
||
| **Rejected because**: | ||
| - Adds abstraction layer that complicates configuration | ||
| - OpenTelemetry alternatives are not CNCF standard with broader ecosystem support | ||
| - Direct SDK usage aligns with existing Strimzi component tracing | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not really how it works. So I do not think this makes sense.