Revised Stretch Cluster Design with Pluggable Network Provider Interface#187
Revised Stretch Cluster Design with Pluggable Network Provider Interface#187aswinayyolath wants to merge 9 commits intostrimzi:mainfrom
Conversation
…imzi This proposal describes how to implement stretch cluster in strimzi Signed-off-by: Aswin A <aswin6303@gmail.com>
Fixed Ownerreference confusion Signed-off-by: Aswin A <aswin6303@gmail.com>
Removed Network latency testing session from the proposal Signed-off-by: Aswin A <aswin6303@gmail.com>
updated proposal to remove redundant sections Signed-off-by: Aswin A <aswin6303@gmail.com>
Added Architecture diagrams Signed-off-by: Aswin A <aswin6303@gmail.com>
Removed redundant section on garbage collection challenges in stretch clusters. Signed-off-by: Aswin A <55191821+aswinayyolath@users.noreply.github.com>
katheris
left a comment
There was a problem hiding this comment.
Thanks @aswinayyolath I've had an initial look. I will need to do a more thorough review but have added some initial questions.
I also had a quick look at the implementation. You seem to have the reconciliation loop duplicated in KafkaReconciler, so having to list everything twice. I'm not keen on that as the final implementation as it creates quite a maintenance problem going forwards.
|
|
||
| ### Low-Level Design and Implementation | ||
|
|
||
| #### Garbage Collection for Remote Resources |
There was a problem hiding this comment.
I'm quite hesitant about the idea of using a ConfigMap in lieu of an owning resource, for one thing If the ConfigMap is accidentally deleted presumably it could take up to 5 minutes for the Strimzi operator to notice and put it back, which is quite a long time for the resources to be missing.
There was a problem hiding this comment.
The GC CM serves as a proxy owner ....it's a lightweight, standalone resource in each remote cluster that exists solely to anchor ownership chains. All remote resources reference this CM as their owner, enabling Kubernetes' native garbage collection to cascade delete everything when the CM is removed.
The concern about 5 min reconciliation delay is valid, but this scenario is extremely unlikely in practice AFAIK....But we can double check...
There was a problem hiding this comment.
ideally the operator's reconciliation loop checks for GC CM existence as the first step when reconciling any Kafka CR with stretch mode. If missing, it's recreated immediately before any other resources.
|
|
||
| The Strimzi project provides reference implementations in separate repositories: | ||
|
|
||
| **NodePort Provider** (`strimzi-stretch-nodeport-plugin`) |
There was a problem hiding this comment.
From what I can see the reference implementations really only have a single class. For those that don't require further dependencies could they be part of the Strimzi cluster-operator? Otherwise we would need to handle release and versioning of those separately
There was a problem hiding this comment.
You're correct that NP and LB providers are relatively simple (single class, no external dependencies). However, keeping them as external plugins, even Strimzi-maintained ones provides several important benefits:
The core operator should focus on Kafka orchestration (reconciliation, configuration, status management). Networking is infra level concern that varies by env. Making networking implementations external enforces this separation at the architectural level. (This was one of the main concern raised by maintainer in the older proposal)
- A bug fix or enhancement in the NodePort provider shouldn't require rebuilding and releasing the entire operator
- Users can update plugins without operator downtime
- Different environments might need different plugin versions (e.g., NodePort v1.0 for dev, v1.1 for production)
There was a problem hiding this comment.
Plugins can be tested independently against multiple operator versions..users can validate plugin behavior in their env before operator upgrade
124-stretch-cluster.md
Outdated
| #### Summary: When to Use Stretch Clusters | ||
|
|
||
| **✅ Use Stretch Clusters When:** | ||
| - Deploying across multiple AZs in same datacenter (< 5ms latency) |
There was a problem hiding this comment.
Presumably this is across multiple Kubernetes clusters in multiple AZs in the same datacenter? Since otherwise you wouldn't need a stretch cluster. It does seem like quite a narrow use case compared to the size of the code needed
There was a problem hiding this comment.
Yes. A stretch cluster is supposed to work on different k8s clusters in the same data centre (in multiple AZs) given that the latency is minimal
| } | ||
| ``` | ||
|
|
||
| #### Plugin Loading Mechanism |
There was a problem hiding this comment.
Can the user change the plugin and restart the operator? What would the implications of that be? How would the old resources be cleaned up?
If the plugin is supposed to be the same for the entire lifespan of the operator, why not include it in the image directly?
There was a problem hiding this comment.
No, users can't safely change the networking plugin for an existing stretch cluster. The plugin determines fundamental networking architecture (DNS names, Service types, endpoint formats). Changing plugins would require:
- Deleting the existing Kafka cluster (generates new certs with different SANs, new svc with different types)
- Losing all data (Kafka state tied to pod identities and network endpoints)
- Reconfiguring all clients (bootstrap addresses change)
The operator should detect plugin changes and reject them
There was a problem hiding this comment.
How will the operator detect the plugin changes?
There was a problem hiding this comment.
Currently there is no way for the operator to detect the plugin changes. If the user change the plugin in the deployment, the operator restarts with the new plugin. This is not recommended.
There was a problem hiding this comment.
@rohananilkumar if that is the case it seems to contradict @aswinayyolath's statement that The operator should detect plugin changes and reject them
There was a problem hiding this comment.
The operator does not detect plugin changes right now in the PoC. I think what Aswin meant here is that the proper behaviour should be that the operator detects the plugin changes. When it detects such a change. It should reject it.
ppatierno
left a comment
There was a problem hiding this comment.
@aswinayyolath thanks, I had a first pass and left some comments questions. I also agree with @katheris about the need to avoid any code duplication in the reconciliation process.
124-stretch-cluster.md
Outdated
|
|
||
| **Alternative:** GitOps tools naturally provide CR storage and replication to new clusters. | ||
|
|
||
| ## Compatibility and Migration |
There was a problem hiding this comment.
Talking about "migration", did you have any thoughts about the possibility to migrate an Apache Kafka cluster from being already running on a single Kubernetes cluster to be stretched across N Kubernetes clusters? It could be possible that today, even some users needed a stretched cluster they were running with just one Kubernetes because of the missing support in Strimzi. I could expect that they would like to move a running Kafka cluster to be stretched once the feature is in.
There was a problem hiding this comment.
Great question... we have indeed considered this scenario. Here's our analysis and roadmap...
The initial implementation focuses on creating new stretch clusters from scratch. This provides the foundation and validates the architecture. We have something in our backlog to find an answer for this but there are several challenges like existing brokers have IDs 0-N. adding nodes in remote clusters requires careful ID allocation. existing certs have SANs for single cluster DNS only. Need to regenerate with cross cluster SANs without breaking existing connections.. advertised.listeners and controller.quorum.voters must change, requiring rolling restart of all brokers etc etc...
There was a problem hiding this comment.
Yeah I know it's going to be challenging so I asked the question but at the same time it's important having thoughts about it. If it's not part of this proposal you should clarify it's a "no goal" for now.
| Map<String, String> listeners); | ||
|
|
||
| // Generate controller.quorum.voters configuration | ||
| Future<String> generateQuorumVoters( |
There was a problem hiding this comment.
A heads up here but it's worth thinking up front. Strimzi will get the support for the dynamic quorum in 2026 (I would expect in Q1) and it will be able to operate already running clusters with static quorum and new ones with dynamic quorum (also migrating from one to the other). The dynamic quorum doesn't use the controller.quorum.voters parameter anymore but controller.quorum.bootstrap.servers instead (with a different format as well). Maybe this interface should provide both since the beginning.
There was a problem hiding this comment.
Thanks for letting us know
| #### Plugin JAR Deployment | ||
|
|
||
| Users must provide the networking provider plugin JAR file to the cluster operator. | ||
| This is accomplished by creating a ConfigMap containing the plugin JAR and mounting it as a volume in the operator deployment. |
There was a problem hiding this comment.
I am not sure the usage of a ConfigMap is the right choice for the goal. First, it has a 1 MB limit AFAIK. Not sure how big could be the jar for a stretch cluster plugin but it's anyway a risk. Another option could be a PVC or OCI Artifacts and Image Volumes (but it's still in beta in Kube 1.33). At this point, maybe it would be better to cook the plugin JAR within the operator image? The drawback of course would be that in case of a new version for the plugin, the image needs to be rebuild, while by using another external mechanism it would be enough restarting the pod (after putting the new jar in the right place).
There was a problem hiding this comment.
You're right that ConfigMaps have a 1MB limit however, this isn't a practical concern for our reference implementations..Even with future enhancements, these plugins are unlikely to exceed 500KB..Other PVC option might be over kill for this , and OCI Artifacts is in beta...If we need to bake this into Operator image we should be willing to accept rebuilding and redeploying the entire operator for Bug CVE fixes etc ...
There was a problem hiding this comment.
I think using a ConfigMap here is an anti-pattern. ConfigMap(s) are used for storing data to be consumed as metadata, configurations and so on but not for storing JARs (or any binary file). The fact that a ConfigMap can be mounted as a volume is for allowing the application running in the pod to get these data and configuration via files. Take into account that everything you put into a ConfigMap is going into ETCD, so by putting a JAR you are actually storing a BLOB within ETCD. I think the good way would be baking the JAR within a custom operator image.
There was a problem hiding this comment.
Baking the JAR within a custom operator image would be better. We are looking into modifying the proposal to that direction
124-stretch-cluster.md
Outdated
|
|
||
| This means all Kafka clusters managed by a single operator instance must use the same networking provider. | ||
| Different networking providers cannot coexist in a single operator instance. | ||
| If networking provider were configurable per Kafka CR, the operator would need to: |
There was a problem hiding this comment.
I think the proposal could be simplified by moving what wasn't picked as solution in the rejected alternatives section. The proposal should be just what you proposed and how it should work in your opinion. All the reasoning about other not-picked choices should not be there. Reviewers can go through the rejected alternatives to read about them. I think this is happening in more places across the proposal that could be simplified at the core this way.
124-stretch-cluster.md
Outdated
| managed-resources: "StrimziPodSet,ServiceAccount,Secret,ConfigMap,Service,ServiceExport,PersistentVolumeClaim" | ||
| ``` | ||
|
|
||
| **Why no ownerReferences?** |
There was a problem hiding this comment.
Another example of "why no ... " which should be in the rejected alternatives section instead. If you really want to mention it here, maybe a short sentence with a link would be better to simplify the proposal core.
| 2. Kubernetes deletes the Kafka CR immediately | ||
| 3. Operator's watch detects the deletion and triggers reconciliation | ||
| 4. Since the CR no longer exists, the operator's `delete()` method is invoked | ||
| 5. The `delete()` method explicitly deletes GC ConfigMap from each remote cluster: |
There was a problem hiding this comment.
What happens if the user deletes the Kafka CR on the central cluster but when it comes to delete the GC ConfigMap(s), there is a connectivity issue with one or more remote clusters and the operator can't do that. How the operator knows what to do on next reconciliation? Or ... the Kafka CR is deleted then the operator crashes. How does it know it has to delete the other resources across remote clusters but for a Kafka CR which doesn't exist and it doesn't have any details anymore?
There was a problem hiding this comment.
This is again excellent q and we have thought about this already... I think the answer to this is finalizer in Kafk. When a user deletes the Kafka CR, Kubernetes doesn't delete it immediately if finalizers are present.
The workflow should look something like
when someone delete a Kafka CR, the operator:
- Deletes GC ConfigMaps in all remote clusters
- Kubernetes automatically cascades deletion of all owned resources
- Removes finalizer from Kafka CR
- Kafka CR is fully deleted
scholzj
left a comment
There was a problem hiding this comment.
I left some comments. I think it is a step in the right direction. But there is lot of work left. The plugin design seems not to take into account the public versus private APIs (or did I missed it somewhere? It is easy to miss in such a long proposal). It also needs better integration into the reconciliation flow rather than just creating soem strange override points. For example lot of the stuff needs to be really plugged into the KafkaListenersReconciler rather than into some artificial entry points. I do not think this creates a stable design which encourages writting of the plugins.
| - name: STRIMZI_STRETCH_PLUGIN_CLASS_PATH | ||
| value: /opt/strimzi/plugins/* |
There was a problem hiding this comment.
Why is custom classpath needed? We do not use it for other plugins.
There was a problem hiding this comment.
That's a good catch. We're looking into baking the plugin JAR with the operator image. With that change, we can also remove the custom class path. We can load it like any other generic plugin
| - name: STRIMZI_STRETCH_PLUGIN_CLASS_NAME | ||
| value: io.strimzi.plugin.stretch.NodePortNetworkingProvider |
There was a problem hiding this comment.
This should be based on some cluster configuration. Not CO level configuration.
There was a problem hiding this comment.
Stupid question ... are you envisaging different Kafka clusters to be stretched across the same Kubernetes clusters but using different networking and operated by the same CO? I.e. Kubernetes clusters A, B and C. Kafka 1 using Cilium and Kafka 2 using Submariner so configuring the plugin at Kafka CR level (if it's what you mean by "some cluster configuration"). Is it really possible? I am not an expert of installing these networking layers so asking.
There was a problem hiding this comment.
Yes, I expect that a large number of the ~5 users interested in this will want to mix the plugins. 😜
Now seriously ...
- I do not see any significant benefits to making this configurable only on a per-operator basis
- The operator might not operate only Kafka clusters on Kube clusters A, B, and C. It might operate one Kafka on Kubes A, B, and C ... and another one on A, D, and E.
- The above is especially true given I expect significantly more users for this for migrations between Kubernetes clusters and not for permanently stretched cluster environments.
- As we expect multiple implementations:
- It should be expected that users would be interested in evaluating them, testing them, etc. Which is where there might be multiple Kafka clusters across the same Kube clusters using different plugins.
- While I guess migration from one plugin to another within a single cluster might not be something that can be easily supported without a significant rework of how Strimzi is configured (which would possibly be very useful for other scenarios, but I do not see anyone working on that anytime soon), users might still be interested in migrating through parallel clusters using different plugins.
| ResourceOperatorSupplier centralSupplier, | ||
| RemoteResourceOperatorSupplier remoteSupplier); |
There was a problem hiding this comment.
- These are private APIs
- These are based on Vert.x which is planned for removal
- These are planned for major refactoring
If you want to use these, you have to explain how they will make it into the public API. And you should also solve the outstanding technical debt first. Maybe as an alternative - which might be better regardless of the points above - passing the Fabric8 client instance should be considered?
| #### Plugin JAR Deployment | ||
|
|
||
| Users must provide the networking provider plugin JAR file to the cluster operator. | ||
| This is accomplished by creating a ConfigMap containing the plugin JAR and mounting it as a volume in the operator deployment. |
There was a problem hiding this comment.
This goes against any good practice. If the user wants to do this, their problem. But the proposal should promote only good solutions. E.g. custom container image.
There was a problem hiding this comment.
I agree, we are taking a look into baking the image with the operator image. We initially thought of the ConfigMap approach to make it easier for the user to load the plugin JAR (without building a custom container image)
| - name: STRIMZI_STRETCH_PLUGIN_CLASS_PATH | ||
| value: /opt/strimzi/plugins/* |
There was a problem hiding this comment.
If you are using special class path with a special classloader, you should probably not use a generic plugin path but some specific one?
There was a problem hiding this comment.
We are looking into baking the plugin image within a custom operator image. We will update the plugin path along with this
124-stretch-cluster.md
Outdated
| └── Kubernetes Cluster 3 (AZ-1c): 1 controller, 9 brokers | ||
|
|
||
| Network: Low-latency private network (< 1ms) | ||
| Benefit: Survives entire AZ failure or K8s control plane outage |
There was a problem hiding this comment.
Survives entire AZ failure
So it does without all of this.
124-stretch-cluster.md
Outdated
| - **Latency:** 1-10 seconds average | ||
| - **Stability:** Sensitive to network jitter | ||
|
|
||
| **Testing Data:** At 10ms added latency, throughput drops to 17,000 msg/sec (66% reduction). At 50ms, only 4,300 msg/sec remains (91% reduction). |
There was a problem hiding this comment.
Why does throughput drop with latency? The general expectation would be stability issues rather than throughput drop. Were stability issues cause of the throughput drop?
There was a problem hiding this comment.
We are not exactly sure why this throughput drop is happening. Our initial thought was some timeout happening in kafka when two brokers could not communicate within a specific time. We might be able to fix this by modifying some kafka timeouts but the experiments that we conducted came from a vanilla Kafka CR
124-stretch-cluster.md
Outdated
|
|
||
| **Alternative:** Use **MirrorMaker 2** for cross region disaster recovery. MM2 provides asynchronous replication optimized for high-latency scenarios and is the correct tool for geographic distribution. | ||
|
|
||
| #### Summary: When to Use Stretch Clusters |
There was a problem hiding this comment.
I think we absolutely do not want to issue any recommendations like this.
| **Deployment:** | ||
| - Deployed only in the central cluster | ||
| - Manages KafkaTopic and KafkaUser resources | ||
| - No changes required for stretch cluster support |
There was a problem hiding this comment.
Likely needs to be properly configured to connect to the whole stretch cluster? Same with KE and CC.
There was a problem hiding this comment.
Can you please elaborate? As I am not sure if I understand it completely.
124-stretch-cluster.md
Outdated
| **❌ Use MirrorMaker 2 Instead When:** | ||
| - Deploying across regions (> 50ms latency) | ||
| - Geographic disaster recovery is primary goal | ||
| - Can tolerate asynchronous replication | ||
| - Prefer simpler operational model |
There was a problem hiding this comment.
Mirror Maker 2 has a completely different purpose. Positioning it as an alternative does not make sense. It also provides completely different guarantees -> e.g. asynch mirroring, but much higher guarantees against cascading failures, which the stretch cluster does not provide. This is not one or the other decision.
There was a problem hiding this comment.
That's true. We've updated the proposal to remove MM2 as an alternative.
|
Thanks @aswinayyolath for your proposal. Based on the comments so far here are some concrete changes I think you should make to this proposal:
More generally the proposal is very long and I think you can do some things to improve it:
@ppatierno @scholzj feel free to disagree with me here if I've misinterpreted your comments. @aswinayyolath let me know if any of the above is unclear and I will give it another more detailed pass as soon as I find time. |
Signed-off-by: ROHAN ANIL KUMAR <rohananilpnr@gmail.com>
Signed-off-by: ROHAN ANIL KUMAR <rohananilpnr@gmail.com>
Made changes according to review comments
|
Hey @aswinayyolath, this is a very well-detailed proposal for the Stretch Kafka Cluster — a big thanks to you! After reading the On our end, we don't rely on the MCS API for exporting services across clusters. Istio's default Service Discovery mechanism uses namespace sameness to group endpoints together and create a single entrypoint for distributed backends. With that in mind, I was wondering: is this something that would be supported by Strimzi stretch clusters out of the box, or would we need to implement a custom plugin? Thanks again for this proposal — can't wait to try it out! 💪 |
This PR presents a revised Stretch Cluster design with a pluggable network provider interface, enabling flexible support for multiple connectivity technologies such as Submariner, Cilium and Kube primitives like NodePort, and LoadBalancer. The design retains a centralized control plane while allowing data plane resources to be realized in remote clusters via KafkaNodePool-based placement, and clearly defines the extension points for multi-cluster networking without coupling the architecture to a specific implementation.