Skip to content

Revised Stretch Cluster Design with Pluggable Network Provider Interface#187

Open
aswinayyolath wants to merge 9 commits intostrimzi:mainfrom
aswinayyolath:stretch-cluster-revised
Open

Revised Stretch Cluster Design with Pluggable Network Provider Interface#187
aswinayyolath wants to merge 9 commits intostrimzi:mainfrom
aswinayyolath:stretch-cluster-revised

Conversation

@aswinayyolath
Copy link

@aswinayyolath aswinayyolath commented Dec 4, 2025

This PR presents a revised Stretch Cluster design with a pluggable network provider interface, enabling flexible support for multiple connectivity technologies such as Submariner, Cilium and Kube primitives like NodePort, and LoadBalancer. The design retains a centralized control plane while allowing data plane resources to be realized in remote clusters via KafkaNodePool-based placement, and clearly defines the extension points for multi-cluster networking without coupling the architecture to a specific implementation.

  1. Strimzi-Kafka-Operator changes: aswinayyolath/strimzi-kafka-operator@eventstreams-stretch-cluster...aswinayyolath:strimzi-kafka-operator:48-migration
  2. MCS plugin implementation: https://github.com/aswinayyolath/strimzi-stretch-mcs-plugin
  3. NodePort plugin implementation: https://github.com/aswinayyolath/strimzi-stretch-nodeport-plugin
  4. Loadbalancer Plugin implementation: https://github.com/aswinayyolath/strimzi-stretch-loadbalancer-plugin

…imzi

This proposal describes how to implement stretch cluster in strimzi

Signed-off-by: Aswin A <aswin6303@gmail.com>
Fixed Ownerreference confusion

Signed-off-by: Aswin A <aswin6303@gmail.com>
Removed Network latency testing session from the proposal

Signed-off-by: Aswin A <aswin6303@gmail.com>
updated proposal to remove redundant sections

Signed-off-by: Aswin A <aswin6303@gmail.com>
Added Architecture diagrams

Signed-off-by: Aswin A <aswin6303@gmail.com>
Removed redundant section on garbage collection challenges in stretch clusters.

Signed-off-by: Aswin A <55191821+aswinayyolath@users.noreply.github.com>
@aswinayyolath aswinayyolath marked this pull request as ready for review December 11, 2025 17:08
Copy link
Member

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aswinayyolath I've had an initial look. I will need to do a more thorough review but have added some initial questions.

I also had a quick look at the implementation. You seem to have the reconciliation loop duplicated in KafkaReconciler, so having to list everything twice. I'm not keen on that as the final implementation as it creates quite a maintenance problem going forwards.


### Low-Level Design and Implementation

#### Garbage Collection for Remote Resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite hesitant about the idea of using a ConfigMap in lieu of an owning resource, for one thing If the ConfigMap is accidentally deleted presumably it could take up to 5 minutes for the Strimzi operator to notice and put it back, which is quite a long time for the resources to be missing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GC CM serves as a proxy owner ....it's a lightweight, standalone resource in each remote cluster that exists solely to anchor ownership chains. All remote resources reference this CM as their owner, enabling Kubernetes' native garbage collection to cascade delete everything when the CM is removed.

The concern about 5 min reconciliation delay is valid, but this scenario is extremely unlikely in practice AFAIK....But we can double check...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally the operator's reconciliation loop checks for GC CM existence as the first step when reconciling any Kafka CR with stretch mode. If missing, it's recreated immediately before any other resources.


The Strimzi project provides reference implementations in separate repositories:

**NodePort Provider** (`strimzi-stretch-nodeport-plugin`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see the reference implementations really only have a single class. For those that don't require further dependencies could they be part of the Strimzi cluster-operator? Otherwise we would need to handle release and versioning of those separately

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct that NP and LB providers are relatively simple (single class, no external dependencies). However, keeping them as external plugins, even Strimzi-maintained ones provides several important benefits:

The core operator should focus on Kafka orchestration (reconciliation, configuration, status management). Networking is infra level concern that varies by env. Making networking implementations external enforces this separation at the architectural level. (This was one of the main concern raised by maintainer in the older proposal)

  • A bug fix or enhancement in the NodePort provider shouldn't require rebuilding and releasing the entire operator
  • Users can update plugins without operator downtime
  • Different environments might need different plugin versions (e.g., NodePort v1.0 for dev, v1.1 for production)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugins can be tested independently against multiple operator versions..users can validate plugin behavior in their env before operator upgrade

#### Summary: When to Use Stretch Clusters

**✅ Use Stretch Clusters When:**
- Deploying across multiple AZs in same datacenter (< 5ms latency)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this is across multiple Kubernetes clusters in multiple AZs in the same datacenter? Since otherwise you wouldn't need a stretch cluster. It does seem like quite a narrow use case compared to the size of the code needed

Copy link

@rohan-anil-kumar rohan-anil-kumar Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A stretch cluster is supposed to work on different k8s clusters in the same data centre (in multiple AZs) given that the latency is minimal

}
```

#### Plugin Loading Mechanism
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the user change the plugin and restart the operator? What would the implications of that be? How would the old resources be cleaned up?

If the plugin is supposed to be the same for the entire lifespan of the operator, why not include it in the image directly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, users can't safely change the networking plugin for an existing stretch cluster. The plugin determines fundamental networking architecture (DNS names, Service types, endpoint formats). Changing plugins would require:

  1. Deleting the existing Kafka cluster (generates new certs with different SANs, new svc with different types)
  2. Losing all data (Kafka state tied to pod identities and network endpoints)
  3. Reconfiguring all clients (bootstrap addresses change)

The operator should detect plugin changes and reject them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the operator detect the plugin changes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is no way for the operator to detect the plugin changes. If the user change the plugin in the deployment, the operator restarts with the new plugin. This is not recommended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohananilkumar if that is the case it seems to contradict @aswinayyolath's statement that The operator should detect plugin changes and reject them

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator does not detect plugin changes right now in the PoC. I think what Aswin meant here is that the proper behaviour should be that the operator detects the plugin changes. When it detects such a change. It should reject it.

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aswinayyolath thanks, I had a first pass and left some comments questions. I also agree with @katheris about the need to avoid any code duplication in the reconciliation process.


**Alternative:** GitOps tools naturally provide CR storage and replication to new clusters.

## Compatibility and Migration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking about "migration", did you have any thoughts about the possibility to migrate an Apache Kafka cluster from being already running on a single Kubernetes cluster to be stretched across N Kubernetes clusters? It could be possible that today, even some users needed a stretched cluster they were running with just one Kubernetes because of the missing support in Strimzi. I could expect that they would like to move a running Kafka cluster to be stretched once the feature is in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question... we have indeed considered this scenario. Here's our analysis and roadmap...

The initial implementation focuses on creating new stretch clusters from scratch. This provides the foundation and validates the architecture. We have something in our backlog to find an answer for this but there are several challenges like existing brokers have IDs 0-N. adding nodes in remote clusters requires careful ID allocation. existing certs have SANs for single cluster DNS only. Need to regenerate with cross cluster SANs without breaking existing connections.. advertised.listeners and controller.quorum.voters must change, requiring rolling restart of all brokers etc etc...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I know it's going to be challenging so I asked the question but at the same time it's important having thoughts about it. If it's not part of this proposal you should clarify it's a "no goal" for now.

Map<String, String> listeners);

// Generate controller.quorum.voters configuration
Future<String> generateQuorumVoters(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A heads up here but it's worth thinking up front. Strimzi will get the support for the dynamic quorum in 2026 (I would expect in Q1) and it will be able to operate already running clusters with static quorum and new ones with dynamic quorum (also migrating from one to the other). The dynamic quorum doesn't use the controller.quorum.voters parameter anymore but controller.quorum.bootstrap.servers instead (with a different format as well). Maybe this interface should provide both since the beginning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting us know

#### Plugin JAR Deployment

Users must provide the networking provider plugin JAR file to the cluster operator.
This is accomplished by creating a ConfigMap containing the plugin JAR and mounting it as a volume in the operator deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the usage of a ConfigMap is the right choice for the goal. First, it has a 1 MB limit AFAIK. Not sure how big could be the jar for a stretch cluster plugin but it's anyway a risk. Another option could be a PVC or OCI Artifacts and Image Volumes (but it's still in beta in Kube 1.33). At this point, maybe it would be better to cook the plugin JAR within the operator image? The drawback of course would be that in case of a new version for the plugin, the image needs to be rebuild, while by using another external mechanism it would be enough restarting the pod (after putting the new jar in the right place).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that ConfigMaps have a 1MB limit however, this isn't a practical concern for our reference implementations..Even with future enhancements, these plugins are unlikely to exceed 500KB..Other PVC option might be over kill for this , and OCI Artifacts is in beta...If we need to bake this into Operator image we should be willing to accept rebuilding and redeploying the entire operator for Bug CVE fixes etc ...

Copy link
Member

@ppatierno ppatierno Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using a ConfigMap here is an anti-pattern. ConfigMap(s) are used for storing data to be consumed as metadata, configurations and so on but not for storing JARs (or any binary file). The fact that a ConfigMap can be mounted as a volume is for allowing the application running in the pod to get these data and configuration via files. Take into account that everything you put into a ConfigMap is going into ETCD, so by putting a JAR you are actually storing a BLOB within ETCD. I think the good way would be baking the JAR within a custom operator image.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Baking the JAR within a custom operator image would be better. We are looking into modifying the proposal to that direction


This means all Kafka clusters managed by a single operator instance must use the same networking provider.
Different networking providers cannot coexist in a single operator instance.
If networking provider were configurable per Kafka CR, the operator would need to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proposal could be simplified by moving what wasn't picked as solution in the rejected alternatives section. The proposal should be just what you proposed and how it should work in your opinion. All the reasoning about other not-picked choices should not be there. Reviewers can go through the rejected alternatives to read about them. I think this is happening in more places across the proposal that could be simplified at the core this way.

managed-resources: "StrimziPodSet,ServiceAccount,Secret,ConfigMap,Service,ServiceExport,PersistentVolumeClaim"
```

**Why no ownerReferences?**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example of "why no ... " which should be in the rejected alternatives section instead. If you really want to mention it here, maybe a short sentence with a link would be better to simplify the proposal core.

2. Kubernetes deletes the Kafka CR immediately
3. Operator's watch detects the deletion and triggers reconciliation
4. Since the CR no longer exists, the operator's `delete()` method is invoked
5. The `delete()` method explicitly deletes GC ConfigMap from each remote cluster:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user deletes the Kafka CR on the central cluster but when it comes to delete the GC ConfigMap(s), there is a connectivity issue with one or more remote clusters and the operator can't do that. How the operator knows what to do on next reconciliation? Or ... the Kafka CR is deleted then the operator crashes. How does it know it has to delete the other resources across remote clusters but for a Kafka CR which doesn't exist and it doesn't have any details anymore?

Copy link
Author

@aswinayyolath aswinayyolath Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again excellent q and we have thought about this already... I think the answer to this is finalizer in Kafk. When a user deletes the Kafka CR, Kubernetes doesn't delete it immediately if finalizers are present.

The workflow should look something like

when someone delete a Kafka CR, the operator:

  1. Deletes GC ConfigMaps in all remote clusters
  2. Kubernetes automatically cascades deletion of all owned resources
  3. Removes finalizer from Kafka CR
  4. Kafka CR is fully deleted

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. I think it is a step in the right direction. But there is lot of work left. The plugin design seems not to take into account the public versus private APIs (or did I missed it somewhere? It is easy to miss in such a long proposal). It also needs better integration into the reconciliation flow rather than just creating soem strange override points. For example lot of the stuff needs to be really plugged into the KafkaListenersReconciler rather than into some artificial entry points. I do not think this creates a stable design which encourages writting of the plugins.

Comment on lines +162 to +163
- name: STRIMZI_STRETCH_PLUGIN_CLASS_PATH
value: /opt/strimzi/plugins/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is custom classpath needed? We do not use it for other plugins.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch. We're looking into baking the plugin JAR with the operator image. With that change, we can also remove the custom class path. We can load it like any other generic plugin

Comment on lines +160 to +161
- name: STRIMZI_STRETCH_PLUGIN_CLASS_NAME
value: io.strimzi.plugin.stretch.NodePortNetworkingProvider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be based on some cluster configuration. Not CO level configuration.

Copy link
Member

@ppatierno ppatierno Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question ... are you envisaging different Kafka clusters to be stretched across the same Kubernetes clusters but using different networking and operated by the same CO? I.e. Kubernetes clusters A, B and C. Kafka 1 using Cilium and Kafka 2 using Submariner so configuring the plugin at Kafka CR level (if it's what you mean by "some cluster configuration"). Is it really possible? I am not an expert of installing these networking layers so asking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I expect that a large number of the ~5 users interested in this will want to mix the plugins. 😜

Now seriously ...

  • I do not see any significant benefits to making this configurable only on a per-operator basis
  • The operator might not operate only Kafka clusters on Kube clusters A, B, and C. It might operate one Kafka on Kubes A, B, and C ... and another one on A, D, and E.
  • The above is especially true given I expect significantly more users for this for migrations between Kubernetes clusters and not for permanently stretched cluster environments.
  • As we expect multiple implementations:
    • It should be expected that users would be interested in evaluating them, testing them, etc. Which is where there might be multiple Kafka clusters across the same Kube clusters using different plugins.
    • While I guess migration from one plugin to another within a single cluster might not be something that can be easily supported without a significant rework of how Strimzi is configured (which would possibly be very useful for other scenarios, but I do not see anyone working on that anytime soon), users might still be interested in migrating through parallel clusters using different plugins.

Comment on lines +104 to +105
ResourceOperatorSupplier centralSupplier,
RemoteResourceOperatorSupplier remoteSupplier);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • These are private APIs
  • These are based on Vert.x which is planned for removal
  • These are planned for major refactoring

If you want to use these, you have to explain how they will make it into the public API. And you should also solve the outstanding technical debt first. Maybe as an alternative - which might be better regardless of the points above - passing the Fabric8 client instance should be considered?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are looking into it

#### Plugin JAR Deployment

Users must provide the networking provider plugin JAR file to the cluster operator.
This is accomplished by creating a ConfigMap containing the plugin JAR and mounting it as a volume in the operator deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes against any good practice. If the user wants to do this, their problem. But the proposal should promote only good solutions. E.g. custom container image.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we are taking a look into baking the image with the operator image. We initially thought of the ConfigMap approach to make it easier for the user to load the plugin JAR (without building a custom container image)

Comment on lines +214 to +215
- name: STRIMZI_STRETCH_PLUGIN_CLASS_PATH
value: /opt/strimzi/plugins/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are using special class path with a special classloader, you should probably not use a generic plugin path but some specific one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are looking into baking the plugin image within a custom operator image. We will update the plugin path along with this

└── Kubernetes Cluster 3 (AZ-1c): 1 controller, 9 brokers

Network: Low-latency private network (< 1ms)
Benefit: Survives entire AZ failure or K8s control plane outage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Survives entire AZ failure

So it does without all of this.

- **Latency:** 1-10 seconds average
- **Stability:** Sensitive to network jitter

**Testing Data:** At 10ms added latency, throughput drops to 17,000 msg/sec (66% reduction). At 50ms, only 4,300 msg/sec remains (91% reduction).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does throughput drop with latency? The general expectation would be stability issues rather than throughput drop. Were stability issues cause of the throughput drop?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not exactly sure why this throughput drop is happening. Our initial thought was some timeout happening in kafka when two brokers could not communicate within a specific time. We might be able to fix this by modifying some kafka timeouts but the experiments that we conducted came from a vanilla Kafka CR


**Alternative:** Use **MirrorMaker 2** for cross region disaster recovery. MM2 provides asynchronous replication optimized for high-latency scenarios and is the correct tool for geographic distribution.

#### Summary: When to Use Stretch Clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we absolutely do not want to issue any recommendations like this.

Comment on lines +1636 to +1639
**Deployment:**
- Deployed only in the central cluster
- Manages KafkaTopic and KafkaUser resources
- No changes required for stretch cluster support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely needs to be properly configured to connect to the whole stretch cluster? Same with KE and CC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate? As I am not sure if I understand it completely.

Comment on lines +1612 to +1616
**❌ Use MirrorMaker 2 Instead When:**
- Deploying across regions (> 50ms latency)
- Geographic disaster recovery is primary goal
- Can tolerate asynchronous replication
- Prefer simpler operational model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mirror Maker 2 has a completely different purpose. Positioning it as an alternative does not make sense. It also provides completely different guarantees -> e.g. asynch mirroring, but much higher guarantees against cascading failures, which the stretch cluster does not provide. This is not one or the other decision.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. We've updated the proposal to remove MM2 as an alternative.

@katheris
Copy link
Member

Thanks @aswinayyolath for your proposal. Based on the comments so far here are some concrete changes I think you should make to this proposal:

  • Update the plugin mechanism so it doesn't use a ConfigMap and instead custom images are used for custom plugins
  • Reduce the number of reference plugin implementations to just one, I am also happy with the NodePort one being the only one

More generally the proposal is very long and I think you can do some things to improve it:

  • Move the rejected alternatives section to the end and move all discussions of rejected alternatives under this one heading. For example the section Why Stretch Configuration is at Operator Level could actually be a rejected alternative called Stretch Configuration at Kafka cluster level.
  • Move summary bullet points directly under the Proposal heading
  • Change diagrams into pictures, currently the diagrams are long to scroll past when trying to find a section of the proposal and often you can't actually see the whole diagram at once
  • Throughout the whole proposal you could save space by reformatting, for example in your API code you probably don't need a new line for every field being passed to the methods, and in the sections where you have YAML you can remove parts that aren't relevant to the proposal, for example you have an entire Kafka CR when really you're only talking about the annotation

@ppatierno @scholzj feel free to disagree with me here if I've misinterpreted your comments. @aswinayyolath let me know if any of the above is unclear and I will give it another more detailed pass as soon as I find time.

rohan-anil-kumar and others added 3 commits January 13, 2026 11:00
Signed-off-by: ROHAN ANIL KUMAR <rohananilpnr@gmail.com>
Signed-off-by: ROHAN ANIL KUMAR <rohananilpnr@gmail.com>
Made changes according to review comments
@em-sav
Copy link

em-sav commented Mar 11, 2026

Hey @aswinayyolath, this is a very well-detailed proposal for the Stretch Kafka Cluster — a big thanks to you!

After reading the Network Architecture section, a question came to mind. Will the Strimzi Operator support a basic networking setup when a flat network is established between pods across multiple Kubernetes clusters? I noticed you have different plugins to allow Multi-Cluster discovery (MCS, NodePort, LoadBalancer, Custom plugin). So, if we have native pod-to-pod communication and use a technology like Istio or Consul that enable Multi-Cluster Service Discovery, would your proposal and NetworkPlugin be compatible with those technologies?

On our end, we don't rely on the MCS API for exporting services across clusters. Istio's default Service Discovery mechanism uses namespace sameness to group endpoints together and create a single entrypoint for distributed backends. With that in mind, I was wondering: is this something that would be supported by Strimzi stretch clusters out of the box, or would we need to implement a custom plugin?

Thanks again for this proposal — can't wait to try it out! 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants