Skip to content

Dynamic controller quorum#203

Open
ppatierno wants to merge 16 commits intostrimzi:mainfrom
ppatierno:dynamic-controller-quorum
Open

Dynamic controller quorum#203
ppatierno wants to merge 16 commits intostrimzi:mainfrom
ppatierno:dynamic-controller-quorum

Conversation

@ppatierno
Copy link
Member

This proposal is about adding the support for dynamic quorum and controllers scaling to the Strimzi Cluster Operator.
It replaces #190.
I have been already working on a POC to validate what is currently written within this proposal.
I also added some scenarios of dynamic quorum and controller scaling usage with both happy paths and failures.
It is also possible to try it by deploying a Strimzi Cluster Operator but using the following images in the Deployment file:

  • operator quay.io/ppatierno/operator:dynamic-quorum
  • Kafka 4.1.1 quay.io/ppatierno/kafka:dynamic-quorum-kafka-4.1.1
  • Kafka 4.2.0 quay.io/ppatierno/kafka:dynamic-quorum-kafka-4.2.0

Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
handling

Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Comment on lines +99 to +102
Without the bootstrap snapshot containing the `VotersRecord`, initial controllers face a kind of deadlock: they cannot participate in elections or become candidates unless they know who the voters are, but the voters set information comes from the `VotersRecord` which must be written to the replicated log by a leader.
However, no leader can exist without successful elections, and elections cannot happen without controllers knowing the voters set.
This circular dependency means that if all controllers start without a pre-written `VotersRecord` in their bootstrap snapshot, they would all have an empty voters set, preventing any of them from becoming candidates or holding elections, leaving the cluster unable to bootstrap.
The only way to break this deadlock is to pre-write the `VotersRecord` into the bootstrap snapshot during initial formatting using the `--initial-controllers` parameter, ensuring all initial controllers know the voters set from the beginning before the cluster even starts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good explanation about why the bootstrap snapshot needed. This part is not explained in the original KIP-853. Nice!

The operator also adds a `cluster.new` field to the ConfigMap, which indicates whether this is a brand new cluster (`true`) or an existing cluster (`false`).
This field is determined based on the presence of the `clusterId` in the `Kafka` custom resource status:

* A cluster is considered "new" if the status is null or the `clusterId` is null/empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. From the description below, the status is to determine if it's static or dynamic quorum, and the clusterId is to determine if this cluster is new or existing. Am I right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to the Kafka custom resource status, when it's null, the cluster is new anyway. With regards to being static quorum based, it's the controllers field in the status to be null instead, but status still does exist.


* Reads the node role from the `process.roles` property in `/tmp/strimzi.properties` into a `PROCESS_ROLES` variable.
* If the node is a broker only, it formats with the `-N` option (works for both new cluster creation and broker scale-up).
* If the node is a controller, it reads the `cluster.new` field from the ConfigMap and proceeds based on whether this is a new cluster or an existing one:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

  1. so we cannot use the clusterId as an identifier to know if this cluster is new or existing at this point of time?
  2. When will cluster.new be set to false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we cannot use the clusterId as an identifier to know if this cluster is new or existing at this point of time?

We can't use it because at this point even a new cluster we'll get assigned a clusterId by the operator for formatting the starting nodes.

When will cluster.new be set to false?

based on the condition that the status is set together with a clusterId within it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide maybe full flow of how cluster Id generated and not generated therefore how these new fields respond? I found it confusing as to how these new fields.

I wonder if it would be helpful to explain the flow something like the following if my understanding is correct:

When a new Kafka CR is created, therefore it's a new cluster:

  • Unique clusterId is generated since Kafka status is null. Kafka CR status is not updated yet at this point.
  • Is this when Kafka CR status is populated with the controllers?

  • Configmap is generated with the fields and is set to be volume mounted.
    • Create cluster.new field for the configmap, as Kafka status is null or has null clusterId.
    • Is this when we populate the controllers string for the configmap based on the CR?

  • Cluster is started by running the script which checks the cluster.new field from the configmap (at this point cluster.id exists but just not in the status yet).
  • Kafka CR status is updated with the clusterId retrieved via Admin API.

When reconciling an existing Kafka cluster:

  • clusterId is retrieved from Kafka CR status, as it already contains a valid clusterId.
  • Configmap is generated with the fields and is set to be volume mounted.
    • Do not create cluster.new field for the configmap, as Kafka status has clusterId.
    • Is this when we populate the controllers string for the configmap based on the CR?

  • Cluster is started by running the script which checks that cluster.new does not exist in the configmap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the flow is somehow described around line 200.

Do not create cluster.new field for the configmap, as Kafka status has clusterId.

cluster.new is created anyway but it's false.

Although this comment let me think that the formatting flow could be simplified.
I can simplify and not distinguish between new and existing cluster but the main difference would just be "is the current node present in the controllers list?" yes -> use -I, no use -N.
The only exception would be if it's in the controllers list but there was a metadata disk change, it needs -N (but it's anyway independent from the fact it's new or existing cluster). I will work on my POC to validate this simplification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinaselenge FYI I updated the proposal based on the above idea. The flow in the run script isn't based on new or existing cluster anymore so the cluster.new was also removed. You can check it around lines 227 and forward.


* BEGIN RECONCILIATION
* ... (other reconcile operations) ...
* **KRaft quorum reconciliation**: analyze current quorum state, unregister and register controllers as needed (typically unregisters controllers being scaled down).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the unregistered controller node will be deleted after this reconciliation? If not, then when?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unregistered controller nodes deletion is part of the "scale down controllers" step and it happens after the controllers are unregistered but still in the same reconciliation.

* **KRaft quorum reconciliation**: analyze current quorum state, unregister and register controllers as needed (typically unregisters controllers being scaled down).
* scale down controllers.
* ... (other reconcile operations) ...
* **KRaft quorum reconciliation** (rolling): for each controller pod restart, `KafkaRoller` invokes a "single-controller reconciliation" to handle metadata disk changes immediately (unregister old voter with stale directory ID, register new observer with current directory ID).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the single-controller reconciliation process only needs for metadata disk change, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes because it's reconcile the participation to the KRaft quorum only for that specific node and it's not a full KRaft quorum reconciliation which takes into account all controllers (the code in my POC is anyway shared between the two, because reconciling the KRaft quorum as a whole needs check quorum participation for each controller anyway).


Analyze desired vs. actual state, for each "desired" controller node:
- Checks if the controller pod has been rolled with the controller role (verified by the presence of the `strimzi.io/controller-role` label on the pod).
- Compares the controller's current state in the quorum (voter, observer, or absent) with the expected state based on the `controllers` status field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: If controller is scaling up from 3 -> 4, then the new added controller node has network issue that cannot talk to the active controller to register itself. In this situation, the admin API might still not have the 4th node in observer or voter. What will we do in this situation?

Copy link
Member Author

@ppatierno ppatierno Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the KRaft quorum reconciler logic won't do anything, nothing to register or nothing to unregister. On next reconciliation, it will still detect that 4th node is in the controllers "desired" list, gets the quorum metadata to compare and if it's an observer, it will be registered otherwise still skipped.


* Add support for dynamic controller quorum to the Strimzi Cluster Operator and using it by default for any newly created Apache Kafka cluster.
* Add support for controllers scaling by leveraging the dynamic controller quorum.
* Add migration from static to dynamic controller quorum.
Copy link
Contributor

@tinaselenge tinaselenge Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these going to implemented via separate PRs or all in one go? Or is there a plan to break down implementation in a way that makes sense to make it easier to review?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two bullets will come together for sure because strictly related.
The migration one can be done in a separate PR if we want. Meanwhile, the operator just skips the KRaft quorum reconciliation for any "static" quorum cluster (where the "controllers" field in the status is null).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Compatibility" section seems to imply that auto migration will be released at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will be released at the same time" is different from "will be merged as two different PRs".
We'll release the overall dynamic quorum support feature in one go but the controller scaling and migration could come in two separate PRs anyway (to simplify the review process).

They use the `controller.quorum.bootstrap.servers` configuration to contact an existing controller from where fetching the metadata log containing the `VotersRecord` for the voters set.
The `VotersRecord` contains critical information including voter IDs, directory IDs (unique UUIDs), endpoints, and supported `kraft.version` ranges.

Without the bootstrap snapshot containing the `VotersRecord`, initial controllers face a kind of deadlock: they cannot participate in elections or become candidates unless they know who the voters are, but the voters set information comes from the `VotersRecord` which must be written to the replicated log by a leader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Without the bootstrap snapshot containing the `VotersRecord`, initial controllers face a kind of deadlock: they cannot participate in elections or become candidates unless they know who the voters are, but the voters set information comes from the `VotersRecord` which must be written to the replicated log by a leader.
Without the bootstrap snapshot containing the `VotersRecord`, initial controllers face a deadlock: they cannot participate in elections unless they know who the voters are, but this information must be written to the replicated log by an elected leader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will push a slightly different version of your suggestion.

Once this standalone controller is running, additional controllers can be formatted with `--no-initial-controllers`, started as observers, and then dynamically added to the quorum using the standard add controller operations.
While this approach simplifies the initial bootstrap by avoiding the need to coordinate directory IDs for all controllers upfront, it means the cluster starts with no fault tolerance since a single-controller quorum cannot tolerate any failures.
However, once additional controllers are added and registered as voters, the cluster achieves the desired redundancy and fault tolerance.
This approach can't cope with how Strimzi starts up the cluster nodes all together and doesn't have the possibility to do a rolling start one by one on cluster creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying, if a quorum is started with a single controller, the rolling cannot happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am saying is that if you have a KafkaNodePool for controllers with 3 replicas, the operator doesn't start the first one only, then when it's ready it moves to the second, and finally to the third. The operator just creates the StrimziPodSets which creates all the 3 pods in parallel. So more controllers always starts in parallel, there is no sequential start on cluster creation. Operator provides only sequential pods re-start during rolling (thanks to the KafkaRoller).

* builds the broker and controller configuration by setting the `controller.quorum.bootstrap.servers` field (instead of the `controller.quorum.voters` one).
* generates a random directory ID, as a UUID, for each controller.
* saves the `controllers` field (list of `KafkaControllerStatus` objects) within the `Kafka` custom resource status, containing the controller IDs and their corresponding directory IDs.
* builds the controllers string from the controllers status list, and stores it as `controllers` field within the node ConfigMap, to be loaded by the `kafka_run.sh` script where it's needed for formatting the storage properly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controller.quorum.bootstrap.servers field already exists in the configmap at this point, so can it not be used for the script if the format of the string is the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string is not the same. The string for storage formatting (with --initial-controllers) has a different format including "nodeId" and "directoryId" for each controller but they are not part of the controller.quorum.bootstrap.servers. For example:

The "controllers" field in the ConfigMap for the format purposes:

controllers: 3@my-cluster-controller-3.my-cluster-kafka-brokers.myproject.svc:9090:IQf-4YZ6Qx20wzUPuf8Oeg,4@my-cluster-controller-4.my-cluster-kafka-brokers.myproject.svc:9090:PtjEWItSR6W4q1JtnTPXaw,5@my-cluster-controller-5.my-cluster-kafka-brokers.myproject.svc:9090:hLPNF3NWSC2laduRehT0Dw

versus the controller.quorum.bootstrap.servers configuration within the controller node:

controller.quorum.bootstrap.servers=my-cluster-controller-3.my-cluster-kafka-brokers.myproject.svc:9090,my-cluster-controller-4.my-cluster-kafka-brokers.myproject.svc:9090,my-cluster-controller-5.my-cluster-kafka-brokers.myproject.svc:9090


* Reads the node role from the `process.roles` property in `/tmp/strimzi.properties` into a `PROCESS_ROLES` variable.
* If the node is a broker only, it formats with the `-N` option (works for both new cluster creation and broker scale-up).
* If the node is a controller, it reads the `cluster.new` field from the ConfigMap and proceeds based on whether this is a new cluster or an existing one:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide maybe full flow of how cluster Id generated and not generated therefore how these new fields respond? I found it confusing as to how these new fields.

I wonder if it would be helpful to explain the flow something like the following if my understanding is correct:

When a new Kafka CR is created, therefore it's a new cluster:

  • Unique clusterId is generated since Kafka status is null. Kafka CR status is not updated yet at this point.
  • Is this when Kafka CR status is populated with the controllers?

  • Configmap is generated with the fields and is set to be volume mounted.
    • Create cluster.new field for the configmap, as Kafka status is null or has null clusterId.
    • Is this when we populate the controllers string for the configmap based on the CR?

  • Cluster is started by running the script which checks the cluster.new field from the configmap (at this point cluster.id exists but just not in the status yet).
  • Kafka CR status is updated with the clusterId retrieved via Admin API.

When reconciling an existing Kafka cluster:

  • clusterId is retrieved from Kafka CR status, as it already contains a valid clusterId.
  • Configmap is generated with the fields and is set to be volume mounted.
    • Do not create cluster.new field for the configmap, as Kafka status has clusterId.
    • Is this when we populate the controllers string for the configmap based on the CR?

  • Cluster is started by running the script which checks that cluster.new does not exist in the configmap.

In the controllers scaling scenario, the operator handles the registration as well (not only the unregistration).

In addition to handling the registration and unregistration of controllers, the Strimzi Cluster Operator has to check the health of the KRaft quorum before allowing the scale down of the controllers.
At the beginning of the `KafkaReconciler` reconciliation process, if scaling down controllers could break the quorum (by losing the majority of voters needed for consensus), it should be blocked and reverted back by the operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the quorum health checked before unregistering the node that is about to be scaled down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea.

- Find if this node ID is part of voters and/or observers in the quorum
- If multiple incarnations detected (same node ID with different directory IDs):
- Read actual directory ID from pod via Kafka Agent HTTP endpoint
- Add voter with wrong directory ID to `toUnregister` list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean the old directory id, rather than wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's better to stick with "old" even if it's technically "wrong" as well.

- If multiple incarnations detected (same node ID with different directory IDs):
- Read actual directory ID from pod via Kafka Agent HTTP endpoint
- Add voter with wrong directory ID to `toUnregister` list
- Add observer matching actual directory ID to `toRegister` list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean the new directory id? How are we differentiating the 2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"actual" would mean the real one coming from reading the Kafka Agent HTTP endpoint, and it would be technically the "new". I will change to "new" but leaving that it's the one actually in the meta.properties.


*Reconciliation:*
- Status before reconciliation: [{3:A}, {4:B}, {5:C}]
- Analysis: Node 6 in desired but NOT in status, operator generates random UUID "D"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know whether we need to generate the directory id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a left over of the rejected approach, I will change it and review the rest if they have same issue.
FYI, it will be generated by the Kafka storage format tool when the node starts by using the famous -N option because it's a scale up.

* removes/unregisters the controllers from the quorum by using the `removeRaftVoter` method in the Kafka Admin API.
* scales down the controllers by deleting the corresponding pods.

If getting the quorum information or unregistering the controllers fails (the Apache Kafka returns an error), the reconciliation will fail to avoid the controllers, not unregistered correctly, to be shutten down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If getting the quorum information or unregistering the controllers fails (the Apache Kafka returns an error), the reconciliation will fail to avoid the controllers, not unregistered correctly, to be shutten down.
If getting the quorum information or unregistering the controllers fails (the Apache Kafka returns an error), the reconciliation will fail to avoid the controllers, not unregistered correctly, to be shut down.

* registers them as voters using the `addRaftVoter` API, following the same sequential registration process described in the scale-up section.

The `strimzi.io/controller-role` label check is critical for resilience: if the operator crashes after some nodes have been rolled but before registration completes, on restart the operator can detect which nodes have already been rolled as controllers (by checking the label) and only register those, skipping nodes that haven't been rolled yet.
This prevents attempting to register nodes that are desired to be controllers but haven't yet been actualized as controllers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This prevents attempting to register nodes that are desired to be controllers but haven't yet been actualized as controllers.
This prevents attempting to register nodes that are desired to be controllers but haven't yet been restarted as controllers.

At the beginning of the reconciliation cycle, the KRaft quorum reconciliation:

* analyzes the quorum state and identifies controllers that need to be removed (those in the current voters but no longer having the controller role in the desired configuration).
* unregisters them from the quorum using the `removeRaftVoter` API, ensuring they are removed before the rolling update begins.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* unregisters them from the quorum using the `removeRaftVoter` API, ensuring they are removed before the rolling update begins.
* unregisters them from the quorum using the `removeRaftVoter` API, ensuring they are removed from the voter list before the rolling update begins.


After unregistration completes:

* the operator builds the new node configuration without controller-specific settings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* the operator builds the new node configuration without controller-specific settings.
* the operator builds configurations of the affected nodes without controller-specific settings.


What is the impact of it on downgrading the operator to a previous release where there is no support for dynamic quorum?

When downgraded to an older release, the operator reconfigures the nodes with the old `controller.quorum.voters` parameter and roll them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so controller.quorum.voters will be added back to the configuration but not actually used for anything right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. This is how Kafka works for backward compatibility. @showuon can confirm it.

A new controller formatted with `-I` containing all current controllers can still join the quorum as an observer and be registered as a voter.
However, several issues make this approach unsuitable:

- Unnecessary checkpoint file: Formatting with `-I` creates a bootstrap snapshot/checkpoint file on the scaled-up controller's disk. This file is redundant because new controllers don't use it to discover the quorum but they fetch the `VotersRecord` from the leader's metadata log instead. This can be consider a minimal issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Unnecessary checkpoint file: Formatting with `-I` creates a bootstrap snapshot/checkpoint file on the scaled-up controller's disk. This file is redundant because new controllers don't use it to discover the quorum but they fetch the `VotersRecord` from the leader's metadata log instead. This can be consider a minimal issue.
- Unnecessary checkpoint file: Formatting with `-I` creates a bootstrap snapshot/checkpoint file on the scaled-up controller's disk. This file is redundant because new controllers don't use it to discover the quorum but they fetch the `VotersRecord` from the leader's metadata log instead. This can be considered a minor issue since the file is quite small and doesn't have any impact.

However, several issues make this approach unsuitable:

- Unnecessary checkpoint file: Formatting with `-I` creates a bootstrap snapshot/checkpoint file on the scaled-up controller's disk. This file is redundant because new controllers don't use it to discover the quorum but they fetch the `VotersRecord` from the leader's metadata log instead. This can be consider a minimal issue.
- Undocumented behavior: Most critically, this approach is not documented in the official Apache Kafka documentation or KIP-853. Relying on undocumented behavior creates a risk that future Kafka versions could change or break this functionality without notice.
Copy link
Contributor

@tinaselenge tinaselenge Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this gets documented officially, does using -I all the time significantly simplifies the process and code? Maybe we could mention here, what exactly gets simplified e.g. it removes the need for cluster.new field in the config map.

If it gets documented officially after the proposal is accepted and implemented, would you consider changing it? How difficult would that be to refactor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be an update about this and I need to discuss with @showuon further because it could be not usable anymore after some investigation he made.


```shell
bin/kafka-metadata-quorum.sh --bootstrap-server my-cluster-kafka-bootstrap:9092 describe --status
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this should be rejected but maybe can add a short sentence or 2 on why it was rejected? Do we need to include all these commands users have to do? Maybe can just say, this requires users to run many commands in specific order manually, how that is not user friendly and error prone etc. ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to leave the commands to make more understandable the process, because this is actually what you would do with a Kafka cluster running on bare metal/VMs so it helps, I guess, to understand more about the automation.
I will add a couple of sentences about why it was rejected.

kubectl patch kafka my-cluster -n myproject --type=merge --subresource=status -p '{"status":{"controllers":[{"id":3,"directoryId":"U3fHvCoMVWiCVYa2ri_K5w"},{"id":4,"directoryId":"g3OMYG2gvmLCeE9Nv-Cz5Q"},{"id":5,"directoryId":"2K7pPIanujBKY1Tsxr-gWg"}]}}'
```

Patching the `Kafka` custom resource status will trigger nodes rolling and the operator will reconfigure them with the `controller.quorum.bootstrap.servers` field for using the dynamic quorum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, do need to these commands? Can we add why this was rejected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto as above.

Copy link
Member

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppatierno thanks for the proposal and examples.

I left few comments, but the approach LGTM.


* Add support for dynamic controller quorum to the Strimzi Cluster Operator and using it by default for any newly created Apache Kafka cluster.
* Add support for controllers scaling by leveraging the dynamic controller quorum.
* Add migration from static to dynamic controller quorum.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Compatibility" section seems to imply that auto migration will be released at the same time.

Comment on lines +374 to +376
3. Analyze current voters and identify unwanted ones:
- For each voter not in desired controllers:
- Add to `toUnregister` list (handles scale-down scenarios)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the toUnregister list order? In other words, which controller is scaled down first when multiple controllers need to be removed? This is crucial to minimize leadership gaps where no metadata changes can be processed.

I think we scale down from the highest-numbered controller pod right. IMO the operator should prefer removing non-leaders first to minimize these leadership gaps and make scale down more efficient.

Example:

  1. The user scales from 5 to 3.
  2. The operator starts removing controller-4 which happens to be the leader.
  3. Controller-4 resigns after commit causing a hopefully brief leadership gap.
  4. New leader is elected among {0, 1, 2, 3}.
  5. Now it's the turn of controller-3, that was just elected as the new leader.
  6. The same leader-removal dance happens again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think more about this but take into account that unregistration is pretty fast and it happens before scaling down so I would expect that controller 4 and 3 would be unregistered one after the other very fast and then shut down. At this point if 4 was the leader, the new election will start among the rest 0,1,2 (so skipping the dance with 3 elected).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unregistration is pretty fast

This is assuming that the active controller is always fast, which may not be the case. We can consider this an optimization and do a follow up, but I would at least mention it in the proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on doing this. I changed the reconciliation algorithms by adding the following:

Phase 1: Unregister all controllers in the toUnregister list (stale/unwanted controllers). If one of the controller is the leader, it will be unregistered as the last one to avoid useless leader elections in between multiple controllers unregistrations.

This means that while all controllers start up in a single reconciliation, registering them all as voters may require multiple reconciliation cycles.

Furthermore, monitoring that the controller has caught up with the active ones is not necessary.
If the controller hasn't caught up yet and the operator runs the registration, the active controller returns an error and the reconciliation will fail, allowing the registration to be re-tried in the next reconciliation cycle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to intercept this error code and log a useful message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's logged by the operator. Not sure we want to have some warning conditions in the status.

Signed-off-by: Paolo Patierno <ppatierno@live.com>
formatting purposes

Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Copy link
Member

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing my comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants