-
Notifications
You must be signed in to change notification settings - Fork 78
Add support for broker demotion #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
abb52f1
07cf465
f2f77b9
2c01570
6a84e7f
a302d54
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,174 @@ | ||
| # Broker demotion support via KafkaRebalance resource | ||
|
|
||
| This proposal extends the `KafkaRebalance` custom resource to support broker demotion by integrating with Cruise Control's `/demote_broker` endpoint. | ||
| This would allow users to demote brokers, removing them from partition leadership eligibility, in preparation for maintenance, decommissioning, or other operational needs. | ||
|
|
||
| ## Current situation | ||
|
|
||
| Currently, Strimzi's Cruise Control integration provides several [rebalancing modes](https://strimzi.io/docs/operators/latest/deploying#con-optimization-proposals-modes-str) through the `KafkaRebalance` custom resource: | ||
| * `full`: performs a complete cluster rebalance across all brokers | ||
| * `add-brokers`: moves replicas to newly added brokers after scaling up | ||
| * `remove-brokers`: moves replicas off brokers before scaling down | ||
| * `remove-disks`: performs intra-broker disk rebalancing for JBOD configurations | ||
|
|
||
| However, there is no built-in way to demote brokers, which would remove them from partition leadership without moving replicas. | ||
| When preparing brokers for removal or maintenance, users must rely on the `remove-brokers` mode which moves all partition replicas off the target brokers. | ||
| This is more disruptive than necessary if the goal is simply to ensure the brokers are not serving as partition leaders. | ||
|
|
||
| Cruise Control provides a dedicated [`/demote_broker`](https://github.com/linkedin/cruise-control/wiki/REST-APIs#demote-a-list-of-brokers-from-the-kafka-cluster) endpoint specifically for this use case, but Strimzi does not currently expose it. | ||
|
|
||
| ## Motivation | ||
|
|
||
| There are a few scenarios where demoting brokers without moving replicas is beneficial: | ||
|
|
||
| 1. **Broker maintenance**: Before performing maintenance on a broker such as upgrading, patching, or restarting, operators may want to transfer leadership away to minimize impact on client traffic, while keeping the broker as a follower to maintain replication factor and availability. | ||
|
|
||
| 2. **Staged decommissioning**: In a multi-step decommissioning process, operators may first demote brokers to observe the impact on leadership distribution and client performance before committing to fully removing replicas with the `remove-brokers` mode. | ||
|
|
||
| 3. **Performance isolation**: Operators may want to reduce load on specific brokers experiencing performance issues by removing their leadership responsibilities while keeping them as followers until the issue is diagnosed and resolved. | ||
|
|
||
| The `remove-brokers` mode is too aggressive for these scenarios because it moves all replicas off the target brokers resulting in: | ||
| - Significant network bandwidth consumption from replica movement | ||
| - Increased CPU and disk I/O on source and destination brokers | ||
| - Extended time to complete the operation | ||
| - Unnecessary disruption when the goal is only to remove leadership | ||
|
|
||
| Broker demotion addresses these concerns by only transferring leadership, which is a lightweight operation compared to replica movement. | ||
|
|
||
| ## Proposal | ||
|
|
||
| The proposal is to add a new `demote-brokers` mode to the existing `KafkaRebalance` custom resource, following the same pattern established by the `add-brokers` and `remove-brokers` modes introduced in [proposal 035](https://github.com/strimzi/proposals/blob/main/035-rebalance-types-scaling-brokers.md). | ||
|
|
||
| When `spec.mode` is set to `demote-brokers` in the `KafkaRebalance` resource, partition leadership is moved off the specified brokers while replicas remain in place. | ||
|
|
||
| Users must provide a list of broker IDs to demote for broker-level demotion via the `spec.brokers` field. | ||
|
|
||
| A `KafkaRebalance` custom resource for broker demotion would look like this: | ||
| ```yaml | ||
| apiVersion: kafka.strimzi.io/v1beta2 | ||
| kind: KafkaRebalance | ||
| metadata: | ||
| name: demote-brokers-example | ||
| labels: | ||
| strimzi.io/cluster: my-cluster | ||
| spec: | ||
| mode: demote-brokers | ||
| brokers: | ||
| - 3 | ||
| - 4 | ||
| ``` | ||
|
|
||
| This example would demote brokers 3 and 4, transferring all partition leadership away from them while keeping the replicas in place. | ||
|
|
||
| ### Supported fields for `demote-broker` mode | ||
kyguy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The supported fields, new and old, for `demote-brokers` mode in the `KafkaRebalance` resource spec are as follows: | ||
|
|
||
| | Field | Type | Description | Default | | ||
| |----------------------------------|---------------|-----------------------------------------------------------------------------|------------ | ||
| | brokers | integer array | List of ids of broker to be demoted in the cluster. | N/A | | ||
| | concurrentLeaderMovements | integer | Upper bound of ongoing leadership swaps. | 1000 | | ||
| | skipUrpDemotion | boolean | Whether to skip demoting leader replicas for under-replicated partitions. | true | | ||
| | excludeFollowerDemotion | boolean | Whether to skip demoting follower replicas on the broker to be demoted. | true | | ||
| | excludeRecentlyDemotedBrokers | boolean | Whether to allow leader replicas to be moved to recently demoted brokers. | false | | ||
|
|
||
| **NOTE**: As part of this proposal, we will also add a `excludeRecentlyDemotedBrokers` field for the `full`, `add-brokers`, and `remove-brokers` KafkaRebalance modes to give users the ability to prevent to leader replicas to be moved to recently demoted brokers. | ||
kyguy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| When `excludeRecentlyDemotedBrokers` is set to `true`, a broker is considered demoted for the duration specified by the Cruise Control `demotion.history.retention.time.ms` server configuration. | ||
| By default, this value is 1209600000 milleseconds (14 days) but is configurable in the `spec.cruiseControl.config` section of the `Kafka` custom resource. | ||
|
Comment on lines
+75
to
+77
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this work? Where is the information about the demotion work?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cruise Control stores the demotion information in memory (yes this is far from ideal). Cruise Control will make sure the ids of the demoted brokers are in the end of the replica lists and will exclude the brokers from being considered for partition movement when generating optimization proposals. |
||
|
|
||
| ### User workflow | ||
|
|
||
| The workflow for using broker demotion follows the same pattern as other rebalance modes: | ||
|
|
||
| 1. User creates a `KafkaRebalance` custom resource with `spec.mode: demote-brokers` and specifies the target broker IDs in `spec.brokers`. | ||
|
|
||
| 2. The `KafkaRebalanceAssemblyOperator` requests an optimization proposal from Cruise Control via the `/demote_broker` endpoint with `dryrun=true`. | ||
|
|
||
| 3. The operator transitions the `KafkaRebalance` resource to the `ProposalReady` state. | ||
| The proposal is stored in `status.optimizationResult` and shows which partition leadership transfers will occur. | ||
|
|
||
| 4. If [auto-approval](https://strimzi.io/docs/operators/latest/deploying#automatically_approving_an_optimization_proposal) is not enabled, the user reviews the proposal and approves it by annotating the resource with `strimzi.io/rebalance=approve`. | ||
|
|
||
| 5. The operator executes the broker demotion via the `/demote_broker` endpoint with `dryrun=false`. | ||
|
|
||
| 6. When complete, the operator transitions the `KafkaRebalance` resource to `Ready` state. | ||
|
|
||
| ### Validation and constraints | ||
|
|
||
| The implementation includes the following validation: | ||
|
|
||
| * When `demote-brokers` mode is specified, the `brokers` field must be provided. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will it be validated only in the broker code or also with CEL?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was originally planning on validating the field in the operator code but this may change depending on the proposal surrounding the API changes discussed #191 (comment). I'll link the separate proposal here when I have a draft ready and we can pick this up once that proposal and its implementation is sorted. |
||
| If the field is missing or empty, the operator will reject the rebalance request and report an error in the `KafkaRebalance` status. | ||
|
|
||
| * The specified broker IDs in the `brokers` list must exist in the cluster. | ||
| If any of the broker IDs are invalid, the demotion request will be rejected and the error will be reported in the `KafkaRebalance` status. | ||
|
|
||
| * When an impossible demotion operation is requested for example demoting all brokers or transferring leadership from the only in-sync replica when the KafkaRebalance `spec.skipUrpDemotion` configuration is set to `false`, the demotion request will be rejected and the error will be reported in the `KafkaRebalance` status. | ||
|
|
||
| * If a target broker fails while leadership is being transferred to it, all demotion operations involving that broker are aborted, and the source brokers remain the leaders for the affected partitions. | ||
| In this case, the overall demotion request continues on a best-effort basis with the remaining proposed operations, transferring the leadership on brokers that are available. | ||
| Also, a warning listing the affected demoted brokers that still host leader partitions will be added to the `KafkaRebalance` status. | ||
|
|
||
| * The following `KafkaRebalance` resource configuration fields are incompatible with, or no-ops for, broker demotion. | ||
| If any of these fields are specified, the rebalance request will be rejected, an error will be logged by the Strimzi Operator, and an error will be added to the KafkaRebalance status. | ||
| * `replicaMovementStrategies` | ||
| * `goals` | ||
| * `skipHardGoalCheck` | ||
| * `rebalanceDisk` | ||
| * `excludedTopics` | ||
| * `concurrentPartitionMovementsPerBroker` | ||
| * `concurrentIntraBrokerPartitionMovements` | ||
| * `moveReplicasOffVolumes` | ||
| * `replicationThrottle` | ||
|
|
||
| ### Interaction with other rebalance modes | ||
|
|
||
| Broker demotion is independent of other rebalance modes but can be used before or after them manually: | ||
|
|
||
| * **add-brokers**: After new brokers are added to the cluster, broker demotion could be used to explicitly transfer partition leadership away from existing brokers to accelerate leadership adoption on newly added brokers. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure about this use case, isn't the auto-rebalancing (we support) already doing something like this? It's moving partitions but not demoting brokers.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From what I understand, auto-rebalancing and the The use case I have in mind is one where a user wants to gradually transfer traffic to a new set of brokers and eventually remove the old brokers but is not ready to decommission them yet. The user can first rebalance the existing load onto the new brokers using Does that make sense?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes It does, thanks.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes It does, thanks. |
||
|
|
||
| * **remove-brokers**: Before decommissioning or scaling down brokers, broker demotion could be performed as a preparatory step to minimize disruption. | ||
|
|
||
| * **remove-disks**: Since disk-level demotion support is not included as part of this proposal, this interaction is not applicable. | ||
|
|
||
| * **full**: After demoting brokers, users could run a `full` mode rebalance to further redistribute leadership across the remaining leader-eligible brokers. | ||
|
|
||
| To reduce the complexity of this proposal and its implementation, broker demotion will remain as a manual operation independent of the other rebalance modes and cluster scaling as described in [proposal 078](https://github.com/strimzi/proposals/blob/main/078-auto-rebalancing-cluster-scaling.md). | ||
|
|
||
| ## Affected/not affected projects | ||
|
|
||
| This proposal impacts the Strimzi Cluster Operator in places related to the `KafkaRebalanceAssemblyOperator` and the `KafkaRebalance` API. | ||
|
|
||
| ## Compatibility | ||
|
|
||
| The proposed changes are fully backward compatible: | ||
|
|
||
| * **API compatibility**: Adding a new enum value to `KafkaRebalanceMode` does not break existing resources. | ||
| Existing `KafkaRebalance` resources using `full`, `add-brokers`, `remove-brokers`, and `remove-disks` modes continue to work unchanged. | ||
|
|
||
| * **CRD compatibility**: The `KafkaRebalance` CRD already includes the `mode` and `brokers` fields required for this feature. | ||
| No structural changes to the CRD schema are needed beyond allowing the new enum value and adding new fields. | ||
|
|
||
| * **Behavioral compatibility**: Existing rebalancing workflows are unaffected. | ||
| The new mode is opt-in and requires explicit user action. | ||
|
|
||
| ## Future Improvements | ||
|
|
||
| ### Add support for disk-level demotion | ||
kyguy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Since Cruise Control's [`/demote_broker`](https://github.com/linkedin/cruise-control/wiki/REST-APIs#demote-a-list-of-brokers-from-the-kafka-cluster) endpoint includes a parameter for demoting individual disks, `brokerid_and_logdirs`, a logical follow up feature would be to add support for disk-level demotion. | ||
kyguy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Rejected alternatives | ||
|
|
||
| ### Alternative 1: Make demotion part of `remove-brokers` mode | ||
|
|
||
| Instead of adding a separate mode, enhance the `remove-brokers` mode to support a two-phase operation: first demote to transfer leadership only and then optionally move replicas. | ||
|
|
||
| This could be controlled via a new field like `spec.demoteOnly: true`. | ||
|
|
||
| **Reasons for rejection:** | ||
| * Overloads the semantics of `remove-brokers`, which are intended for replica removal | ||
| * Makes the `remove-brokers` mode more complex with conditional behavior | ||
| * Reduces clarity for users about what operation is being performed | ||
| * Inconsistent with the design philosophy of having distinct modes for distinct operations | ||
| * A separate mode provides better visibility in status, metrics, and logs about which operation is in progress. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? I get it that it removes them from partition leadership. But how does it ensure they are not eligible to become leaders again?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cruise Control has the broker ids marked as ineligible in memory and moves the broker ids to the end of the replica lists, then triggers a leadership election. Since leadership election prioritizes choosing the first broker id of the replica list (the preferred leader) as the leader, the demoted brokers are less likely to be elected as leaders.
If a rebalance or demotion request is made the broker ids marked as ineligible in memory are excluded from having partitions moved to them or being listed first in the replica lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when CC pod is restarted (upgrade, migration to different node, etc).? User will have to trigger "rebalance" proposal and approval to refresh in-memory data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this time yes, to refresh the in-memory demotion data after Cruise Control pod restart, a user would have to trigger another
KafkaRebalancedemotion request.