From 5461d4635feb17dc6496e80f7246dddf35975bcc Mon Sep 17 00:00:00 2001
From: Alexander Laye <alaye@microsoft.com>
Date: Mon, 8 Dec 2025 16:43:15 -0500
Subject: [PATCH 1/3] add baseline doc

---
 docs/designs/hub-controller-design.md | 60 +++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)
 create mode 100644 docs/designs/hub-controller-design.md

diff --git a/docs/designs/hub-controller-design.md b/docs/designs/hub-controller-design.md
new file mode 100644
index 00000000..6ac57bc7
--- /dev/null
+++ b/docs/designs/hub-controller-design.md
@@ -0,0 +1,60 @@
+# Multi-Cluster Hub Controller Design
+
+## Problem
+
+* There is no simple way for the promotion token from a demoted cluster to 
+transfer to the newly promoted cluster
+* There needs to be a central location where Azure DNS can be managed
+
+## Implementation
+
+This will be a separate k8s operator running in the KubeFleet hub,
+It will try to remain as minimal as possible.
+
+### Promotion token management
+
+The Controller will be able to query endpoints on the member clusters
+with the promotion token, and then create a configMap and CRP to
+send that token to the new primary cluster. It will have access to the 
+documentdb crp so it will be able to see which member is primary. 
+
+It will clean up the token and crp when the promotion is complete. 
+It can determine this through another documentdb operator endpoint.
+
+### DNS Management
+
+If requested in the documentdb object, the controller should also
+provision and manage an Azure DNS zone for the documentdb cluster.
+This will create an SRV that points to the primary for seamless
+client-side failover, as well as individual DNS entries for each
+cluster individually.
+
+This will need the following information
+* Azure Resource group 
+* Azure Subscription
+* DNS Zone name (optional, could be generated on the fly)
+* Parent DNS Zone (optional)
+    * Parent DNS Zone RG and Subscription
+
+## Other possible additions
+
+### Streamlined Operator and Cluster deployment
+
+This new conrtoller could theoretically handle the installation and 
+distribution of the cert manager and the operator to save the user from
+having to deploy a large and cumbersome CRP. It could also monitor 
+the DocumentDB CRD and automatically create a CRP for that matching
+the provided clusterReplication field.
+
+## Security considerations
+
+This operator will have no more access than the fleet manager already
+does, and the member cluster operator endpoints will be limited to the
+least amount of information provided possible and only grant access 
+to the fleet controller.
+
+## Alternatives
+
+Currently, we perform this promotion token transfer using a nginx pod
+and a multi-cluster service when using KubeFleet. The DNS zone creation
+and management is handled by the creation and failover scripts.

From 061c4eea09eab6380161f46ac0d84dcb466fc4ee Mon Sep 17 00:00:00 2001
From: Alexander Laye <alaye@microsoft.com>
Date: Tue, 9 Dec 2025 11:31:55 -0500
Subject: [PATCH 2/3] Add unmanaged failover section

Signed-off-by: Alexander Laye <alaye@microsoft.com>
---
 docs/designs/hub-controller-design.md | 60 +++++++++++++++++++++------
 1 file changed, 48 insertions(+), 12 deletions(-)

diff --git a/docs/designs/hub-controller-design.md b/docs/designs/hub-controller-design.md
index 6ac57bc7..b74ff4c3 100644
--- a/docs/designs/hub-controller-design.md
+++ b/docs/designs/hub-controller-design.md
@@ -2,9 +2,10 @@
 
 ## Problem
 
-* There is no simple way for the promotion token from a demoted cluster to 
+* There is no simple way for the promotion token from a demoted cluster to
 transfer to the newly promoted cluster
 * There needs to be a central location where Azure DNS can be managed
+* We need some way to initiate failover without manual intervention
 
 ## Implementation
 
@@ -15,42 +16,73 @@ It will try to remain as minimal as possible.
 
 The Controller will be able to query endpoints on the member clusters
 with the promotion token, and then create a configMap and CRP to
-send that token to the new primary cluster. It will have access to the 
-documentdb crp so it will be able to see which member is primary. 
+send that token to the new primary cluster. It will have access to the
+documentdb crp so it will be able to see which member is primary.
 
-It will clean up the token and crp when the promotion is complete. 
+It will clean up the token and crp when the promotion is complete.
 It can determine this through another documentdb operator endpoint.
 
 ### DNS Management
 
 If requested in the documentdb object, the controller should also
 provision and manage an Azure DNS zone for the documentdb cluster.
-This will create an SRV that points to the primary for seamless
-client-side failover, as well as individual DNS entries for each
-cluster individually.
+This will create an SRV DNS entry that points to the primary for
+seamless client-side failover, as well as individual DNS entries
+for each cluster.
 
 This will need the following information
-* Azure Resource group 
+
+* Azure Resource group
 * Azure Subscription
 * DNS Zone name (optional, could be generated on the fly)
+* Azure credentials
 * Parent DNS Zone (optional)
-    * Parent DNS Zone RG and Subscription
+  * Parent DNS Zone RG and Subscription
+
+### Automatic failover
+
+The operator will have a health check endpoint that the controller can
+periodically query to determine liveness for failover. There will be a
+setting for how long a primary cluster will be marked down before failover is
+initiated.
+
+The health check endpoint should provide the controller with a LSN for the
+database so that it can have an up to date list
+
+When that time limit is hit, the operator should use the LSNs that it knows
+to pick a promotion candidate and alter the DocumentDB object so the operators
+know to run the promotion process.
 
 ## Other possible additions
 
 ### Streamlined Operator and Cluster deployment
 
-This new conrtoller could theoretically handle the installation and 
+This new controller could theoretically handle the installation and
 distribution of the cert manager and the operator to save the user from
-having to deploy a large and cumbersome CRP. It could also monitor 
+having to deploy a large and cumbersome CRP. It could also monitor
 the DocumentDB CRD and automatically create a CRP for that matching
 the provided clusterReplication field.
 
+### Pluggable DNS management
+
+The DNS management could be abstracted to allow for other cloud's
+DNS management systems. The current implementation will create an
+API that will extensible.
+
+## Updates
+
+Updates of the operators will be coordinated through KubeFleet's
+ClusterStagedUpdateStrategy. This will allow the operators to safely
+update with optional rollbacks. The controller itself should be able to
+be updated independently of the operators. Steps will be taken to ensure
+backwards compatibility through the use of things like feature flags and
+deprecating but maintaining old APIs.
+
 ## Security considerations
 
 This operator will have no more access than the fleet manager already
 does, and the member cluster operator endpoints will be limited to the
-least amount of information provided possible and only grant access 
+least amount of information provided possible and only grant access
 to the fleet controller.
 
 ## Alternatives
@@ -58,3 +90,7 @@ to the fleet controller.
 Currently, we perform this promotion token transfer using a nginx pod
 and a multi-cluster service when using KubeFleet. The DNS zone creation
 and management is handled by the creation and failover scripts.
+
+## References
+
+* [KubeFleet Staged Update](https://kubefleet.dev/docs/how-tos/staged-update/)

From e9e439fca126f55db92596f6b13852f538bba109 Mon Sep 17 00:00:00 2001
From: Alexander Laye <alaye@microsoft.com>
Date: Mon, 15 Dec 2025 11:00:02 -0500
Subject: [PATCH 3/3] remove auto-failover language

---
 docs/designs/hub-controller-design.md | 33 ++++++++++++---------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/docs/designs/hub-controller-design.md b/docs/designs/hub-controller-design.md
index b74ff4c3..02fdc252 100644
--- a/docs/designs/hub-controller-design.md
+++ b/docs/designs/hub-controller-design.md
@@ -5,7 +5,7 @@
 * There is no simple way for the promotion token from a demoted cluster to
 transfer to the newly promoted cluster
 * There needs to be a central location where Azure DNS can be managed
-* We need some way to initiate failover without manual intervention
+* We need some way to manage the failover of many DocumentDB instances at once
 
 ## Implementation
 
@@ -14,13 +14,13 @@ It will try to remain as minimal as possible.
 
 ### Promotion token management
 
-The Controller will be able to query endpoints on the member clusters
-with the promotion token, and then create a configMap and CRP to
-send that token to the new primary cluster. It will have access to the
-documentdb crp so it will be able to see which member is primary.
+The Controller will be able to query the Kube API on the member clusters to
+get the promotion token from the Cluster CRD. Then it will create a configMap
+and CRP to send that token to the new primary cluster. It will use the
+documentdb crd to determine which member is primary.
 
 It will clean up the token and crp when the promotion is complete.
-It can determine this through another documentdb operator endpoint.
+It can determine this through the Cluster CRD status.
 
 ### DNS Management
 
@@ -39,19 +39,16 @@ This will need the following information
 * Parent DNS Zone (optional)
   * Parent DNS Zone RG and Subscription
 
-### Automatic failover
+### Regional Failover
 
-The operator will have a health check endpoint that the controller can
-periodically query to determine liveness for failover. There will be a
-setting for how long a primary cluster will be marked down before failover is
-initiated.
-
-The health check endpoint should provide the controller with a LSN for the
-database so that it can have an up to date list
-
-When that time limit is hit, the operator should use the LSNs that it knows
-to pick a promotion candidate and alter the DocumentDB object so the operators
-know to run the promotion process.
+The user should be able to initiate a regional failover, wherein all clusters in
+a region change their primary. The controller should know the LSNs on each
+instance, and pick the highest for each cluster to become the new primary. To
+initiate this failover, the user should create a CRD that marks a particular
+member cluster as not primary-ready. The controller will watch this resource,
+and use that information to update each DocumentDB instance. The crp will
+automatically push those changes, and the Operators will perform the actual
+promotions and demotions
 
 ## Other possible additions