Doc: Add design for external controller. #187

alaye-ms · 2025-12-08T21:43:52Z

No description provided.

xgerman · 2025-12-08T22:22:36Z

docs/designs/hub-controller-design.md

+* There is no simple way for the promotion token from a demoted cluster to 
+transfer to the newly promoted cluster
+* There needs to be a central location where Azure DNS can be managed
+


we also want a central controller to manage failovers - planned and unplanned. Additionally we need to control backups centrally to move backup schedules to all sites so we can perform a site swap and still have backups.

also this controller needs to be handled in our update design -- please research how it cna play a role during update aka it should probably corrdinate multi-region/cloud updates of oeprators and individual DocumentDB clusters

I can add info about a health check. For Updates I think we should use fleet itself to coordinate multi-cloud updates, but I'll add that to the design here.

While the controller can probably help to manage operator updates, I think it would be better to use fleet's staged update process instead of creating our own, and then let the operators themselves manage the updates of the individual documentdb clusters

Assume we have only one replica in each region - during update we will need to fail over to another region and then update that region. We can of course defer this case to future work and assume the primary region during update is HA enabled and the local operator will handle the feailovers automatically

xgerman · 2025-12-08T22:24:17Z

docs/designs/hub-controller-design.md

+It will try to remain as minimal as possible.
+
+### Promotion token management
+


for promotion it should determine which remote clister has the highest LSN and pcik that one to minimze data transfer time and downtime. Potentially user cna overwrite and focrce a specific region but that's needs to be not the default behavior.

docs/designs/hub-controller-design.md

xgerman · 2025-12-08T22:26:08Z

docs/designs/hub-controller-design.md

+cluster individually.
+
+This will need the following information
+* Azure Resource group 


can that be pluggable? What if soemone adds non Azure DNS handling... be more general

I don't really see a way this could be pluggable. We'll need to specifically use the Azure API to create these resources and there's no general API for DNS creation as far as I'm aware

e.g. add a field DNSStrategy=Azure and if soemone likes others they can propose them.

Signed-off-by: Alexander Laye <alaye@microsoft.com>

xgerman · 2025-12-09T16:50:54Z

docs/designs/hub-controller-design.md

+
+### Automatic failover
+
+The operator will have a health check endpoint that the controller can


this needs to be per documentdb cluster -- and not go through the operator.

We don't want to run a side swap if the oeprator is down but the rest is fine

There can. be instances that only some clusteres are down whereas others are still fine (partial outage)

We need to throttle/prioritize the side swaps... e.g. we can't have them happen all at once which might overwhelm the new primaries

Updated to remove auto-failover language.

xgerman · 2025-12-09T16:57:15Z

docs/designs/hub-controller-design.md

+
+## Updates
+
+Updates of the operators will be coordinated through KubeFleet's


we can do operator updates through the fleet functionality... on the fence if we need to put that in the opertaor so a user can run kubectl documentdb update operators?

In any case we need to control the update of individual multi-region/cloud doucmentdb clusteres through the controller - so users can d that with just one command and it will roll out accordingly (optionally plau into fleet's deployment mechanism)

I think that documentdb updates at that precise a level should probably be handled at the operator level, not by a multi-cloud controller.

add baseline doc

5461d46

alaye-ms requested review from hossain-rayhan and xgerman as code owners December 8, 2025 21:43

xgerman requested changes Dec 8, 2025

View reviewed changes

Add unmanaged failover section

061c4ee

Signed-off-by: Alexander Laye <alaye@microsoft.com>

xgerman reviewed Dec 9, 2025

View reviewed changes

remove auto-failover language

e9e439f

alaye-ms linked an issue Dec 15, 2025 that may be closed by this pull request

Design a Hub controller to manage failovers and Member cluster operator status #131

Open

		It will try to remain as minimal as possible.

		### Promotion token management


		### Automatic failover

		The operator will have a health check endpoint that the controller can


		## Updates

		Updates of the operators will be coordinated through KubeFleet's

Doc: Add design for external controller. #187

Are you sure you want to change the base?

Doc: Add design for external controller. #187

Uh oh!

Conversation

alaye-ms commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants