-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Milestone
Description
This bug tracks the items needed for cluster-mgr's HA. It s a big list and will be addressed by one or more future PRs preferably introducing small features.
Current behavior:
- state management:
- clustermgr (and it's sub systems) keeps some state like global extra-vars, per node host-group (once a node is commissioned)
- this state is lost on a process restart
- clustermgr is able to rebuild certain state like node's current inventory state (from collins db); node's current monitoring state (from serf's client interface)
- if collins or serf also dies then this state can be lost
- clustermgr forgets some state like extra-vars specified when a node is commissioned which is expected to be provided everytime a node is commissioned
- clustermgr (and it's sub systems) keeps some state like global extra-vars, per node host-group (once a node is commissioned)
- dependency on external processes:
- collins is used as inventory database and for node lifecycle management (primary state transition and logging events).
- while collins offers a rich feature set around node management itself (like power cycle etc), we are not using these features
- serf is used as node health monitoring service which helps with single point of node management.
- number of clustermgr instances that can run:
- clustermgr is able to run behind a VIP, allowing multiple instances to be running at a time and only one instance serving the requests.
- cluster-mgr has a minimal config that needs to be provided at process start. It is expected that all instances are started with similar config.
- collins is run as a container (with it's own local mysql db) which prevents node state to be available everywhere atm, so all clustermgr instances need to access same collins instance
- if we can extract the mysql from collins container to be run on a distributed filesystem, it might be possible to make the info available everywhere
- serf runs on every node, so node monitoring state is available everywhere. However, care needs to be taken to ensure that only one cluster-mgr instance acts on it. Right now clustermgr doesn't do much on monitoring state changes.
- clustermgr is able to run behind a VIP, allowing multiple instances to be running at a time and only one instance serving the requests.
Desired behavior and considerations for HA (++: priority high, +: low priority)
- state availability:
- (++) the state should be available wherever a cluster-mgr instance is running
- state restore on process restarts
- (++) if there are multiple instances, a new (or restarted) instance should be able to restore state from other instances.
- (+) if there is just a single instance, a restated instance should be able to restore state from local host.
Possible approaches (this space will be changing for a bit as I explore and document different approaches):
- use a distributed memcache (like golang/groupcache) that each clustermgr instance can use to keep and share it's state.
- pros:
- it's provided as a client/server lib with no need for a separate server
- cons:
- it's not clear how peer additions/deletions are handled. Need to study the lib more.
- it is not clear how the cache is updated (or flushed). Need to study the lib more.
- persistence of state in single instance environment will be tricky
- pros:
- start with a in memory db (like boltdb)
- pros:
- it's provided as a client/server lib with no need for a separate server
- persistence of state is built-in
- cons:
- distribution of state will need to be done separately
- pros:
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels