cluster-mgr HA considerations

This bug tracks the items needed for cluster-mgr's HA. It s a big list and will be addressed by one or more future PRs preferably introducing small features.

**Current behavior**:
- **state management**: 
  - clustermgr (and it's sub systems) keeps some state like **global extra-vars**, per node **host-group** (once a node is commissioned)
    - this state is lost on a process restart
  - clustermgr is able to rebuild certain state like node's current **inventory state** (from collins db); node's current **monitoring state** (from serf's client interface)
    - if collins or serf also dies then this state can be lost
  - clustermgr forgets some state like extra-vars specified when a node is commissioned which is expected to be provided everytime a node is commissioned
- **dependency on external processes**:
  - **collins** is used as inventory database and for node lifecycle management (primary state transition and logging events). 
  - while collins offers a rich feature set around node management itself (like power cycle etc), we are not using these features
  - **serf** is used as node health monitoring service which helps with single point of node management.
- **number of clustermgr instances that can run**:
  - clustermgr is able to run behind a VIP, allowing multiple instances to be running at a time and only one instance serving the requests.
    - cluster-mgr has a minimal config that needs to be provided at process start. It is expected that all instances are started with similar config.
  - collins is run as a container (with it's own local mysql db) which **prevents node state to be available everywhere** atm, so all clustermgr instances need to access same collins instance
    - if we can extract the mysql from collins container to be run on a distributed filesystem, it might be possible to make the info available everywhere
  - serf runs on every node, so node monitoring state is available everywhere. However, care needs to be taken **to ensure that only one cluster-mgr instance acts on it**. Right now clustermgr doesn't do much on monitoring state changes.

**Desired behavior and considerations for HA** (**++**: priority high, **+**: low priority)
- state availability:
  - (**++**) the state should be available wherever a cluster-mgr instance is running
- state restore on process restarts
  - (**++**) if there are multiple instances, a new (or restarted) instance should be able to restore state from other instances.
  - (**+**) if there is just a single instance, a restated instance should be able to restore state from local host.

**Possible approaches** (this space will be changing for a bit as I explore and document different approaches):
- use a distributed memcache (like golang/groupcache) that each clustermgr instance can use to keep and share it's state.
  - pros:
    - it's provided as a client/server lib with no need for a separate server
  - cons:
    - it's not clear how peer additions/deletions are handled. Need to study the lib more.
    - it is not clear how the cache is updated (or flushed). Need to study the lib more.
    - persistence of state in single instance environment will be tricky
- start with a in memory db (like boltdb)
  - pros:
    - it's provided as a client/server lib with no need for a separate server
    - persistence of state is built-in
  - cons:
    - distribution of state will need to be done separately


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-mgr HA considerations #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cluster-mgr HA considerations #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions