Skip to content

cluster-mgr HA considerations #53

@mapuri

Description

@mapuri

This bug tracks the items needed for cluster-mgr's HA. It s a big list and will be addressed by one or more future PRs preferably introducing small features.

Current behavior:

  • state management:
    • clustermgr (and it's sub systems) keeps some state like global extra-vars, per node host-group (once a node is commissioned)
      • this state is lost on a process restart
    • clustermgr is able to rebuild certain state like node's current inventory state (from collins db); node's current monitoring state (from serf's client interface)
      • if collins or serf also dies then this state can be lost
    • clustermgr forgets some state like extra-vars specified when a node is commissioned which is expected to be provided everytime a node is commissioned
  • dependency on external processes:
    • collins is used as inventory database and for node lifecycle management (primary state transition and logging events).
    • while collins offers a rich feature set around node management itself (like power cycle etc), we are not using these features
    • serf is used as node health monitoring service which helps with single point of node management.
  • number of clustermgr instances that can run:
    • clustermgr is able to run behind a VIP, allowing multiple instances to be running at a time and only one instance serving the requests.
      • cluster-mgr has a minimal config that needs to be provided at process start. It is expected that all instances are started with similar config.
    • collins is run as a container (with it's own local mysql db) which prevents node state to be available everywhere atm, so all clustermgr instances need to access same collins instance
      • if we can extract the mysql from collins container to be run on a distributed filesystem, it might be possible to make the info available everywhere
    • serf runs on every node, so node monitoring state is available everywhere. However, care needs to be taken to ensure that only one cluster-mgr instance acts on it. Right now clustermgr doesn't do much on monitoring state changes.

Desired behavior and considerations for HA (++: priority high, +: low priority)

  • state availability:
    • (++) the state should be available wherever a cluster-mgr instance is running
  • state restore on process restarts
    • (++) if there are multiple instances, a new (or restarted) instance should be able to restore state from other instances.
    • (+) if there is just a single instance, a restated instance should be able to restore state from local host.

Possible approaches (this space will be changing for a bit as I explore and document different approaches):

  • use a distributed memcache (like golang/groupcache) that each clustermgr instance can use to keep and share it's state.
    • pros:
      • it's provided as a client/server lib with no need for a separate server
    • cons:
      • it's not clear how peer additions/deletions are handled. Need to study the lib more.
      • it is not clear how the cache is updated (or flushed). Need to study the lib more.
      • persistence of state in single instance environment will be tricky
  • start with a in memory db (like boltdb)
    • pros:
      • it's provided as a client/server lib with no need for a separate server
      • persistence of state is built-in
    • cons:
      • distribution of state will need to be done separately

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions