Skip to content

[Discussion] Fault tolerating master node #51

@kc1212

Description

@kc1212

We need to decide whether to tackle this feature and how to tackle it. Some options are:

  • Store master's state on S3 and keep an idle master that pings the main master. If the main master goes offline, the idle master grabs the state from S3 and acts as the main master. Downsides are that we need to use S3 (does it map well to the state that we need to store?), and the latency of S3 must also be assessed.
  • Only main master performs scheduling/autoscaling, replicate everything between the backup masters. Downside is the backup masters are just doing nothing other than syncing - waste of resources.
  • Use distributed masters where every master and schedule and autoscale and users can submit jobs to any of those masters. It'll be tricky to have them collaborate the autoscaling process (perhaps only let one master do it?).
  • Let clients access the slaves directly, the master is there to simply instruct the client which slave to use (a bit like Hadoop)?
  • Forget about fault tolerance on master, just focus on the slaves, AFAIK the Hadoop master isn't really fault tolerant.

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions