[Discussion] Fault tolerating master node

We need to decide whether to tackle this feature and how to tackle it. Some options are:
* Store master's state on S3 and keep an idle master that pings the main master. If the main master goes offline, the idle master grabs the state from S3 and acts as the main master. Downsides are that we need to use S3 (does it map well to the state that we need to store?), and the latency of S3 must also be assessed.
* Only main master performs scheduling/autoscaling, replicate everything between the backup masters. Downside is the backup masters are just doing nothing other than syncing - waste of resources.
* Use distributed masters where every master and schedule and autoscale and users can submit jobs to any of those masters. It'll be tricky to have them collaborate the autoscaling process (perhaps only let one master do it?).
* Let clients access the slaves directly, the master is there to simply instruct the client which slave to use (a bit like Hadoop)?
* Forget about fault tolerance on master, just focus on the slaves, AFAIK the Hadoop master isn't really fault tolerant. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] Fault tolerating master node #51

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Discussion] Fault tolerating master node #51

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions