Skip to content

Machine Learning Design #14

@vsoch

Description

@vsoch

These are old notes from a few weeks ago, about how to integrate ML here.

We would want to be able to have an algorithm that maximizes utilization, which means having nodes ready to go only when the jobs that need them are ready to run. With our current approach, we are just taking the next job in the queue, whatever it is, and scaling to that. This means that, in practice, we are too late (the job is ready but the nodes are not) and we have a job waiting for the scale up. We would want the request to go in to scale at the exact N-<seconds> before the job is ready. OR decide not to scale, that it's better to wait for jobs to finish (if they are finishing soon).

What we'd want to do is somehow have an algorithm that can predict when jobs that are running are finished, and if it's cheaper to wait for them to finish (and use the resources) or scale up then and there. This is actually just like what we started to think about with Rajib.

  • Start out submitting a bunch of jobs at random.
  • Start building a model for each ensemble type, and each size within that.
  • When we get to some number of jobs that are trained for the model, stop submitting at random.
  • When we stop submitting at random, set job urgencies to 0 so nothing submits.
  • Then based on calculating the time/cost for each size and ensemble type in the queue under two conditions:
    • if we wait for nodes to be ready
    • if we ask for them right now and then add nodes to the cluster

Choose the ensemble member / size and the solution above that minimizes the cost.

Ping @milroy since we recently chat about the above - I wrote this before our discussion yesterday anticipating it could be interesting to work on/think about. Please disregard if not interested / don't have time (I understand).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions