-
Notifications
You must be signed in to change notification settings - Fork 80
Add SSHLauncher for multi-node execution #690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add SSHLauncher for multi-node execution #690
Conversation
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
…o sfc-gh-truwase/ssh_launcher
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
|
Hi @sfc-gh-truwase! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
@daniellepintz @allenwang28 will appreciate your feedback. |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
allenwang28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this @sfc-gh-truwase!
Can you share more information about your preferred cluster setup?
I see right now that we're grabbing from a host names text file, not sure if that's something you're creating yourself or if that's something your scheduler creates automatically for you. Some schedulers like SLURM may e.g. give you SLURM_JOB_NODELIST or something similar
|
@allenwang28 yes, in our cluster setup the scheduler automatically creates the text file of allocated host names. Do you have suggestions for generalization? Regarding, SLURM, I assume that existing SlurmLauncher would suffice. Does that align with your thinking? |
|
ok, understood. For context, Monarch has recently created a "Job" API which defines how Monarch acquires HostMeshes from underlying schedulers. See SlurmJob for instance. There's also an implementation for SSHJob. I'm planning on migrating Forge over to the Jobs API, and away from the launcher specific APIs that we have today. I'm wondering if this would work for your cluster? We could probably test a much simpler workload than Forge for this. |
|
@allenwang28 yes, we used SSHJob in our PR. Can you advise if the usage aligns with your plans? |
|
Hi @sfc-gh-truwase, thanks for your work on this launcher! As Allen mentioned we are working on a refactor to all of our launchers at once, which involves substantial changes to provisioner.py and SlurmLauncher, etc. Would you mind waiting until that refactor is done, and then rebasing your SSHLauncher changes on top? Would greatly appreciate it! That refactor should be done later this week |
|
Hi @sfc-gh-truwase thanks for your patience! I have merged #700, would you like to take a look and see if the API works for you? |
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
|
@daniellepintz @allenwang28 I have aligned the PR to the new API. Will appreciate your review. Thanks! |
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Add SSHLauncher for multi-node execution to complement existing the Slurm and MAST launchers. SSHLauncher is applicable when the nodes are already preallocated and accessible via passwordless SSH. This is related to earlier discussions in #587 #611
SSHLauncher can be enabled by specifying
sshas launcher in theprovisionersection of yaml as follows:/job/hostfileis assumed to contain a list of hostnames (or SSH aliases), which are machines accessible via passwordless SSH, and slot counts, which specify the number of GPUs available on the systemThis PR currently includes an option to colocate
trainerandref_model.Test plan