-
Notifications
You must be signed in to change notification settings - Fork 332
Fallback to python task if worker is zero for pytorch #1629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: byhsu <byhsu@linkedin.com>
Codecov Report
|
| task_type=task_type, | ||
| **kwargs, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
- I think we might need to handle
workers=0separately inget_customsince in this case we don't want to create aPytorchJob. (See how this is done for elastic task below) - Currently,
task_config=Pytorch(workers=0)is equivalent to notask_configat all. However,torch.distributed.init_process_group()will not work without the env vars set by the operator. We could solve this by overwriting theexecutemethod and simply setting the env varsWORLD_SIZE=1,RANK=0, and potentially the master address (would have to try whether it is required).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you y'all think about throwing an error if workers=0 and telling people to use a standard python config if they want to run it on a single machine?
If people really want to set workers to 0 then I understand having a smooth fallback, but otherwise it could confuse people if they make a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ByronHsu @wild-endeavor the new pytorch elastic task can run locally and in a single k8s pod but also with multiple workers using kubeflow training operator. I'd say its functionality is a superset of the already existing PyTorch task config. What do you think about using this one in order to debug dist training with a single worker @ByronHsu ?
I think falling back to a normal pod (without kubeflow operator) when doing task_config=PyTorch(num_workers=0) doesn't make much sense because the env vars like MASTER_ADDR, RANK, ... required by torch.distributed.init_process_group(), ... will not be set, neither by the kubeflow operator, nor by the pytorch task logic and distributed training, thus, cannot be tested.
I would propose to either allow num_workers=0 in PyTorch task but use kubeflow training operator also in this case (when users don't want to use the training operator, they can use Elastic) or 2) not allow num_workers=0 as is the case now.
wild-endeavor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe @peridotml can chime in on this pr?
|
should this be closed? |
TL;DR
Fallback to python task if worker is zero for pytorch
Type
Are all requirements met?
Tracking Issue
flyteorg/flyteplugins#348