Skip to content

Conversation

@ByronHsu
Copy link
Collaborator

@ByronHsu ByronHsu commented May 10, 2023

TL;DR

Fallback to python task if worker is zero for pytorch

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Tracking Issue

flyteorg/flyteplugins#348

Signed-off-by: byhsu <byhsu@linkedin.com>
Signed-off-by: byhsu <byhsu@linkedin.com>
@codecov
Copy link

codecov bot commented May 10, 2023

Codecov Report

Merging #1629 (6e01600) into master (35bb556) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1629   +/-   ##
=======================================
  Coverage   71.22%   71.22%           
=======================================
  Files         334      334           
  Lines       30391    30391           
  Branches     5490     5490           
=======================================
  Hits        21645    21645           
  Misses       8206     8206           
  Partials      540      540           

task_type=task_type,
**kwargs,
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  • I think we might need to handle workers=0 separately in get_custom since in this case we don't want to create a PytorchJob. (See how this is done for elastic task below)
  • Currently, task_config=Pytorch(workers=0) is equivalent to no task_config at all. However, torch.distributed.init_process_group() will not work without the env vars set by the operator. We could solve this by overwriting the execute method and simply setting the env vars WORLD_SIZE=1, RANK=0, and potentially the master address (would have to try whether it is required).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you y'all think about throwing an error if workers=0 and telling people to use a standard python config if they want to run it on a single machine?

If people really want to set workers to 0 then I understand having a smooth fallback, but otherwise it could confuse people if they make a mistake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ByronHsu @wild-endeavor the new pytorch elastic task can run locally and in a single k8s pod but also with multiple workers using kubeflow training operator. I'd say its functionality is a superset of the already existing PyTorch task config. What do you think about using this one in order to debug dist training with a single worker @ByronHsu ?

I think falling back to a normal pod (without kubeflow operator) when doing task_config=PyTorch(num_workers=0) doesn't make much sense because the env vars like MASTER_ADDR, RANK, ... required by torch.distributed.init_process_group(), ... will not be set, neither by the kubeflow operator, nor by the pytorch task logic and distributed training, thus, cannot be tested.

I would propose to either allow num_workers=0 in PyTorch task but use kubeflow training operator also in this case (when users don't want to use the training operator, they can use Elastic) or 2) not allow num_workers=0 as is the case now.

Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe @peridotml can chime in on this pr?

@kumare3
Copy link
Contributor

kumare3 commented Oct 5, 2023

should this be closed?

@ByronHsu ByronHsu closed this Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants