No training when using 2 nodes and torchrun


Thank you for adding the ability to use multi-nodes without slurm! When I run training on one machine using torchrun, everything works without problems. But when I run it on two machines, the training freezes. The machines connect to each other, the model is loaded, but the training process does not continue. Learning gets stuck somewhere. Now I'm trying to figure out how to solve this problem. My launch command:

```
torchrun --master-addr [ip] \
--master-port [port] \
--node_rank 0 \
--nnodes 2 \
--nproc-per-node 2 \
-m dora run [ARGS]
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No training when using 2 nodes and torchrun #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No training when using 2 nodes and torchrun #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions