Started jobs hang at "Running calculations ..." #261
-
|
Hello, I am currently having the problem that jobs are sent to the workers, but it seems they never really start and thus get canceled due to the time limit. The code itself should be okay since it runs without problem using e.g. the My first guess is that the works cannot communicate because they don't find zeromq. I tried to set the Worker log SSH log Thank you very much |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 8 replies
-
|
This looks less like a library issue, more like a network (SSH) forwarding issue. Can you tell me:
|
Beta Was this translation helpful? Give feedback.
-
|
Hey, Interesting that this might be a SSH issue.
Mmh... running on the login node doesn't work and Clustermq get stuck during this step: Which is the same step where it gets stuck when using SSH. |
Beta Was this translation helpful? Give feedback.
-
|
Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh. Your login node likely has multiple network interfaces, and if a worker tries to connect to You likely need to set You can list your network interfaces using the Here, You can use this code to check which interface the node name resolves to: To decide which interface to use instead, either (1) check manually for incoming connections e.g. using netcat, or (2) try different interfaces until it works. |
Beta Was this translation helpful? Give feedback.
-
|
Hi, I am also encountering this issue when trying to use Following the advice in the thread above, I ran Are there other ways to use clustermq to connect to my EC2 instance? For example, can you provide more information on what a job-job connection is? Thanks, |
Beta Was this translation helpful? Give feedback.
-
|
Yes, the Q calls work manually in R after SSHing in. When I SSH in and load clustermq, R (on EC2) says I know that clustermq won't work from my local machine when that is set to LOCAL, so I have been trying to set it to How can I set the scheduler to be multiprocess (or slurm, or anything besides local) when I leave R? There are two types of logs on EC2. cmq_ssh.log says this There are also logs from Q running correctly, but those are when I run R directly on EC2 after SSHing in. |
Beta Was this translation helpful? Give feedback.
Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh.
Your login node likely has multiple network interfaces, and if a worker tries to connect to
Sys.info()["nodename"]it resolves to the wrong interface.You likely need to set
options(clustermq.host="<interface that accepts worker connections>").You can list your network interfaces using the
ifconfigcommand, which will look something like the following: