Started jobs hang at "Running calculations ..." #261

mhesselbarth · 2021-04-16T13:12:11Z

mhesselbarth
Apr 16, 2021

Hello,

I am currently having the problem that jobs are sent to the workers, but it seems they never really start and thus get canceled due to the time limit. The code itself should be okay since it runs without problem using e.g. the future package and all I'm doing is to get nodename (fx <- function(x) {Sys.sleep(30); Sys.info()["nodename"]}.

My first guess is that the works cannot communicate because they don't find zeromq. I tried to set the LD_LIBRARY_PATH to the installation of zeromq, but this didn't help (setenv ('LD_LIBRARY_PATH', 'home/mhessel/zeromq-4.0.3/')).

Worker log

2021-04-16 08:40:25.777142 | Master: tcp://gl-login2.arc-ts.umich.edu:7313
2021-04-16 08:40:25.798204 | WORKER_UP to: tcp://gl-login2.arc-ts.umich.edu:7313
slurmstepd: error: *** JOB 19291379 ON gl3031 CANCELLED AT 2021-04-16T08:42:39 DUE TO TIME LIMIT ***

SSH log

> clustermq:::ssh_proxy(ctl=51896, job=50915)
master ctl listening at: tcp://127.0.0.1:51896
forwarding local network from: tcp://gl-login2.arc-ts.umich.edu:7313
sent PROXY_UP to master ctl
received common data:function (x) {    Sys.sleep(30)    Sys.info()["nodename"]}
setting up qsys: SLURM
sent PROXY_READY to master ctl
received: PROXY_CMDqsys$submit_jobs(job_name = "clustermq", service = "short", mem_cpu = 512, walltime = "00:02:00", log_file = "clustermq.log", n_jobs = 3, log_worker = TRUE, verbose = TRUE)
Submitting 3 worker jobs (ID: clustermq) ...
received: PROXY_STOPTRUE
shutting down and cleaning up
Master: [247.2s 0.0% CPU]; Worker: [avg NA% CPU, max NA Mb]

Thank you very much

Answered by mschubert

Apr 27, 2021

Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh.

Your login node likely has multiple network interfaces, and if a worker tries to connect to Sys.info()["nodename"] it resolves to the wrong interface.

You likely need to set options(clustermq.host="<interface that accepts worker connections>").

You can list your network interfaces using the ifconfig command, which will look something like the following:

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        ...

em3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  …

View full answer

mschubert · 2021-04-20T11:51:28Z

mschubert
Apr 20, 2021
Maintainer

This looks less like a library issue, more like a network (SSH) forwarding issue.

Can you tell me:

Does your code work if you run it on your login node instead of via SSH?
Which version of clustermq are you using?
Did this work before? If yes, what changed? (e.g. package update from version X to version Y)

0 replies

mhesselbarth · 2021-04-26T17:49:08Z

mhesselbarth
Apr 26, 2021
Author

Hey,

Interesting that this might be a SSH issue.

I am using clustermq_0.8.95.1
I used clustermq before, but on a different HPC. On the HPC I am using currently I never used clustermq and I am also not aware somebody else did.

Mmh... running on the login node doesn't work and Clustermq get stuck during this step:

Submitting 3 worker jobs (ID: clustermq) ...
Running 3 calculations (0 objs/0 Mb common; 1 calls/chunk) ...

Which is the same step where it gets stuck when using SSH.

0 replies

mschubert · 2021-04-27T07:42:02Z

mschubert
Apr 27, 2021
Maintainer

Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh.

Your login node likely has multiple network interfaces, and if a worker tries to connect to Sys.info()["nodename"] it resolves to the wrong interface.

You likely need to set options(clustermq.host="<interface that accepts worker connections>").

You can list your network interfaces using the ifconfig command, which will look something like the following:

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        ...

em3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.23.44.3  netmask 255.255.252.0  broadcast 172.23.47.255
        inet6 fe80::eef4:bbff:fece:2514  prefixlen 64  scopeid 0x20<link>
        ...

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65520
        inet 172.23.52.3  netmask 255.255.252.0  broadcast 172.23.55.255
        inet6 fe80::f652:1403:79:8a11  prefixlen 64  scopeid 0x20<link>
        ...

Here, Sys.info()["nodename"] resolved to the em interface, which did not accept incoming connections. Setting options(clustermq.host="ib0") solved the issue.

You can use this code to check which interface the node name resolves to:

R -e 'system(paste("nslookup", Sys.info()["nodename"]))'
> Name:	your.node.name
> Address: 172.23.44.3 # <- this matches the inet resolved from the node name

To decide which interface to use instead, either (1) check manually for incoming connections e.g. using netcat, or (2) try different interfaces until it works.

5 replies

c1au6i0 Nov 30, 2021

Yes, I have same problem just trying to run the user guide example using Slurm. It submits the jobs but hangs with same error.
I have an extremely naive follow up question. How do actually establish the "interface that accepts worker connections".
Thanks!

mschubert Nov 30, 2021
Maintainer

I added more explanation above.

c1au6i0 Nov 30, 2021

@mschubert thank you for your help, it is very appreciated! In my case, I have lo, ib0, em1-4. I have tried all of them without success. I am working on a clinically graded node, not sure if that adds an extra level of security and/or has any connection with this.

mschubert Nov 30, 2021
Maintainer

There may well be (login) nodes that do not accept incoming connections from jobs at all (in which case that's a limitation imposed by your sys admins that will unfortunately block clustermq entirely). An alternative might be to run your main process in a job as well (if job-job connections are allowed), or use e.g. the batchtools package that transmits data via the file system instead.

c1au6i0 Nov 30, 2021

Yes, it seems that job-job connections work and to be a viable workaround. Thanks @mschubert !!

zt8zf · 2025-10-23T17:38:57Z

zt8zf
Oct 23, 2025

Hi,

I am also encountering this issue when trying to use clustermq to connect to an EC2 instance I spun up. Specifically, when I try to run result <- Q(function(x) x^2, x = 1:10, n_jobs = 2, log_worker=TRUE), I get

Connecting to ‘ec2-user@ec2-18-221-243-114.us-east-2.compute.amazonaws.com’ via SSH ...
Running 10 calculations (5 objs/19.7 Kb common; 1 calls/chunk) ...
Error: Interrupted system call
Master: [19.9 secs 1.0% CPU]; Worker: [avg NA% CPU, max 0 bytes]

Following the advice in the thread above, I ran ifconfig which listed two options: ens5 and lo. When I use either of them in options(clustermq.host = 'ens5'), R says Error: Binding port failed (Operation not supported by device).

Are there other ways to use clustermq to connect to my EC2 instance? For example, can you provide more information on what a job-job connection is?

Thanks,
Zach

1 reply

mschubert Oct 23, 2025
Maintainer

This looks like your SSH connection is working (otherwise it would get stuck at "Connecting ..."). If you ssh into your instance and run your Q call in R manually, does that work? What do the ssh or worker logs say?

zt8zf · 2025-10-23T19:15:07Z

zt8zf
Oct 23, 2025

Yes, the Q calls work manually in R after SSHing in. When I SSH in and load clustermq, R (on EC2) says

> library(clustermq)
* Option 'clustermq.scheduler' not set, defaulting to ‘LOCAL’
--- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration

I know that clustermq won't work from my local machine when that is set to LOCAL, so I have been trying to set it to multiprocess or slurm. But whenever I exit R on EC2 and go back in and re-load clustermq, I get the same above message about the scheduler defaulting to LOCAL.

How can I set the scheduler to be multiprocess (or slurm, or anything besides local) when I leave R?

There are two types of logs on EC2. cmq_ssh.log says this

> clustermq:::ssh_proxy(50665)
2025-10-23 19:12:27.342580 | listening for workers at tcp://ip-172-31-30-160.us-east-2.compute.internal:8743
2025-10-23 19:12:27.466122 | submit args: log_worker=TRUE, n_jobs=2
2025-10-23 19:12:27.466748 | setting up qsys: LOCAL
Error in doTryCatch(return(expr), name, parentenv, handler) : 
  Remote SSH QSys ‘LOCAL’ is not allowed
Calls: <Anonymous> ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted

There are also logs from Q running correctly, but those are when I run R directly on EC2 after SSHing in.

2 replies

mschubert Oct 23, 2025
Maintainer

You can set this in your ~/.Rprofile on your EC2 instance (add the options(clustermq.scheduler = "multiprocess") line there), then the original SSH command should work as well

zt8zf Oct 24, 2025

That worked beautifully! Thanks! Sorry I missed that part in the user guide.

Started jobs hang at "Running calculations ..." #261

Uh oh!

mhesselbarth Apr 16, 2021

Replies: 5 comments · 8 replies

Uh oh!

mschubert Apr 20, 2021 Maintainer

Uh oh!

Uh oh!

mhesselbarth Apr 26, 2021 Author

Uh oh!

Uh oh!

mschubert Apr 27, 2021 Maintainer

Uh oh!

c1au6i0 Nov 30, 2021

Uh oh!

mschubert Nov 30, 2021 Maintainer

Uh oh!

c1au6i0 Nov 30, 2021

Uh oh!

mschubert Nov 30, 2021 Maintainer

Uh oh!

c1au6i0 Nov 30, 2021

Uh oh!

zt8zf Oct 23, 2025

Uh oh!

mschubert Oct 23, 2025 Maintainer

Uh oh!

zt8zf Oct 23, 2025

Uh oh!

mschubert Oct 23, 2025 Maintainer

Uh oh!

zt8zf Oct 24, 2025

mhesselbarth
Apr 16, 2021

Replies: 5 comments 8 replies

mschubert
Apr 20, 2021
Maintainer

mhesselbarth
Apr 26, 2021
Author

mschubert
Apr 27, 2021
Maintainer

mschubert Nov 30, 2021
Maintainer

mschubert Nov 30, 2021
Maintainer

zt8zf
Oct 23, 2025

mschubert Oct 23, 2025
Maintainer

zt8zf
Oct 23, 2025

mschubert Oct 23, 2025
Maintainer