-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Labels
Description
Hi @mschubert thanks for maintaining clustermq, which seems interesting!
I was wondering if it is possible to retreive partial results from SLURM even if a job times out?
For example consider this code
myfun <- function(x){
Sys.sleep(10)
x*2
}
minutes.per.job <- 1
result.list <- clustermq::Q(myfun, x=1:20, n_jobs=2, template=list(minutes=minutes.per.job, megabytes=3000), timeout=minutes.per.job*60)with this SLURM template
#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ megabytes | 200 }}
#SBATCH --time={{ minutes | 30 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --cpus-per-task={{ cores | 1 }}
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'I get output like this:
> result.list <- clustermq::Q(myfun, x=1:20, n_jobs=2, template=list(minutes=minutes.per.job, megabytes=3000), timeout=minutes.per.job*60)
Submitting 2 worker jobs to SLURM as 'cmq8454' ...
Running 20 calculations (5 objs/20.2 Kb common; 1 calls/chunk) ...
[=====================================>----------------] 70% (2/2 wrk) eta: 32s
Error: Socket timed out after 60053 ms
Master: [2.3 mins 0.0% CPU]; Worker: [avg 2.1% CPU, max 227.8 Mb]
!> result.list
Error: object 'result.list' not foundI set a time limit of 1 minute per job, with 2 jobs, but each job takes 10 seconds, so there is not enough time to do them all. I get an error from Q(), but I wonder if there is a way to get a list of results for the ones which are possible to compute under the time limit? (and NULL otherwise for the jobs which timed out?)
Thanks!!