Skip to content

partial results when SLURM cancels jobs due to time limit #339

@tdhock

Description

@tdhock

Hi @mschubert thanks for maintaining clustermq, which seems interesting!

I was wondering if it is possible to retreive partial results from SLURM even if a job times out?

For example consider this code

myfun <- function(x){
  Sys.sleep(10)
  x*2
}
minutes.per.job <- 1
result.list <- clustermq::Q(myfun, x=1:20, n_jobs=2, template=list(minutes=minutes.per.job, megabytes=3000), timeout=minutes.per.job*60)

with this SLURM template

#!/bin/sh                                                                          
#SBATCH --job-name={{ job_name }}                                                  
#SBATCH --output={{ log_file | /dev/null }}                                        
#SBATCH --error={{ log_file | /dev/null }}                                         
#SBATCH --mem-per-cpu={{ megabytes | 200 }}                                        
#SBATCH --time={{ minutes | 30 }}                                                  
#SBATCH --array=1-{{ n_jobs }}                                                     
#SBATCH --cpus-per-task={{ cores | 1 }}                                            
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

I get output like this:

> result.list <- clustermq::Q(myfun, x=1:20, n_jobs=2, template=list(minutes=minutes.per.job, megabytes=3000), timeout=minutes.per.job*60)                         
 Submitting 2 worker jobs to SLURM as 'cmq8454' ...
 Running 20 calculations (5 objs/20.2 Kb common; 1 calls/chunk) ...
 [=====================================>----------------]  70% (2/2 wrk) eta: 32s



 Error: Socket timed out after 60053 ms
 Master: [2.3 mins 0.0% CPU]; Worker: [avg 2.1% CPU, max 227.8 Mb]
!> result.list
 Error: object 'result.list' not found

I set a time limit of 1 minute per job, with 2 jobs, but each job takes 10 seconds, so there is not enough time to do them all. I get an error from Q(), but I wonder if there is a way to get a list of results for the ones which are possible to compute under the time limit? (and NULL otherwise for the jobs which timed out?)

Thanks!!

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions