Skip to content

Job crashes if JobRunner.update_job() raises ConnectionError #84

@LourensVeen

Description

@LourensVeen

This can easily happen if you lose your network connection after submitting, and should be grounds for a retry on the next connection attempt, not a move into SystemError. Should check RemoteJobFiles.update_job() as well.

[2019-01-12 19:03:14.060] [CRITICAL] Traceback (most recent call last):
  File "cerise/../cerise/back_end/execution_manager.py", line 290, in _process_jobs
    self._job_runner.update_job(job_id)
  File "cerise/../cerise/back_end/job_runner.py", line 52, in update_job
    status = self._sched.get_status(job.remote_job_id)
  File "/usr/local/lib/python3.5/dist-packages/cerulean/slurm_scheduler.py", line 66, in get_status
    10, command, ['-j', job_id, '-h', '-o', '%T'], None, None)
  File "/usr/local/lib/python3.5/dist-packages/cerulean/ssh_terminal.py", line 137, in run
    raise ConnectionError(str(last_exception))
ConnectionError: Timeout opening channel.
 [cerise.back_end.execution_manager]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions