Skip to content

Job.load() raises KeyError for array parent job when parent has finished but child tasks are still running #397

@dliptai

Description

@dliptai

We are encountering an issue when working with Slurm job arrays via pyslurm.Job.load().

In a job array where:

  • SLURM_JOB_ID == SLURM_ARRAY_JOB_ID (i.e. the array parent job ID)
  • The parent job has finished
  • Some array tasks are still running

Calling:

pyslurm.Job.load(job_id)

where job_id is the array parent ID, results in:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pyslurm/core/job/job.pyx", line 307, in pyslurm.core.job.job.Job.load
  File "pyslurm/core/job/job.pyx", line 300, in pyslurm.core.job.job.Job.load
  File "pyslurm/core/job/step.pyx", line 103, in pyslurm.core.job.step.JobSteps._load_single
KeyError: 222

(where 222 is the job ID in this case)

Environment:

  • Slurm version: 24.11.6
  • pyslurm version: 24.11.0

Analysis:
Job.load() attempts to load job steps as part of job construction.

Normally:

  • If the returned dictionary of steps (from JobSteps._load_data()) is empty, an RPC error is raised.
    data = steps._load_data(job.id, slurm.SHOW_ALL)
    if not data and not slurm.IS_JOB_PENDING(job.ptr):
    msg = f"Failed to load step info for Job {job.id}."
    raise RPCError(msg=msg)
  • That RPC error is handled in Job.load().
    if not slurm.IS_JOB_PENDING(wrap.ptr):
    # Just ignore if the steps couldn't be loaded here.
    try:
    wrap.steps = JobSteps._load_single(wrap)
    except RPCError:
    pass

The problematic case appears to be:

  1. JobSteps._load_single() is called for the array parent job ID.
  2. The RPC and thus JobSteps._load_data() returns steps for all array elements that still have running steps.
  3. The parent job itself has no steps (it is already finished).
  4. Therefore, the returned dictionary is non-empty, but does not contain an entry for the parent job ID.
  5. JobSteps._load_single() then attempts to index into the dictionary using the parent job ID.
  6. Since that key does not exist, a KeyError is raised.
  7. This bypasses the normal RPC error handling path in Job.load().

By contrast:

  • If a single (non-array) job is finished, JobSteps._load_data() returns an empty dict.
  • That empty dict correctly triggers the RPC error path, which is handled.

So the failure only occurs when:

  • The array parent is finished, and
  • Some child tasks are still running.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions