-
Notifications
You must be signed in to change notification settings - Fork 128
Open
Description
We are encountering an issue when working with Slurm job arrays via pyslurm.Job.load().
In a job array where:
SLURM_JOB_ID == SLURM_ARRAY_JOB_ID(i.e. the array parent job ID)- The parent job has finished
- Some array tasks are still running
Calling:
pyslurm.Job.load(job_id)
where job_id is the array parent ID, results in:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "pyslurm/core/job/job.pyx", line 307, in pyslurm.core.job.job.Job.load
File "pyslurm/core/job/job.pyx", line 300, in pyslurm.core.job.job.Job.load
File "pyslurm/core/job/step.pyx", line 103, in pyslurm.core.job.step.JobSteps._load_single
KeyError: 222
(where 222 is the job ID in this case)
Environment:
- Slurm version: 24.11.6
- pyslurm version: 24.11.0
Analysis:
Job.load() attempts to load job steps as part of job construction.
Normally:
- If the returned dictionary of steps (from
JobSteps._load_data()) is empty, an RPC error is raised.
pyslurm/pyslurm/core/job/step.pyx
Lines 98 to 101 in 2adb2da
data = steps._load_data(job.id, slurm.SHOW_ALL) if not data and not slurm.IS_JOB_PENDING(job.ptr): msg = f"Failed to load step info for Job {job.id}." raise RPCError(msg=msg) - That RPC error is handled in
Job.load().
pyslurm/pyslurm/core/job/job.pyx
Lines 297 to 302 in 2adb2da
if not slurm.IS_JOB_PENDING(wrap.ptr): # Just ignore if the steps couldn't be loaded here. try: wrap.steps = JobSteps._load_single(wrap) except RPCError: pass
The problematic case appears to be:
JobSteps._load_single()is called for the array parent job ID.- The RPC and thus
JobSteps._load_data()returns steps for all array elements that still have running steps. - The parent job itself has no steps (it is already finished).
- Therefore, the returned dictionary is non-empty, but does not contain an entry for the parent job ID.
JobSteps._load_single()then attempts to index into the dictionary using the parent job ID.- Since that key does not exist, a
KeyErroris raised. - This bypasses the normal RPC error handling path in
Job.load().
By contrast:
- If a single (non-array) job is finished,
JobSteps._load_data()returns an empty dict. - That empty dict correctly triggers the RPC error path, which is handled.
So the failure only occurs when:
- The array parent is finished, and
- Some child tasks are still running.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels