-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Summary
Following the install instructions, I can start mana_coordinator, but launching a trivial MPI “hello” program under MANA fails on our HPC cluster. I’ve reproduced with both OpenMPI 5.0.4 (mpirun) and MPICH 4.1.2 (srun). In both cases the MPI program runs fine without MANA, but under mana_launch the tasks exit with code 99 and the MANA coordinator subsequently appears to disconnect.
Environment
- Cluster scheduler: Slurm
- Allocation: 1 exclusive compute node
- OS: Rocky Linux 9.4
- MPI stacks tested: OpenMPI 5.0.4 (module: openmpi/5.0.4, also shown as openmpi) MPICH 4.1.2 (module: mpich/4.1.2)
- Launcher(s): mpirun (OpenMPI), srun (Slurm+MPICH)
- PMIx: Slurm’s PMIx plugin in use; error mentions pmix_v3
- Temporary directory for checkpoints: mpi_ckpt_images/ (user-writable on local filesystem)
What I did
Terminal A (coordinator):
rpc-95-9:$ mana_coordinator
*** Coordinator/job information written to /mnt/REPACSS/home/yongzhao/.mana-slurm-22306.rc
rpc-95-9:$ mana_status
Coordinator:
Host: rpc-95-9
Port: 7779
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE, BARRIER
After a failed run (see below), running mana_status again shows:
rpc-95-9:$ mana_status
Coordinator not found. Please check port and host.
*** Checking for coordinator:
+ /mnt/REPACSS/home/yongzhao/software/mana/bin/dmtcp_command --status --coord-host rpc-95-9 --coord-port 7779
Coordinator not found. Please check port and host.
+ set +x
No coordinator detected. Try:
/mnt/REPACSS/home/yongzhao/software/mana/bin/mana_coordinator
Or:
/mnt/REPACSS/home/yongzhao/software/mana/bin/dmtcp_coordinator --exit-on-last -q --daemon
For help, do: /mnt/REPACSS/home/yongzhao/software/mana/bin/mana_command --help
Terminal B (OpenMPI test):
rpc-95-9:$ ml load openmpi
rpc-95-9:$ ls
hello_mpi hello_mpi.c mpi_ckpt_images omp_dmtcp_demo research_projects scripts software
rpc-95-9:$ mpirun -n 4 ./hello_mpi
[start] world=4 host[0]=rpc-95-9
[step 0] gathered: 0 1 2 3
[step 1] gathered: 0 2 4 6
[step 2] gathered: 0 3 6 9
[step 3] gathered: 0 4 8 12
^Crpc-95-9:$ mpirun -n 4 mana_launch --tmpdir mpi_ckpt_images/ ./hello_mpi
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:
Process name: [prterun-rpc-95-9-40239@1,1] Exit code: 99
--------------------------------------------------------------------------
After that failure, the coordinator becomes unreachable as shown above.
Terminal B (MPICH/Slurm test):
rpc-95-9:$ ml swap openmpi mpich/4.1.2
rpc-95-9:$ srun -n 4 ./hello_mpi
[start] world=4 host[0]=rpc-95-9
[step 0] gathered: 0 1 2 3
[step 1] gathered: 0 2 4 6
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=22306.0 tasks 0-3: running
[step 2] gathered: 0 3 6 9
[step 3] gathered: 0 4 8 12
[step 4] gathered: 0 5 10 15
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=22306.0 tasks 0-3: running
[step 5] gathered: 0 6 12 18
^Csrun: sending Ctrl-C to StepId=22306.0
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 22306.0 ON rpc-95-9 CANCELLED AT 2025-11-12T20:41:05 ***
rpc-95-9:$ srun -n 4 mana_launch --tmpdir mpi_ckpt_images/ ./hello_mpi
srun: error: rpc-95-9: tasks 0,2-3: Exited with exit code 99
slurmstepd: error: mpi/pmix_v3: _errhandler: rpc-95-9 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.22306.1:0]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 22306.1 ON rpc-95-9 CANCELLED AT 2025-11-12T20:41:37 ***
srun: error: rpc-95-9: task 1: Exited with exit code 99
Metadata
Metadata
Assignees
Labels
No labels