Skip to content

MANA fails to launch simple MPI app on HPC (OpenMPI 5.0.4 & MPICH 4.1.2); processes exit with code 99 and coordinator disconnects #464

@billzyj

Description

@billzyj

Summary
Following the install instructions, I can start mana_coordinator, but launching a trivial MPI “hello” program under MANA fails on our HPC cluster. I’ve reproduced with both OpenMPI 5.0.4 (mpirun) and MPICH 4.1.2 (srun). In both cases the MPI program runs fine without MANA, but under mana_launch the tasks exit with code 99 and the MANA coordinator subsequently appears to disconnect.

Environment

  • Cluster scheduler: Slurm
  • Allocation: 1 exclusive compute node
  • OS: Rocky Linux 9.4
  • MPI stacks tested: OpenMPI 5.0.4 (module: openmpi/5.0.4, also shown as openmpi) MPICH 4.1.2 (module: mpich/4.1.2)
  • Launcher(s): mpirun (OpenMPI), srun (Slurm+MPICH)
  • PMIx: Slurm’s PMIx plugin in use; error mentions pmix_v3
  • Temporary directory for checkpoints: mpi_ckpt_images/ (user-writable on local filesystem)

What I did

Terminal A (coordinator):

rpc-95-9:$ mana_coordinator
*** Coordinator/job information written to /mnt/REPACSS/home/yongzhao/.mana-slurm-22306.rc
rpc-95-9:$ mana_status
Coordinator:
  Host: rpc-95-9
  Port: 7779
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE, BARRIER

After a failed run (see below), running mana_status again shows:

rpc-95-9:$ mana_status
Coordinator not found. Please check port and host.
*** Checking for coordinator:
+ /mnt/REPACSS/home/yongzhao/software/mana/bin/dmtcp_command --status --coord-host rpc-95-9 --coord-port 7779
Coordinator not found. Please check port and host.
+ set +x
  No coordinator detected.   Try:
    /mnt/REPACSS/home/yongzhao/software/mana/bin/mana_coordinator
  Or:
    /mnt/REPACSS/home/yongzhao/software/mana/bin/dmtcp_coordinator --exit-on-last -q --daemon
  For help, do:  /mnt/REPACSS/home/yongzhao/software/mana/bin/mana_command --help

Terminal B (OpenMPI test):

rpc-95-9:$ ml load openmpi
rpc-95-9:$ ls
hello_mpi  hello_mpi.c  mpi_ckpt_images  omp_dmtcp_demo  research_projects  scripts  software
rpc-95-9:$ mpirun -n 4 ./hello_mpi
[start] world=4 host[0]=rpc-95-9
[step 0] gathered: 0 1 2 3
[step 1] gathered: 0 2 4 6
[step 2] gathered: 0 3 6 9
[step 3] gathered: 0 4 8 12
^Crpc-95-9:$ mpirun -n 4 mana_launch --tmpdir mpi_ckpt_images/ ./hello_mpi
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-rpc-95-9-40239@1,1] Exit code:    99
--------------------------------------------------------------------------

After that failure, the coordinator becomes unreachable as shown above.

Terminal B (MPICH/Slurm test):

rpc-95-9:$ ml swap openmpi mpich/4.1.2
rpc-95-9:$ srun -n 4 ./hello_mpi
[start] world=4 host[0]=rpc-95-9
[step 0] gathered: 0 1 2 3
[step 1] gathered: 0 2 4 6
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=22306.0 tasks 0-3: running
[step 2] gathered: 0 3 6 9
[step 3] gathered: 0 4 8 12
[step 4] gathered: 0 5 10 15
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=22306.0 tasks 0-3: running
[step 5] gathered: 0 6 12 18
^Csrun: sending Ctrl-C to StepId=22306.0
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 22306.0 ON rpc-95-9 CANCELLED AT 2025-11-12T20:41:05 ***
rpc-95-9:$ srun -n 4 mana_launch --tmpdir mpi_ckpt_images/ ./hello_mpi
srun: error: rpc-95-9: tasks 0,2-3: Exited with exit code 99
slurmstepd: error:  mpi/pmix_v3: _errhandler: rpc-95-9 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.22306.1:0]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 22306.1 ON rpc-95-9 CANCELLED AT 2025-11-12T20:41:37 ***
srun: error: rpc-95-9: task 1: Exited with exit code 99

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions