MANA fails to launch simple MPI app on HPC (OpenMPI 5.0.4 & MPICH 4.1.2); processes exit with code 99 and coordinator disconnects

**Summary**
Following the install instructions, I can start mana_coordinator, but launching a trivial MPI “hello” program under MANA fails on our HPC cluster. I’ve reproduced with both OpenMPI 5.0.4 (mpirun) and MPICH 4.1.2 (srun). In both cases the MPI program runs fine without MANA, but under mana_launch the tasks exit with code 99 and the MANA coordinator subsequently appears to disconnect.

**Environment**

- Cluster scheduler: Slurm
- Allocation: 1 exclusive compute node
- OS: Rocky Linux 9.4
- MPI stacks tested:  OpenMPI 5.0.4 (module: openmpi/5.0.4, also shown as openmpi)  MPICH 4.1.2 (module: mpich/4.1.2)
- Launcher(s): mpirun (OpenMPI), srun (Slurm+MPICH)
- PMIx: Slurm’s PMIx plugin in use; error mentions pmix_v3
- Temporary directory for checkpoints: mpi_ckpt_images/ (user-writable on local filesystem)

### What I did
#### Terminal A (coordinator):
```
rpc-95-9:$ mana_coordinator
*** Coordinator/job information written to /mnt/REPACSS/home/yongzhao/.mana-slurm-22306.rc
rpc-95-9:$ mana_status
Coordinator:
  Host: rpc-95-9
  Port: 7779
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE, BARRIER
```
After a failed run (see below), running mana_status again shows:
```
rpc-95-9:$ mana_status
Coordinator not found. Please check port and host.
*** Checking for coordinator:
+ /mnt/REPACSS/home/yongzhao/software/mana/bin/dmtcp_command --status --coord-host rpc-95-9 --coord-port 7779
Coordinator not found. Please check port and host.
+ set +x
  No coordinator detected.   Try:
    /mnt/REPACSS/home/yongzhao/software/mana/bin/mana_coordinator
  Or:
    /mnt/REPACSS/home/yongzhao/software/mana/bin/dmtcp_coordinator --exit-on-last -q --daemon
  For help, do:  /mnt/REPACSS/home/yongzhao/software/mana/bin/mana_command --help
```

#### Terminal B (OpenMPI test):
```
rpc-95-9:$ ml load openmpi
rpc-95-9:$ ls
hello_mpi  hello_mpi.c  mpi_ckpt_images  omp_dmtcp_demo  research_projects  scripts  software
rpc-95-9:$ mpirun -n 4 ./hello_mpi
[start] world=4 host[0]=rpc-95-9
[step 0] gathered: 0 1 2 3
[step 1] gathered: 0 2 4 6
[step 2] gathered: 0 3 6 9
[step 3] gathered: 0 4 8 12
^Crpc-95-9:$ mpirun -n 4 mana_launch --tmpdir mpi_ckpt_images/ ./hello_mpi
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-rpc-95-9-40239@1,1] Exit code:    99
--------------------------------------------------------------------------
```
After that failure, the coordinator becomes unreachable as shown above.

#### Terminal B (MPICH/Slurm test):
 ```
rpc-95-9:$ ml swap openmpi mpich/4.1.2
rpc-95-9:$ srun -n 4 ./hello_mpi
[start] world=4 host[0]=rpc-95-9
[step 0] gathered: 0 1 2 3
[step 1] gathered: 0 2 4 6
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=22306.0 tasks 0-3: running
[step 2] gathered: 0 3 6 9
[step 3] gathered: 0 4 8 12
[step 4] gathered: 0 5 10 15
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=22306.0 tasks 0-3: running
[step 5] gathered: 0 6 12 18
^Csrun: sending Ctrl-C to StepId=22306.0
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 22306.0 ON rpc-95-9 CANCELLED AT 2025-11-12T20:41:05 ***
rpc-95-9:$ srun -n 4 mana_launch --tmpdir mpi_ckpt_images/ ./hello_mpi
srun: error: rpc-95-9: tasks 0,2-3: Exited with exit code 99
slurmstepd: error:  mpi/pmix_v3: _errhandler: rpc-95-9 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.22306.1:0]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 22306.1 ON rpc-95-9 CANCELLED AT 2025-11-12T20:41:37 ***
srun: error: rpc-95-9: task 1: Exited with exit code 99
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MANA fails to launch simple MPI app on HPC (OpenMPI 5.0.4 & MPICH 4.1.2); processes exit with code 99 and coordinator disconnects #464

What I did

Terminal A (coordinator):

Terminal B (OpenMPI test):

Terminal B (MPICH/Slurm test):

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MANA fails to launch simple MPI app on HPC (OpenMPI 5.0.4 & MPICH 4.1.2); processes exit with code 99 and coordinator disconnects #464

Description

What I did

Terminal A (coordinator):

Terminal B (OpenMPI test):

Terminal B (MPICH/Slurm test):

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions