-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Hi,
I did some tests with the latest stat-cl.
It works file on runs with 45 nodes x 16 MPI
but it fails with the error from bellow on a run with 880 nodes x 8 MPI
[atosla@ac1-1001 ~]$ stat-cl `pgrep -o srun` 2>~/stat-cl.err
STAT started at 2022-07-08-13:44:15
Attaching to job launcher (null):690121 and launching tool daemons...
<Jul 08 13:44:15> <Launchmon> (INFO): Just continued the RM process out of the first trap
<Jul 08 13:44:18> <LMON BE API> (ERROR): read_lmonp return a neg value
In stat-cl.err I see the following
graph is not connected, found 0 potential roots
<Jul 08 13:44:18> <STAT_FrontEnd.C: 642> STAT returned error type STAT_MRNET_ERROR: MRNet reported a Network error with message: MRNet: Topology error: file has incorrect format
<Jul 08 13:44:18> <STAT.C: 181> STAT returned error type STAT_MRNET_ERROR: Failed to launch MRNet tree()
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2041.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2041.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-1087.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-1087.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac1-4064.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac1-4064.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2012.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2012.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac1-4008.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac1-4008.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-3013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-3013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2033.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2033.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
...
Metadata
Metadata
Assignees
Labels
No labels