Skip to content

<LMON BE API> (ERROR): read_lmonp return a neg value #46

@antonl321

Description

@antonl321

Hi,

I did some tests with the latest stat-cl.

It works file on runs with 45 nodes x 16 MPI
but it fails with the error from bellow on a run with 880 nodes x 8 MPI

[atosla@ac1-1001 ~]$ stat-cl  `pgrep -o srun` 2>~/stat-cl.err
STAT started at 2022-07-08-13:44:15
Attaching to job launcher (null):690121 and launching tool daemons...
<Jul 08 13:44:15> <Launchmon> (INFO): Just continued the RM process out of the first trap
<Jul 08 13:44:18> <LMON BE API> (ERROR): read_lmonp return a neg value

In stat-cl.err I see the following

graph is not connected, found 0 potential roots
<Jul 08 13:44:18> <STAT_FrontEnd.C: 642> STAT returned error type STAT_MRNET_ERROR: MRNet reported a Network error with message: MRNet: Topology error: file has incorrect format
<Jul 08 13:44:18> <STAT.C: 181> STAT returned error type STAT_MRNET_ERROR: Failed to launch MRNet tree()
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2041.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2041.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-1087.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-1087.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac1-4064.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac1-4064.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2012.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2012.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac1-4008.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac1-4008.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-3013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-3013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2033.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2033.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions