Skip to content

Conversation

@amartyads
Copy link
Contributor

@amartyads amartyads commented Dec 31, 2025

Description

I've created the plugin I was talking about in #348, to show hardware info once a simulation has started. This helps double check whether the distribution of resources was correct, whether pinning was successful, whether a heterogeneous hardware run was launched properly etc. For higher versions of glibc, this also shows NUMA information.

Resolved Issues

How Has This Been Tested?

Tested on

  • Local machine, with higher version of glibc, serial and parallel, clang and gcc, intel
  • HSUper, with 16 nodes, various permutations of ranks and threads, lower glibc, icpx and gcc, intel

Documentation

(Only relevant if this PR introduces new features)

  • all-options.xml documents how to use the feature.
  • The responsible readXML() documents how to use the feature.

Discussion points

A few points I'd like everyone's input on, hence initially making this PR draft.

  1. I didn't add software information (ls1 version, compiler version, mpi version etc) because that would replicate data already present in the first few lines of ls1 output.
  2. I was going to add hardware information (cpu model and speed, max memory available) per node, but then ran into several roadblocks. I want to talk about them here.
    1. Library: The easiest solution would be to use a library like hwinfo, but I don't know how others feel about introducing an external dependency for a plugin.
    2. Header file: The header file cpuid.h only works for linux systems, and does not have RAM info; to get max available RAM, you need to do sysconf(_SC_PAGESIZE) * sysconf(_SC_PHYS_PAGES) from unistd.h. I'm already using unistd (despite only existing on posix systems), so that's not a big deal, but it's not the most elegant solution. This does work on intel, ARM, AMD, with both gcc and clang-based compilers. Adding headers for windows systems would add a lot more clutter.
    3. System counters: On linux only, I can parse /proc/cpuinfo and /proc/meminfo to get the information, but cpuinfo doesn't give clean values on ARM, it gives a code that you need to then look up in the documentation to find the CPU model. So this works only on linux, non ARM (intel and AMD work fine).
    4. Command execution: I can just run lscpu with popen() on linux and then parse the output, but do we want to do that?
      From all these options, 2 seems the best to me, but I want to know what everyone else thinks. 2,3 and 4 are all linux specific.
  3. If any header is unavailable, I've added #ifdef checks. But I've made it so that unavailable values are -1, and are not written if unavailable. I didn't put all the variables and writing code in #ifdefs because I thought it would be too cluttered. Do I do it anyway, or do I stick to -1?

I would also really appreciate it if someone with a different hardware setup tested out this plugin to make sure it's working. I couldn't do it on ARM and AMD yet because the cmake version on the minicluster is too low; I can try and run it there but it would be more convenient if someone has a set up ready to go.

Copy link
Contributor

@rubenhorn rubenhorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far.
I've added a few comments about possible Windows core library features that could be used.
In general, I would be OK with having POSIX/Linux exclusive features, since that's what most HPC systems (all?) use, anyway.
Therefore, /proc/* could be used, as the Linux file system is already required for the EnergyRAPL plugin.

I would advise against parsing command execution output and consistency across linux systems (Intel, AMD, ARM) is mandatory.
You could add almost all functionality for Windows using windows.h and by reading the Registry, but I'm also not sure if it is worth the additional code clutter.
(Memory in bytes is provided by GlobalMemoryStatusEx.ullTotalPhys.)

@amartyads
Copy link
Contributor Author

Looks good so far. I've added a few comments about possible Windows core library features that could be used. In general, I would be OK with having POSIX/Linux exclusive features, since that's what most HPC systems (all?) use, anyway. Therefore, /proc/* could be used, as the Linux file system is already required for the EnergyRAPL plugin.

Okay, I guess I'll use the header implementation for now, test it, and if there are issues, move onto parsing /proc/*.

The windows stuff; you're right, we have no Windows HPC systems to test on, so if I add functionality for windows, we'll have untested code. I'll skip windows for now then, and in the future, if someone needs windows, they can easily implement the code that you've written.

Copy link
Contributor

@cniethammer cniethammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know if the SysMon plugin could handle this but I assume you had a look at it and it was not possible.

Here are some quick comments about the code from my side.

Comment on lines +11 to +13
#ifdef __GLIBC__
#include <sched.h> // sched_getcpu(), getcpu(int*, int*)
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef __GLIBC__
#include <sched.h> // sched_getcpu(), getcpu(int*, int*)
#endif
#define _GNU_SOURCE
#include <sched.h> // sched_getcpu(), getcpu(int*, int*)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not too well versed in this; is it not possible that we use a compiler without glibc? For example, LLVM is supposed to come out with its own library, and it might have different functionality. On other OSs it's possible to use a different library.

Also, the extensions provided by _GNU_SOURCE aren't used anywhere else, is it necessary to have it?

I genuinely don't know what the standard is, and I can't seem to find any comprehensive documentation.

Comment on lines 172 to 175
rankInfo << "\n\t\t\"" << _rank << "\": {\n";
rankInfo << "\t\t\t\"node_name\": \"" << _nodeName << "\",\n";
rankInfo << "\t\t\t\"total_threads\": \"" << _threadData[0].totalThreads << "\",\n";
rankInfo << "\t\t\t\"thread_data\": {\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON output formating should be done by tools, not manually.
Also number values in JSON should not be quoted, e.g., totalTreads

Suggested change
rankInfo << "\n\t\t\"" << _rank << "\": {\n";
rankInfo << "\t\t\t\"node_name\": \"" << _nodeName << "\",\n";
rankInfo << "\t\t\t\"total_threads\": \"" << _threadData[0].totalThreads << "\",\n";
rankInfo << "\t\t\t\"thread_data\": {\n";
rankInfo << "\"" << _rank << "\": {";
rankInfo << "\"node_name\": \"" << _nodeName << "\",";
rankInfo << "\"total_threads\": " << _threadData[0].totalThreads << ",";
rankInfo << "\"thread_data\": {";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it desirable to have an external tool just for a plugin?
I could write my own formatter inside the plugin, but that's code that's not strictly related to the plugin's purpose.
Or I could pivot back over to CSV or TSV.

@amartyads
Copy link
Contributor Author

amartyads commented Jan 7, 2026

I do not know if the SysMon plugin could handle this but I assume you had a look at it and it was not possible.

Here are some quick comments about the code from my side.

The sysmon plugin does have a lot of hardware data, but the focus of this plugin was more to print the MPI and openMP info, with the node names, NUMA domains, and pinning info, to double check that whatever pinning you've done has persisted, or for node failures, see if you used a failed node. I don't believe sysmon does that. Also, sysmon doesn't seem to print any CPU information. I think the only major overlap is that I plan to print the max available ram per node. I could remove that, remove the CPU info, and rename the plugin to PinningPrint.cpp or something, to make it clearer.

@amartyads
Copy link
Contributor Author

I've added json library support, everything works fine, and the time taken by the plugin with 16 nodes, 4 MPI, 18 openMP is in the order of 10^-5 seconds.

Points that are still open:

  1. Do I keep the library support, or change to CSV/TSV and write directly to file instead of using this library?
  2. Do I rename the plugin to PinningInfo?

@amartyads amartyads marked this pull request as ready for review January 13, 2026 14:11
@SamNewcome
Copy link
Contributor

I've added json library support, everything works fine, and the time taken by the plugin with 16 nodes, 4 MPI, 18 openMP is in the order of 10^-5 seconds.

Points that are still open:

1. Do I keep the library support, or change to CSV/TSV and write directly to file instead of using this library?

IMHO, I think a JSON library is fine but I also have no issues with CSV.

2. Do I rename the plugin to PinningInfo?

Yes. Or something like HardwareAndPinningInfo -> The main intent is really about the pinning it seems, but hardware info is also included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants