Skip to content

Conversation

@aeblyve
Copy link
Collaborator

@aeblyve aeblyve commented Sep 1, 2023

Intro

This is a branch of the dev/gc00/improve-virtualids code intended for longer-term maintainability, review, and integration into main MANA. It is also intended to be the base for OpenMPI, ExaMPI, etc. development.

What's different from dev/gc00/improve-virtualids?

  1. I integrated a critical bugfix in REMOVE_OLD, which we had applied in our local {Exa,Open}MPI development branches on NEU Discovery into this branch. This bug did not manifest with MPICH, but it made it impossible to free anything with OpenMPI, ExaMPI. (OpenMPI still needs to run with TCP enabled to prevent rdma_frag assertion error)
  2. The commit count is much more reasonable. I sincerely apologize for the inconvenience of wrestling with literally hundreds of commits and will take care not to do this again -- I was taught by others in the past that commits should be used as though they are free and are in fact a reasonable way to save changes when in a development branch. This is not true if coherent history under rebasing is a goal.
  3. ggids are no longer granted unconditionally. This caused problems when we created an internal communicator and it was granted a ggid (PRESUSPEND bug). I had solved the problem with a hack get_vcomm_internal before, this is a better solution because the semantics match the old style. See grant_ggid and its usages.
  4. dev/gc00/improve-virtualids had slightly different ADD_NEW semantics than main, because the equivalent of realIdExists(real_id) would be hacky to write. I've written up how this realIdExists functionality might look like, but without a two-level table, it is very bad for big apps, because we cannot iterate through only the descriptors of a single type. Please look at this part, @xuyao0127. If we cannot develop and perfect a two-level table like this by camera-ready for SC23, we could revert to the old behavior. I don't think it caused problems, but it is different. Alternatively, as a workaround we could maintain several maps for the different types as before (but do away with string comparison and template class still).
  5. I reworked and reformatted some commentary. FIXME annotates points of concern.
  6. I removed some documentation by @rajatpratapbisht , because these should be added in a dedicated PR (I can add this later).

Otherwise, the branches are the same.

Please, read over the FIXME comments I have written in reviewing, they contain points of interest and important questions/concerns.

Contrib

Unfortunately, this is a pretty big PR. I tried to organize the commits so it is clear what commit is in which phase.

This PR does three main things:

1. Replace the old virtualid machinery

(map from ints to ints, written in a C++ template and using dmtcp's virtualidtable.h) to a mapping from int to struct pointer, with each struct pointer containing both the Real ID for a some Virtual ID, and any metadata information required. I tried to write this part in a C-style. The ugliest part has to be the macros, but I thought that this was the cleanest way to write the code which depends on the particular MPI types, because they are not defined the same way even in the same implementation. For instance, MPI_File is a pointer even in MPICH. It also integrates well with the existing macros. Like @xuyao0127 says, the definition and usage of these macros are kind of strange and obtuse, or at least poorly documented (VIRTUAL_TO_REAL(real) == real). However, changing their behavior would mean changing them literally everywhere, making this PR even bigger. It could be a later project.

The struct pointer architecture enables objectives 2 and 3 of the PR. It's also the part that enables portability between MPI implementations (with the exception of one FIXME comment in init_comm_world: @xuyao0127 , @JainTwinkle , I would appreciate your feedback/work here. I think it would be great if the MANA codebase was truly 100% the same between ExaMPI, OpenMPI, MPICH. Therefore, we need to integrate the lh_constants_map into every version.

If we only want to demonstrate the capability of checkpoint-restart with multiple MPI (without also improving performance in points 2, 3 and really flexing struct VIDs) this is all we have to test and enhance, just the first commit here.

2. Integrate the old ggid machinery into the system in point 1.

This means that we don't have to look up the same virtual id multiple times. An optimization.

3. Replace the log-replay architecture with a decode-recode architecture.

This works readily for communicators, groups, and operators, because they can all be constructed in one go, when given the proper arguments. It is more difficult for datatypes, because datatypes cannot be "fully deserialized" in the same way. While operators also cannot be deserialized, it is not a problem because there is only way to create a new operator, and operators are not recursively defined, so it is trivial to just hook into operator creation. Datatypes that are "doubly-derived", i.e., user datatypes built using other user datatypes, do not immediately break down into MPI primitives.

One way to solve this problem would be to maintain a dependency tree of datatypes: every wrapper that creates a datatype can record the virtual id, and mark that it is a dependency. Another way may be to implement our own full-deserialization, but this is probably technically difficult. Yet another is to preserve the log-replay system for datatypes exclusively. Either way, the point of this part is to enable faster runtime (we don't need to record MPI calls), faster restart (we don't have to replay MPI calls) at the cost of making checkpoint time slightly slower (we have to record the information for everything that's alive at checkpoint time).

Future

Where to go from here:

  • Resolve the FIXME (answer the questions and/or change the code)
  • Implement the two-level table
  • Integrate lh_constants_map
  • Rip out all of the LOG_CALL when we think decon-recon is good. (right now, restoreMpiLogState is disabled, but logging itself is not)
  • Whatever else the reviewers think is fit.

@aeblyve aeblyve marked this pull request as ready for review September 1, 2023 20:12
@aeblyve
Copy link
Collaborator Author

aeblyve commented Sep 1, 2023

Could we trigger more involved automated testing?

@gc00
Copy link
Collaborator

gc00 commented Sep 1, 2023

Yes, More involved unit testing would be nice.
Unfortunately, the academic prototype that we inherited already had plenty of design flaws We have to finish fixing the design flaws before moving on to better unit tests.

@aeblyve
Copy link
Collaborator Author

aeblyve commented Sep 1, 2023

Okay, then the other goal is to use this branch as a better base for exampi and openmpi support. @xuyao0127 @JainTwinkle should add lh_constants_map to this branch and make other changes as needed.

@aeblyve aeblyve force-pushed the virtualids-v2 branch 2 times, most recently from b84a6b3 to eeeb8a8 Compare September 19, 2023 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants