Correctness - BugFix - Drain P2P messages in PRESUSPEND event instead of PRECHECKPOINT event. #312

tarunsmalviya · 2023-04-29T16:46:28Z

In applications using MPI, an MPI thread is responsible for managing asynchronous requests in the background. During checkpointing, it is necessary to suspend Collective and P2P communication so that MANA can drain messages from the network and create a checkpoint image. Previously, we had been draining collective messages during the PRESUSPEND event, while all threads were still running. This is because draining collective messages requires all ranks to be in a consistent state, which can only be achieved through trial and error and by allowing the user application to make progress. However, we were draining P2P messages after suspending all threads, including the MPI thread. This was because each rank explicitly asks other ranks for any pending messages floating in the network, and MANA drains them into its internal buffer so that they can be passed to the user application during checkpoint restart. Additionally, we rely on MPI to provide information about pending requests or messages in the network.

While this approach seemed reasonable, it did not account for a case where the MPI thread was creating metadata for a P2P request and was suspended in between. In such a scenario, that message would not be visible to other ranks since the MPI APIs used by MANA to gather information about pending requests would report that there are no messages, even though there is a message pending that is not yet visible to all ranks because its metadata has not yet been generated.

tarunsmalviya · 2023-04-29T16:52:05Z

I intentionally left some print statements in the code so that if the process hangs, it will be easier to determine the issue. However, since the code is preliminary, it is possible that draining P2P messages during the PRESUSPEND event could cause the process to hang. Once we have built confidence in the code over time, we can remove these print statements

tarunsmalviya · 2023-04-29T17:21:37Z

I have tested this branch with multiple checkpoint-resume and checkpoint-restart and the configuration Tom uses in his automated testing, and the job finished without any issues.

gc00

@tarunsmalviya ,
I have some changes requested.
In addition, I find the names internal_p2p_communication and global_p2p_communication to be confusing. Now that we're going from debugging to production code, we should choose names with a more obvious meaning.
If I understand, global_p2p_communication is initialized to 1, and it says that we are not yet in presuspend. When we get to presuspend, we set internal_p2p_communication to 1.
But I still don't understand if the GNI thread is being treated specially, or if we're simply waiting longer in the PRESUSPEND phase to allow the GNI thread to finish transferring the network messages.
Could you review your current comments, and see if you can revise them to make them clearer?
Thanks!

mpi-proxy-split/mpi_plugin.cpp

mpi-proxy-split/mpi_plugin.h

gc00 · 2023-04-29T19:34:11Z

mpi-proxy-split/mpi_plugin.h

+*/
+
+/*
+The P2P communication in an application is controlled globally by <global_p2p_communication> variable. When we say "control", we mean that it suspends P2P communication and prevents the user application from being in the lower half. This allows for P2P message draining and checkpointing during the PRESUSPEND checkpoint event. If the P2P message draining is performed after PRESUSPEND and before PRECHECKPOINT, the variable mentioned above is not needed.


Reformat to 80 char line as above.

by <global_p2p_communication> variable -> by the <global_p2p_communication> variable

the variable mentioned above -> the <global_p2p_communication> variable

gc00 · 2023-04-29T19:44:11Z

mpi-proxy-split/mpi_plugin.h

+The P2P communication in an application is controlled globally by <global_p2p_communication> variable. When we say "control", we mean that it suspends P2P communication and prevents the user application from being in the lower half. This allows for P2P message draining and checkpointing during the PRESUSPEND checkpoint event. If the P2P message draining is performed after PRESUSPEND and before PRECHECKPOINT, the variable mentioned above is not needed.
+*/
+static int global_p2p_communication = 1;
+static int internal_p2p_communication = 0;


It's almost always wrong to allocate storage for a variable within a .h file. We should declare the type in a .h variable (e.g., extern int global_p2p_communication;), and then to allocate storage for the variable in a .cpp file (e.g., int global_p2p_communication = 1; at global level, near the top of one .cpp file).

If you don't follow this convention, you can create lots of bugs. Right now, you're saying that if mpi_plugin.h is included into two distinct .cpp files, then since you use static, you have now created two independent copies of global_p2p_communication.

Also, the convention within MANA is to use the prefix g_ in the name of any global variable. For example, see g_numMaps.

Even worse, you seem to use in both mpi_plugin.cpp and p2p_drain_send_recv.cpp. So, you seem to have fallen into this trap, and created two copies of internal_p2p_communication.
Could you verify the logic in this case?

mpi-proxy-split/mpi_plugin.cpp

gc00 · 2023-04-29T20:40:04Z

mpi-proxy-split/mpi_plugin.cpp

+      (global_p2p_communication == 0 && internal_p2p_communication == 1))
+    return;
+
+  time_t my_time = time(NULL);


If I understand this code, a call to MPI_Send will always call this routine, which will always print a message. Computing a time and then doing a print is very expensive compared to MPI_Send. So, we should compute a time (and do some printing) only in the uncommon case, and not in the common case.

Do I understand the code correctly? If this is not a problem, then could you add some comments somewhere to make it clear that this happens only in the uncommon case?

Maybe my confusion comes from not understanding the variable global_p2p_communication. I think it's a global variable initialized to 1, which would mean that we will not compute a time in the common case. But then, can you add some comments there about global_p2p_communication and internal_p2p_communication?

Finally, should we be printing only in the case that we want to debug (#ifdef DEBUG)? I'm not sure that all of the end users in the production code need to see this information.

dahongli

Nice fix. Thanks Tarun!

dahongli · 2023-05-02T20:49:30Z

mpi-proxy-split/mpi_plugin.cpp

+void
+global_p2p_communication_barrier()
+{
+  if (global_p2p_communication == 1 ||


Does it have race condition between checkpoint thread and user application threads? Should we use lock to protect global_p2p_communication access?

gc00 · 2023-08-28T22:50:37Z

@tarunsmalviya , Just a friendly reminder that we're still waiting for the final version of this PR. Thanks.

tarunsmalviya requested review from JainTwinkle, dahongli, fyshhh, gc00, jiamingz9925, karya0 and xuyao0127 April 29, 2023 16:46

tarunsmalviya assigned gc00, tarunsmalviya and dahongli Apr 29, 2023

tarunsmalviya force-pushed the bug/correctness/drain_p2p_msg_in_presuspend_event branch from 95e4f2a to 0798db5 Compare April 29, 2023 16:57

tarunsmalviya changed the title ~~BugFix - Drain P2P messages in PRESUSPEND event instead of PRECHECKPOINT event.~~ Correctness - BugFix - Drain P2P messages in PRESUSPEND event instead of PRECHECKPOINT event. Apr 29, 2023

gc00 requested changes Apr 29, 2023

View reviewed changes

dahongli reviewed May 2, 2023

View reviewed changes

Drain P2P messages in PRESUSPEND event instead of PRECHECKPOINT event.

09a8ed8

tarunsmalviya closed this Jul 12, 2023

tarunsmalviya force-pushed the bug/correctness/drain_p2p_msg_in_presuspend_event branch 2 times, most recently from 0798db5 to 15762dc Compare July 12, 2023 22:23

tarunsmalviya reopened this Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctness - BugFix - Drain P2P messages in PRESUSPEND event instead of PRECHECKPOINT event. #312

Correctness - BugFix - Drain P2P messages in PRESUSPEND event instead of PRECHECKPOINT event. #312

Uh oh!

tarunsmalviya commented Apr 29, 2023

Uh oh!

tarunsmalviya commented Apr 29, 2023

Uh oh!

tarunsmalviya commented Apr 29, 2023

Uh oh!

gc00 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gc00 Apr 29, 2023

Uh oh!

gc00 Apr 29, 2023

Uh oh!

Uh oh!

gc00 Apr 29, 2023

Uh oh!

dahongli left a comment

Uh oh!

dahongli May 2, 2023

Uh oh!

gc00 commented Aug 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Correctness - BugFix - Drain P2P messages in PRESUSPEND event instead of PRECHECKPOINT event. #312

Are you sure you want to change the base?

Correctness - BugFix - Drain P2P messages in PRESUSPEND event instead of PRECHECKPOINT event. #312

Uh oh!

Conversation

tarunsmalviya commented Apr 29, 2023

Uh oh!

tarunsmalviya commented Apr 29, 2023

Uh oh!

tarunsmalviya commented Apr 29, 2023

Uh oh!

gc00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gc00 Apr 29, 2023

Choose a reason for hiding this comment

Uh oh!

gc00 Apr 29, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gc00 Apr 29, 2023

Choose a reason for hiding this comment

Uh oh!

dahongli left a comment

Choose a reason for hiding this comment

Uh oh!

dahongli May 2, 2023

Choose a reason for hiding this comment

Uh oh!

gc00 commented Aug 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants