Skip to content

Conversation

@jti-lanl
Copy link
Collaborator

Refactored task-lists in PhoebusDriver::PostInitializationCommunications(), to avoid a segfault with our 3D-blast-wave input.

PR Summary

There was a comment in the original:

for (int ib = 0; ib < blocks.size(); ib++) {
   ...
   auto &md = pmesh->mesh_data.GetOrAdd(stage_name[0], ib); // ... gives an empty md

At this point in initialization, all the blocks have been allocated in a single stage. Thus, the second iteration finds nothing. This (empty md) produces a SEGV on the 3D-blast-wave input deck.

The code appears to intend the creation of a series of TaskLists in a single Region, each of which (TL) is filled with a series of tasks pertaining to a single block, starting with enabling receives, then performing exchanges, etc. However, the exchanges are actually performed with an entire MeshData object, covering all the blocks (?), which can be handled by a single task.

If that analysis is correct, then the present approach resolves the problem by reformulating the inits into just three TaskLists: The first is a pass through all the blocks, enabling recvs. The second does the MD-wide tasks to do the exchanges. Finally, there's per-block boundary maintenance.

A question to consider is how the existing code was able to function. Perhaps other input decks don't invoke the second iteration? Or do they have initializations that result in per-block stages? If the latter, then the current changes should continue to work for those cases.

Tested with OMP and MPI, but only on A64FX, and only with our 3D_blast_wave input.

Needs a pass through whatever unit tests are available.

PR Checklist

  • Adds a test for any bugs fixed. Adds tests for new features.
  • [ X] Format your changes by calling scripts/bash/format.sh.
  • [ X] Explain what you did.

jti-lanl added 2 commits June 14, 2022 13:03
…ns().

   The driver was apparently expecting pmesh->mesh_data.GetOrAdd() would
   find a distinct stage for each block, but there is only one stage at
   this point, having the width of Mesh::DefaultPackSize, which matches the
   number of blocks.  Therefore, we split the inits into three TaskRegions,
   first for the blocks to begin receiving, a second to do MD-wide
   boundary-buffer exchanges, and a third for the blocks to manage local
   boundaries.
@jti-lanl
Copy link
Collaborator Author

On second thought, the comment about how this "should continue to work" for cases that have allocated per-block stages looks wrong. I don't know if there are such things.

@Yurlungur
Copy link
Collaborator

This strikes me as a little odd... The 3 task lists are all executed concurrently---you can't expect them to be ordered. Are you relying on that here?

@brryan you wrote this code in the first place, maybe you can take a look?

@Yurlungur Yurlungur requested a review from brryan June 14, 2022 22:20
@jti-lanl
Copy link
Collaborator Author

Yes, I wondered about that, but saw that task-lists allow incomplete tasks to be retried, so thought that that would resolve potential deadlock.

However, you help me see that the current implementation doesn't prevent exchanges from starting early. I could fix that with three distinct parallel regions, executed in series. Or perhaps the regional dependency stuff.

@Yurlungur
Copy link
Collaborator

I think 3 distinct parallel regions is the way to go.

@jti-lanl
Copy link
Collaborator Author

Good. I see that. I'll have a shot at that, unless @brryan wants to take over.

@brryan
Copy link
Collaborator

brryan commented Jun 15, 2022

@jti-lanl Yes this is a substantial problem in this code I wrote, thank you for identifying it! I agree with you and @Yurlungur that three distinct TaskRegions based on how you've already split up the tasks between per-meshblock and per-meshdata work is the way to go. So we're not stepping on each other's toes maybe it's easiest if you just add that to your branch but if you want to pass to me feel free.

The regression tests probably don't do a great job of testing this logic unfortunately, and I'm not immediately sure the best way to make a unit test out of this. I'll open an issue for me to think about this, since this fix should probably go in ASAP since it resolves such a nasty bug.

@jti-lanl
Copy link
Collaborator Author

I'm puzzled why the new approach (3 distinct regions) is deadlocking on nv-devkit nodes (but not Akebono, though at first glance they seem to be using the same versions of OpenMPI). I will continue to look. It's fine with me, if someone else sees and fixes the problem.

NOTE: be aware of parthenon-hpc-lab/parthenon#645 Easy to avoid by allocating successive regions only when they are used.

@brryan
Copy link
Collaborator

brryan commented Jun 15, 2022

@jti-lanl if you push your latest commit I can try to reproduce the locking behavior and take a look. Also, is your 3D blast wave input deck just the 2d blast wave deck from the repository but with parthenon/mesh/nx3 = 128 and parthenon/meshblock/nx3=32?

   TBD: Why is there a deadlock, for the 3D blast-wave input?
   (Not in the patched method, and only on certain hardware.)
@jti-lanl
Copy link
Collaborator Author

jti-lanl commented Jun 15, 2022

I've pushed the patch after exploring a little. No deadlock on Akebono but only on devkits (both a64fx), whether configured and built on either platform.

The deadlock doesn't happen in the patched method, but in the simulation proper, just after the cycle 0 output header. It happens reliably and immediately on 4 devkit nodes.

CXXFLAGS="-mcpu=a64fx -msve-vector-bits=scalable -O3 -ggdb" cmake -DPHOEBUS_ENABLE_MPI=ON -DPHOEBUS_ENABLE_OPENMP=ON -DPHOEBUS_ENABLE_HDF5=OFF   -DKokkos_ARCH_A64FX=ON -DKokkos_ENABLE_PROFILING_LOAD_PRINT=ON ..

Our 3d_blast_wave differs from the default 2D, as follows

<parthenon/mesh>       128x128x128 -> 32x32x32
<parthenon/meshblock>  32x32x1 -> 8x8x8
<recon>                mp5 -> weno5

(... plus tlim and file_type)

@brryan
Copy link
Collaborator

brryan commented Jun 16, 2022

@jti-lanl Thanks, yes I can reproduce this error on 3 or 4 GPUs on a Darwin power9 node. I'm not immediately seeing what could be causing this hang (the code is getting stuck trying to receive MPI messages that never arrive) but I will keep looking.

In the meantime a workaround would be to just comment out driver.PostInitializationCommunication(); in phoebus/src/main.cpp. This routine is used to ensure consistency with the imposed boundary conditions after generating initial data, but that isn't really relevant to the blast wave problem (where we are also still writing the initial data to active + ghost cells of each meshblock in the problem generator so we don't need to call communication before the first cycle). So this problem should work just fine without PostInitializationCommunication.

@Yurlungur
Copy link
Collaborator

@brryan what's the status of this. Do we have a fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants