-
Notifications
You must be signed in to change notification settings - Fork 5
reworked task-lists, in PostInititalizationCommunications() #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ns(). The driver was apparently expecting pmesh->mesh_data.GetOrAdd() would find a distinct stage for each block, but there is only one stage at this point, having the width of Mesh::DefaultPackSize, which matches the number of blocks. Therefore, we split the inits into three TaskRegions, first for the blocks to begin receiving, a second to do MD-wide boundary-buffer exchanges, and a third for the blocks to manage local boundaries.
|
On second thought, the comment about how this "should continue to work" for cases that have allocated per-block stages looks wrong. I don't know if there are such things. |
|
This strikes me as a little odd... The 3 task lists are all executed concurrently---you can't expect them to be ordered. Are you relying on that here? @brryan you wrote this code in the first place, maybe you can take a look? |
|
Yes, I wondered about that, but saw that task-lists allow incomplete tasks to be retried, so thought that that would resolve potential deadlock. However, you help me see that the current implementation doesn't prevent exchanges from starting early. I could fix that with three distinct parallel regions, executed in series. Or perhaps the regional dependency stuff. |
|
I think 3 distinct parallel regions is the way to go. |
|
Good. I see that. I'll have a shot at that, unless @brryan wants to take over. |
|
@jti-lanl Yes this is a substantial problem in this code I wrote, thank you for identifying it! I agree with you and @Yurlungur that three distinct The regression tests probably don't do a great job of testing this logic unfortunately, and I'm not immediately sure the best way to make a unit test out of this. I'll open an issue for me to think about this, since this fix should probably go in ASAP since it resolves such a nasty bug. |
|
I'm puzzled why the new approach (3 distinct regions) is deadlocking on nv-devkit nodes (but not Akebono, though at first glance they seem to be using the same versions of OpenMPI). I will continue to look. It's fine with me, if someone else sees and fixes the problem. NOTE: be aware of parthenon-hpc-lab/parthenon#645 Easy to avoid by allocating successive regions only when they are used. |
|
@jti-lanl if you push your latest commit I can try to reproduce the locking behavior and take a look. Also, is your 3D blast wave input deck just the 2d blast wave deck from the repository but with |
TBD: Why is there a deadlock, for the 3D blast-wave input? (Not in the patched method, and only on certain hardware.)
|
I've pushed the patch after exploring a little. No deadlock on Akebono but only on devkits (both a64fx), whether configured and built on either platform. The deadlock doesn't happen in the patched method, but in the simulation proper, just after the cycle 0 output header. It happens reliably and immediately on 4 devkit nodes. CXXFLAGS="-mcpu=a64fx -msve-vector-bits=scalable -O3 -ggdb" cmake -DPHOEBUS_ENABLE_MPI=ON -DPHOEBUS_ENABLE_OPENMP=ON -DPHOEBUS_ENABLE_HDF5=OFF -DKokkos_ARCH_A64FX=ON -DKokkos_ENABLE_PROFILING_LOAD_PRINT=ON ..Our 3d_blast_wave differs from the default 2D, as follows (... plus |
|
@jti-lanl Thanks, yes I can reproduce this error on 3 or 4 GPUs on a Darwin power9 node. I'm not immediately seeing what could be causing this hang (the code is getting stuck trying to receive MPI messages that never arrive) but I will keep looking. In the meantime a workaround would be to just comment out |
|
@brryan what's the status of this. Do we have a fix? |
Refactored task-lists in PhoebusDriver::PostInitializationCommunications(), to avoid a segfault with our 3D-blast-wave input.
PR Summary
There was a comment in the original:
At this point in initialization, all the blocks have been allocated in a single stage. Thus, the second iteration finds nothing. This (empty md) produces a SEGV on the 3D-blast-wave input deck.
The code appears to intend the creation of a series of TaskLists in a single Region, each of which (TL) is filled with a series of tasks pertaining to a single block, starting with enabling receives, then performing exchanges, etc. However, the exchanges are actually performed with an entire MeshData object, covering all the blocks (?), which can be handled by a single task.
If that analysis is correct, then the present approach resolves the problem by reformulating the inits into just three TaskLists: The first is a pass through all the blocks, enabling recvs. The second does the MD-wide tasks to do the exchanges. Finally, there's per-block boundary maintenance.
A question to consider is how the existing code was able to function. Perhaps other input decks don't invoke the second iteration? Or do they have initializations that result in per-block stages? If the latter, then the current changes should continue to work for those cases.
Tested with OMP and MPI, but only on A64FX, and only with our 3D_blast_wave input.
Needs a pass through whatever unit tests are available.
PR Checklist
scripts/bash/format.sh.