Improve efficiency in memory trace and device->host mem-copy#5
Improve efficiency in memory trace and device->host mem-copy#5gauravharsha wants to merge 5 commits intomainfrom
Conversation
…m device to host; also reduce memory trace by removing intermediates and writing directly to shared memory self-energy.
egull
left a comment
There was a problem hiding this comment.
OK. On Monday need to go over shared memory sync and effective locking/unlocking.
For now I propose to split this into the three issues: The FLOP count (harmless), the trap/abort (also harmless), and the shared mem.
The first two we can get done right away. The third needs work.
| qpt.compute_Pq(); | ||
| qpt.transform_wt(); | ||
| // Write to Sigma(k), k belongs to _ink | ||
| MPI_Win_lock_all(MPI_MODE_NOCHECK, sigma_tau_host_shared.win()); |
There was a problem hiding this comment.
This we need to discuss. It's fairly catastrophic in amulti-GPU environment: only ONE MPI process will enter the section below at a time... I believe there's no reason for that.
There was a problem hiding this comment.
@egull that's not how MPI_Win_lock_all(MPI_MODE_NOCHECK, win) works. This call only asserts the start of the communication epoch, and no synchronization is done here. To do the memory synchronization one has to call MPI_Win_sync, which is done bellow after the loop.
However, there is much more dangerous things going on here. Since all processes that have GPU enter the loop, there would be a guaranteed race condition in this loop, as we run over all k-points and do a summation over all q-points.
Doing similar synchronization pattern as implemented here is safe (and this is actually advised to reduce synchronizations) if we know that there is no overlap between memory regions that are accessed by different processes.
There was a problem hiding this comment.
You're a bit late to that party. We'll discuss today how to do that in a way that is both performant (which your solution is not) and correct (which his solution is not). The way I think it works is via
MPI_Win_lock_all
MPI_Win_sync
...then do the update/access
MPI_Win_sync
MPI_Win_flush_all
MPI_Win_unlock_all
the MPI_Win_sync at most syncronizes a private with a public version, but may not syncronize the private version of other threads. The standard section 11 has more.
There was a problem hiding this comment.
Which my implementation is not performant?
Also, mpi_win_flush is not needed here, we don't do any RMA operations on shared window here.
| qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| copy_Sigma(Sigma_tskij_host, Sigmak_stij, k_reduced_id, _nts, _ns); | ||
| qkpt->compute_second_tau_contraction(qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| copy_Sigma(sigma_tau_host_shared.object(), Sigmak_stij, k_reduced_id, _nts, _ns); |
There was a problem hiding this comment.
why can't the lock be around this?
| qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| copy_Sigma_2c(Sigma_tskij_host, Sigmak_stij, k_reduced_id, _nts); | ||
| qkpt->compute_second_tau_contraction_2C(qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| copy_Sigma_2c(sigma_tau_host_shared.object(), Sigmak_stij, k_reduced_id, _nts); |
| } else { | ||
| qkpt->set_up_qkpt_second(nullptr, V_Qim.data(), k_reduced_id, k1_reduced_id, need_minus_k1); | ||
| qkpt->compute_second_tau_contraction(nullptr, qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| qkpt->compute_second_tau_contraction(qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); |
There was a problem hiding this comment.
even here, you probably don't want to lock out everybody but do the lock/unlock jsut around the memcpy
| } | ||
| MPI_Win_sync(sigma_tau_host_shared.win()); | ||
| MPI_Barrier(utils::context.node_comm); | ||
| MPI_Win_unlock_all(sigma_tau_host_shared.win()); |
There was a problem hiding this comment.
this is silly here? Why? we always have lock/unlock in pairs
src/cugw_qpt.cu
Outdated
| template <typename prec> | ||
| void gw_qkpt<prec>::cleanup(bool low_memory_mode, cxx_complex* Sigmak_stij_host) { | ||
| if (cleanup_req_) { | ||
| std::memcpy(Sigma_stij_host, Sigmak_stij_buffer_, ns_ * ntnao2_ * sizeof(cxx_complex)); |
There was a problem hiding this comment.
we were talking about combining buffers here. Possibly that was done already.
| // status of data transfer / copy from Device to Host. | ||
| // false: not required, stream ready for next calculation | ||
| // true: required | ||
| bool cleanup_req_; |
There was a problem hiding this comment.
propose making a destructor that throws an exception if cleanup_req_ is true and a constructor that sets it to false.
There was a problem hiding this comment.
c++ standard advises to not throwing exceptions in destructor.
There was a problem hiding this comment.
That's right. I guess we need to print an error and abort the program. This should never happen unless there's a logic error anyway.
|
|
||
| template <typename prec> | ||
| void gw_qkpt<prec>::cleanup(bool low_memory_mode, cxx_complex* Sigmak_stij_host) { | ||
| if (cleanup_req_) { |
There was a problem hiding this comment.
I would put the shared window lock just here
| void gw_qkpt<prec>::cleanup(bool low_memory_mode, cxx_complex* Sigmak_stij_host) { | ||
| if (cleanup_req_) { | ||
| std::memcpy(Sigma_stij_host, Sigmak_stij_buffer_, ns_ * ntnao2_ * sizeof(cxx_complex)); | ||
| cleanup_req_ = false; |
There was a problem hiding this comment.
I would put the shared window unlock just here.
| qpt.compute_Pq(); | ||
| qpt.transform_wt(); | ||
| // Write to Sigma(k), k belongs to _ink | ||
| MPI_Win_lock_all(MPI_MODE_NOCHECK, sigma_tau_host_shared.win()); |
There was a problem hiding this comment.
@egull that's not how MPI_Win_lock_all(MPI_MODE_NOCHECK, win) works. This call only asserts the start of the communication epoch, and no synchronization is done here. To do the memory synchronization one has to call MPI_Win_sync, which is done bellow after the loop.
However, there is much more dangerous things going on here. Since all processes that have GPU enter the loop, there would be a guaranteed race condition in this loop, as we run over all k-points and do a summation over all q-points.
Doing similar synchronization pattern as implemented here is safe (and this is actually advised to reduce synchronizations) if we know that there is no overlap between memory regions that are accessed by different processes.
| // status of data transfer / copy from Device to Host. | ||
| // false: not required, stream ready for next calculation | ||
| // true: required | ||
| bool cleanup_req_; |
There was a problem hiding this comment.
c++ standard advises to not throwing exceptions in destructor.
| template <typename prec> | ||
| void wait_and_clean_qkpts(std::vector<gw_qkpt<prec>*>& qkpts, bool low_memory_mode, | ||
| typename cu_type_map<std::complex<prec>>::cxx_type* Sigmak_stij_host) { | ||
| static int pos = 0; |
There was a problem hiding this comment.
next three lines does not make sense to me.
|
Closing. Superseded by #9 |
Proposing the following changes
cudaMemcpy. This blocks the processes and further execution of kernels. Replacing this withcudaMemcpyAsync. This should, in principle, speed up the code.cudaMemcpyto a buffer on host memory, followed by writing the host buffer to shared-memory Self-energy is introduced.