-
Notifications
You must be signed in to change notification settings - Fork 4
Improve efficiency in memory trace and device->host mem-copy #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
fd6c529
2b4b916
6dd12c4
3a29fe5
e8cb1cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -209,7 +209,7 @@ namespace green::gpu { | |
| template <typename prec> | ||
| void cugw_utils<prec>::solve(int _nts, int _ns, int _nk, int _ink, int _nao, const std::vector<size_t>& reduced_to_full, | ||
| const std::vector<size_t>& full_to_reduced, std::complex<double>* Vk1k2_Qij, | ||
| ztensor<5>& Sigma_tskij_host, int _devices_rank, int _devices_size, bool low_device_memory, | ||
| St_type& sigma_tau_host_shared, int _devices_rank, int _devices_size, bool low_device_memory, | ||
| int verbose, irre_pos_callback& irre_pos, mom_cons_callback& momentum_conservation, | ||
| gw_reader1_callback<prec>& r1, gw_reader2_callback<prec>& r2) { | ||
| // this is the main GW loop | ||
|
|
@@ -252,6 +252,7 @@ namespace green::gpu { | |
| qpt.compute_Pq(); | ||
| qpt.transform_wt(); | ||
| // Write to Sigma(k), k belongs to _ink | ||
| MPI_Win_lock_all(MPI_MODE_NOCHECK, sigma_tau_host_shared.win()); | ||
| for (size_t k_reduced_id = 0; k_reduced_id < _ink; ++k_reduced_id) { | ||
| size_t k = reduced_to_full[k_reduced_id]; | ||
| for (size_t q_or_qinv = 0; q_or_qinv < _nk; ++q_or_qinv) { | ||
|
|
@@ -264,33 +265,42 @@ namespace green::gpu { | |
| bool need_minus_k1 = reduced_to_full[k1_reduced_id] != k1; | ||
| bool need_minus_q = reduced_to_full[q_reduced_id] != q_or_qinv; | ||
|
|
||
| // read and prepare G(k-q), V(k, k-q) and V(k-q, k) | ||
| r2(k, k1, k1_reduced_id, k_vector, V_Qim, Vk1k2_Qij, Gk1_stij, need_minus_k1); | ||
|
|
||
| gw_qkpt<prec>* qkpt = obtain_idle_qkpt(qkpts); | ||
| gw_qkpt<prec>* qkpt = obtain_idle_qkpt_for_sigma(qkpts, _low_device_memory, Sigmak_stij.data()); | ||
| if (_low_device_memory) { | ||
| if (!_X2C) { | ||
| qkpt->set_up_qkpt_second(Gk1_stij.data(), V_Qim.data(), k_reduced_id, k1_reduced_id, need_minus_k1); | ||
| qkpt->compute_second_tau_contraction(Sigmak_stij.data(), | ||
| qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| copy_Sigma(Sigma_tskij_host, Sigmak_stij, k_reduced_id, _nts, _ns); | ||
| qkpt->compute_second_tau_contraction(qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| copy_Sigma(sigma_tau_host_shared.object(), Sigmak_stij, k_reduced_id, _nts, _ns); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why can't the lock be around this? |
||
| } else { | ||
| // In 2cGW, G(-k) = G*(k) has already been addressed in r2() | ||
| qkpt->set_up_qkpt_second(Gk1_stij.data(), V_Qim.data(), k_reduced_id, k1_reduced_id, false); | ||
| qkpt->compute_second_tau_contraction_2C(Sigmak_stij.data(), | ||
| qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| copy_Sigma_2c(Sigma_tskij_host, Sigmak_stij, k_reduced_id, _nts); | ||
| qkpt->compute_second_tau_contraction_2C(qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| copy_Sigma_2c(sigma_tau_host_shared.object(), Sigmak_stij, k_reduced_id, _nts); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...and around this? |
||
| } | ||
| } else { | ||
| qkpt->set_up_qkpt_second(nullptr, V_Qim.data(), k_reduced_id, k1_reduced_id, need_minus_k1); | ||
| qkpt->compute_second_tau_contraction(nullptr, qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
| qkpt->compute_second_tau_contraction(qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. even here, you probably don't want to lock out everybody but do the lock/unlock jsut around the memcpy |
||
| } | ||
| } | ||
| } | ||
| } | ||
| MPI_Win_sync(sigma_tau_host_shared.win()); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why so far outside the write?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually I'm totally confused about the need to sync at all. |
||
| MPI_Barrier(utils::context.node_comm); | ||
| MPI_Win_unlock_all(sigma_tau_host_shared.win()); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is silly here? Why? we always have lock/unlock in pairs |
||
| } | ||
| cudaDeviceSynchronize(); | ||
| MPI_Win_lock_all(MPI_MODE_NOCHECK, sigma_tau_host_shared.win()); | ||
| wait_and_clean_qkpts(qkpts, _low_device_memory, Sigmak_stij.data()); | ||
| MPI_Win_sync(sigma_tau_host_shared.win()); | ||
| MPI_Win_unlock_all(sigma_tau_host_shared.win()); | ||
| // wait for all qkpts to complete | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. propose to put the lock/unlock magic right around the memcpy and keep it very local there. |
||
| if (!_low_device_memory and !_X2C) { | ||
| copy_Sigma_from_device_to_host(sigma_kstij_device, Sigma_tskij_host.data(), _ink, _nao, _nts, _ns); | ||
| MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, sigma_tau_host_shared.win()); | ||
| copy_Sigma_from_device_to_host(sigma_kstij_device, sigma_tau_host_shared.object().data(), _ink, _nao, _nts, _ns); | ||
| MPI_Win_unlock(0, sigma_tau_host_shared.win()); | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -252,7 +252,7 @@ namespace green::gpu { | |
| naux_, Pivot_, d_info_, nw_b_) != CUBLAS_STATUS_SUCCESS) { | ||
| throw std::runtime_error("CUDA GETRF failed!"); | ||
| } | ||
| validate_info<<<1, 1, 0, stream_>>>(d_info_, nw_b_); | ||
| validate_info<<<1, 1, 0, stream_>>>(d_info_, nw_b_); | ||
| cudaEventRecord(LU_decomposition_ready_event_, stream_); | ||
|
|
||
| if (cudaStreamWaitEvent(stream_, LU_decomposition_ready_event_, 0 /*cudaEventWaitDefault*/)) | ||
|
|
@@ -320,7 +320,7 @@ namespace green::gpu { | |
| g_ktij_(g_ktij), g_kmtij_(g_kmtij), sigma_ktij_(sigma_ktij), sigma_k_locks_(sigma_k_locks), nao_(nao), nao2_(nao * nao), | ||
| nao3_(nao2_ * nao), naux_(naux), naux2_(naux * naux), nauxnao_(naux * nao), nauxnao2_(naux * nao * nao), ns_(ns), nt_(nt), | ||
| nt_batch_(nt_batch), ntnaux_(nt * naux), ntnaux2_(nt * naux * naux), ntnao_(nt * nao), ntnao2_(nt * nao2_), | ||
| handle_(handle) { | ||
| handle_(handle), cleanup_req_(false) { | ||
| _low_memory_requirement = (g_ktij == nullptr) ? true : false; | ||
| if (cudaStreamCreate(&stream_) != cudaSuccess) throw std::runtime_error("main stream creation failed"); | ||
|
|
||
|
|
@@ -349,7 +349,11 @@ namespace green::gpu { | |
| throw std::runtime_error("failure allocating Gk_tsij on host"); | ||
| if (cudaMallocHost(&Gk_smtij_buffer_, ns_ * ntnao2_ * sizeof(cxx_complex)) != cudaSuccess) | ||
| throw std::runtime_error("failure allocating Gk_tsij on host"); | ||
| Sigmak_stij_buffer_ = Gk_smtij_buffer_; | ||
| // ! GH: I think this will interfere with our cudaMemcpyAsync. Should we simply allocate a different array for Sigmak_stij_buffer_? | ||
| // ! The more I think, this here is the real reason why we had to use cudaMemcpy and not the asynchronous version. | ||
| // ! <previously> Sigmak_stij_buffer_ = Gk_smtij_buffer_; | ||
| if (cudaMallocHost(&Sigmak_stij_buffer_, ns_ * ntnao2_ * sizeof(cxx_complex)) != cudaSuccess) | ||
| throw std::runtime_error("failure allocating Gk_tsij on host"); | ||
gauravharsha marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
|
|
||
| if (cudaMalloc(&Pqk0_tQP_local_, nt_batch_ * naux2_ * sizeof(cuda_complex)) != cudaSuccess) | ||
|
|
@@ -383,6 +387,9 @@ namespace green::gpu { | |
| cudaFreeHost(Gk1_stij_buffer_); | ||
| cudaFreeHost(Gk_smtij_buffer_); | ||
| } | ||
| if (cleanup_req_ == true) { | ||
| throw std::runtime_error("cleanup of self-energy was not done correctly."); | ||
| } | ||
| } | ||
|
|
||
| template <typename prec> | ||
|
|
@@ -526,7 +533,7 @@ namespace green::gpu { | |
| } | ||
|
|
||
| template <typename prec> | ||
| void gw_qkpt<prec>::compute_second_tau_contraction(cxx_complex* Sigmak_stij_host, cuda_complex* Pqk_tQP) { | ||
| void gw_qkpt<prec>::compute_second_tau_contraction(cuda_complex* Pqk_tQP) { | ||
| cuda_complex one = cu_type_map<cxx_complex>::cast(1., 0.); | ||
| cuda_complex zero = cu_type_map<cxx_complex>::cast(0., 0.); | ||
| cuda_complex m1 = cu_type_map<cxx_complex>::cast(-1., 0.); | ||
|
|
@@ -556,12 +563,12 @@ namespace green::gpu { | |
| } | ||
| } | ||
| } | ||
| write_sigma(_low_memory_requirement, Sigmak_stij_host); | ||
| write_sigma(_low_memory_requirement); | ||
| cudaEventRecord(all_done_event_); | ||
| } | ||
|
|
||
| template <typename prec> | ||
| void gw_qkpt<prec>::compute_second_tau_contraction_2C(cxx_complex* Sigmak_stij_host, cuda_complex* Pqk_tQP) { | ||
| void gw_qkpt<prec>::compute_second_tau_contraction_2C(cuda_complex* Pqk_tQP) { | ||
| cuda_complex one = cu_type_map<cxx_complex>::cast(1., 0.); | ||
| cuda_complex zero = cu_type_map<cxx_complex>::cast(0., 0.); | ||
| cuda_complex m1 = cu_type_map<cxx_complex>::cast(-1., 0.); | ||
|
|
@@ -593,13 +600,14 @@ namespace green::gpu { | |
| } | ||
| } | ||
| } | ||
| write_sigma(true, Sigmak_stij_host); | ||
| write_sigma(true); | ||
| cudaEventRecord(all_done_event_); | ||
| } | ||
|
|
||
| template <typename prec> | ||
| void gw_qkpt<prec>::write_sigma(bool low_memory_mode, cxx_complex* Sigmak_stij_host) { | ||
| void gw_qkpt<prec>::write_sigma(bool low_memory_mode) { | ||
| // write results. Make sure we have exclusive write access to sigma, then add array sigmak_tij to sigma_ktij | ||
| // TODO: In my understanding, the lock is only required for RAXPY part now, so we should move them inside the first if condition | ||
gauravharsha marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| acquire_lock<<<1, 1, 0, stream_>>>(sigma_k_locks_ + k_); | ||
| scalar_t one = 1.; | ||
| if (!low_memory_mode) { | ||
|
|
@@ -608,15 +616,23 @@ namespace green::gpu { | |
| throw std::runtime_error("RAXPY fails on gw_qkpt.write_sigma()."); | ||
| } | ||
| } else { | ||
| // Copy sigmak_stij_ back to CPU | ||
| if (Sigmak_stij_host == nullptr) | ||
| throw std::runtime_error("gw_qkpt.write_sigma(): Sigmak_stij_host cannot be a null pointer"); | ||
| cudaMemcpy(Sigmak_stij_buffer_, sigmak_stij_, ns_ * ntnao2_ * sizeof(cuda_complex), cudaMemcpyDeviceToHost); | ||
| std::memcpy(Sigmak_stij_host, Sigmak_stij_buffer_, ns_ * ntnao2_ * sizeof(cxx_complex)); | ||
| // Copy sigmak_stij_ asynchronously back to CPU | ||
| cudaMemcpyAsync(Sigmak_stij_buffer_, sigmak_stij_, ns_ * ntnao2_ * sizeof(cuda_complex), cudaMemcpyDeviceToHost, stream_); | ||
| // cudaMemcpyAsync will require a cleanup at later stage. | ||
| // So, we update the cleanup_req_ status to true | ||
| cleanup_req_ = true; | ||
gauravharsha marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
| release_lock<<<1, 1, 0, stream_>>>(sigma_k_locks_ + k_); | ||
| } | ||
|
|
||
| template <typename prec> | ||
| void gw_qkpt<prec>::cleanup(bool low_memory_mode, cxx_complex* Sigmak_stij_host) { | ||
| if (cleanup_req_) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would put the shared window lock just here |
||
| std::memcpy(Sigmak_stij_host, Sigmak_stij_buffer_, ns_ * ntnao2_ * sizeof(cxx_complex)); | ||
| cleanup_req_ = false; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would put the shared window unlock just here. |
||
| } | ||
| } | ||
|
|
||
| template <typename prec> | ||
| bool gw_qkpt<prec>::is_busy() { | ||
| cudaError_t stream_status = cudaStreamQuery(stream_); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -30,7 +30,19 @@ | |
| #include "cublas_routines_prec.h" | ||
| #include "cuda_common.h" | ||
|
|
||
| /** | ||
| * \brief checks success of a LU or Cholesky decomposition | ||
| * | ||
| * \param info output of Cuda equivalent of decomposition function, e.g., POTRF | ||
| */ | ||
| __global__ void validate_info(int* info); | ||
|
|
||
| /** | ||
| * \brief checks success of a LU or Cholesky decomposition | ||
| * | ||
| * \param info (vector) output of Cuda equivalent of decomposition function, e.g., POTRF | ||
| * \param N length of info | ||
| */ | ||
| __global__ void validate_info(int* info, int N); | ||
| __global__ void set_up_one_minus_P(cuDoubleComplex* one_minus_P, cuDoubleComplex* P, int naux); | ||
| __global__ void set_up_one_minus_P(cuComplex* one_minus_P, cuComplex* P, int naux); | ||
|
|
@@ -265,30 +277,46 @@ namespace green::gpu { | |
| /** | ||
| * \brief Using dressed GW polarization compute self-energy at a given momentum point | ||
| * | ||
| * \param Sigmak_stij_host Host stored array for Self-energy at a given momentum point | ||
| * \param Pqk_tQP Dressed polarization bubble | ||
| */ | ||
| void compute_second_tau_contraction(cxx_complex* Sigmak_stij_host = nullptr, cuda_complex* Pqk_tQP = nullptr); | ||
| void compute_second_tau_contraction(cuda_complex* Pqk_tQP = nullptr); | ||
| /** | ||
| * \brief Using dressed GW polarization compute self-energy at a given momentum point (X2C version) | ||
| * \param Sigmak_stij_host Host stored array for Self-energy at a given momentum point | ||
| * | ||
| * \param Pqk_tQP Dressed polarization bubble | ||
| */ | ||
| void compute_second_tau_contraction_2C(cxx_complex* Sigmak_stij_host = nullptr, cuda_complex* Pqk_tQP = nullptr); | ||
| void compute_second_tau_contraction_2C(cuda_complex* Pqk_tQP = nullptr); | ||
|
|
||
| /** | ||
| * \brief For a given k-point copy self-energy back to a host memory | ||
| * \param low_memory_mode - whether the whole self-energy allocated in memory or not | ||
| * \param Sigmak_stij_host - Host stored self-energy object at a given momentum point | ||
| */ | ||
| void write_sigma(bool low_memory_mode = false, cxx_complex* Sigmak_stij_host = nullptr); | ||
| void write_sigma(bool low_memory_mode = false); | ||
|
|
||
| /** | ||
| * \brief Check if cuda devices are budy | ||
| * \return true if asynchronous calculations are still running | ||
| * \brief Check if cuda devices are busy | ||
| * \return true - if asynchronous calculations are still running | ||
| */ | ||
| bool is_busy(); | ||
|
|
||
| /** | ||
| * \brief return the status of copy_selfenergy from device to host | ||
| * | ||
| * \return false - not required, stream ready for next calculation | ||
| * \return true - required | ||
| */ | ||
| bool require_cleanup(){ | ||
| return cleanup_req_; | ||
| } | ||
|
|
||
| /** | ||
| * \brief perform cleanup, i.e. copy data from Sigmak buffer (4-index array for a given momentum point) to Host shared memory Self-energy | ||
| * | ||
| * \param low_memory_mode - whether the whole self-energy allocated in memory or not | ||
| * \param Sigma_stij_host - HHost stored self-energy object at a given momentum point | ||
| */ | ||
| void cleanup(bool low_memory_mode, cxx_complex* Sigmak_stij_host); | ||
|
|
||
| // | ||
| static std::size_t size(size_t nao, size_t naux, size_t nt, size_t nt_batch, size_t ns) { | ||
| return (2 * naux * nao * nao // V_Qpm+V_pmQ | ||
|
|
@@ -374,8 +402,20 @@ namespace green::gpu { | |
|
|
||
| // pointer to cublas handle | ||
| cublasHandle_t* handle_; | ||
|
|
||
| // status of data transfer / copy from Device to Host. | ||
| // false: not required, stream ready for next calculation | ||
| // true: required | ||
| bool cleanup_req_; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. propose making a destructor that throws an exception if cleanup_req_ is true and a constructor that sets it to false.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. c++ standard advises to not throwing exceptions in destructor.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's right. I guess we need to print an error and abort the program. This should never happen unless there's a logic error anyway. |
||
| }; | ||
|
|
||
| /** | ||
| * \brief returns an idle qkpt stream, otherwise waits until a stream is available | ||
| * | ||
| * \tparam prec - precision for calculation | ||
| * \param qkpts - vector of qkpt workers (gw_qkpt<prec> type) | ||
| * \return gw_qkpt<prec>* - pointer to idle qkpt | ||
| */ | ||
| template <typename prec> | ||
| gw_qkpt<prec>* obtain_idle_qkpt(std::vector<gw_qkpt<prec>*>& qkpts) { | ||
| static int pos = 0; | ||
|
|
@@ -387,4 +427,48 @@ namespace green::gpu { | |
| return qkpts[pos]; | ||
| } | ||
|
|
||
| /** | ||
| * \brief returns an idle qkpt stream, otherwise waits until a stream is available | ||
| * | ||
| * \tparam prec - precision for calculation | ||
| * \param qkpts - vector of qkpt workers (gw_qkpt<prec> type) | ||
| * \param low_memory_mode - low memory mode for read/write integrals | ||
| * \param Sigmak_stij_host - cudaMallocHost buffer for transfering Sigma | ||
| * \return gw_qkpt<prec>* - pointer to idle qkpt | ||
| */ | ||
| template <typename prec> | ||
| gw_qkpt<prec>* obtain_idle_qkpt_for_sigma(std::vector<gw_qkpt<prec>*>& qkpts, bool low_memory_mode, | ||
| typename cu_type_map<std::complex<prec>>::cxx_type* Sigmak_stij_host) { | ||
| static int pos = 0; | ||
| pos++; | ||
| if (pos >= qkpts.size()) pos = 0; | ||
| while (qkpts[pos]->is_busy()) { | ||
| pos = (pos + 1) % qkpts.size(); | ||
| } | ||
| qkpts[pos]->cleanup(low_memory_mode, Sigmak_stij_host); | ||
| return qkpts[pos]; | ||
| } | ||
|
|
||
| /** | ||
| * \brief waits for all qkpts to complete and cleans them up | ||
| * | ||
| * \tparam prec - precision for calculation | ||
| * \param qkpts - vector of qkpt workers (gw_qkpt<prec> type) | ||
| * \param low_memory_mode - low memory mode for read/write integrals | ||
| * \param Sigmak_stij_host - cudaMallocHost buffer for transfering Sigma | ||
| */ | ||
| template <typename prec> | ||
| void wait_and_clean_qkpts(std::vector<gw_qkpt<prec>*>& qkpts, bool low_memory_mode, | ||
| typename cu_type_map<std::complex<prec>>::cxx_type* Sigmak_stij_host) { | ||
| static int pos = 0; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. next three lines does not make sense to me. |
||
| pos++; | ||
| if (pos >= qkpts.size()) pos = 0; | ||
| for (pos = 0; pos < qkpts.size(); pos++) { | ||
| while (qkpts[pos]->is_busy()) { | ||
| continue; | ||
| } | ||
| qkpts[pos]->cleanup(low_memory_mode, Sigmak_stij_host); | ||
| } | ||
| return; | ||
| } | ||
| } // namespace green::gpu | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This we need to discuss. It's fairly catastrophic in amulti-GPU environment: only ONE MPI process will enter the section below at a time... I believe there's no reason for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@egull that's not how
MPI_Win_lock_all(MPI_MODE_NOCHECK, win)works. This call only asserts the start of the communication epoch, and no synchronization is done here. To do the memory synchronization one has to callMPI_Win_sync, which is done bellow after the loop.However, there is much more dangerous things going on here. Since all processes that have GPU enter the loop, there would be a guaranteed race condition in this loop, as we run over all k-points and do a summation over all q-points.
Doing similar synchronization pattern as implemented here is safe (and this is actually advised to reduce synchronizations) if we know that there is no overlap between memory regions that are accessed by different processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're a bit late to that party. We'll discuss today how to do that in a way that is both performant (which your solution is not) and correct (which his solution is not). The way I think it works is via
MPI_Win_lock_all
MPI_Win_sync
...then do the update/access
MPI_Win_sync
MPI_Win_flush_all
MPI_Win_unlock_all
the MPI_Win_sync at most syncronizes a private with a public version, but may not syncronize the private version of other threads. The standard section 11 has more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which my implementation is not performant?
Also,
mpi_win_flushis not needed here, we don't do any RMA operations on shared window here.