Skip to content

Improve efficiency in memory trace and device->host mem-copy#5

Closed
gauravharsha wants to merge 5 commits intomainfrom
efficiency
Closed

Improve efficiency in memory trace and device->host mem-copy#5
gauravharsha wants to merge 5 commits intomainfrom
efficiency

Conversation

@gauravharsha
Copy link
Contributor

Proposing the following changes

  1. Current copy of new self-energy from CUDA device to Host memory uses cudaMemcpy. This blocks the processes and further execution of kernels. Replacing this with cudaMemcpyAsync. This should, in principle, speed up the code.
  2. To implement (1), function signatures are updated. Two-step process involving cudaMemcpy to a buffer on host memory, followed by writing the host buffer to shared-memory Self-energy is introduced.
  3. Adding a function to measure total FLOPs achieved (including read/write steps).
  4. Updates in error handling for Cholesky / LU failures when computing polarization P from P0.

Copy link
Contributor

@egull egull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. On Monday need to go over shared memory sync and effective locking/unlocking.

For now I propose to split this into the three issues: The FLOP count (harmless), the trap/abort (also harmless), and the shared mem.

The first two we can get done right away. The third needs work.

qpt.compute_Pq();
qpt.transform_wt();
// Write to Sigma(k), k belongs to _ink
MPI_Win_lock_all(MPI_MODE_NOCHECK, sigma_tau_host_shared.win());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This we need to discuss. It's fairly catastrophic in amulti-GPU environment: only ONE MPI process will enter the section below at a time... I believe there's no reason for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egull that's not how MPI_Win_lock_all(MPI_MODE_NOCHECK, win) works. This call only asserts the start of the communication epoch, and no synchronization is done here. To do the memory synchronization one has to call MPI_Win_sync, which is done bellow after the loop.

However, there is much more dangerous things going on here. Since all processes that have GPU enter the loop, there would be a guaranteed race condition in this loop, as we run over all k-points and do a summation over all q-points.

Doing similar synchronization pattern as implemented here is safe (and this is actually advised to reduce synchronizations) if we know that there is no overlap between memory regions that are accessed by different processes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're a bit late to that party. We'll discuss today how to do that in a way that is both performant (which your solution is not) and correct (which his solution is not). The way I think it works is via

MPI_Win_lock_all
MPI_Win_sync
...then do the update/access
MPI_Win_sync
MPI_Win_flush_all
MPI_Win_unlock_all

the MPI_Win_sync at most syncronizes a private with a public version, but may not syncronize the private version of other threads. The standard section 11 has more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which my implementation is not performant?

Also, mpi_win_flush is not needed here, we don't do any RMA operations on shared window here.

qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q));
copy_Sigma(Sigma_tskij_host, Sigmak_stij, k_reduced_id, _nts, _ns);
qkpt->compute_second_tau_contraction(qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q));
copy_Sigma(sigma_tau_host_shared.object(), Sigmak_stij, k_reduced_id, _nts, _ns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't the lock be around this?

qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q));
copy_Sigma_2c(Sigma_tskij_host, Sigmak_stij, k_reduced_id, _nts);
qkpt->compute_second_tau_contraction_2C(qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q));
copy_Sigma_2c(sigma_tau_host_shared.object(), Sigmak_stij, k_reduced_id, _nts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and around this?

} else {
qkpt->set_up_qkpt_second(nullptr, V_Qim.data(), k_reduced_id, k1_reduced_id, need_minus_k1);
qkpt->compute_second_tau_contraction(nullptr, qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q));
qkpt->compute_second_tau_contraction(qpt.Pqk_tQP(qkpt->all_done_event(), qkpt->stream(), need_minus_q));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even here, you probably don't want to lock out everybody but do the lock/unlock jsut around the memcpy

}
MPI_Win_sync(sigma_tau_host_shared.win());
MPI_Barrier(utils::context.node_comm);
MPI_Win_unlock_all(sigma_tau_host_shared.win());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is silly here? Why? we always have lock/unlock in pairs

src/cugw_qpt.cu Outdated
template <typename prec>
void gw_qkpt<prec>::cleanup(bool low_memory_mode, cxx_complex* Sigmak_stij_host) {
if (cleanup_req_) {
std::memcpy(Sigma_stij_host, Sigmak_stij_buffer_, ns_ * ntnao2_ * sizeof(cxx_complex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were talking about combining buffers here. Possibly that was done already.

// status of data transfer / copy from Device to Host.
// false: not required, stream ready for next calculation
// true: required
bool cleanup_req_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

propose making a destructor that throws an exception if cleanup_req_ is true and a constructor that sets it to false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c++ standard advises to not throwing exceptions in destructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I guess we need to print an error and abort the program. This should never happen unless there's a logic error anyway.


template <typename prec>
void gw_qkpt<prec>::cleanup(bool low_memory_mode, cxx_complex* Sigmak_stij_host) {
if (cleanup_req_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the shared window lock just here

void gw_qkpt<prec>::cleanup(bool low_memory_mode, cxx_complex* Sigmak_stij_host) {
if (cleanup_req_) {
std::memcpy(Sigma_stij_host, Sigmak_stij_buffer_, ns_ * ntnao2_ * sizeof(cxx_complex));
cleanup_req_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the shared window unlock just here.

qpt.compute_Pq();
qpt.transform_wt();
// Write to Sigma(k), k belongs to _ink
MPI_Win_lock_all(MPI_MODE_NOCHECK, sigma_tau_host_shared.win());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egull that's not how MPI_Win_lock_all(MPI_MODE_NOCHECK, win) works. This call only asserts the start of the communication epoch, and no synchronization is done here. To do the memory synchronization one has to call MPI_Win_sync, which is done bellow after the loop.

However, there is much more dangerous things going on here. Since all processes that have GPU enter the loop, there would be a guaranteed race condition in this loop, as we run over all k-points and do a summation over all q-points.

Doing similar synchronization pattern as implemented here is safe (and this is actually advised to reduce synchronizations) if we know that there is no overlap between memory regions that are accessed by different processes.

// status of data transfer / copy from Device to Host.
// false: not required, stream ready for next calculation
// true: required
bool cleanup_req_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c++ standard advises to not throwing exceptions in destructor.

template <typename prec>
void wait_and_clean_qkpts(std::vector<gw_qkpt<prec>*>& qkpts, bool low_memory_mode,
typename cu_type_map<std::complex<prec>>::cxx_type* Sigmak_stij_host) {
static int pos = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next three lines does not make sense to me.

@gauravharsha
Copy link
Contributor Author

Closing. Superseded by #9

@gauravharsha gauravharsha deleted the efficiency branch March 11, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants