Async copy of self-energyy from Device to Host#9
Async copy of self-energyy from Device to Host#9gauravharsha merged 36 commits intoGreen-Phys:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the self-energy (Sigma) computation workflow in GPU-accelerated GW calculations to improve asynchronous data handling and memory management. The changes introduce a cleanup mechanism for asynchronous device-to-host data transfers and separate cublas handles for each qkpt worker stream.
- Removed
Sigmak_stij_hostparameter from computation functions, introducing a deferred cleanup pattern - Added
cleanup()method andcleanup_req_flag to manage asynchronous Sigma data transfers - Introduced separate cublas handles for each qkpt worker to enable concurrent operations
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/green/gpu/cugw_qpt.h | Added cleanup infrastructure (methods, member variables) and helper functions for managing qkpt workers; modified function signatures to remove host pointer parameters |
| src/cugw_qpt.cu | Implemented cleanup logic, moved Sigma copy operations into cleanup method, replaced synchronous memcpy with async version |
| src/green/gpu/cu_routines.h | Added _qkpt_handles vector to store separate cublas handles for qkpt streams |
| src/cu_routines.cu | Integrated new cleanup pattern into solve workflow, created separate cublas handles for each qkpt worker |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Build will fail until PR#3 on Green-Utils is merged. |
egull
left a comment
There was a problem hiding this comment.
Cool, thanks. I think this is ready to merge. It would be easier to review if there were more pull requests about smaller changes (e.g. timers separate from async copy and k-index buffering), but this is not a perfect world...
|
That works. Choose whatever you think is clearest.
…On Thu, Nov 6, 2025, 4:36 PM Gaurav Harsha ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/green/gpu/gpu_kernel.h
<#9 (comment)>:
> @@ -66,7 +66,7 @@ namespace green::gpu {
*/
inline void set_shared_Coulomb() {
if (_coul_int_reading_type == as_a_whole) {
- statistics.start("Read");
+ statistics.start("Allocate shared Coulomb");
I think the event "Read" was a bit of a misnomer. We do not do any reading
operations in the scope, only allocate shared memory space to load the
ntegrals. Alternatively, I can rename it to "read whole integral". That way
the event timing will reflect allocating + reading together
—
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABW32RN67RDCMM2C7XFCAHT33NTH5AVCNFSM6AAAAACKQKQNEGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTIMRYHA3DMOBTGU>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The PR proposes asynchronous logic for the copy of self-energy results from D2H.
In the current version, we use
cudaMemcpywhich is a blocking call, i.e., all subsequent qkpt workers wait for the data transfer before starting to read integrals from filesystem.The Self-energy computation blocks become highly efficient and keep the GPU busy most of the time.