Optimize nt_batch automatically to improve performance#10
Optimize nt_batch automatically to improve performance#10gauravharsha merged 10 commits intoGreen-Phys:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds automatic optimization of the nt_batch parameter for the CUDA GW solver to maximize GPU performance. Previously, users had to manually specify this value, which required knowledge of GPU memory constraints and optimal batch sizing.
- Changed default
nt_batchfrom 1 to 0 (triggers automatic optimization) - Added
optimize_ntbatch()function to calculate optimal batch size based on available GPU memory - Extended test coverage to verify both automatic optimization and manual specification modes
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| src/green/gpu/gpu_factory.h | Updated nt_batch parameter default from 1 to 0 and clarified documentation |
| src/green/gpu/gw_gpu_kernel.h | Added declaration for optimize_ntbatch() function with documentation |
| src/gw_gpu_kernel.cpp | Implemented optimize_ntbatch() logic and integrated it into memory checking; updated warning/error messages |
| test/cu_solver_test.cpp | Extended test coverage to test both automatic optimization (nt_batch="0") and manual setting (nt_batch="1") |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…or very large jobs
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
egull
left a comment
There was a problem hiding this comment.
Looks good to me. Print nt_batch so we have it in the log (ignore if it's done elsewhere), and then let's try it out!
|
Merging now. More tests and benchmarks under way, but the output of unit tests for Hydrogen on ancient Quadro P1000 already show improvements with optimization: |
GW kernel uses
cublas::gemm_strided_batchedwhich performs the best when batch size is large.This PR proposes automatic optimization of
nt_batchto achieve high optimal performance, with the following logic:nt_batch = 0nt_batch == 0, optimize value for better performance.Optimization logic:
nt_batch.nt_batchis large butn_tau - nt_batch < n_tau / 4(i.e., second batch is small), we opt instead fornt_batch = n_tau / 2-- this is debatable but for both small and large applications, this shouldn't make much of a difference.