Skip to content

Optimize nt_batch automatically to improve performance#10

Merged
gauravharsha merged 10 commits intoGreen-Phys:mainfrom
gauravharsha:stream-cleanup
Nov 8, 2025
Merged

Optimize nt_batch automatically to improve performance#10
gauravharsha merged 10 commits intoGreen-Phys:mainfrom
gauravharsha:stream-cleanup

Conversation

@gauravharsha
Copy link
Contributor

GW kernel uses cublas::gemm_strided_batched which performs the best when batch size is large.
This PR proposes automatic optimization of nt_batch to achieve high optimal performance, with the following logic:

  • Set default value: nt_batch = 0
  • If nt_batch == 0, optimize value for better performance.
  • Otherwise, use specified value.

Optimization logic:

  • Keep at least 2 streams, then maximize nt_batch.
  • If optimized nt_batch is large but n_tau - nt_batch < n_tau / 4 (i.e., second batch is small), we opt instead for nt_batch = n_tau / 2 -- this is debatable but for both small and large applications, this shouldn't make much of a difference.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds automatic optimization of the nt_batch parameter for the CUDA GW solver to maximize GPU performance. Previously, users had to manually specify this value, which required knowledge of GPU memory constraints and optimal batch sizing.

  • Changed default nt_batch from 1 to 0 (triggers automatic optimization)
  • Added optimize_ntbatch() function to calculate optimal batch size based on available GPU memory
  • Extended test coverage to verify both automatic optimization and manual specification modes

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
src/green/gpu/gpu_factory.h Updated nt_batch parameter default from 1 to 0 and clarified documentation
src/green/gpu/gw_gpu_kernel.h Added declaration for optimize_ntbatch() function with documentation
src/gw_gpu_kernel.cpp Implemented optimize_ntbatch() logic and integrated it into memory checking; updated warning/error messages
test/cu_solver_test.cpp Extended test coverage to test both automatic optimization (nt_batch="0") and manual setting (nt_batch="1")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gauravharsha and others added 4 commits November 7, 2025 16:03
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@gauravharsha gauravharsha requested a review from Copilot November 7, 2025 21:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@egull egull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Print nt_batch so we have it in the log (ignore if it's done elsewhere), and then let's try it out!

@egull egull self-requested a review November 8, 2025 11:09
@gauravharsha
Copy link
Contributor Author

Merging now. More tests and benchmarks under way, but the output of unit tests for Hydrogen on ancient Quadro P1000 already show improvements with optimization:

Low memory mode   |   Precision   |   GFLOPS w/ optimization  |   GFLOPS w/ nt_batch=1
-------------------------------------------------------------------------------------------
No                |   Double      |   2.81423                 |   0.505108
Yes               |   Double      |   2.88435                 |   0.514456
No                |   Single      |   23.6168                 |   1.44349
Yes               |   Single      |   27.5014                 |   1.44404

@gauravharsha gauravharsha merged commit 010479c into Green-Phys:main Nov 8, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants