Skip to content

Make warpspeed scan tunable#8008

Merged
davebayer merged 8 commits intoNVIDIA:mainfrom
bernhardmgruber:scan_offset_T
Mar 13, 2026
Merged

Make warpspeed scan tunable#8008
davebayer merged 8 commits intoNVIDIA:mainfrom
bernhardmgruber:scan_offset_T

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Mar 12, 2026

  • Duplicate exclusive/sum benchmark for warpspeed since it is too different from the old implementation
  • Hardcode benchmark OffsetT to uint64
  • Review warpspeed scan tuning parameters and expose the relevant ones

This PR will create a conflict with #7565 and I am fine with merging this PR after it. This PR is needed to unblock @gonidelis tuning the new scan implementation.

Fixes: #7893
Fixes: #7894

@bernhardmgruber bernhardmgruber requested review from a team as code owners March 12, 2026 12:20
@bernhardmgruber bernhardmgruber requested a review from shwina March 12, 2026 12:20
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 12, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 12, 2026
@bernhardmgruber
Copy link
Contributor Author

With the changes in this PR, I can start a tuning run for the warpspeed scan:

cccl/build_tune$ CUDA_VISIBLE_DEVICES=0 ../benchmarks/scripts/search.py -R 'cub.bench.scan.exclusive.sum.warspeed' -a 'KeyT{ct}=I32' -a 'Elements{io}[pow2]=28'
 ctk:  13.2.51
cccl:  v3.4.0.dev-208-g2f040ac1d4
cub.bench.scan.exclusive.sum.warspeed.wrps_5.lbi_8.ipt_224 0.3516259758697389
cub.bench.scan.exclusive.sum.warspeed.wrps_3.lbi_3.ipt_48 0.9829544835206165
cub.bench.scan.exclusive.sum.warspeed.wrps_7.lbi_7.ipt_64 1.005813892002792
cub.bench.scan.exclusive.sum.warspeed.wrps_2.lbi_1.ipt_144 0.9774011267682982
cub.bench.scan.exclusive.sum.warspeed.wrps_4.lbi_2.ipt_64 0.9943421016735846
cub.bench.scan.exclusive.sum.warspeed.wrps_6.lbi_4.ipt_24 0.988571386808916
cub.bench.scan.exclusive.sum.warspeed.wrps_1.lbi_4.ipt_48 0.5148809219190508
cub.bench.scan.exclusive.sum.warspeed.wrps_5.lbi_7.ipt_216 0.3626834049278357
cub.bench.scan.exclusive.sum.warspeed.wrps_3.lbi_2.ipt_216 0.7330993445471629
cub.bench.scan.exclusive.sum.warspeed.wrps_5.lbi_5.ipt_152 0.7456895936547066
cub.bench.scan.exclusive.sum.warspeed.wrps_1.lbi_4.ipt_128 0.8565018316886749
cub.bench.scan.exclusive.sum.warspeed.wrps_3.lbi_7.ipt_24 0.7393161554878765
cub.bench.scan.exclusive.sum.warspeed.wrps_3.lbi_2.ipt_128 1.0000902018895734

@github-actions

This comment has been minimized.

* Duplicate exclusive/sum benchmark for warpspeed since it is too different from the old implementation
* Hardcode benchmark OffsetT to uint64
* Review warpspeed scan tuning parameters and expose the relevant ones

Fixes: NVIDIA#7893
Fixes: NVIDIA#7894
@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 31m: Pass: 100%/249 | Total: 8d 17h | Max: 2h 30m | Hits: 72%/157696

See results here.

@davebayer davebayer merged commit 1f93d93 into NVIDIA:main Mar 13, 2026
265 of 268 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Mar 13, 2026
@bernhardmgruber bernhardmgruber deleted the scan_offset_T branch March 13, 2026 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Make warpspeed scan tunable Remove I32 benchmarks for scan

3 participants