Implement the new tuning API for `DeviceScan` by griwes · Pull Request #7565 · NVIDIA/cccl

griwes · 2026-02-08T05:44:48Z

Description

Resolves #7521
Resolves #7476
Resolves #6821

Ready for review, still planning to do SASS inspection in some crucial places.

Sidenote: this exact type of task seems to fit Codex really, really well.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

…i/scan

bernhardmgruber

This looks really good already! Great work!

c/parallel/src/scan.cu

cub/benchmarks/bench/scan/policy_selector.h

cub/cub/device/dispatch/tuning/tuning_radix_sort.cuh

cub/cub/device/dispatch/dispatch_scan.cuh

cub/cub/device/dispatch/tuning/tuning_scan.cuh

bernhardmgruber · 2026-02-10T22:03:07Z

@griwes we just merged #6811, which also touches the scan tunings. This will probably create some more work for this PR. Issue #6821 also tracks making the new scan implementation available to CCCL.C. Do you think you can handle this as well?

bernhardmgruber · 2026-02-13T13:03:55Z

@griwes I pulled out the delay constructor refactoring in #7668 so I can better stack my refactorings on top, in case this PR takes a bit longer (sorry again for the extra work with warpspeed!)

…feature/new-tuning-api/scan

…i/scan

griwes · 2026-02-18T23:52:04Z

Note, the warpspeed integration is still largely untested; I've added an rtxpro6000 test job to c.parallel and that will be the primary test right now. I'll lease a machine with a relevant GPU if that fails, or if there's anything that's clearly wrong to someone's eyes in review.

Edit: also seems I messed up some constexprness 😅

griwes · 2026-02-27T18:21:25Z

Last remaining real failure is SASS checks in non-scan c.parallel tests on sm120; I'll pull that out of this PR, together with the enablement of the config in CI, and post it separately.

cub/cub/device/dispatch/tuning/tuning_scan.cuh

bernhardmgruber

I still have to re-review the dispatch logic and the changes around the kernel, especially the refactoring to compute whether we can fit a single stage into 48KiB SMEM. Otherwise this looks pretty good already!

Ideally, we should not see any SASS changes for SM 75;80;86;90;100 for one of the benchmarks, like cub.bench.scan.sum.base. Can you please diff a SASS dump before and after the PR and confirm this? Thx!

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

bernhardmgruber · 2026-03-04T14:40:11Z

cub/cub/device/dispatch/kernels/scan_warpspeed_policy.cuh

+{
+  static constexpr int num_squads = 5;
+
+  bool valid = false;


Remark: we should probably introduce an algorithm enum like in DeviceTransform before all the policies go public. No changes need for now.

cub/cub/device/dispatch/dispatch_scan.cuh

bernhardmgruber · 2026-03-04T22:48:09Z

cub/cub/device/dispatch/dispatch_scan.cuh

+  return dispatch_arch(policy_selector, arch_id, [&](auto policy_getter) {
+    return DispatchScan<InputIteratorT,
+                        OutputIteratorT,
+                        ScanOpT,
+                        InitValueT,
+                        OffsetT,
+                        AccumT,
+                        EnforceInclusive,
+                        fake_policy,
+                        KernelSource,
+                        KernelLauncherFactory>{
+      d_temp_storage,
+      temp_storage_bytes,
+      d_in,
+      d_out,
+      num_items,
+      scan_op,
+      init_value,
+      stream,
+      -1 /* ptx_version, not used actually */,
+      kernel_source,
+      launcher_factory}
+      .__invoke(policy_getter, policy_selector);
+  });


Remark: I wonder if it would have been easier to duplicate the logic from DispatchScan into the dispatch function and strip all warpspeed logic from DispatchScan. The warpspeed scan is not on a release branch yet, so it's fine if it's not reachable through DispatchScan.

cub/test/catch2_test_device_scan_env.cu

cub/cub/device/device_scan.cuh

bernhardmgruber · 2026-03-04T22:55:28Z

I finished another review and I only have minor comments except for the wish to retain the static assert that one stage fits into SMEM. I am now waiting for confirmation that we don't see SASS changes.

…i/scan

bernhardmgruber · 2026-03-13T14:04:53Z

I have been thinking a bit about how the check whether a single stage fits into 48KiB SMEM, and I wondered whether we actually need this check in CCCL.C. The main purpose of the check is to ensure forward compatibility of compiled binaries. So if you compile for sm_100 today and run that binary in 10 years on a GPU that really only has 48KiB SMEM, it should still work. We don't need that guarantee for CCCL.C, since we don't keep around binaries.

The second reason we have this check is that a user could provide us with an input type, or an accumulator type (as dictated by the scan operator), that is so huge that we go beyond 48KiB SMEM even with a conservative tuning policy, and we should just fall back to the old scan, because it's not possible to run the warpspeed scan.

Now I wondered, is the set of types that CCCL.C will use open or closed? Because if we know all types that warpspeed scan will be used from CCCL.C, we can just test if it fits into SMEM in a unit test and entirely omit the entire compile time checking for CCCL.C. We would just drop the SMEM check from the scan_use_warpspeed predicate. That would make this PR a lot simpler.

bernhardmgruber · 2026-03-13T14:28:12Z

I just realized we still need the runtime computation to know how much SMEM we must request :S

griwes · 2026-03-16T21:56:16Z

There is SASS changes. Here's a random assortment of kernels compared: https://gist.github.com/griwes/a94e3daf0d2b58faaeebea1932e0c1b0. I believe that there's a whole bunch of codegen artifacts here + some loss/gain of uniform instructions (presumably because the changes made it both easier and harder for the compiler to reason about uniformity...). I have not spotted any significant changes in the hot paths.

There's also two specific cases that seem to now be producing LMEM instructions, though as far as I can tell it's not in the hot loop either: https://gist.github.com/griwes/e0bc6107675b9a55fc3efabdc7244564.

…i/scan

github-actions · 2026-03-17T00:30:20Z

🥳 CI Workflow Results

🟩 Finished in 2h 13m: Pass: 100%/255 | Total: 8d 19h | Max: 2h 12m | Hits: 59%/159900

See results here.

griwes added 5 commits February 7, 2026 20:58

Base changes in scan and tests.

ad1c1df

Update benchmarks.

6371339

Update copyright years.

9e346b5

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

910d511

…i/scan

c.parallel: centralize the handling of common cub types.

e9467af

griwes requested review from a team as code owners February 8, 2026 05:44

griwes requested a review from shwina February 8, 2026 05:44

github-project-automation bot added this to CCCL Feb 8, 2026

griwes requested a review from elstehle February 8, 2026 05:44

github-project-automation bot moved this to Todo in CCCL Feb 8, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 8, 2026

This comment has been minimized.

Sign in to view

bernhardmgruber reviewed Feb 9, 2026

View reviewed changes

bernhardmgruber mentioned this pull request Feb 9, 2026

Move delay constructor policies somewhere central #7530

Closed

griwes added 2 commits February 12, 2026 18:09

Resolve review comments.

c4c0c09

Fix c.parallel radix_sort breakage.

2c2db7c

This was referenced Feb 13, 2026

Implement the new tuning API for Dispatch[Streaming]ReduceByKey #7667

Merged

Centralize delay_constructor policy helpers #7668

Merged

griwes added 2 commits February 19, 2026 00:44

integrate warpspeed: Merge remote-tracking branch 'origin/main' into …

a288da0

…feature/new-tuning-api/scan

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

9eec3e2

…i/scan

griwes added 4 commits February 19, 2026 01:15

Compilation fixes.

2a0ddf4

Go through dispatch_arch, unify dispatch paths for scan.

22ece56

Remove cuda::std::optional from policies.

d7f5333

Pull scan_warpspeed_policy out into its own file.

497638c

Add missing includes.

16bc452

This comment has been minimized.

Sign in to view

bernhardmgruber reviewed Mar 3, 2026

View reviewed changes

cub/cub/device/dispatch/tuning/tuning_scan.cuh Outdated Show resolved Hide resolved

bernhardmgruber reviewed Mar 3, 2026

View reviewed changes

bernhardmgruber mentioned this pull request Mar 4, 2026

Use tuning policy as tag for tuning environment #7835

Open

2 tasks

griwes mentioned this pull request Mar 4, 2026

Enable SM120 CI for c.parallel. #7885

Merged

2 tasks

bernhardmgruber reviewed Mar 4, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

aec8147

…i/scan

This comment has been minimized.

Sign in to view

bernhardmgruber mentioned this pull request Mar 12, 2026

Make warpspeed scan tunable #8008

Merged

griwes added 4 commits March 12, 2026 20:56

Codegen fixes.

565a017

Review comments.

9cd3ca0

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

f6af88d

…i/scan

More abstraction layers to restore constexprness.

81b7a7f

This comment has been minimized.

Sign in to view

Correctly check for the constants.

029b195

griwes added 4 commits March 13, 2026 11:41

Another abstraction layer, to remove a constexpr reference to this.

ac03691

I kinda hate this but I think it has to be like this.

a6ed3cd

Silence a warning.

3bb8169

Silence MSVC unreachable code warning.

5dbcfd6

This comment has been minimized.

Sign in to view

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

203e807

…i/scan

Conversation

griwes commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

This comment has been minimized.

bernhardmgruber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bernhardmgruber commented Feb 10, 2026

Uh oh!

bernhardmgruber commented Feb 13, 2026

Uh oh!

griwes commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

griwes commented Feb 27, 2026

Uh oh!

This comment has been minimized.

Uh oh!

bernhardmgruber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bernhardmgruber Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bernhardmgruber Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bernhardmgruber commented Mar 4, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

bernhardmgruber commented Mar 13, 2026

Uh oh!

bernhardmgruber commented Mar 13, 2026

Uh oh!

This comment has been minimized.

griwes commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 13m: Pass: 100%/255 | Total: 8d 19h | Max: 2h 12m | Hits: 59%/159900

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

griwes commented Feb 8, 2026 •

edited

Loading

griwes commented Feb 18, 2026 •

edited

Loading