Skip to content

Implement the new tuning API for DeviceScan#7565

Open
griwes wants to merge 38 commits intoNVIDIA:mainfrom
griwes:feature/new-tuning-api/scan
Open

Implement the new tuning API for DeviceScan#7565
griwes wants to merge 38 commits intoNVIDIA:mainfrom
griwes:feature/new-tuning-api/scan

Conversation

@griwes
Copy link
Contributor

@griwes griwes commented Feb 8, 2026

Description

Resolves #7521
Resolves #7476
Resolves #6821

Ready for review, still planning to do SASS inspection in some crucial places.

Sidenote: this exact type of task seems to fit Codex really, really well.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@griwes griwes requested review from a team as code owners February 8, 2026 05:44
@griwes griwes requested a review from shwina February 8, 2026 05:44
@griwes griwes requested a review from elstehle February 8, 2026 05:44
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 8, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 8, 2026
@github-actions

This comment has been minimized.

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good already! Great work!

@bernhardmgruber
Copy link
Contributor

@griwes we just merged #6811, which also touches the scan tunings. This will probably create some more work for this PR. Issue #6821 also tracks making the new scan implementation available to CCCL.C. Do you think you can handle this as well?

@bernhardmgruber
Copy link
Contributor

@griwes I pulled out the delay constructor refactoring in #7668 so I can better stack my refactorings on top, in case this PR takes a bit longer (sorry again for the extra work with warpspeed!)

@griwes
Copy link
Contributor Author

griwes commented Feb 18, 2026

Note, the warpspeed integration is still largely untested; I've added an rtxpro6000 test job to c.parallel and that will be the primary test right now. I'll lease a machine with a relevant GPU if that fails, or if there's anything that's clearly wrong to someone's eyes in review.

Edit: also seems I messed up some constexprness 😅

@github-actions

This comment has been minimized.

@griwes
Copy link
Contributor Author

griwes commented Feb 27, 2026

Last remaining real failure is SASS checks in non-scan c.parallel tests on sm120; I'll pull that out of this PR, together with the enablement of the config in CI, and post it separately.

@github-actions

This comment has been minimized.

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to re-review the dispatch logic and the changes around the kernel, especially the refactoring to compute whether we can fit a single stage into 48KiB SMEM. Otherwise this looks pretty good already!

Ideally, we should not see any SASS changes for SM 75;80;86;90;100 for one of the benchmarks, like cub.bench.scan.sum.base. Can you please diff a SASS dump before and after the PR and confirm this? Thx!

{
static constexpr int num_squads = 5;

bool valid = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: we should probably introduce an algorithm enum like in DeviceTransform before all the policies go public. No changes need for now.

Comment on lines +933 to +956
return dispatch_arch(policy_selector, arch_id, [&](auto policy_getter) {
return DispatchScan<InputIteratorT,
OutputIteratorT,
ScanOpT,
InitValueT,
OffsetT,
AccumT,
EnforceInclusive,
fake_policy,
KernelSource,
KernelLauncherFactory>{
d_temp_storage,
temp_storage_bytes,
d_in,
d_out,
num_items,
scan_op,
init_value,
stream,
-1 /* ptx_version, not used actually */,
kernel_source,
launcher_factory}
.__invoke(policy_getter, policy_selector);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: I wonder if it would have been easier to duplicate the logic from DispatchScan into the dispatch function and strip all warpspeed logic from DispatchScan. The warpspeed scan is not on a release branch yet, so it's fine if it's not reachable through DispatchScan.

@bernhardmgruber
Copy link
Contributor

I finished another review and I only have minor comments except for the wish to retain the static assert that one stage fits into SMEM. I am now waiting for confirmation that we don't see SASS changes.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor

I have been thinking a bit about how the check whether a single stage fits into 48KiB SMEM, and I wondered whether we actually need this check in CCCL.C. The main purpose of the check is to ensure forward compatibility of compiled binaries. So if you compile for sm_100 today and run that binary in 10 years on a GPU that really only has 48KiB SMEM, it should still work. We don't need that guarantee for CCCL.C, since we don't keep around binaries.

The second reason we have this check is that a user could provide us with an input type, or an accumulator type (as dictated by the scan operator), that is so huge that we go beyond 48KiB SMEM even with a conservative tuning policy, and we should just fall back to the old scan, because it's not possible to run the warpspeed scan.

Now I wondered, is the set of types that CCCL.C will use open or closed? Because if we know all types that warpspeed scan will be used from CCCL.C, we can just test if it fits into SMEM in a unit test and entirely omit the entire compile time checking for CCCL.C. We would just drop the SMEM check from the scan_use_warpspeed predicate. That would make this PR a lot simpler.

@bernhardmgruber
Copy link
Contributor

I just realized we still need the runtime computation to know how much SMEM we must request :S

@github-actions

This comment has been minimized.

@griwes
Copy link
Contributor Author

griwes commented Mar 16, 2026

There is SASS changes. Here's a random assortment of kernels compared: https://gist.github.com/griwes/a94e3daf0d2b58faaeebea1932e0c1b0. I believe that there's a whole bunch of codegen artifacts here + some loss/gain of uniform instructions (presumably because the changes made it both easier and harder for the compiler to reason about uniformity...). I have not spotted any significant changes in the hot paths.

There's also two specific cases that seem to now be producing LMEM instructions, though as far as I can tell it's not in the hot loop either: https://gist.github.com/griwes/e0bc6107675b9a55fc3efabdc7244564.

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 13m: Pass: 100%/255 | Total: 8d 19h | Max: 2h 12m | Hits: 59%/159900

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Implement the new tuning API for DeviceScan Refactor cccl.c scan to use tuning API Make warpspeed scan work in CCCL.C

3 participants