Skip to content

Fea/6482 enable grid constant for all non-mutable CUB kernel parameters#6642

Open
toxicteddy00077 wants to merge 29 commits intoNVIDIA:mainfrom
toxicteddy00077:fea/6482-enable-grid-constant
Open

Fea/6482 enable grid constant for all non-mutable CUB kernel parameters#6642
toxicteddy00077 wants to merge 29 commits intoNVIDIA:mainfrom
toxicteddy00077:fea/6482-enable-grid-constant

Conversation

@toxicteddy00077
Copy link

Description

closes #6482

Just added _CCCL_GRID_CONSTANT const to the immutable params for the necessary kernels

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 17, 2025
…ea/6482-enable-grid-constant

Merge upstream
@toxicteddy00077 toxicteddy00077 force-pushed the fea/6482-enable-grid-constant branch from f690f5e to bf617f5 Compare November 17, 2025 17:16
@bernhardmgruber
Copy link
Contributor

/ok to test bf617f5

@bernhardmgruber
Copy link
Contributor

Thank you for this contribution! I think we should also do a benchmark and a SASS check somewhere, just because I am curious about the impact of this change. @toxicteddy00077 you can leave this to us.

@github-actions

This comment has been minimized.

Copy link
Contributor

@fbusato fbusato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @toxicteddy00077. It is great to see this feature in CUB.

Initial feedbacl: This work could be extended to more parameters, e.g. cuda::std::array and pointers. Also, we need to double-check the implications for custom operators

CUB_DETAIL_KERNEL_ATTRIBUTES void DeviceHistogramSweepKernel(
SampleIteratorT d_samples,
_CCCL_GRID_CONSTANT const SampleIteratorT d_samples,
::cuda::std::array<int, NumActiveChannels> num_output_bins_wrapper,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why _CCCL_GRID_CONSTANT is skipped on cuda::std::array parameters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it earlier, but when i tested the benches I ran into errors regarding AgentHistogram since the .data() function in it is expected to return type CounterT** but with _CCCL_GRID_CONSTANT const seems to return something else. Please let me know if I've missed something

Copy link
Contributor

@fbusato fbusato Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the code. It is a bit unexpected that we modify internal cuda::std::array pointers. Maybe @bernhardmgruber knows more about this point.

Excluding pointers, other ::cuda::std::array parameters should work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this had historical reasons, because we could not change AgentHistogram to take pointers by const before CCCL 3.0, where we moved them to a detail namespace.

Please add _CCCL_GRID_CONSTANT const to the ::cuda::std::array for histogram and propagatge const throughout the agent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added _CCCL_GRID_CONSTANT const to the ::cuda::std::array, just had to make the necessary AgentHistogram members const(I marked them as read-only) and also make some kernel_histogram.cuh methods in Transform which are utilized as const

ValueT* items_pong,
CompareOpT compare_op,
_CCCL_GRID_CONSTANT const CompareOpT compare_op,
OffsetT* merge_partitions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a write memory location? or _CCCL_GRID_CONSTANT was not applied to pointers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what i understand the _CCCL_GRID_CONSTANT const was applied to the outer pointers, not the inner pointers which actually point to the write memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is the reason I asked. It doesn't matter if the pointer is for read or write

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Nov 18, 2025
CUB_DETAIL_KERNEL_ATTRIBUTES void DeviceHistogramSweepKernel(
SampleIteratorT d_samples,
_CCCL_GRID_CONSTANT const SampleIteratorT d_samples,
::cuda::std::array<int, NumActiveChannels> num_output_bins_wrapper,
Copy link
Contributor

@fbusato fbusato Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the code. It is a bit unexpected that we modify internal cuda::std::array pointers. Maybe @bernhardmgruber knows more about this point.

Excluding pointers, other ::cuda::std::array parameters should work.

KeyT* tmp_keys_out,
ValueT* tmp_items_out,
CompareOpT compare_op,
_CCCL_GRID_CONSTANT const CompareOpT compare_op,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side effects here are extremely rare, but I would remove _CCCL_GRID_CONSTANT from compare_op

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@toxicteddy00077
Copy link
Author

Apologies for the delay. I will be making changes and adding _CCCL_GRID_CONSTANT const to ::cuda::std::array for histogram and make sure it propogates const

@@ -445,17 +445,17 @@ template <typename ChainedPolicyT,
typename OffsetT>
__launch_bounds__(int(ChainedPolicyT::ActivePolicy::AgentHistogramPolicyT::BLOCK_THREADS))
CUB_DETAIL_KERNEL_ATTRIBUTES void DeviceHistogramSweepKernel(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous kernel DeviceHistogramInitKernel misses _CCCL_GRID_CONSTANT

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made corrections

AtomicOffsetT* d_ctrs,
OffsetT* d_bins_out,
const OffsetT* d_bins_in,
_CCCL_GRID_CONSTANT const OffsetT* const d_bins_in,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeviceRadixSortExclusiveSumKernel (below) is missing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made corrections

@toxicteddy00077 toxicteddy00077 force-pushed the fea/6482-enable-grid-constant branch from 774eb51 to 0e50ba3 Compare November 26, 2025 15:04
@toxicteddy00077 toxicteddy00077 force-pushed the fea/6482-enable-grid-constant branch from bfbcdf4 to 53dbe0b Compare February 22, 2026 13:59
@toxicteddy00077
Copy link
Author

toxicteddy00077 commented Feb 22, 2026

done @bernhardmgruber

@bernhardmgruber
Copy link
Contributor

/ok to test 53dbe0b

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor

Some CI jobs are getting stuck compiling now. I have seen this previously with the warpspeed scan, where adding _CCCL_GRID_CONSTANT to one specific kernel parameter caused nvcc < 12.8 to get stuck. See:

// nvcc 12.0 gets stuck compiling some TUs like `cub.bench.scan.exclusive.sum.base`, so only enable for newer versions
#if _CCCL_CUDACC_AT_LEAST(12, 8)
_CCCL_GRID_CONSTANT
#endif // _CCCL_CUDACC_AT_LEAST(12, 8)

@fbusato what do you want to do? Use _CCCL_GRID_CONSTANT only for nvcc >= 12.8 or start finding out which line(s) of this PR cause an issue?

@fbusato
Copy link
Contributor

fbusato commented Feb 25, 2026

thanks, @bernhardmgruber. This is very helpful to address the problem. Yes, I strongly suggest to enable it only for nvcc >= 12.8. I will track the potential bug offline and chat with the compiler team.

@toxicteddy00077 toxicteddy00077 force-pushed the fea/6482-enable-grid-constant branch from 9ffb226 to e384d92 Compare March 11, 2026 11:18
@toxicteddy00077
Copy link
Author

@fbusato any update on this? thanks

@fbusato
Copy link
Contributor

fbusato commented Mar 11, 2026

@toxicteddy00077 sorry again, this not a known issue. Please go ahead and enable _CCCL_GRID_CONSTANT only for nvcc >= 12.8

@toxicteddy00077 toxicteddy00077 force-pushed the fea/6482-enable-grid-constant branch from e384d92 to fb8b5cb Compare March 11, 2026 16:43
@toxicteddy00077 toxicteddy00077 requested a review from a team as a code owner March 11, 2026 16:43
@toxicteddy00077 toxicteddy00077 requested a review from fbusato March 11, 2026 17:32
@fbusato
Copy link
Contributor

fbusato commented Mar 11, 2026

/ok to test fb8b5cb

@github-actions

This comment has been minimized.

@toxicteddy00077
Copy link
Author

@fbusato there's still one test failing, what could be the reason? Anything i can fix?

@fbusato
Copy link
Contributor

fbusato commented Mar 12, 2026

this is unrelated to your changes. It has been fixed quite recently. Let me try to rebase the PR.

@fbusato
Copy link
Contributor

fbusato commented Mar 12, 2026

/ok to test 4a7004e

@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 2h 58m: Pass: 99%/298 | Total: 12d 04h | Max: 2h 57m | Hits: 69%/372635

See results here.

@fbusato
Copy link
Contributor

fbusato commented Mar 14, 2026

Python nvcc GCC / SQ / [CTK13.1 GCC13 py3.13] Test cuda.compute(amd64, L4)

AssertionError: LDL instruction found in SASS

@NaderAlAwar could you please check?

@NaderAlAwar
Copy link
Contributor

@fbusato Looked into this. The LDL/STL instructions appear even with NVCC, this is not specific to cuda.compute. I ran some benchmarks and the performance after the changes in this PR are 0-2% better, so this is not an issue. I will open a PR to disable the checks on the python side. I think we should disable all the LDL/STL checks until I spend some more time looking into #7978.

@NaderAlAwar
Copy link
Contributor

Disabled sass checks in #8053. Once it lands, you should merge from main to fix CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants