-
Notifications
You must be signed in to change notification settings - Fork 55
Fix max block size computation in forall
#744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
Greptile SummaryThis PR fixes a critical bug where Key Changes:
Trade-off: The new approach uses the maximum allowable threads per block rather than calculating an optimal block size for occupancy. While this may not achieve optimal occupancy in all cases, it fixes the immediate bug and simplifies the codebase by routing through cuda-python APIs. Confidence Score: 4/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
|
/ok to test |
cpcloud
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
|
It looks like this is still in use in cudf. Perhaps we can just fix it as is and keep it around until it can be adjusted downstream in cudf? |
PR #609 made some changes to the way modules were loaded that results in the wrong object being passed to
cuOccupancyMaxPotentialBlockSize(previously aCUFunctionand now aCUKernel). This causes the max block size calculation to fail after eventually getting the wrong object and leads to aCUDA_ERROR_LAUNCH_OUT_OF_RESOURCESon certain GPUs. This is observable on a V100 with a resource hungry kernel:This PR removes the
numba-cudanative maximum threads per block computation machinery and routes throughcuda-pythonAPIs to get the same information.