Skip to content

Conversation

@brandon-b-miller
Copy link
Contributor

PR #609 made some changes to the way modules were loaded that results in the wrong object being passed to cuOccupancyMaxPotentialBlockSize (previously a CUFunction and now a CUKernel). This causes the max block size calculation to fail after eventually getting the wrong object and leads to a CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES on certain GPUs. This is observable on a V100 with a resource hungry kernel:

python -m numba.runtests numba.cuda.tests.cudapy.test_gufunc.TestCUDAGufunc.test_gufunc_small
cuda.core._utils.cuda_utils.CUDAError: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES: This indicates that a launch did not occur because it did not have appropriate resources. 

This PR removes the numba-cuda native maximum threads per block computation machinery and routes through cuda-python APIs to get the same information.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 23, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 23, 2026

Greptile Summary

This PR fixes a critical bug where cuOccupancyMaxPotentialBlockSize was being passed a CUKernel object instead of a CUFunction, causing CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES on certain GPUs. The fix simplifies the implementation by removing the custom occupancy calculation machinery and instead directly using kernel.attributes.max_threads_per_block() from cuda-python.

Key Changes:

  • Removed get_max_potential_block_size() method and its helper implementations from Context class
  • Updated ForAll._compute_thread_per_block() to use function.kernel.attributes.max_threads_per_block() directly
  • Removed corresponding test code for the deleted method
  • Cleaned up unused imports (c_size_t, cu_occupancy_b2d_size)

Trade-off: The new approach uses the maximum allowable threads per block rather than calculating an optimal block size for occupancy. While this may not achieve optimal occupancy in all cases, it fixes the immediate bug and simplifies the codebase by routing through cuda-python APIs.

Confidence Score: 4/5

  • This PR is safe to merge - it fixes a critical runtime bug with a pragmatic solution
  • The fix resolves a real production issue (CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES). The implementation is straightforward and the test was properly updated. Score is 4/5 rather than 5/5 because the semantic change from optimal occupancy to maximum block size may impact performance in some workloads, though this appears to be an acceptable trade-off given the bug being fixed.
  • No files require special attention - all changes are clean and well-structured

Important Files Changed

Filename Overview
numba_cuda/numba/cuda/cudadrv/driver.py Removes broken get_max_potential_block_size method and unused imports
numba_cuda/numba/cuda/dispatcher.py Replaces occupancy API with direct max_threads_per_block() call - simpler but may not optimize for occupancy
numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py Removes test for deleted get_max_potential_block_size method

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@cpcloud
Copy link
Contributor

cpcloud commented Jan 23, 2026

It looks like this is still in use in cudf. Perhaps we can just fix it as is and keep it around until it can be adjusted downstream in cudf?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants