-
Notifications
You must be signed in to change notification settings - Fork 4
Optimize photon differential splatting CUDA kernels with grid-stride loops and memory access patterns #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… splatting - Add CHECK_INPUT validation for all tensors in backward function - Optimize forward/backward kernels with grid-stride loops and value hoisting - Add tunable block size (DEFAULT_BLOCK_SIZE = 256) - Enable --use_fast_math for PhotonDifferentialSplatting in setup.py - Add comprehensive test suite with numerical and performance tests Co-authored-by: CompN3rd <1405794+CompN3rd@users.noreply.github.com>
Co-authored-by: CompN3rd <1405794+CompN3rd@users.noreply.github.com>
CompN3rd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests should compare numerical equivalence against the unoptimized kernels
…optimized baseline - Add CPU-based reference implementations matching original unoptimized CUDA kernel logic - Update numerical equivalence tests to compare optimized vs reference implementations - Reference functions: silverman_kernel, d_silverman_kernel, pixel_to_coord, coord_to_pixel, matrix_multiply - reference_pds_forward: Replicates original forward kernel with offset-based loops - reference_pds_backward: Replicates original backward kernel with local accumulation - Tests now validate that optimizations maintain numerical equivalence with tight tolerances - Updated documentation to explain reference implementation approach Co-authored-by: CompN3rd <1405794+CompN3rd@users.noreply.github.com>
Added reference implementations that replicate the original unoptimized kernel logic. The tests now compare optimized CUDA kernels against CPU-based reference implementations ( |
CompN3rd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't compare with a reference numpy implementation, but rather with the kernel version before the changes made to the cuda kennels in this pr
…sting - Add original unoptimized CUDA kernels (_original suffix) to photon_differentials.cu - Expose pds_forward_original and pds_backward_original through C++ wrapper - Update tests to compare optimized vs original CUDA kernels (not NumPy) - Remove NumPy reference implementations from test file - Both kernels run on GPU for direct comparison - Update documentation to reflect use of original CUDA kernels as reference Co-authored-by: CompN3rd <1405794+CompN3rd@users.noreply.github.com>
Changed to compare with original CUDA kernels instead of NumPy reference. Added |
Refactored
pds_cuda_forward_kernelandpds_cuda_backward_kernelto reduce memory bandwidth and improve GPU occupancy through grid-stride loops, value hoisting, and local accumulation.CUDA Kernel Optimizations
Forward kernel:
w,h,inv_half_w_sq,inv_half_h_sq) outside main loopradius,E_center,M_center[6], bounding box) before inner loopsBackward kernel:
g_Ep,g_xp[2],g_Mp[6]) reduces global writes by ~100xOriginal unoptimized kernels:
pds_cuda_forward_kernel_originalandpds_cuda_backward_kernel_originalto preserve pre-optimization implementationpds_forward_original()andpds_backward_original()Python functionsConfiguration:
DEFAULT_BLOCK_SIZE = 256(reduced from hardcoded 512)--use_fast_mathenabled for PhotonDifferentialSplatting extension onlyInput Validation
Added
CHECK_INPUTfor all tensors inpds_backward:Ep,xp,Mp,cp,radiusTest Suite
Added
tests/test_photon_differentials.pywith:rtol=1e-5, atol=1e-6)Example test structure:
Tests compare the optimized kernels directly against the original CUDA implementation (both running on GPU) to validate numerical equivalence.
Files Changed
PyOptix/kernel/photon_differentials.cu- Optimized kernels and original unoptimized kernels for testingPyOptix/PhotonDifferentialSplattig.cpp- Input validation and exposure of original kernelssetup.py- Fast math compilation flagtests/test_photon_differentials.py- Test suite (7 tests, 3 classes)OPTIMIZATION_SUMMARY.md- Technical detailstests/README.md- Testing guideOriginal prompt
This pull request was created as a result of the following prompt from Copilot chat.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.