Conversation
- Add full_sort_cpu and partial_sort_cpu functions - Use Kokkos::parallel_for with DefaultHostExecutionSpace for row-level parallelism - Add benchmark options: -c for CPU sort, -f -c for CPU full sort - Add test cases for both CPU sort functions
Implement full_sort_radix using LSD (Least Significant Digit) radix sort with 4 passes (8 bits per pass). This provides ~10x speedup over the existing bitonic sort for large arrays. - Add full_sort_radix function using global memory double-buffering - Add -r/--radix-sort option to partial-sort-bench - Add test case comparing against std::stable_sort
- Use cub::DeviceSegmentedRadixSort::SortPairs for efficient GPU sorting - Throw exception when CUDA is not enabled - Add full_sort_with_scratch declaration to header - Add -s/--scratch-sort option to benchmark for comparing implementations
- Rename original full_sort to full_sort_kokkos - Add full_sort wrapper that dispatches to: - full_sort_with_scratch if data fits in scratch memory - full_sort_radix (CUB) on CUDA builds otherwise - full_sort_kokkos on non-CUDA builds otherwise - Simplify ccm function to use full_sort wrapper - Add -K/--kokkos-sort option to benchmark - Update tests to use full_sort wrapper
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
full_sort_cpu,partial_sort_cpu) usingstd::sort/std::partial_sortcub::DeviceSegmentedRadixSortfor significantly better performancefull_sort_with_scratchdeclaration to header and benchmark optionccmbased on data size and available scratch memoryPerformance
Benchmark results on GPU (time in ms, N×N matrix):
full_sort_with_scratchis fastestfull_sort_radix(CUB) is ~10x faster than default Kokkos sortTest plan