Skip to content

Add CPU sort functions and optimize GPU radix sort with CUB#81

Merged
keichi merged 4 commits intomasterfrom
cpu-sort
Jan 11, 2026
Merged

Add CPU sort functions and optimize GPU radix sort with CUB#81
keichi merged 4 commits intomasterfrom
cpu-sort

Conversation

@keichi
Copy link
Owner

@keichi keichi commented Jan 11, 2026

Summary

  • Add CPU versions of sort functions (full_sort_cpu, partial_sort_cpu) using std::sort/std::partial_sort
  • Replace custom GPU radix sort implementation with cub::DeviceSegmentedRadixSort for significantly better performance
  • Add full_sort_with_scratch declaration to header and benchmark option
  • Automatically select optimal sort implementation in ccm based on data size and available scratch memory

Performance

Benchmark results on GPU (time in ms, N×N matrix):

N full_sort (Kokkos) full_sort_with_scratch full_sort_radix (CUB)
1,000 1.35 0.36 1.63
2,000 11.34 1.37 3.60
5,000 99.19 12.12 15.44
10,000 434.18 ❌ memory limit 49.38
20,000 1,823.33 ❌ memory limit 182.51
  • For small sizes (N ≤ 5,000): full_sort_with_scratch is fastest
  • For large sizes (N ≥ 10,000): full_sort_radix (CUB) is ~10x faster than default Kokkos sort

Test plan

  • All 23 existing tests pass on CUDA build
  • Benchmark verified performance improvements

keichi added 4 commits January 8, 2026 14:54
- Add full_sort_cpu and partial_sort_cpu functions
- Use Kokkos::parallel_for with DefaultHostExecutionSpace for row-level parallelism
- Add benchmark options: -c for CPU sort, -f -c for CPU full sort
- Add test cases for both CPU sort functions
Implement full_sort_radix using LSD (Least Significant Digit) radix sort
with 4 passes (8 bits per pass). This provides ~10x speedup over the
existing bitonic sort for large arrays.

- Add full_sort_radix function using global memory double-buffering
- Add -r/--radix-sort option to partial-sort-bench
- Add test case comparing against std::stable_sort
- Use cub::DeviceSegmentedRadixSort::SortPairs for efficient GPU sorting
- Throw exception when CUDA is not enabled
- Add full_sort_with_scratch declaration to header
- Add -s/--scratch-sort option to benchmark for comparing implementations
- Rename original full_sort to full_sort_kokkos
- Add full_sort wrapper that dispatches to:
  - full_sort_with_scratch if data fits in scratch memory
  - full_sort_radix (CUB) on CUDA builds otherwise
  - full_sort_kokkos on non-CUDA builds otherwise
- Simplify ccm function to use full_sort wrapper
- Add -K/--kokkos-sort option to benchmark
- Update tests to use full_sort wrapper
@keichi keichi merged commit 8f64f0b into master Jan 11, 2026
11 checks passed
@keichi keichi deleted the cpu-sort branch January 11, 2026 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant