Add CPU sort functions and optimize GPU radix sort with CUB by keichi · Pull Request #81 · keichi/kEDM

keichi · 2026-01-11T13:21:02Z

Summary

Add CPU versions of sort functions (full_sort_cpu, partial_sort_cpu) using std::sort/std::partial_sort
Replace custom GPU radix sort implementation with cub::DeviceSegmentedRadixSort for significantly better performance
Add full_sort_with_scratch declaration to header and benchmark option
Automatically select optimal sort implementation in ccm based on data size and available scratch memory

Performance

Benchmark results on GPU (time in ms, N×N matrix):

N	full_sort (Kokkos)	full_sort_with_scratch	full_sort_radix (CUB)
1,000	1.35	0.36	1.63
2,000	11.34	1.37	3.60
5,000	99.19	12.12	15.44
10,000	434.18	❌ memory limit	49.38
20,000	1,823.33	❌ memory limit	182.51

For small sizes (N ≤ 5,000): full_sort_with_scratch is fastest
For large sizes (N ≥ 10,000): full_sort_radix (CUB) is ~10x faster than default Kokkos sort

Test plan

All 23 existing tests pass on CUDA build
Benchmark verified performance improvements

- Add full_sort_cpu and partial_sort_cpu functions - Use Kokkos::parallel_for with DefaultHostExecutionSpace for row-level parallelism - Add benchmark options: -c for CPU sort, -f -c for CPU full sort - Add test cases for both CPU sort functions

Implement full_sort_radix using LSD (Least Significant Digit) radix sort with 4 passes (8 bits per pass). This provides ~10x speedup over the existing bitonic sort for large arrays. - Add full_sort_radix function using global memory double-buffering - Add -r/--radix-sort option to partial-sort-bench - Add test case comparing against std::stable_sort

- Use cub::DeviceSegmentedRadixSort::SortPairs for efficient GPU sorting - Throw exception when CUDA is not enabled - Add full_sort_with_scratch declaration to header - Add -s/--scratch-sort option to benchmark for comparing implementations

- Rename original full_sort to full_sort_kokkos - Add full_sort wrapper that dispatches to: - full_sort_with_scratch if data fits in scratch memory - full_sort_radix (CUB) on CUDA builds otherwise - full_sort_kokkos on non-CUDA builds otherwise - Simplify ccm function to use full_sort wrapper - Add -K/--kokkos-sort option to benchmark - Update tests to use full_sort wrapper

keichi added 4 commits January 8, 2026 14:54

keichi force-pushed the cpu-sort branch from 6989628 to 1072102 Compare January 11, 2026 14:10

keichi merged commit 8f64f0b into master Jan 11, 2026
11 checks passed

keichi deleted the cpu-sort branch January 11, 2026 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU sort functions and optimize GPU radix sort with CUB#81

Add CPU sort functions and optimize GPU radix sort with CUB#81
keichi merged 4 commits intomasterfrom
cpu-sort

keichi commented Jan 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

keichi commented Jan 11, 2026

Summary

Performance

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant