Skip to content

Optimize partial_sort in CCM#80

Merged
keichi merged 5 commits intomasterfrom
opt-partial-sort
Jan 8, 2026
Merged

Optimize partial_sort in CCM#80
keichi merged 5 commits intomasterfrom
opt-partial-sort

Conversation

@keichi
Copy link
Owner

@keichi keichi commented Jan 8, 2026

Summary

  • Simplify pivot computation by using bit manipulation instead of parallel_reduce
  • Use parallel_for to reset histogram bins for better GPU performance
  • Sort top-k elements in scratch memory before copying to global memory
  • Add partial-sort-bench benchmark for performance measurement

Benchmark Results (dango, N=10000, k=21)

GPU (RTX 3090)

Version partial_sort
Before optimization 10.45 ms
After optimization 9.30 ms
Improvement 11%

CPU (OpenMP)

Function Time
partial_sort 56.8 ms
full_sort 1467.8 ms

Test plan

  • All C++ tests pass on CPU
  • All C++ tests pass on GPU

keichi added 5 commits January 8, 2026 01:31
Document the radix select algorithm steps for finding top-k elements.
- Replace parallel_reduce with direct bit manipulation to compute pivot
  (set undetermined bits to 1 to get upper bound of k-th element)
- Sort top-k in scratch memory before copying to global memory
- Merge two parallel_for loops into one
- Remove unused find_result struct and reduction_identity specialization
Benchmark for measuring partial_sort and full_sort performance in CCM.
Replace Kokkos::single with parallel_for for better GPU performance.
@keichi keichi merged commit cc27c3c into master Jan 8, 2026
11 checks passed
@keichi keichi deleted the opt-partial-sort branch January 8, 2026 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant