Optimizations for index_select_scalar_cumsum_kernel by amd-wsung102 · Pull Request #137 · ROCm/FBGEMM

amd-wsung102 · 2025-12-16T20:36:18Z

Optimization Changes

Vectorized gather/store on ROCm: process 4/2/1 indices per thread, improving memory coalescing and bandwidth
Fast path for single-block workloads, which uses a simple block scan and early return
Multi-block cumsum uses a shared block prefix, reducing global synchronization overhead
Auto-tuned vector width and block entry count at launch time to match indices length
Guarded new ROCm changes using #ifdef USE_ROCM

Test Result

Reduced the duration of index_select_scalar_cumsum_kernel by 1.11 us, yielding a 1.3x speedup.

Test Plan

Unit test passed

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

avbokovoy

Minor tweaks are needed, but overall LGTM

fbgemm_gpu/include/fbgemm_gpu/utils/inclusive_sum_scan.cuh

avbokovoy · 2025-12-17T09:14:08Z

fbgemm_gpu/src/jagged_tensor_ops/keyed_jagged_index_select_dim1.cu

-    auto grid_size = cuda_calc_xblock_count(
+    int grid_size = 0;
+#ifdef USE_ROCM
+    constexpr int VEC = 4;


We can pass that to the kernel as a template parameter with default value for easier tweaking if needed. Also VEC variable name is not self-explanatory

I changed the name of VEC to ENTRIES_PER_THREAD so it sounds more intuitive. It was originally passed to the kernel as a new template parameter, but Li said to avoid changing the template and kernel API. Should I change it back to a template parameter?

ENTRIES_PER_THREAD (previously VEC) is now passed to the kernel as a template parameter.

@liligwu Could you elaborate why we should avoid changing template and kernel API?

fbgemm_gpu/src/jagged_tensor_ops/keyed_jagged_index_select_dim1.cu

fbgemm_gpu/include/fbgemm_gpu/utils/inclusive_sum_scan.cuh

aryaman-gupta

I took a glance at the code and left a couple of comments. Looks good to me otherwise

fbgemm_gpu/include/fbgemm_gpu/utils/inclusive_sum_scan.cuh

aryaman-gupta · 2025-12-18T18:06:31Z

fbgemm_gpu/src/jagged_tensor_ops/keyed_jagged_index_select_dim1.cu

+   }
+
+   // Faster path for single block
+   if (!multi_block) {


As in the other file, you may consider passing multi_block as a compile-time parameter or splitting the function and dispatching the appropriate one at runtime.

Got it, I will be sure to try it to test its results.

root added 4 commits December 16, 2025 20:13

Optimizations for index_select kernel

dbab5d3

Optimizations for index_select kernel

fe9bfbc

Optimizations for index_select kernel

782bf91

Updated cumsum

b9f8625

amd-wsung102 self-assigned this Dec 16, 2025

amd-wsung102 requested review from aryaman-gupta, liligwu and shbiswas834 December 16, 2025 20:39

root added 5 commits December 16, 2025 21:11

Updated template for kernel API

3ef975b

Updated formatting

20d1445

Updated VEC and formatting

27e5537

Updated formatting

2182e0b

Updated formatting

fe929e5

amd-wsung102 requested a review from avbokovoy December 16, 2025 23:07

avbokovoy approved these changes Dec 17, 2025

View reviewed changes

root added 4 commits December 17, 2025 17:57

Addressing PR comments

489e4b6

Added back VEC templateparameter

6083e78

Fixed formatting

1a75460

Removed volatile and added back spin-lock

2d160b2

aryaman-gupta reviewed Dec 18, 2025

View reviewed changes

root added 3 commits January 22, 2026 20:47

Fixed grid_size is 0 and added early return

8fe70ea

Cleaner separation between ROCm and CUDA paths

200de28

Changed variable name grid_size to rocm_grid_size for better clarity

af5ccbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for index_select_scalar_cumsum_kernel#137

Optimizations for index_select_scalar_cumsum_kernel#137
amd-wsung102 wants to merge 16 commits intowill/upstreamfrom
fbgemm_opt

amd-wsung102 commented Dec 16, 2025 •

edited by aryaman-gupta

Loading

Uh oh!

avbokovoy left a comment

Uh oh!

Uh oh!

avbokovoy Dec 17, 2025

Uh oh!

amd-wsung102 Dec 17, 2025

Uh oh!

amd-wsung102 Dec 17, 2025

Uh oh!

avbokovoy Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aryaman-gupta left a comment

Uh oh!

Uh oh!

aryaman-gupta Dec 18, 2025

Uh oh!

amd-wsung102 Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amd-wsung102 commented Dec 16, 2025 • edited by aryaman-gupta Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimization Changes

Test Result

Test Plan

Submission Checklist

Uh oh!

avbokovoy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

avbokovoy Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

amd-wsung102 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

amd-wsung102 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

avbokovoy Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aryaman-gupta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aryaman-gupta Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

amd-wsung102 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-wsung102 commented Dec 16, 2025 •

edited by aryaman-gupta

Loading