Skip to content

Optimizations for index_select_scalar_cumsum_kernel#137

Open
amd-wsung102 wants to merge 16 commits intowill/upstreamfrom
fbgemm_opt
Open

Optimizations for index_select_scalar_cumsum_kernel#137
amd-wsung102 wants to merge 16 commits intowill/upstreamfrom
fbgemm_opt

Conversation

@amd-wsung102
Copy link

@amd-wsung102 amd-wsung102 commented Dec 16, 2025

Optimization Changes

  • Vectorized gather/store on ROCm: process 4/2/1 indices per thread, improving memory coalescing and bandwidth
  • Fast path for single-block workloads, which uses a simple block scan and early return
  • Multi-block cumsum uses a shared block prefix, reducing global synchronization overhead
  • Auto-tuned vector width and block entry count at launch time to match indices length
  • Guarded new ROCm changes using #ifdef USE_ROCM

Test Result

Reduced the duration of index_select_scalar_cumsum_kernel by 1.11 us, yielding a 1.3x speedup.

Test Plan

Unit test passed

image

Submission Checklist

Copy link

@avbokovoy avbokovoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor tweaks are needed, but overall LGTM

auto grid_size = cuda_calc_xblock_count(
int grid_size = 0;
#ifdef USE_ROCM
constexpr int VEC = 4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass that to the kernel as a template parameter with default value for easier tweaking if needed. Also VEC variable name is not self-explanatory

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name of VEC to ENTRIES_PER_THREAD so it sounds more intuitive. It was originally passed to the kernel as a new template parameter, but Li said to avoid changing the template and kernel API. Should I change it back to a template parameter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENTRIES_PER_THREAD (previously VEC) is now passed to the kernel as a template parameter.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liligwu Could you elaborate why we should avoid changing template and kernel API?

Copy link

@aryaman-gupta aryaman-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a glance at the code and left a couple of comments. Looks good to me otherwise

}

// Faster path for single block
if (!multi_block) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the other file, you may consider passing multi_block as a compile-time parameter or splitting the function and dispatching the appropriate one at runtime.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I will be sure to try it to test its results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants