Skip to content

Optimize group_index_select_or_add_2d_kernel by adding a separate codepath for small embedding dimensions#135

Open
aryaman-gupta wants to merge 7 commits intomain_12162025_upstreamfrom
aryaman/group-index-subwarp
Open

Optimize group_index_select_or_add_2d_kernel by adding a separate codepath for small embedding dimensions#135
aryaman-gupta wants to merge 7 commits intomain_12162025_upstreamfrom
aryaman/group-index-subwarp

Conversation

@aryaman-gupta
Copy link

@aryaman-gupta aryaman-gupta commented Dec 16, 2025

This PR optimizes the performance of the group_index_select_or_add_2d_kernel kernel on tables with small embedding dimensions (i.e., num_cols).

For tables with small embedding dimensions, the code is refactored to process multiple rows within the same warp. Two files are changed:

  1. fbgemm_gpu/src/sparse_ops/sparse_ops_gpu.cpp - The calculation of the warp_offsets is changed in the host-side code.
  2. fbgemm_gpu/src/sparse_ops/sparse_group_index.cu - The group_index_select_or_add_2d_kernel kernel is modified to process multiple rows within a warp for small embedding dimensions.

Benchmark results:

Benchmark 1:

Forward Backward
Baseline 10 ms 14 ms
Optimized 7 ms 10.5 ms

Benchmark 2:

Forward Backward
Baseline 3 ms 6 ms
Optimized 2 ms 5 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants