Split Kernel Optimization for `group_index_select_or_add_2d_kernel` by aryaman-gupta · Pull Request #142 · ROCm/FBGEMM

aryaman-gupta · 2026-02-02T16:54:53Z

Overview

Optimizes group_index_select operations on ROCm by splitting tables into small and large embedding groups and launching specialized kernels for each.

Implementation

Introduces compile-time template parameter USE_SMALL_EMB_DIM that splits kernel execution:

Small embeddings (num_cols < cols_per_warp): Pack multiple rows per warp for better occupancy
Large embeddings (num_cols >= cols_per_warp): Standard one-or-more warps per row

Tables are separated into two groups during host-side processing, then launched with two separate kernel invocations.

Why It Works

Independent use_var_cols tracking: Each group (small/large) maintains its own use_var_cols flag. If all small embeddings have identical dimensions, the small kernel skips variable column logic entirely, even when large embeddings vary.
Compact kernel generation: Compile-time specialization eliminates runtime branching between small/large paths, producing tighter kernels.

Files Modified

Kernel Implementation (`sparse_group_index.cu`)

Added USE_SMALL_EMB_DIM template parameter to the group_index_select_or_add_2d_kernel for compile-time kernel specialization

Host GPU Code (`sparse_ops_gpu.cpp`)

Forward pass: Split tables into small/large groups, allocate dual args_tensor_small/large and saved_data_small/large, launch separate kernels
Backward pass: Unpack dual saved data, split gradients into small/large groups, launch separate kernels
Returns group_size + 4 tensors on ROCm (2 args_tensors + 2 saved_data) vs group_size + 2 on CUDA

CPU Wrapper (`sparse_ops_cpu.cpp`)

Updated tensor count validation to handle variable returns. The CPU implementation of group_index_select remains unmodified and returns 2 elements, while the ROCm GPU implementation now returns 4.

Python Interface (`sparse_ops.py`)

Added runtime detection (torch.version.hip is not None) to allocate appropriate number of tensors

Header (`sparse_ops.h`)

Added use_small_emb_dim parameter to group_index_select_or_add_cuda function signature to control kernel specialization (conditionally compiled for ROCm only)

…r_add_2d_kernel

…g on embedding dim size

…or_add_2d_kernel

…p_index_select_or_add_2d_kernel

…zed small embedding dims path

…isable optimized smallEmbD path

…_2d_kernel_small kernel

…ptimizations

…ug code

…data between forward and backward passes

…adds debug code" This reverts commit 4576e59.

…ions

…kward function

…s in backward function" This reverts commit 39afe28.

…kward function

… kernels

Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2206 Pull Request resolved: pytorch#5207 - Revert the changes made in D87922263 Reviewed By: cthi, atalman, huydhn Differential Revision: D88774663 fbshipit-source-id: ecc0486eb82564ebc31eac503c58a35600816548

…e-time split

…aman/group-index-optimizations

avbokovoy

The overall logic looks correct to me. Proposed several changes that will significantly reduce code duplication, diff size and probably will make potential CUDA integration easier (at least on GPU and invoke function sides)

avbokovoy · 2026-02-03T11:55:47Z