Implement pre-sorting, caching and contigous warp processing in group_index_select by avbokovoy · Pull Request #144 · ROCm/FBGEMM

avbokovoy · 2026-03-03T10:03:34Z

Follow-up of #139

The differences are:

Reduced #ifdef USE_ROCM usage in favor of if constexpr (OPT_BOOL).
Added compile-time host side codegen guard for the kernel (CUDA vs ROCm)
Fixed an issue with tailing row cache flush

…_index_select kernel

aryaman-gupta

The PR introduces crucial optimizations for the group_index_select_or_add_2d_kernel. The majority of the code is clean and the separation of ROCm and CUDA codepaths has been done well.

Most of these changes were already reviewed in #139 . I have left a few comments that I think should be looked at before merging. Some of these are design choices, and the PR could proceed with merging even if the code is not modified,

aryaman-gupta · 2026-03-05T14:46:33Z

fbgemm_gpu/include/fbgemm_gpu/utils/rocm/sparse_group_utils.h

+    auto sorted_indices = at::empty_like(contiguous_indices);
+    auto reverse_indices = at::empty(
+        contiguous_indices.sizes(),
+        contiguous_indices.options().dtype(at::kLong));


This should be at::kInt as we had previously discussed. Since reverse_indices tracks the positions of the elements, and the number of elements is limited to int::max.

Same with original_positions below.

aryaman-gupta · 2026-03-05T15:48:20Z

fbgemm_gpu/src/sparse_ops/sparse_ops_cpu.cpp

  auto res = forward_op.call(
      all_indices_input_tensor, static_cast<int64_t>(group_size));
-  TORCH_CHECK(res.size() == group_size + 2);
+  TORCH_CHECK(res.size() == group_size + 4);


We discussed previously that we could keep the CPU pass unchanged, and therefore change this condition to TORCH_CHECK(res.size() >= group_size + 2);. Did you change your mind about that?

aryaman-gupta · 2026-03-05T15:49:15Z

fbgemm_gpu/src/sparse_ops/sparse_ops_cpu.cpp

  // to match return format in CUDA implementation
-  // (group_size outputs, 1 args_tensor, 1 saved_data)
+  // (group_size outputs, 1 args_tensor, 1 saved_data, 1 sorted tensor, 1 reverse tensor)
+  output_group.push_back(at::empty({0}, at::TensorOptions().dtype(at::kLong)));


As in the previous comment, if we decide to keep the CPU path unchanged, these additional tensors should be removed.

aryaman-gupta · 2026-03-05T15:56:02Z

fbgemm_gpu/src/sparse_ops/sparse_ops_cpu.cpp

  // all input size =  group_size * 2 (from grads, indices)
  // + 1 args_tensor + 1 saved_data + 1 first output
-  const int64_t group_size = static_cast<int64_t>((all_inputs.size() - 3) / 2);
+  const int64_t group_size = static_cast<int64_t>((all_inputs.size() - 5) / 2);


Following on the above comments, group_size could then be saved between forward and backward passes as:

FBGEMM/fbgemm_gpu/src/sparse_ops/sparse_ops_cpu.cpp

Line 3634 in 63fa242

ctx->saved_data["group_size"] = group_size;

aryaman-gupta · 2026-03-05T16:23:12Z

fbgemm_gpu/src/sparse_ops/sparse_ops_gpu.cpp

  int64_t warp_offset = 0;
  bool use_var_cols = false;

+  Tensor sorted_indices_storage =


Do we need these tensors in the forward pass when the sorting actually takes place in the backward pass?

aryaman-gupta · 2026-03-05T16:27:39Z

fbgemm_gpu/src/sparse_ops/sparse_ops_gpu.cpp

+  int64_t* warp_offsets_group = reinterpret_cast<int64_t*>(saved_data_ptr[4]);
+  int32_t* num_cols_group = reinterpret_cast<int32_t*>(saved_data_ptr[5]);


Should be const int64_t*

aryaman-gupta · 2026-03-05T16:31:02Z

fbgemm_gpu/include/fbgemm_gpu/utils/rocm/sparse_group_utils.h

+        case at::ScalarType::Byte:
+        dispatch(uint8_t{});
+        break;
+        case at::ScalarType::Char:
+        dispatch(int8_t{});
+        break;
+        case at::ScalarType::Short:
+        dispatch(int16_t{});
+        break;
+        case at::ScalarType::Int:
+        dispatch(int32_t{});
+        break;
+        case at::ScalarType::Long:
+        dispatch(int64_t{});
+        break;
+        default:
+        TORCH_CHECK(


Indentation could be added as:

case at::ScalarType::Byte: { dispatch(uint8_t{}); break; }

aryaman-gupta · 2026-03-05T16:31:38Z

fbgemm_gpu/include/fbgemm_gpu/utils/rocm/sparse_group_utils.h

+            false));
+    };
+
+    switch (scalar_type) {


Can a PyTorch macro like AT_DISPATCH_INTEGRAL_TYPES be used here?

aryaman-gupta · 2026-03-05T16:36:19Z

fbgemm_gpu/src/sparse_ops/sparse_group_index.cu

+    for (int i = 0; i < UNROLL_FACTOR && col_offset + i < num_cols; i++) {
+      // Compile time conditional
+      if constexpr (USE_INDEX_SELECT) {
+        if constexpr (USE_CACHE) {


USE_CACHE is always false for the forward pass. Why not simplify the code by removing the condition?

Implement pre-sorting, caching and contigous warp processing in group…

3754ab4

…_index_select kernel

avbokovoy requested review from aryaman-gupta and liligwu March 3, 2026 10:03

avbokovoy self-assigned this Mar 3, 2026

aryaman-gupta requested changes Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement pre-sorting, caching and contigous warp processing in group_index_select#144

Implement pre-sorting, caching and contigous warp processing in group_index_select#144
avbokovoy wants to merge 1 commit intoabokovoi/upstreamfrom
abokovoi/group-index-sort-and-cache-opt

avbokovoy commented Mar 3, 2026

Uh oh!

aryaman-gupta left a comment

Uh oh!

aryaman-gupta Mar 5, 2026

Uh oh!

aryaman-gupta Mar 5, 2026

Uh oh!

aryaman-gupta Mar 5, 2026

Uh oh!

aryaman-gupta Mar 5, 2026

Uh oh!

aryaman-gupta Mar 5, 2026

Uh oh!

aryaman-gupta Mar 5, 2026

Uh oh!

aryaman-gupta Mar 5, 2026

Uh oh!

aryaman-gupta Mar 5, 2026

Uh oh!

aryaman-gupta Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		int64_t* warp_offsets_group = reinterpret_cast<int64_t*>(saved_data_ptr[4]);
		int32_t* num_cols_group = reinterpret_cast<int32_t*>(saved_data_ptr[5]);

Conversation

avbokovoy commented Mar 3, 2026

Uh oh!

aryaman-gupta left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants