Meta28 optimization upstream by shbiswas834 · Pull Request #113 · ROCm/FBGEMM

shbiswas834 · 2025-08-18T17:56:04Z

No description provided.

liligwu · 2025-08-18T18:16:28Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu


                const bool use_deterministic_algorithms = at::globalContext().deterministicAlgorithms();
-                const int max_segment_length_per_cta = use_deterministic_algorithms ? INT_MAX : 1024;
+                const int max_segment_length_per_cta = use_deterministic_algorithms ? INT_MAX : 4096;


We need to guard the change by #ifdef USE_ROCM

liligwu · 2025-08-18T18:19:21Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu

                    // Compute shared memory size for cta_per_row
                    constexpr auto kCacheAccBytes = sizeof(at::acc_type<cache_t, true>);
-                    int32_t num_cta_per_row_groups = kMaxThreads / kWarpSize;
+                    int32_t num_cta_per_row_groups = (kMaxThreads/4) / kWarpSize;


We need to guard the change by #ifdef USE_ROCM

liligwu · 2025-08-18T18:19:51Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu

                                                      || dev_weights.scalar_type() == at::ScalarType::Float;

-                    if (use_hip_kernel && supported_weights_type && !mixed_D && rocm::is_supported_cdna())
+                    if (use_hip_kernel && !mixed_D && supported_weights_type && rocm::is_supported_cdna())


Does the order matter?

Order doesnt matter

liligwu · 2025-08-18T18:21:27Z

fbgemm_gpu/src/tbe/eeg/indices_generator.cpp

  // (vector is very large).
  std::sort(
-      std::execution::par,
+  //    std::execution::par,


uncomment this line

kudomcho · 2025-08-18T18:24:08Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu


                    const int32_t cta_per_row_grid_size = std::min(
-                        div_round_up(total_unique_indices, kMaxThreads),
+                        div_round_up(total_unique_indices, (kMaxThreads/4)),


Add #ifdef USE_ROCM above this line

shbiswas834 · 2025-08-18T22:37:34Z

@liligwu changes are tested. Builds successfully. Few backwards and forwards were tested. no regression

avbokovoy

Overall LGTM

avbokovoy · 2025-08-19T08:45:20Z

fbgemm_gpu/cmake/Hip.cmake

  list(APPEND HIP_CXX_FLAGS -mf16c)
  list(APPEND HIP_CXX_FLAGS -mfma)
  list(APPEND HIP_CXX_FLAGS -std=c++20)
+  list(APPEND HIP_CXX_FLAGS -g)


I suggest to revert the whole file changes since it only increases build time and diff in this context

avbokovoy · 2025-08-19T08:47:06Z

fbgemm_gpu/src/tbe/eeg/indices_generator.cpp

  // Now sort the indices by their tags. Use parallel sort for some extra speed
  // (vector is very large).
  std::sort(
-      std::execution::par,


Should be reverted

warp per row wg change

…d opt size and adjusted WG size for L=1 on hip split embed kernel

avbokovoy and others added 11 commits August 18, 2025 17:52

Add gfx950 build support + fp16 fix + index type fix

3942d12

Change int64_t to index_t as template parameters in load_raw_per_warp

7a20f4c

Implement llvm fp16 buffer load for gfx950

1236b3a

Fix c-style half to float cast

3bc1ba8

Patch 256 half stores

6dcd104

cta_per_row workgroup optim

dc3d3e0

Added mi350 guards

c5a6b25

Fix index overflow in row load

1057a22

cta_per_row workgroup reduce by 4 optim

e981269

Fix mixed_D frontend to backend connection

3ef8e56

changed max_segment_length_per_cta to 4096

caf2e9e

shbiswas834 requested review from kudomcho and liligwu August 18, 2025 17:56

liligwu requested changes Aug 18, 2025

View reviewed changes

kudomcho reviewed Aug 18, 2025

View reviewed changes

added rocm guards and removed comment

4b841bd

avbokovoy approved these changes Aug 19, 2025

View reviewed changes

liligwu and others added 12 commits August 20, 2025 03:00

clean debug statements in Hip.cmake

63287f6

Merge pull request #121

b6d76f9

warp per row wg change

Guard f16 llvm intrinsics with ROCm >=7.0

76d0914

fix the bug in dimention 160 in ROCm optimization

a56c299

Cleanup optimized warp_per_raw kernel

54f296d

Add 320 embedding dim support for optimized warp_per_row kernel

b3b9868

changed the max length per warp and cta per row WG size

8ae4724

added DPP and changed max length per warp to 16k

a5f48da

guard max segment warp based on emb dim

a4cceb7

added guarding opt of max segment for the case batch size list=1

8b4f25c

added condition to apply DPP warp reduce sum when emb matches backwar…

bf4769b

…d opt size and adjusted WG size for L=1 on hip split embed kernel

reverted pt2 autograd version back by 1 commit

2c44b8d

root added 2 commits September 19, 2025 16:26

enabled DPP if emb dim in range of backward opt

711e565

opt for grad_indice_weights kernel

af7392a

Conversation

shbiswas834 commented Aug 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shbiswas834 commented Aug 18, 2025

Uh oh!

avbokovoy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants