Implement asynchronous LDS loads for MI350 by avbokovoy · Pull Request #138 · ROCm/FBGEMM

avbokovoy · 2025-12-19T10:50:04Z

This PR implements direct HBM->LDS stores in tbe inference kernel. There are 2 major changes:

Rows data isn't loaded in-place, instead we store pointers to global memory and store the actual data w.r.t. the predicate into LDS. In case predicate is false, we pre-allocate small chunk of static device memory of 16B once, fill it with zeros, and fallback to this chunk
HBM->LDS 16B loads are implemented for ROCm >= 7.0 and MI350. We can expand the support range to MI30* through 4B loads, however it doesn't bring any performance benefits because we'll have to introduce an overhead of addresses transposition and 4x more load operations. You can find out the reference implementation here: pytorch@fe52557.

Due to pre-7.2 ROCm features, we are forced to used assembly inline to get 16B loads to work, so manual synchronization was added. In case of ROCm >= 7.2, we use proper intrinsics to handle memory synchronization.

This change brings ~10% performance boost on average for weighted and unweighted cases. We may try to push it further by doing async loads for indices weights.

cc: @amirakb89 you might be interested

aryaman-gupta

I have taken a look at the PR and identified a couple of areas that looked tricky to me. It may make sense to double-check these to confirm that the logic is correct

aryaman-gupta · 2026-01-09T15:12:37Z

fbgemm_gpu/include/fbgemm_gpu/embedding_forward_template_helpers.cuh

+      __builtin_amdgcn_readfirstlane(hip_cvta_to_shared_address(smem_ptr));
+  const void *src_ptr = (pred_guard) ? global_ptr : &zero_tile;
+  asm volatile("s_mov_b32 m0, %0\n"
+               "global_load_lds_dwordx4 %1, off\n" ::"s"(smem),


This enforces that the entire warp will load a contiguous chunk of memory from global to LDS. What happens when the row is not large enough, i.e., kWarpSize > NumUint4LoadsPerRow? As I understand it, this would assign different row_load_idx to different lanes in the wavefront

FBGEMM/fbgemm_gpu/codegen/inference/embedding_forward_quantized_split_nbit_kernel_template.cu

Line 153 in 856f1af

uint32_t row_load_idx = load_idx % NumUint4LoadsPerRow;

You might want to confirm that this case is correctly handled

It will load 16 bytes (16 x 64 for the whole wave) from corresponding vector register (address is different from lane to lane) into LDS pointer with corresponding strides. Global memory doesn't have to be contiguous. The sanity of the loads are checked outside of this function and is handled with pred_guard. Tailing or OOB loads are redirected to zero_tile global memory chunk, which contains zeroes. It's then handled properly by kernel logic

aryaman-gupta · 2026-01-09T15:26:38Z

fbgemm_gpu/include/fbgemm_gpu/embedding_forward_template_helpers.cuh


  asm volatile("cp.async.wait_all;\n" ::);
+#elif defined(USE_ROCM) &&                                                     \
+    (ROCM_VERSION_MAJOR <= 7 && ROCM_VERSION_MINOR < 2) && defined(__gfx950__)


Does this mean that there is no wait instruction defined for ROCm version >= 7.2? Just wanted to confirm that this is intentional

It should be handled by intrinsic, which assumes that this functions has a side effect

aryaman-gupta · 2026-01-09T17:55:47Z

fbgemm_gpu/codegen/inference/embedding_forward_quantized_split_nbit_kernel_template.cu

+            cp_async_zfill_cg<sizeof(uint4)>(
+                &buffers[warp_idx][i][input_row_idx][row_load_idx + uint4_loads_per_row * packed_bag_load_idx],
+                &row_v[inner_i][row_load_idx],
+                final_valid);


It seems to me that in PackedMode, the smem_ptr passed to the different lanes in the cp_async_zfill_cg function is strided. However, the cp_async_zfill_cg function uses lane 0's smem_ptr and performs a contiguous memory read into that location. This seems suspicious to me, so I wanted to point it out. I suppose you have verified that the logic is correct @avbokovoy ?

I guess this comment applies here as well:
#138 (comment)

Implement asynchronous LDS loads for MI350

856f1af

avbokovoy requested review from amd-wsung102, aryaman-gupta, liligwu and shbiswas834 December 19, 2025 10:50

avbokovoy self-assigned this Dec 19, 2025

avbokovoy added the enhancement New feature or request label Dec 19, 2025

aryaman-gupta reviewed Jan 9, 2026

View reviewed changes

avbokovoy added 2 commits January 14, 2026 10:15

Hardcode size value in __builtin_amdgcn_global_load_lds intrinsic

dc3b15b

Fix ROCm version and arch guards

2c739ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement asynchronous LDS loads for MI350#138

Implement asynchronous LDS loads for MI350#138
avbokovoy wants to merge 3 commits intoabokovoi/upstreamfrom
abokovoi/async-lds-inference-opt

avbokovoy commented Dec 19, 2025

Uh oh!

aryaman-gupta left a comment

Uh oh!

aryaman-gupta Jan 9, 2026

Uh oh!

avbokovoy Jan 12, 2026 •

edited

Loading

Uh oh!

aryaman-gupta Jan 9, 2026

Uh oh!

avbokovoy Jan 12, 2026 •

edited

Loading

Uh oh!

aryaman-gupta Jan 9, 2026

Uh oh!

avbokovoy Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

avbokovoy commented Dec 19, 2025

Uh oh!

aryaman-gupta left a comment

Choose a reason for hiding this comment

Uh oh!

aryaman-gupta Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

avbokovoy Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aryaman-gupta Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

avbokovoy Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aryaman-gupta Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

avbokovoy Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

avbokovoy Jan 12, 2026 •

edited

Loading

avbokovoy Jan 12, 2026 •

edited

Loading