Increment dispatch signal before kernel dispatch in ggml-hsa.cpp by aamarnat · Pull Request #153 · ypapadop-amd/ggml

aamarnat · 2025-11-22T01:04:35Z

Increment dispatch signal before kernel dispatch in ggml-hsa.cpp. Handles dec if dispatch fails. Handles multiple packet dispatches.

* CMake support for HSA backend * Stubs for initializing HSA * Stubs for HSA buffers * Stubs for HSA and host buffers * Using new backend CMake declaration * Additional stubs for HSA backend * Formatting * Adding function to track unimplemented APIs

* Identifying memory pools * Support for buffer type alignment and max size * Cache memory properties * Comments * Using fixed-width integrals * Buffer allocation support

* Adding HSA backend to examples/simple-backend * Adding HSA backend to backend registration * Adding Eigen as a temporary matrix mulmat implementation * Support for device type reporting * Support for free and total memory reporting * Properly reporting which kernels are supported

* More function implementations, cleanups * Remove redundant information, catch exception

* Correcting comment * Add description from agent name * Implementation of backend get_device_description * Comments * Adding cpu backend as fallback for all ops * Marking which functions can be improved and correct guid * Remove needs-implementation marker on more functions * Hide cpu backend internally * Remove extra header

* HSA backend in test-mul-mat * HSA backend in gpt-2-backend * Offloading to CPU backend if operation not implemented

* Creating HSA queue * Zero-init all members * Adding signal support

* Add HSA backend to GPT-2 example * Remove CPU backend from HSA * Returning that it is host buffer for NPU memory * Adding CPY kernels and factoring out kernel code * Formatting, comments * Temporary storage for cpy * Extracting supports_op conditions * Renaming function

* Add option for CPU fallback in CMakeLists * Adding fallback to CPU backend if operation is not supported

* Add operation example * Using tensor count variable * Count source tensors and copy name * Detect if execution failed * Switching test to int32_t * add kernel using XRT * Aligning example size with kernel * Adding dev heap pool * Using HSA in add kernel * Using relaxed write to queue * Remove XRT dependency * Size independent test * Correct elements for kernel * Moving load functions to common.h * Using simplified AIE packet * Moving loading to a kernel registry * Adding constructor * Add kernel * Refactoring add script * Single name for PDI and instr.txt * Refactoring * Generalizing add.py * Adding dims * Comments and error checking * Modularize python script * CMake kernel generation * Remove magic numbers and use GGML data type naming * Adding a structure for NPU kernels and free function * Accepting only contiguous tensors for now * Stub for keeping loaded kernel in context * Passing device info as parameter, renaming contexts for easy filtering * Renaming variables * Reworking example * Using registry of kernels * Using HSA agent name for kernels * Using dladdr to get the kernel directory

* Using static instead of anonymous namespace * Handling exceptions

* clang-format configuration loosely based on ggml-sycl * Formatting

* Comments, disabling copy/move when not allowed * Replacing high / low bits macros * Factoring out dispatch functionality * Free all finished packets

* Missing checks in example * Adding init_tensor support

* Vector add for floats * Handling higher dims upon load * Move tensor testing in the operation supports function * mul_mat kernel compilation * Smaller gemm * Renaming args * Smaller gemm * Copy instead of moving PDI * Unify cmake kernel generation functions * Using latest CMake Python integration * Missing CMake HSA integration * Install kernels * Adding missing dependencies * Updating test to use HSA conditionally. HSA-specific mul_mat test * Encoding all dims in kernel filename

* Using new compilation process in CMake * Loading insts as binary

* Renaming dispatch function * Track allocated memory for packets

* Avoid warnings * Adding extra data to HSA backend tensors * Caching kernel in tensor extra metadata

* Renaming pending data functions. Refactor packet dispatch * Guard all CPU fallback cases * Internal nodes do not init extra until after graph allocation * Assert cleanups * Comments * Separate CMake support

* Add expected find_package definitions * Expose both C and C++ Peano compilers * Remove unused property * Relocating kernels * Output kernels for a device in a directory * Explicit names for Peano compilation

* Python script fixes * Separating kernel discovery to its own header

* Remove conservative asserts * Removing cpy kernel. Delegating to the CPU device for supports_op * Extracting types in example * Create completely independent fallback graph * Correct source tensor iteration * Better messages * Caching emulated tensors

* Update IRON environment set-up * Fixing typo and index url

* Renaming device to arch * Replace device with arch * Fix headers * Use arch in binary_ops * Info logs only during debug * Refactor binary_op implementation * Refactoring unary ops * Unary ops refactor * Temporary storage for input conversion * Adding i16 support for ggml_hsa_assign * Rename device to arch * Lower alignment requirements to 64bytes * Making CoreFunction a dataclass * Fix typos * Unary ops simplification * Adding alignment checks and simplifying tensor creation * Refactor internal nodes

* Aligning tensor sizes for bf16 / int8 / int16 * Using constant. Removing extra import. * Adding comment

…adop-amd#112) * Rephrasing error * Don't return true for is_host on HSA memory. Refactor exception catching. * Reenabling warning and refactoring registration. * Remove printf

* Avoid multiple logging * Abstracting log switch * Enable/disable logging at run-time

* Verbose log when kernel not found * Fix typo * Use ggml_op_is_empty when possible. Remove deadcode * Move ggml-hsa specific tests to separate directory * Move simple-vector example as a test-vector-hsa

* Raise exception if module not found * Moving kernels as generic * Reorganizing IRON kernels * Avoiding shadowing function name * Update README * Update src/ggml-hsa/kernels/build.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Initial plan * Add GitHub Copilot instructions for GGML repository Co-authored-by: ypapadop-amd <102817138+ypapadop-amd@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ypapadop-amd <102817138+ypapadop-amd@users.noreply.github.com>

* Updating requirements * Remove chaning cwd

* Remove CoreFunction from kernel implementation * Moving parameters out of CoreFunction * Per arch num of cols * Hybrid solution with both CoreFunction and external functions helper * Moving more out of the CoreFunction factory * Remove CoreFunction * Renaming kernel files * Remove unused variable

* Update documentation with supported configurations * Update compilation checks * Update src/ggml-hsa/kernels/binary_ops.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Adding new unary ops * Assert if type is not floating point * Fix floor implementation

* Update README on supported NPUs and prerequisites * Update src/ggml-hsa/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Moving ops and kernel files registration to build script * Using single op to kernel map * Freezing dataclass * Update src/ggml-hsa/kernels/build.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/ggml-hsa/kernels/build.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* More generic kernel discovery * Renaming compilation function to suggest it's for AIE agents * Renaming AIE kernel compiler files * Updating references to AIE kernel compiler files * Use switch-case * Update comments

* Making TensorDesc into a dataclass * Create TensorDesc from ggml_tensor interface * Update src/ggml-hsa/kernels/tensor_desc.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Formatting * Correct TensorDesc missing members * Update src/ggml-hsa/kernels/build.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Adding alternative data type for members --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update README on how to compile * Update src/ggml-hsa/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update README.md --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ypapadop-amd added 30 commits November 17, 2025 12:13

Adding agent information to internal datastructures (ypapadop-amd#4)

0ead463

Memory pool information and buffer allocation support (ypapadop-amd#5)

44059df

* Identifying memory pools * Support for buffer type alignment and max size * Cache memory properties * Comments * Using fixed-width integrals * Buffer allocation support

HSA tensor data functions (ypapadop-amd#6)

06c06ce

Fix for correctly choosing memories and reporting size (ypapadop-amd#8)

716aef5

HSA backend function implementations (ypapadop-amd#9)

d697b5f

* More function implementations, cleanups * Remove redundant information, catch exception

HSA operations fallback to CPU (ypapadop-amd#11)

be3c292

* HSA backend in test-mul-mat * HSA backend in gpt-2-backend * Offloading to CPU backend if operation not implemented

HSA queue support (ypapadop-amd#12)

d7ad305

* Creating HSA queue * Zero-init all members * Adding signal support

Use GGML_ABORT instead of abort() (ypapadop-amd#13)

dcd3ce9

CPU backend fallback (ypapadop-amd#15)

b38c505

* Add option for CPU fallback in CMakeLists * Adding fallback to CPU backend if operation is not supported

Replacing mutex and bool with call_once (ypapadop-amd#17)

4d04b3b

Exception and error handling (ypapadop-amd#25)

1f9cd9f

* Using static instead of anonymous namespace * Handling exceptions

Clang format support (ypapadop-amd#26)

285e998

* clang-format configuration loosely based on ggml-sycl * Formatting

Free packet memory after synchronization (ypapadop-amd#28)

6116ed4

* Comments, disabling copy/move when not allowed * Replacing high / low bits macros * Factoring out dispatch functionality * Free all finished packets

ggml_backend_buffer_init_tensor support (ypapadop-amd#29)

a81f34b

* Missing checks in example * Adding init_tensor support

Using alias for std::filesystem (ypapadop-amd#30)

fde1804

Binary instructions format loading (ypapadop-amd#32)

7e8a195

* Using new compilation process in CMake * Loading insts as binary

Release memory allocated for packets (ypapadop-amd#33)

cec9da7

* Renaming dispatch function * Track allocated memory for packets

Directly create PDI instead through xclbin (ypapadop-amd#34)

58684b2

HSA tensor extra metadata (ypapadop-amd#37)

b197a27

* Avoid warnings * Adding extra data to HSA backend tensors * Caching kernel in tensor extra metadata

Various fixes (ypapadop-amd#38)

ba2785b

* Renaming pending data functions. Refactor packet dispatch * Guard all CPU fallback cases * Internal nodes do not init extra until after graph allocation * Assert cleanups * Comments * Separate CMake support

CMake Refactor (ypapadop-amd#39)

422b7db

* Add expected find_package definitions * Expose both C and C++ Peano compilers * Remove unused property * Relocating kernels * Output kernels for a device in a directory * Explicit names for Peano compilation

Stand-alone kernel discovery header (ypapadop-amd#40)

87ae5ec

* Python script fixes * Separating kernel discovery to its own header

Renaming kernel to the IRON equivalent (ypapadop-amd#41)

8a78d93

ypapadop-amd and others added 23 commits November 17, 2025 12:13

Update IRON environment set-up (ypapadop-amd#108)

1fd1064

* Update IRON environment set-up * Fixing typo and index url

Adding pytest to Python requirements (ypapadop-amd#109)

264fa31

Aligning tensor sizes for bf16 / int8 / int16 (ypapadop-amd#111)

e3b26d9

* Aligning tensor sizes for bf16 / int8 / int16 * Using constant. Removing extra import. * Adding comment

Return false is_host for HSA memory and refactor error messages (ypap…

4c518ec

…adop-amd#112) * Rephrasing error * Don't return true for is_host on HSA memory. Refactor exception catching. * Reenabling warning and refactoring registration. * Remove printf

Adding new ggml_backend_i member (ypapadop-amd#115)

b12417c

Run time logging (ypapadop-amd#116)

3614b39

* Avoid multiple logging * Abstracting log switch * Enable/disable logging at run-time

Move tests to test/ggml-hsa (ypapadop-amd#122)

4433ded

* Verbose log when kernel not found * Fix typo * Use ggml_op_is_empty when possible. Remove deadcode * Move ggml-hsa specific tests to separate directory * Move simple-vector example as a test-vector-hsa

Update latest iron (ypapadop-amd#136)

99c1e02

* Updating requirements * Remove chaning cwd

Adding new unary ops (ypapadop-amd#141)

885b4be

* Adding new unary ops * Assert if type is not floating point * Fix floor implementation

Removing define

9ec0c1a

Generic backend kernel support (ypapadop-amd#146)

2602680

* More generic kernel discovery * Renaming compilation function to suggest it's for AIE agents * Renaming AIE kernel compiler files * Updating references to AIE kernel compiler files * Use switch-case * Update comments

Update README on how to compile (ypapadop-amd#148)

670ff77

* Update README on how to compile * Update src/ggml-hsa/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update README.md --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Pass op_params argument to kernels in compilation

52d220b

aie2p MAT_MUL (ypapadop-amd#150)

c4bfc13

Increment and dec dispatch signal before kernel dispatch in ggml-hsa.cpp

d68f884

ypapadop-amd force-pushed the hsa-backend branch 3 times, most recently from 092e35d to 6f9f0ea Compare December 15, 2025 18:59

ypapadop-amd force-pushed the hsa-backend branch from 6f9f0ea to b32f95e Compare January 7, 2026 15:57

ypapadop-amd force-pushed the hsa-backend branch from 331c960 to fcd1205 Compare January 26, 2026 19:43

ypapadop-amd force-pushed the hsa-backend branch from 04d0c66 to ba22186 Compare February 10, 2026 18:41

ypapadop-amd force-pushed the hsa-backend branch from 07b1565 to 8ba19ce Compare February 18, 2026 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increment dispatch signal before kernel dispatch in ggml-hsa.cpp#153

Increment dispatch signal before kernel dispatch in ggml-hsa.cpp#153
aamarnat wants to merge 115 commits intoypapadop-amd:hsa-backendfrom
aamarnat:hsa-backend

aamarnat commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

aamarnat commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants