Add Windows/clang-cl support for AMD HIP backend #179

jammm · 2025-12-30T12:48:10Z

This allows triton running on AMD GPUs on windows via. TheRock wheels - https://github.com/ROCm/TheRock/blob/main/RELEASES.md

It should build as-is with the same build process as @woct0rdho as it only modifies .py and a .c file that's compiled at runtime.

Whenever you run a program that requires triton, make sure to set the following environment variables:

CC and CXX to clang-cl
ROCM_HOME to rocm-sdk path --root;$PATH
DISTUTILS_USE_SDK=1

Summary of changes:

Use LoadLibrary/GetProcAddress on Windows instead of dlopen/dlsym
Use rocm_sdk.find_libraries() to locate amdhip64
Add platform-specific macros for dynamic library loading
Escape Windows paths for C string embedding
Treat clang-cl as MSVC-compatible compiler in build.py
Fix NamedTemporaryFile handling on Windows in compiler.py

- Use LoadLibrary/GetProcAddress on Windows instead of dlopen/dlsym - Use rocm_sdk.find_libraries() to locate amdhip64 - Add platform-specific macros for dynamic library loading - Escape Windows paths for C string embedding - Treat clang-cl as MSVC-compatible compiler in build.py - Fix NamedTemporaryFile handling on Windows in compiler.py

woct0rdho · 2025-12-30T14:03:41Z

Looks good to me! I haven't followed the modern AMD toolchain for a while, but if this is enough to make it work, then it will not add much maintenance cost.

Maybe you can also tell people at https://github.com/patientx/ComfyUI-Zluda and https://github.com/lshqqytiger/triton about this.

woct0rdho · 2025-12-30T14:27:35Z

I've cherry-picked this onto the upcoming release/3.6.x-windows branch and you can test it. The wheel is at https://github.com/Comfy-Org/wheels/actions/runs/20599020739

woct0rdho · 2025-12-30T15:45:17Z

Also, can you test the Triton 3.5 wheel at https://github.com/Comfy-Org/wheels/actions/runs/20599014618 , in the way that users would install it? If it works, I'll publish it to PyPI.

jammm · 2025-12-30T16:03:57Z

Also, can you test the Triton 3.5 wheel at https://github.com/Comfy-Org/wheels/actions/runs/20599014618 , in the way that users would install it? If it works, I'll publish it to PyPI.

Just tried triton_windows-3.5.1.post23-cp312-cp312-win_amd64.whl and ran it on TurboDiffusion which uses flash-attn triton backend via. FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" and it seems to work fine (albeit it didn't use the triton FA kernels as SpargeAttn HIP kernels were used. But it did use layernorm, qk quantize etc. triton kernels which are in flash-attn).

generated_video.mp4

alexsarmiento · 2026-01-02T16:02:05Z

The python test examples from Triton finally work with my gfx1100
. Also, sageattention1.0.6 works in comfyui and i am getting faster generations with some workflows

But when I try torch.compile via inductor in comfyui, it fails:

Assertion failed: llvm::isUIntN(BitWidth, val) && "Value is not an N-bit unsigned value", file D:\a\triton\triton\llvm-project\llvm\include\llvm/ADT/APInt.h, line 128

But this seems to be a bug with llvm.

IxMxAMAR · 2026-01-02T18:46:33Z

The python test examples from Triton finally work with my gfx1100 . Also, sageattention1.0.6 works in comfyui and i am getting faster generations with some workflows

But when I try torch.compile via inductor in comfyui, it fails:

Assertion failed: llvm::isUIntN(BitWidth, val) && "Value is not an N-bit unsigned value", file D:\a\triton\triton\llvm-project\llvm\include\llvm/ADT/APInt.h, line 128

But this seems to be a bug with llvm.

How did you achieve this? I have an Rx7600 and can I do this? can you share your ComfyUI-run.bat or the args and env variables you are using? and is it Comfy-UI official or ZLUDA?

patientx · 2026-01-03T15:05:53Z

I've cherry-picked this onto the upcoming release/3.6.x-windows branch and you can test it. The wheel is at https://github.com/Comfy-Org/wheels/actions/runs/20599020739

First of all ,thanks for working on this , much appricated.
Does installing it like this ok since it seems like you added it : " "pip install triton-windows" which installs "triton-windows 3.5.1.post23" I then installed sage-attention with "pip install sageattention==1.0.6"and flash-attention also. But in the end both gave errors. I set up the parameters like this in starter batch for comfy :

set CC=clang-cl
set CXX=clang-cl
set DISTUTILS_USE_SDK=1
for /f "delims=" %%i in ('python -c "import rocm; print(rocm.path[0])"') do set ROCM_HOME=%%i

since rocm is installed as package that last one should work right ?

jammm · 2026-01-03T15:15:21Z

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

…

On Saturday, January 3, 2026, patientx ***@***.***> wrote: *patientx* left a comment (woct0rdho/triton-windows#179) <#179 (comment)> First of all ,thanks for working on this , much appricated. Does installing it like this ok since it seems like you added it : " "pip install triton-windows" which installs "triton-windows 3.5.1.post23" I then installed sage-attention with "pip install sageattention==1.0.6"and flash-attention also. But in the end both gave errors. I set up the parameters like this in starter batch for comfy : set CC=clang-cl set CXX=clang-cl set DISTUTILS_USE_SDK=1 for /f "delims=" %%i in ('python -c "import rocm; print(rocm.*path*[0])"') do set ROCM_HOME=%%i since rocm is installed as package that last one should work right ? — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATCSOEKRMS3ZPUSJ323LM34E7LGPAVCNFSM6AAAAACQKA7ZBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMBXGEYTKNJVGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0xDELUXA · 2026-01-03T15:22:15Z

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

I’m curious about Sage too...

patientx · 2026-01-03T17:01:28Z

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.

edit : sage-attention works with the patches on my comfyui-zluda fork.

IxMxAMAR · 2026-01-03T18:39:01Z

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.

edit : sage-attention works with the patches on my comfyui-zluda fork.

I also compiled SpargeAttn but how exactly am i supposed to use it with comfyUI, any idea

0xDELUXA · 2026-01-03T19:36:43Z

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.

edit : sage-attention works with the patches on my comfyui-zluda fork.

I don't know what changes your fork has, but why don't you make a PR here so everyone can use SageAttention?

I'll try Sage myself soon too. What about the performance vs SDPA flash?

patientx · 2026-01-03T19:55:09Z

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.
edit : sage-attention works with the patches on my comfyui-zluda fork.

I don't know what changes your fork has, but why don't you make a PR here so everyone can use SageAttention?

I'll try Sage myself soon too. What about the performance vs SDPA flash?

On my rx6800 sdpa is the slowest , slower than quad-cross which I was using by default. Sage attention "patches" were made by someone in sdnext discord and I was applying them every install with zluda setup , interestingly I didn't need them with lee's torch builds from may 2024. (they were using hip 6.5 though) Now it seems the same patches also work here. Just replace these three files actually here are curl commands to apply them directly when venv is activated and inside comfyui directory.

install sage-attention (v1) with this : `pip install sageattention==1.0.6'

curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block.py,
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block_causal.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block_causal.py 
curl -s -o venv\Lib\site-packages\sageattention\quant_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/quant_per_block.py venv\Lib\site-packages\sageattention\quant_per_block.py

0xDELUXA · 2026-01-03T20:19:56Z

On my rx6800 sdpa is the slowest , slower than quad-cross which I was using by default. Sage attention "patches" were made by someone in sdnext discord and I was applying them every install with zluda setup , interestingly I didn't need them with lee's torch builds from may 2024. (they were using hip 6.5 though) Now it seems the same patches also work here. Just replace these three files actually here are curl commands to apply them directly when venv is activated and inside comfyui directory.

install sage-attention (v1) with this : `pip install sageattention==1.0.6'
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block.py,
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block_causal.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block_causal.py 
curl -s -o venv\Lib\site-packages\sageattention\quant_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/quant_per_block.py venv\Lib\site-packages\sageattention\quant_per_block.py

Int8? I think these are for RDNA2 or 3, don't think RDNA4 needs them. Will try soon tho

Edit: Yes, it does. I’m getting a lot of errors coming from venv\Lib\site-packages\sageattention\attn_qk_int8_per_block.py.
I think we could consider patching those three files in this repo for AMD only, by default (my suggestion, feel free to ignore).

sfinktah · 2026-01-04T07:15:54Z

Yes, that works. It did do a nasty crash the first time, which I am saving here not as a complaint, but as a reference for a possible side project: "Write a program to force an AMD driver crash in order to free up all that VRAM that dwm never gives back."

Sampling 81 frames at 640x640 with 2 steps
  0%|                                                                                                                                                                                                                                                                                                                                                                                                                | 0/2 [00:00<?, ?it/s]Generated new RoPE frequencies
Exception Code: 0xC0000005
 #0 0x00007ff8c47c06eb (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x9206eb)
 #1 0x00007ff8c42f4315 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x454315)
 #2 0x00007ff8c432ef47 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x48ef47)
 #3 0x00007ff8c432dec6 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x48dec6)
 #4 0x00007ff8c432e1b4 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x48e1b4)
 #5 0x00007ff8c431b105 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x47b105)
 #6 0x00007ff8c429010f (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x3f010f)
 #7 0x00007ff8c4290231 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x3f0231)
 #8 0x00007ff8c42b3a86 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x413a86)
 #9 0x00007ff8c424b0ff (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x3ab0ff)
#10 0x00007ff999b5259d (C:\WINDOWS\System32\KERNEL32.DLL+0x1259d)
#11 0x00007ff99bdeaf78 (C:\WINDOWS\SYSTEM32\ntdll.dll+0x5af78)

FYI the ZLUDA sageattn is basically just a patch to change the parameters to

BLOCK_M: 32, BLOCK_N: 16, STAGE: 1, waves_per_eu: 3 or 4, num_warps: 2, num_ctas: 1, num_stages: 1

Otherwise it uses too much "shared memory" and produces black screens. See also https://raw.githubusercontent.com/sfinktah/amd-torch/refs/heads/main/patches/sageattention-1.0.6+sfinktah+env-py3-none-any.patch which is an environment variable adjustable version of the same thing.

IxMxAMAR · 2026-01-04T15:50:16Z

BLOCK_M: 32, BLOCK_N: 16, STAGE: 1, waves_per_eu: 3 or 4, num_warps: 2, num_ctas: 1, num_stages: 1

I have tried many many combinations and besides these and none of them worked, most of the time i got garbled noise instead of image and sometimes i got a black and white frame for what was supposed to be the subject, i have an Rx7600

rwfsmith · 2026-01-06T06:01:32Z

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

Using an SDXL fine-tune using a DMD2 lora and an upscaler KSampler step
Base resolution: 1024x1496
Upscale resolution: 1536x2240
8 steps on both
CFG: 1
lcm/exponential

Compared using flash attention with pytorch cross attention across 10 image generations, listing the average it/s for both the base and upscaler samplers at the end.

Flash	Cross
29.11	31.74
28.84	32.48
27.92	30.7
27.86	30.58
28.1	30.67
28.04	30.61
27.98	30.7
28.07	30.67
28.04	30.88
28.11	30.76
28.207	30.979
1.83it/s	1.67it/s
1.36s/it	1.58s/it

edit: I also just tested with sage attention 1, but the results seem to be the same as cross attention

IxMxAMAR · 2026-01-06T08:59:37Z

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

how did you do that? I have also compiled Sparge and Sage but haven't tried Flash yet, Flash 2 seems even better so how can I?
I have an Rx 7600 which one do you have?

0xDELUXA · 2026-01-06T09:05:37Z

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

I was also able, based on my results Flash 2 is slower than AOTriton SPDA Flash on RDNA4.

jammm mentioned this pull request Dec 30, 2025

AMD port of TurboDiffusion - Working on gfx1151 on Windows thu-ml/TurboDiffusion#66

Draft

woct0rdho merged commit 9d87bfc into woct0rdho:release/3.5.x-windows Dec 30, 2025

Good-luck-Haim mentioned this pull request Jan 1, 2026

Native pytorch build for AMD on Windows [Official amd 7000-9000 support & two choices for 6000 series] patientx/ComfyUI-Zluda#170

Open

O-J1 mentioned this pull request Jan 2, 2026

Upgrade to torch 2.9.1 Nerogar/OneTrainer#1189

Open

2 tasks

IxMxAMAR mentioned this pull request Jan 2, 2026

ComfyUI master vs ComfyUI ZLUDA patientx/ComfyUI-Zluda#410

Open

1 task

Add Windows/clang-cl support for AMD HIP backend #179

Add Windows/clang-cl support for AMD HIP backend #179

Conversation

jammm commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woct0rdho commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woct0rdho commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woct0rdho commented Dec 30, 2025

Uh oh!

jammm commented Dec 30, 2025

Uh oh!

alexsarmiento commented Jan 2, 2026

Uh oh!

IxMxAMAR commented Jan 2, 2026

Uh oh!

patientx commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented Jan 3, 2026 via email

Uh oh!

0xDELUXA commented Jan 3, 2026

Uh oh!

patientx commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IxMxAMAR commented Jan 3, 2026

Uh oh!

0xDELUXA commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patientx commented Jan 3, 2026

Uh oh!

0xDELUXA commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfinktah commented Jan 4, 2026

Uh oh!

IxMxAMAR commented Jan 4, 2026

Uh oh!

rwfsmith commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IxMxAMAR commented Jan 6, 2026

Uh oh!

0xDELUXA commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jammm commented Dec 30, 2025 •

edited

Loading

woct0rdho commented Dec 30, 2025 •

edited

Loading

woct0rdho commented Dec 30, 2025 •

edited

Loading

patientx commented Jan 3, 2026 •

edited

Loading

patientx commented Jan 3, 2026 •

edited

Loading

0xDELUXA commented Jan 3, 2026 •

edited

Loading

0xDELUXA commented Jan 3, 2026 •

edited

Loading

rwfsmith commented Jan 6, 2026 •

edited

Loading