Skip to content

Conversation

@jammm
Copy link

@jammm jammm commented Dec 30, 2025

This allows triton running on AMD GPUs on windows via. TheRock wheels - https://github.com/ROCm/TheRock/blob/main/RELEASES.md

It should build as-is with the same build process as @woct0rdho as it only modifies .py and a .c file that's compiled at runtime.

Whenever you run a program that requires triton, make sure to set the following environment variables:

  • CC and CXX to clang-cl
  • ROCM_HOME to rocm-sdk path --root;$PATH
  • DISTUTILS_USE_SDK=1

Summary of changes:

  • Use LoadLibrary/GetProcAddress on Windows instead of dlopen/dlsym
  • Use rocm_sdk.find_libraries() to locate amdhip64
  • Add platform-specific macros for dynamic library loading
  • Escape Windows paths for C string embedding
  • Treat clang-cl as MSVC-compatible compiler in build.py
  • Fix NamedTemporaryFile handling on Windows in compiler.py

- Use LoadLibrary/GetProcAddress on Windows instead of dlopen/dlsym
- Use rocm_sdk.find_libraries() to locate amdhip64
- Add platform-specific macros for dynamic library loading
- Escape Windows paths for C string embedding
- Treat clang-cl as MSVC-compatible compiler in build.py
- Fix NamedTemporaryFile handling on Windows in compiler.py
@woct0rdho
Copy link
Owner

woct0rdho commented Dec 30, 2025

Looks good to me! I haven't followed the modern AMD toolchain for a while, but if this is enough to make it work, then it will not add much maintenance cost.

Maybe you can also tell people at https://github.com/patientx/ComfyUI-Zluda and https://github.com/lshqqytiger/triton about this.

@woct0rdho woct0rdho merged commit 9d87bfc into woct0rdho:release/3.5.x-windows Dec 30, 2025
@woct0rdho
Copy link
Owner

woct0rdho commented Dec 30, 2025

I've cherry-picked this onto the upcoming release/3.6.x-windows branch and you can test it. The wheel is at https://github.com/Comfy-Org/wheels/actions/runs/20599020739

@woct0rdho
Copy link
Owner

Also, can you test the Triton 3.5 wheel at https://github.com/Comfy-Org/wheels/actions/runs/20599014618 , in the way that users would install it? If it works, I'll publish it to PyPI.

@jammm
Copy link
Author

jammm commented Dec 30, 2025

Also, can you test the Triton 3.5 wheel at https://github.com/Comfy-Org/wheels/actions/runs/20599014618 , in the way that users would install it? If it works, I'll publish it to PyPI.

Just tried triton_windows-3.5.1.post23-cp312-cp312-win_amd64.whl and ran it on TurboDiffusion which uses flash-attn triton backend via. FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" and it seems to work fine (albeit it didn't use the triton FA kernels as SpargeAttn HIP kernels were used. But it did use layernorm, qk quantize etc. triton kernels which are in flash-attn).

generated_video.mp4

@alexsarmiento
Copy link

The python test examples from Triton finally work with my gfx1100
. Also, sageattention1.0.6 works in comfyui and i am getting faster generations with some workflows

But when I try torch.compile via inductor in comfyui, it fails:

Assertion failed: llvm::isUIntN(BitWidth, val) && "Value is not an N-bit unsigned value", file D:\a\triton\triton\llvm-project\llvm\include\llvm/ADT/APInt.h, line 128

But this seems to be a bug with llvm.

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 2, 2026

The python test examples from Triton finally work with my gfx1100 . Also, sageattention1.0.6 works in comfyui and i am getting faster generations with some workflows

But when I try torch.compile via inductor in comfyui, it fails:

Assertion failed: llvm::isUIntN(BitWidth, val) && "Value is not an N-bit unsigned value", file D:\a\triton\triton\llvm-project\llvm\include\llvm/ADT/APInt.h, line 128

But this seems to be a bug with llvm.

How did you achieve this? I have an Rx7600 and can I do this? can you share your ComfyUI-run.bat or the args and env variables you are using? and is it Comfy-UI official or ZLUDA?

@patientx
Copy link

patientx commented Jan 3, 2026

I've cherry-picked this onto the upcoming release/3.6.x-windows branch and you can test it. The wheel is at https://github.com/Comfy-Org/wheels/actions/runs/20599020739

First of all ,thanks for working on this , much appricated.
Does installing it like this ok since it seems like you added it : " "pip install triton-windows" which installs "triton-windows 3.5.1.post23" I then installed sage-attention with "pip install sageattention==1.0.6"and flash-attention also. But in the end both gave errors. I set up the parameters like this in starter batch for comfy :

set CC=clang-cl
set CXX=clang-cl
set DISTUTILS_USE_SDK=1
for /f "delims=" %%i in ('python -c "import rocm; print(rocm.path[0])"') do set ROCM_HOME=%%i

since rocm is installed as package that last one should work right ?

@jammm
Copy link
Author

jammm commented Jan 3, 2026 via email

@0xDELUXA
Copy link

0xDELUXA commented Jan 3, 2026

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

I’m curious about Sage too...

@patientx
Copy link

patientx commented Jan 3, 2026

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.

edit : sage-attention works with the patches on my comfyui-zluda fork.

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 3, 2026

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.

edit : sage-attention works with the patches on my comfyui-zluda fork.

I also compiled SpargeAttn but how exactly am i supposed to use it with comfyUI, any idea

@0xDELUXA
Copy link

0xDELUXA commented Jan 3, 2026

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.

edit : sage-attention works with the patches on my comfyui-zluda fork.

I don't know what changes your fork has, but why don't you make a PR here so everyone can use SageAttention?

I'll try Sage myself soon too. What about the performance vs SDPA flash?

@patientx
Copy link

patientx commented Jan 3, 2026

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.
edit : sage-attention works with the patches on my comfyui-zluda fork.

I don't know what changes your fork has, but why don't you make a PR here so everyone can use SageAttention?

I'll try Sage myself soon too. What about the performance vs SDPA flash?

On my rx6800 sdpa is the slowest , slower than quad-cross which I was using by default. Sage attention "patches" were made by someone in sdnext discord and I was applying them every install with zluda setup , interestingly I didn't need them with lee's torch builds from may 2024. (they were using hip 6.5 though) Now it seems the same patches also work here. Just replace these three files actually here are curl commands to apply them directly when venv is activated and inside comfyui directory.

  • install sage-attention (v1) with this : `pip install sageattention==1.0.6'
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block.py,
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block_causal.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block_causal.py 
curl -s -o venv\Lib\site-packages\sageattention\quant_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/quant_per_block.py venv\Lib\site-packages\sageattention\quant_per_block.py

@0xDELUXA
Copy link

0xDELUXA commented Jan 3, 2026

On my rx6800 sdpa is the slowest , slower than quad-cross which I was using by default. Sage attention "patches" were made by someone in sdnext discord and I was applying them every install with zluda setup , interestingly I didn't need them with lee's torch builds from may 2024. (they were using hip 6.5 though) Now it seems the same patches also work here. Just replace these three files actually here are curl commands to apply them directly when venv is activated and inside comfyui directory.

  • install sage-attention (v1) with this : `pip install sageattention==1.0.6'
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block.py,
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block_causal.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block_causal.py 
curl -s -o venv\Lib\site-packages\sageattention\quant_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/quant_per_block.py venv\Lib\site-packages\sageattention\quant_per_block.py

Int8? I think these are for RDNA2 or 3, don't think RDNA4 needs them. Will try soon tho

Edit: Yes, it does. I’m getting a lot of errors coming from venv\Lib\site-packages\sageattention\attn_qk_int8_per_block.py.
I think we could consider patching those three files in this repo for AMD only, by default (my suggestion, feel free to ignore).

@sfinktah
Copy link

sfinktah commented Jan 4, 2026

Yes, that works. It did do a nasty crash the first time, which I am saving here not as a complaint, but as a reference for a possible side project: "Write a program to force an AMD driver crash in order to free up all that VRAM that dwm never gives back."

Sampling 81 frames at 640x640 with 2 steps
  0%|                                                                                                                                                                                                                                                                                                                                                                                                                | 0/2 [00:00<?, ?it/s]Generated new RoPE frequencies
Exception Code: 0xC0000005
 #0 0x00007ff8c47c06eb (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x9206eb)
 #1 0x00007ff8c42f4315 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x454315)
 #2 0x00007ff8c432ef47 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x48ef47)
 #3 0x00007ff8c432dec6 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x48dec6)
 #4 0x00007ff8c432e1b4 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x48e1b4)
 #5 0x00007ff8c431b105 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x47b105)
 #6 0x00007ff8c429010f (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x3f010f)
 #7 0x00007ff8c4290231 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x3f0231)
 #8 0x00007ff8c42b3a86 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x413a86)
 #9 0x00007ff8c424b0ff (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x3ab0ff)
#10 0x00007ff999b5259d (C:\WINDOWS\System32\KERNEL32.DLL+0x1259d)
#11 0x00007ff99bdeaf78 (C:\WINDOWS\SYSTEM32\ntdll.dll+0x5af78)

FYI the ZLUDA sageattn is basically just a patch to change the parameters to

BLOCK_M: 32, BLOCK_N: 16, STAGE: 1, waves_per_eu: 3 or 4, num_warps: 2, num_ctas: 1, num_stages: 1

Otherwise it uses too much "shared memory" and produces black screens. See also https://raw.githubusercontent.com/sfinktah/amd-torch/refs/heads/main/patches/sageattention-1.0.6+sfinktah+env-py3-none-any.patch which is an environment variable adjustable version of the same thing.

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 4, 2026

BLOCK_M: 32, BLOCK_N: 16, STAGE: 1, waves_per_eu: 3 or 4, num_warps: 2, num_ctas: 1, num_stages: 1

I have tried many many combinations and besides these and none of them worked, most of the time i got garbled noise instead of image and sometimes i got a black and white frame for what was supposed to be the subject, i have an Rx7600

@rwfsmith
Copy link

rwfsmith commented Jan 6, 2026

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

Using an SDXL fine-tune using a DMD2 lora and an upscaler KSampler step
Base resolution: 1024x1496
Upscale resolution: 1536x2240
8 steps on both
CFG: 1
lcm/exponential

Compared using flash attention with pytorch cross attention across 10 image generations, listing the average it/s for both the base and upscaler samplers at the end.

Flash Cross
29.11 31.74
28.84 32.48
27.92 30.7
27.86 30.58
28.1 30.67
28.04 30.61
27.98 30.7
28.07 30.67
28.04 30.88
28.11 30.76
28.207 30.979
1.83it/s 1.67it/s
1.36s/it 1.58s/it

edit: I also just tested with sage attention 1, but the results seem to be the same as cross attention

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 6, 2026

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

how did you do that? I have also compiled Sparge and Sage but haven't tried Flash yet, Flash 2 seems even better so how can I?
I have an Rx 7600 which one do you have?

@0xDELUXA
Copy link

0xDELUXA commented Jan 6, 2026

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

I was also able, based on my results Flash 2 is slower than AOTriton SPDA Flash on RDNA4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants