-
Notifications
You must be signed in to change notification settings - Fork 78
AMD Port of SpargeAttn - Working on windows for gfx1151 #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
15a0f4c to
8e7f363
Compare
|
Great to see this! |
|
A small nitpick, but for me, |
yeah. IIRC, it's slower than aotriton SDPA on larger topk values. but around 0.25 or below sparsity this SLA one does better. Note that SDPA FA is always dense attention as it doesn't support SLA |
Hmm that's strange. It should work. Can you create an issue in https://github.com/ROCm/TheRock with repro steps? |
Ready: ROCm/TheRock#2726 |
|
For me, the test script from
If I change the dtype from Is this behavior expected on the gfx1200, or is there a problem with my build? |
Thanks for the catch! I've fixed the dtype issue. However, this code isn't implemented for rdna4 yet. It's just rdna3 for now. Some modifications need to be made specifically for rdna4 to work with rocWMMA as the matrix fragment layouts for the individual elements are different. But I don't have access to an RDNA4 GPU at the moment. |
8e7f363 to
1523334
Compare
I see. No worries, I can wait until you have an RDNA4 GPU available. It was exactly the same with PyTorch support for gfx1200 XD |
|
Correct me if I'm wrong, but theoretically RDNA4 can use fp8, so we would need Based on: in |
Yup. Not sure about perf vs. SDPA as aotriton should have fp8 kernels too, I think? |
Just checked. My bad, I forgot it was deprecated. I'll modify the README's to only specify the torch installation as that should automatically download the corresponding rocm wheels as a dependency. |
1523334 to
7a2278c
Compare
Don't really understand these things, so I think you're right. All I know is over at Nvidia, Sparge is better than SPDA Flash in these type of workloads. |
rocm-sdk-devel didn't include the tar file for me, but this worked: pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ "rocm[libraries,devel]" |
|
ran this to build on linux: it builds successfully, but my cosine similarity is low Cosine similarity: 0.998755 |
You mean on Linux? |
this is fine I believe. Give it a try on some image/video model to confirm. |
What I wanted to say: 0.998... should be fine, it’s very close to 100. Unlike the value I got: 0.467271 (because of RNDA4). I hope this PR will be optimized for the RX 9000 series in the near future. |
|
Hello, I have Rx7600 and built this with similar cosine similarity around 0.998, currently using this in ComfyUI works like a charm well topk values below 0.2 mess things up while 0.25 topk give similar results to sageattn1.06 in quality and speed so it is a great alternative for sageattn on AMD i guess. |
That was initially on Linux, but I ran it again when I rebooted into Windows and using "rocm-sdk-devel" works there, but rocm[devel] worked on both. Funny enough, though, I was able to build this pretty easily on Linux but having trouble in Windows :P. |
how would I enable it for use in ComfyUI? |
|
Hello. I have successfully compiled the SpargeAttn AMD you provided on Linux. My graphics card is 9070xt. After selecting SpargeAttn for the workflow under Comfyui and selecting sparse_stage as the parameter, the K-sampling error is as follows. It can also run the same workflow with the same configuration on Windows. I don't know if it can solve this problem. Thank @jammm ComfyUI Error ReportError Details
Stack TraceSystem Information
Devices
LogsAttached WorkflowPlease make sure that workflow does not contain any sensitive information such as API keys or passwords. Additional Context(Please add any additional context or steps to reproduce the error here) |
We can't yet use Sparge on RDNA4, I mean we can, but it will be bad (based on my cosine result). Need to wait for Jam to optimize this PR for RDNA4. |
in truth. However, Comfyui can run on Windows and is faster than Sageatt, although the image quality has decreased. |
My RX 7900XT is also displaying the same error in Windows. |
|
Indeed, the RX 7900 XT doesn't support FP8, but @ouco1986's RX 9070 XT does. The error specifically says:
Also,
So theoretically gfx1100 (RX 7900 XT) should work - in TurboDiffusion, where it was mainly tested. I think that in the future there might be a conditional in the code, so if the GPU isn’t RDNA4 (or CDNA), it should use the FP16 path; otherwise, FP8. |
|
Unfortunately I have not worked on getting this running on ComfyUI yet, but It seems like the wrapper needs to be refactored a tiny bit to use the fp16 specific code path instead of the fp8 one. And yes, only RDNA3/3.5 supported as @0xDELUXA rightly pointed out. |
|
@githust66 if you have sageattention installed which I think isn't needed either way, just install KJnodes and replace your model_optimization_nodes.py in C:\ComfyUI\custom_nodes\comfyui-kjnodes\nodes with this and then you will have an option to use this in ComfyUI with the Patch sage attention node, just select spas_sage2_attn from the list and put in the topk value you want. edit: I removed the code for other sage attention patches, but that won't matter since they don't work on AMD eitherway IG |
|
Theoretically, ROCm/rocm-libraries#3579 also affects SpargeAttn with ROCm 7, which could mean it can be even faster than it is now. |
|
Where can I find this comfyui wrapper for SpargeAttn? Is it on GH? |
Would be good to know, so you could work on getting this running with Comfy, now that AMD support in ComfyUI is official. |
https://github.com/kijai/ComfyUI-WanVideoWrapper/tree/main/wanvideo/radial_attention |
Use rocWMMA instead of CUTLASS.
See README_AMD_WINDOWS.md for setup steps.
Heavily inspired from thu-ml/SageAttention#332
Used claude opus 4.5 to assist.
Tested with TurboDiffusion
Currently only supports RDNA3/3.5