Quantized SDPA #3026

CC-Yeh · 2026-01-20T22:18:09Z

Proposed changes

Add Metal quantized SDPA vector kernels based on #1515

Speedup vs fp16 (Quant SDPA)

Config: H=32, H_k=8, D=128, GQA=4x

SeqLen	mxfp4 (4b)	mxfp8 (8b)
2048	0.91x	1.21x
4096	1.59x	1.35x
8192	1.79x	1.52x
16384	2.31x	1.94x
32768	2.49x	2.04x
65536	2.62x	2.17x
131072	2.58x	2.05x

TODO:

Support Affine and NVFP4
Adapt Faster two pass sdpa #3023 once it's merged.
Cleanup

What improve performance:

Removed thread storage k, v to reduce register pressure (was waiting on synchronization).
Fused computation with dequantization
Tuned reading size ('uint16_t'/'uin32_t') for quantized k/v
Manual unroll better than clang loop optimizer

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

awni · 2026-01-21T00:44:14Z

The numbers seem quite good.. a little too good to be true 😅

What's the difference between SDPA and Attention in the benchmark? Also what's the query sequence length used for the benchmark?

CC-Yeh · 2026-01-21T01:32:35Z

The numbers seem quite good.. a little too good to be true 😅

Totally agree, must be missing something 🤔

What's the difference between SDPA and Attention in the benchmark? Also what's the query sequence length used for the benchmark?

Attention is a simple reference implementation built from matmul + softmax + matmul (Maybe too naive?).
SDPA uses mx.fast.scaled_dot_product_attention, which hits the sdpa_vector_2pass kernels when Lq ≤ 8 (this case).

The query sequence length here is 1 (q.shape = (1, 32, 1, 128)), so this benchmark is measuring the single-token decode case, where one new token attends to a long KV cache (L = 32768).

CC-Yeh · 2026-01-21T02:01:41Z

@awni
Fixed some bugs in dequantizing 8bit and benchmark(unneccessary dequantization steps).
Now the numbers make more sense 😃

awni · 2026-01-21T02:51:25Z

So if I’m understanding correctly the fused implementation is slower in the quantized case than the unfused ops-based one?

CC-Yeh · 2026-01-21T03:07:03Z

Fused SDPA is faster: MXFP4 15.33 ms vs 24.71 ms, and MXFP8 26.09 ms vs 46.48 ms to decode a single query.

awni · 2026-01-21T03:11:37Z

Very nice!!

awni · 2026-01-21T14:23:32Z

mlx/fast.cpp

+  if (qmode == QuantizationMode::Nvfp4) {
+    throw std::invalid_argument(
+        "[quantized_scaled_dot_product_attention] Mode 'nvfp4' is not supported for fast attention.");
+  }


Why not nvfp4?

It’s on the way! I just wanted to make sure the PR structure was okay first.

Added support

awni · 2026-01-21T14:23:59Z

mlx/fast.cpp

+  if (qmode == QuantizationMode::Affine) {
+    throw std::invalid_argument(
+        "[quantized_scaled_dot_product_attention] Only fp quantization modes are supported.");
+  }


Why not affine?

Btw not suggesting we necessarily do it. Maybe it's better to be more limited in the quants we support here. Maybe fp8, fp4 are fine to start?

For example I don't think it's necessary to support every bit width because in practice no-one will ever use 2, 3 for KV cache quantization.

Added initial support, still has more room for tuning bit 2/3/5/6

awni · 2026-01-27T14:41:57Z

@CC-Yeh I'm interested in this PR moving forward. Let me know if you have questions. Also no need to support everything on a first pass. I think doing one 8-bit (fp8 / int8) quant well for Metal / CUDA is already probably good enough to start.

CC-Yeh · 2026-01-29T22:21:35Z

@awni

I’ve added the Metal paths for mxfp4/8, nvfp4, and affine(2/3/4/5/6/8) (affine is not optimized).
Further tuning likely needs validation on other machines.

For the CUDA path (maybe next PR), Colab doesn’t support NVFP4, so would need help for that.

awni · 2026-01-29T22:24:25Z

affine(2/3/4/5/6/8)

What group sizes did you do for that? I"m not convinced we need broad support for bitwidth X group size. I expect bits < 4 to be used rarely if ever.

CC-Yeh · 2026-01-29T22:35:34Z

affine(2/3/4/5/6/8)

What group sizes did you do for that? I"m not convinced we need broad support there. I expect < 4 to be used rarely.

What group sizes do you think we should support for affine? Currently it's templated so it can handle various
sizes, but I can limit the instantiations if there's a specific set that's practical.

template <typename T, int D, QuantMode mode, int group_size, int bits>
[[kernel]] void quant_sdpa_vector_2pass_1(

awni · 2026-01-29T22:37:52Z

Yes totally. I think it's good to keep it generic. But probably better to limit initial support and grow than vice versa.

I would maybe start with bits = {4, 6, 8} and just group_size = 32. I think 32 is most flexible for the head dimension right?

CC-Yeh · 2026-01-30T23:04:11Z

Yes totally. I think it's good to keep it generic. But probably better to limit initial support and grow than vice versa.

I would maybe start with bits = {4, 6, 8} and just group_size = 32. I think 32 is most flexible for the head dimension right?

Limited the affine support.

Yeah, 32 is most flexible for head dim.

CC-Yeh force-pushed the quantized_sdpa branch from b64b7dc to 11b24f5 Compare January 20, 2026 22:23

CC-Yeh force-pushed the quantized_sdpa branch from 11b24f5 to 640ec94 Compare January 21, 2026 01:29

awni reviewed Jan 21, 2026

View reviewed changes

CC-Yeh added 7 commits January 29, 2026 18:08

first attempt

88e0f02

fix

dbafea4

Unify mxfp4/8 paths and optimize mxfp8 fused calculation

397893b

supports nvpf4

ba5819e

supports affine 4/8 bits

bf7196e

supports affine 2/3/5/6 bits

792ee95

clean up

5af4060

CC-Yeh force-pushed the quantized_sdpa branch from f3dc49d to 5af4060 Compare January 29, 2026 18:10

CC-Yeh changed the title ~~[WIP] Quantized SDPA~~ Quantized SDPA Jan 29, 2026

CC-Yeh marked this pull request as ready for review January 29, 2026 22:04

CC-Yeh requested a review from awni January 29, 2026 22:22

adapt ml-explore#3023

c72fad9

CC-Yeh force-pushed the quantized_sdpa branch from 3bc3e28 to c72fad9 Compare January 29, 2026 22:27

Limit affine SDPA to group_size=32 and bits={4,6,8}

75231ee

Quantized SDPA #3026

Are you sure you want to change the base?

Quantized SDPA #3026

Conversation

CC-Yeh commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Speedup vs fp16 (Quant SDPA)

Checklist

Uh oh!

awni commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026

Uh oh!

awni commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Jan 21, 2026

Uh oh!

awni Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

CC-Yeh Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

CC-Yeh Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

awni Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

awni Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

CC-Yeh Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

awni commented Jan 27, 2026

Uh oh!

CC-Yeh commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CC-Yeh commented Jan 29, 2026

Uh oh!

awni commented Jan 29, 2026

Uh oh!

CC-Yeh commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CC-Yeh commented Jan 20, 2026 •

edited

Loading

CC-Yeh commented Jan 21, 2026 •

edited

Loading

CC-Yeh commented Jan 29, 2026 •

edited

Loading

awni commented Jan 29, 2026 •

edited

Loading