-
Notifications
You must be signed in to change notification settings - Fork 48
Open
Description
As I understand it there's currently no support for any quants. I think it would be really cool to eventually have support for a type of quant so we can do larger models, say qwen3.5 35b on a 4090. The most practical quant support is transformers + bitsandbytes due to autokernel being built around PyTorch profiling, it seems to me.
As I see it there are effectively 3 levels:
1. Load quantized models
This is easy.
- Add optional deps like bitsandbytes in pyproject.toml (line 12).
- Extend profile.py (line 267) and verify.py (line 147) to accept quantization args and pass a BitsAndBytesConfig into from_pretrained().
- Add CLI flags like --quantization bnb4|bnb8|none, --compute-dtype bf16|fp16, maybe --device-map.
2. Profile quantized models
This is medium difficulty.
- The profiler will still run, but kernel names and module types will change.
- Today the repo classifies kernels by CUDA name fragments in profile.py (line 449), which is tuned for dense PyTorch/cuBLAS-style kernels, not quant-specific kernels.
- You’d need to inspect what 4-bit/8-bit runs actually emit and extend the classifier.
3. Optimize and reintegrate quantized kernels
This is the real work.
- End-to-end reintegration currently only replaces plain nn.Linear, nn.LayerNorm, and RMSNorm-like modules in verify.py (line 563).
- Quantized models often replace nn.Linear with custom classes, so verify.py (line 575) would miss them.
- More importantly, the existing kernel library assumes dense fp16/bf16 kernels. Quantized inference needs different kernels and references: dequantize+matmul fusion, packed weights, scales/zeros, possibly group-wise quant metadata. That means new starter kernels, new reference paths, and likely new benchmark inputs in bench.py.
I can unpack this work at a high level but I don't think I'm there yet for the implementation. Any opinions/guidance?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels