-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Description
Hello! First, thank you for creating and maintaining this excellent project - it's been incredibly helpful for LLM inference.
I'm currently working on quantizing LLM to optimize inference efficiency. As far as I know, per-channel quantization is also an important type of quantization. However, I couldn't find explicit documentation or code references about this feature in the project.
Also, I would like to use the ONNXruntime for inference, and I think I can customize the ONNX operator using the C++ kernel provided by T-MAC. I don't know if this is reasonable.
Could you please clarify:
- Does the current implementation support per-channel quantization (e.g., for convolutional layers or specific operators)?
- Is it feasible to use the generated kernel.cc for ONNXruntime custom operators?
Thanks in advance for your insights! Looking forward to your guidance. 🙂
Metadata
Metadata
Assignees
Labels
No labels