-
Notifications
You must be signed in to change notification settings - Fork 88
[FEATURE] Add FP8 Quantization to FlashAttention3 #373
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershacktoberfest
Description
‼️ If you want to work on this issue: please comment below and wait until a maintainer assigns this issue to you before opening a PR to avoid several contributions on the same issue. Thanks! 😊
✨ What You’ll Do
Right now, we have an implementation of flash attention 3 in Pruna. However, quantization has not been included. So, your task will be to include FP8 quantization if applies.
🤖 Useful Resources
- https://github.com/Dao-AILab/flash-attention
- Add a hyperparameter to FA3 (
src/pruna/algorithms/kernels/flash_attn3.py) along the lines of "quantize=True/False" - If the hyperparameter is activated, quantize the attention inputs to the FP8 data format
- Update the documentation
✅ Acceptance Criteria
- It follows the style guidelines.
- Tests & Docs: All existing and new unit tests pass, and the documentation is updated
And don’t forget to give us a ⭐️!
❓ Questions?
Feel free to jump into the #contributing Discord channel if you hit any roadblocks. Can’t wait to see your contribution! 🚀
Share on Socials
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershacktoberfest