Skip to content

[FEATURE] Add FP8 Quantization to FlashAttention3 #373

@sdiazlor

Description

@sdiazlor

‼️ If you want to work on this issue: please comment below and wait until a maintainer assigns this issue to you before opening a PR to avoid several contributions on the same issue. Thanks! 😊

✨ What You’ll Do

Right now, we have an implementation of flash attention 3 in Pruna. However, quantization has not been included. So, your task will be to include FP8 quantization if applies.

🤖 Useful Resources

  • https://github.com/Dao-AILab/flash-attention
  • Add a hyperparameter to FA3 (src/pruna/algorithms/kernels/flash_attn3.py) along the lines of "quantize=True/False"
  • If the hyperparameter is activated, quantize the attention inputs to the FP8 data format
  • Update the documentation

✅ Acceptance Criteria

  • It follows the style guidelines.
  • Tests & Docs: All existing and new unit tests pass, and the documentation is updated

And don’t forget to give us a ⭐️!


❓ Questions?

Feel free to jump into the #contributing Discord channel if you hit any roadblocks. Can’t wait to see your contribution! 🚀


Share on Socials

Share on X

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions