Skip to content

sayakpaul/diffusers-blackwell-quants

Repository files navigation

diffusers-blackwell-quants

Easy recipes to speed up latency of Flux, QwenImage, and LTX-2 with NVFP4 and MXFP8 on Blackwell.

Note

We demonstrate reproducible end-to-end inference speedups of up to 1.26x with MXFP8 and 1.68x with NVFP4 with diffusers and torchao on the Flux.1-Dev, QwenImage, and LTX-2 models on NVIDIA B200. We also outline how we used selective quantization, CUDA Graphs, and LPIPS as a measure to iterate on accuracy and performance of these models.

For more information (setup, results, discussions, etc.), please refer to our blog post (TODO).

Thanks to Claude Code for pairing 🫡

Scripts

├── benchmark.py -- main benchmarking script which can run locally
├── compute_lpips.py -- computes lpips
├── run_all_benchmarks_local.sh -- shell script to bulk-launch runs
├── run_benchmark_local.py -- run the benchmark
└── run_drawbench_local.py -- generate images from DrawBench prompts

Computing LPIPS

We provide scripts to compute LPIPS between the Bfloat16 results and the quantized results. First, generate the images separately with each of the quant_modes (including "none") with run_drawbench_modal.py. This will use the Drawbench dataset and the Flux.1-Dev model. It will run on Modal.

Once all the images are generated, run compute_lpips.py.

About

Easy recipes to speed up latency of Flux, QwenImage, and LTX-2 with NVFP4 and MXFP8 on Blackwell.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors