UPSTREAM PR #1306: improved flux attention qkv unpacking#71
UPSTREAM PR #1306: improved flux attention qkv unpacking#71
Conversation
OverviewAnalysis of 49,735 functions across Power Consumption:
Function AnalysisCritical Path Improvements:
Secondary Effects: STL functions showed minor regressions: Initialization functions ( Several functions improved from compiler optimizations: Additional FindingsThe optimization targets the most performance-critical ML inference operations. With ~19 attention layers per forward pass and ~30 denoising steps per image, the 7.3μs per-layer improvement compounds to approximately 4-8ms total latency reduction per image generation. The zero-copy approach reduces memory bandwidth consumption and improves cache locality, particularly beneficial for GPU-accelerated inference where CPU-GPU data transfer is a bottleneck. Hot-path improvements significantly outweigh cold-path regressions when weighted by call frequency. 🔎 Full breakdown: Loci Inspector |
Note
Source pull request: leejet/stable-diffusion.cpp#1306
This PR improves performance a bit for flux models by getting rid of some ggml_cont ops.
RTX 4090