UPSTREAM PR #1306: improved flux attention qkv unpacking by loci-dev · Pull Request #71 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-03-02T04:41:01Z

Note

Source pull request: leejet/stable-diffusion.cpp#1306

This PR improves performance a bit for flux models by getting rid of some ggml_cont ops.

RTX 4090

FLUX.2 Klein 4B (CFG 1, 4 steps, bf16)	master	This PR
512x512	7.8it/s	8.2it/s
1024x1024	2.5it/s	2.57it/s

loci-review · 2026-03-02T05:45:19Z

Overview

Analysis of 49,735 functions across build.bin.sd-server and build.bin.sd-cli identified 59 modified functions (0.12%) with 0 new or removed functions. The commit "improved qkv speed by removing cont op" optimized QKV tensor extraction in Flux attention mechanisms by replacing memory-copy operations (ggml_cont) with zero-copy tensor views (ggml_view_4d).

Power Consumption:

build.bin.sd-server: 527,270.52 nJ → 527,177.24 nJ (-0.018%)
build.bin.sd-cli: 491,394.58 nJ → 491,401.03 nJ (+0.001%)

Function Analysis

Critical Path Improvements:

Flux::SelfAttention::pre_attention (both binaries) achieved 15% response time reduction (48,533ns → 41,193ns in sd-server, 48,659ns → 41,320ns in sd-cli), saving 7.3μs per call. The optimization eliminated ggml_ext_chunk, ggml_cont, and ggml_reshape_4d operations, replacing 9 operations with 3 direct ggml_view_4d calls using calculated byte offsets. This removes memory allocations and copies during Q, K, V tensor extraction.

Flux::SingleStreamBlock::forward (sd-server) improved 3.88% (350,313ns → 336,727ns, -13.6μs), propagating benefits from the QKV optimization through the attention mechanism call chain.

Secondary Effects:

STL functions showed minor regressions: std::vector::end() (+183ns, +307%), nlohmann::json::items() (+185ns, +317%), and std::_Rb_tree::_M_insert() (+34ns, +45%). These are compiler optimization trade-offs in cold-path code (model initialization, metadata parsing) with negligible practical impact.

Initialization functions (WAN::Head::init_params, PhotoMakerIDEncoder::get_param_tensors) showed 22-27ns increases, acceptable for one-time setup operations.

Several functions improved from compiler optimizations: std::vector<gguf_tensor_info>::back() (-191ns, -57%), gguf_set_val_f64 (-62ns, -31%), and scheduler constructors (-32ns, -18%).

Additional Findings

The optimization targets the most performance-critical ML inference operations. With ~19 attention layers per forward pass and ~30 denoising steps per image, the 7.3μs per-layer improvement compounds to approximately 4-8ms total latency reduction per image generation. The zero-copy approach reduces memory bandwidth consumption and improves cache locality, particularly beneficial for GPU-accelerated inference where CPU-GPU data transfer is a bottleneck. Hot-path improvements significantly outweigh cold-path regressions when weighted by call frequency.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

improved qkv speed by removing cont op

aaf479a

loci-dev temporarily deployed to stable-diffusion-cpp-prod March 2, 2026 04:41 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1306: improved flux attention qkv unpacking#71

UPSTREAM PR #1306: improved flux attention qkv unpacking#71
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1306-improve-flux-attn-qkv

loci-dev commented Mar 2, 2026

Uh oh!

loci-review bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 2, 2026

Uh oh!

loci-review bot commented Mar 2, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants