Skip to content

UPSTREAM PR #1306: improved flux attention qkv unpacking#71

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1306-improve-flux-attn-qkv
Open

UPSTREAM PR #1306: improved flux attention qkv unpacking#71
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1306-improve-flux-attn-qkv

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Mar 2, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1306

This PR improves performance a bit for flux models by getting rid of some ggml_cont ops.

RTX 4090

FLUX.2 Klein 4B (CFG 1, 4 steps, bf16) master This PR
512x512 7.8it/s 8.2it/s
1024x1024 2.5it/s 2.57it/s

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod March 2, 2026 04:41 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Mar 2, 2026

Overview

Analysis of 49,735 functions across build.bin.sd-server and build.bin.sd-cli identified 59 modified functions (0.12%) with 0 new or removed functions. The commit "improved qkv speed by removing cont op" optimized QKV tensor extraction in Flux attention mechanisms by replacing memory-copy operations (ggml_cont) with zero-copy tensor views (ggml_view_4d).

Power Consumption:

  • build.bin.sd-server: 527,270.52 nJ → 527,177.24 nJ (-0.018%)
  • build.bin.sd-cli: 491,394.58 nJ → 491,401.03 nJ (+0.001%)

Function Analysis

Critical Path Improvements:

Flux::SelfAttention::pre_attention (both binaries) achieved 15% response time reduction (48,533ns → 41,193ns in sd-server, 48,659ns → 41,320ns in sd-cli), saving 7.3μs per call. The optimization eliminated ggml_ext_chunk, ggml_cont, and ggml_reshape_4d operations, replacing 9 operations with 3 direct ggml_view_4d calls using calculated byte offsets. This removes memory allocations and copies during Q, K, V tensor extraction.

Flux::SingleStreamBlock::forward (sd-server) improved 3.88% (350,313ns → 336,727ns, -13.6μs), propagating benefits from the QKV optimization through the attention mechanism call chain.

Secondary Effects:

STL functions showed minor regressions: std::vector::end() (+183ns, +307%), nlohmann::json::items() (+185ns, +317%), and std::_Rb_tree::_M_insert() (+34ns, +45%). These are compiler optimization trade-offs in cold-path code (model initialization, metadata parsing) with negligible practical impact.

Initialization functions (WAN::Head::init_params, PhotoMakerIDEncoder::get_param_tensors) showed 22-27ns increases, acceptable for one-time setup operations.

Several functions improved from compiler optimizations: std::vector<gguf_tensor_info>::back() (-191ns, -57%), gguf_set_val_f64 (-62ns, -31%), and scheduler constructors (-32ns, -18%).

Additional Findings

The optimization targets the most performance-critical ML inference operations. With ~19 attention layers per forward pass and ~30 denoising steps per image, the 7.3μs per-layer improvement compounds to approximately 4-8ms total latency reduction per image generation. The zero-copy approach reduces memory bandwidth consumption and improves cache locality, particularly beneficial for GPU-accelerated inference where CPU-GPU data transfer is a bottleneck. Hot-path improvements significantly outweigh cold-path regressions when weighted by call frequency.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants