UPSTREAM PR #1184: Feat: Select backend devices via arg by loci-dev · Pull Request #40 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-02-02T10:47:12Z

Note

Source pull request: leejet/stable-diffusion.cpp#1184

The main goal of this PR is to improve user experience in multi-gpu setups, allowing to chose which model part gets sent to which device.

Cli changes:

Adds the --main-backend-device [device_name] argument to set the default backend
remove --clip-on-cpu, --vae-on-cpu and --control-net-cpu arguments
replace them respectively with the new --clip_backend_device [device_name], --vae-backend-device [device_name], --control-net-backend-device [device_name] arguments
add the --diffusion_backend_device (control the device used for the diffusion/flow models) and the --tae-backend-device
add --upscaler-backend-device, --photomaker-backend-device, and --vision-backend-device
add --list-devices argument to print the list of available ggml devices and exit.
add --rpc argument to connect to a compatible GGML rpc server

C API changes (stable-diffusion.h):

Change the content of the sd_ctx_params_t struct.
void list_backends_to_buffer(char* buffer, size_t buffer_size) to write the details of the available buffers to a null-terminated char array. Devices are separated by newline characters (\n), and the name and description of the device are separated by \t character.
size_t backend_list_size() to get the size of the buffer needed for void list_backends_to_buffer
void add_rpc_device(const char* address); connect to a ggml RPC backend (from llama.cpp)

The default device selection should now consistently prioritize discrete GPUs over iGPUs.

For example if you want to run the text encoders on CPU, you'd need to use --clip_backend_device CPU instead of --clip-on-cpu

TODO:

Fix bug with --lora-apply-mode immediately when clip and diffsion models are running on different (non-cpu) backends.
Clean up logs

Important: to use RPC, you need to add -DGGML_RPC=ON to the build. Additionally it requires either sd.cpp to be built with -DSD_USE_SYSTEM_GGML flag (I haven't tested that one), or the RPC server to be built with -DCMAKE_C_FLAGS="-DGGML_MAX_NAME=128" -DCMAKE_CXX_FLAGS="-DGGML_MAX_NAME=128" (default is 64)

Fixes #1116

loci-review · 2026-02-03T11:50:26Z

Overview

Analysis of stable-diffusion.cpp across 18 commits reveals minimal performance impact from multi-backend device management refactoring. Of 48,425 total functions, 124 were modified (0.26%), 331 added, and 109 removed. Power consumption increased negligibly: build.bin.sd-cli (+0.388%, 479,167→481,028 nJ) and build.bin.sd-server (+0.239%, 512,977→514,202 nJ).

Function Analysis

SDContextParams Constructor (both binaries): Response time increased ~40% (+2,816-2,840ns) due to initializing 9 new std::string device placement fields replacing 3 boolean flags. Enables per-component GPU/CPU device selection for heterogeneous computing.

SDContextParams Destructor (both binaries): Response time increased ~42% (+2,497-2,505ns) from destroying 9 additional string members. One-time cleanup cost outside inference paths.

~StableDiffusionGGML (both binaries): Throughput time increased ~95% (+192ns absolute) managing 7 backend types versus 3, including loop-based cleanup for multiple CLIP backends. Response time impact minimal (+5.2%, ~720ns).

ggml_e8m0_to_fp32_half (sd-cli): Response time improved 24% (-36ns), benefiting quantization operations called millions of times during inference.

Standard library functions (std::_Rb_tree::begin, std::vector::_S_max_size, std::swap): Showed 76-289% throughput increases due to template instantiation complexity, but absolute changes remain under 220ns in non-critical initialization paths.

Additional Findings

All performance regressions occur in initialization and cleanup phases, not inference hot paths. The architectural changes enable multi-GPU workload distribution, per-component device placement (diffusion, CLIP, VAE on separate devices), and runtime backend flexibility. Quantization improvements and multi-GPU capabilities provide net performance gains during actual inference, far exceeding the microsecond-level initialization overhead. Changes are well-justified architectural improvements with negligible real-world impact.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-02-17T05:12:00Z

Overview

Analysis of 20 commits implementing multi-backend GPU architecture refactoring across 48,713 functions. Modified 132 functions (0.27%), added 406, removed 63. Power consumption increased minimally: build.bin.sd-server +1.13% (515,491→521,302 nJ), build.bin.sd-cli +1.17% (480,110→485,727 nJ). Performance regressions concentrated in initialization and state-change operations; inference hot path unaffected.

Function Analysis

apply_loras_immediately (both binaries): Response time increased +199% (10.4ms→31.1ms, +20.7ms absolute). Throughput time increased +106-109% (+705-719ns). Changes fix critical bug enabling correct LoRA application across heterogeneous backends (e.g., diffusion on GPU, CLIP on CPU). Function now performs backend similarity checking, creates separate tensor filters per model component, and loads LoRAs independently for diffusion/CLIP/VAE backends. Called only during LoRA state changes, not per-inference, making 20ms overhead acceptable.

~StableDiffusionGGML (both binaries): Throughput time increased +95% (201ns→393ns, +192ns). Response time increased +5.2% (13.9μs→14.6μs, +722ns). Destructor now manages multiple backends via vector iteration (multiple CLIP encoders) plus specialized backends for diffusion, TAE, PhotoMaker, vision models. Enables proper multi-device resource cleanup.

~SDContextParams (both binaries): Response time increased +42% (5.9μs→8.4μs, +2.5μs). Replaced 3 boolean flags with 9 std::string members for granular per-component device specification, requiring heap deallocation overhead. Enables flexible multi-GPU device assignment.

Standard library functions showed mixed performance due to compiler/toolchain differences, not application code changes. Other analyzed functions saw negligible changes or improvements.

Additional Findings

Architectural refactoring enables critical capabilities: multi-GPU workload distribution, heterogeneous CPU/GPU computing, multi-encoder support (Flux models), and RPC distributed inference. Commit e511f77 fixed correctness bug in multi-backend LoRA application. Commit 3f62282 added sequential tensor loading for RPC backends. Performance regressions are isolated to non-critical initialization/cleanup code, with zero impact on per-image inference latency. The 1.1-1.2% power increase reflects initialization overhead amortized over many inferences. Trade-off strongly favors functionality over minimal performance cost.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

fix sdxl conditionner backends fix sd3 backend display

loci-review · 2026-03-02T05:41:12Z

Overview

Analysis of 50,145 functions across two binaries reveals a major architectural refactoring implementing multi-backend device management. 126 functions modified (0.25%), 406 new (0.81%), 63 removed (0.13%), with 49,550 unchanged (98.81%).

Power consumption changes:

build.bin.sd-server: +0.235% (+1,240 nJ)
build.bin.sd-cli: +0.553% (+2,717 nJ)

Performance regressions are concentrated in initialization paths, not inference hot paths, with acceptable trade-offs for significant architectural improvements.

Function Analysis

apply_loras_immediately (both binaries): Response time increased from ~10.4ms to ~31.2ms (+199%, +20.8ms absolute). Throughput time doubled from ~690ns to ~1,370ns (+98-101%). Architectural refactoring now applies LoRAs across three separate backends (diffusion, CLIP, VAE/TAE) instead of one, with tensor filtering and compatibility validation. This one-time initialization cost enables multi-GPU LoRA application and heterogeneous hardware support.

~StableDiffusionGGML (both binaries): Response time increased ~720ns (+5.2%), throughput time increased ~192ns (+95.5%). Destructor now manages six distinct backends with loop-based cleanup for multiple CLIP backends, enabling proper resource deallocation for heterogeneous hardware.

SDContextParams constructor/destructor (both binaries): Response time increased ~2.5-2.8μs (+40-42%). Struct refactored from 3 boolean flags to 9 std::string members for granular device specification, increasing heap allocation overhead but enabling flexible per-component device assignment.

Compiler-optimized functions: std::vector::end(), std::vector::begin(), and std::vector::back() showed 21-75% improvements (180-190ns faster) from toolchain optimizations. Standard library regex functions showed minor regressions (43-122% slower, 78-162ns) from STL implementation differences.

Additional Findings

The 21-commit refactoring enables critical production capabilities: multi-GPU support with parallel CLIP encoding, heterogeneous device allocation (e.g., CLIP on CPU, diffusion on GPU), and distributed inference via RPC backends. All performance regressions occur in cold-path initialization operations, not inference hot paths. The 20.8ms LoRA application overhead is negligible compared to typical inference times (1-10 seconds per image) and is amortized across multiple generations. The architectural changes align with modern ML framework requirements for complex hardware configurations while maintaining backward compatibility.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

loci-dev force-pushed the main branch 19 times, most recently from 052ebb0 to 76ede2c Compare February 3, 2026 10:20

loci-dev force-pushed the loci/pr-1184-select-backend branch from 29e8399 to 2d43513 Compare February 3, 2026 10:46

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 3, 2026 10:46 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 76ede2c to d36519c Compare February 3, 2026 11:18

loci-dev force-pushed the main branch 7 times, most recently from 5bbc590 to 68f62a5 Compare February 8, 2026 04:51

loci-dev force-pushed the loci/pr-1184-select-backend branch from 2d43513 to 3f62282 Compare February 17, 2026 04:18

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 17, 2026 04:18 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 74d69ae to 10ea7dd Compare February 20, 2026 04:16

loci-dev force-pushed the main branch from 10ea7dd to 2f8b672 Compare February 27, 2026 04:17

stduhpf and others added 21 commits March 1, 2026 20:41

Select backend devices via arg

e622901

fix build

1939c38

show backend device description

69db292

CLI: add --list-devices arg

74ac71d

null-terminate even if buffer is too small

5f79501

move stuff to ggml_extend.cpp

08f14a6

--upscaler-backend-device

1c30a15

use diffusion_backend for loading LoRAs

f4bc429

--photomaker-backend-device (+fixes)

df10b7c

--vision-backend-device

8f1a855

check backends at runtime

c6057ba

fix missing includes

17628f2

fix typo

f6d2d5a

multiple clip backend devices

41987ec

fix sdxl conditionner backends fix sd3 backend display

update help message

ae3fc74

Add RPC documentation

af19b7f

update docs

48404da

update RPC docs

e135497

fix apply_loras_immediately when using different non-CPU backends

362af51

Force sequencial tensor loading when using RPC

8e886c1

fix build

84dd892

loci-dev force-pushed the main branch from 2f8b672 to 2cf1d7d Compare March 2, 2026 04:16

loci-dev force-pushed the loci/pr-1184-select-backend branch from 3f62282 to 84dd892 Compare March 2, 2026 04:16

loci-dev deployed to stable-diffusion-cpp-prod March 2, 2026 04:16 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1184: Feat: Select backend devices via arg#40

UPSTREAM PR #1184: Feat: Select backend devices via arg#40
loci-dev wants to merge 21 commits intomainfrom
loci/pr-1184-select-backend

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 17, 2026

Uh oh!

loci-review bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 17, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Mar 2, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants