feat: WriteCombined DMA staging for expert weights (+50% decode) by 3spky5u-oss · Pull Request #19 · brontoguana/krasis

3spky5u-oss · 2026-04-08T00:55:46Z

Summary

WriteCombined host memory bypasses the CPU cache hierarchy, giving the GPU DMA engine uncontested access to the memory bus during expert weight transfers. This is the operation that dominates decode latency in hybrid MoE inference.

+50% decode throughput on PCIe Gen5 (38 → 57 tok/s)
+64% PCIe bandwidth (~28 GB/s → ~46 GB/s measured)
Opt-in via --wc-alloc flag — zero behavior change without it
Incremental per-layer migration keeps peak RAM manageable

Benchmark — Qwen3.5-122B-A10B on single RTX 5090

Hardware: RTX 5090 32GB, Ryzen 9950X3D, 91GB DDR5, PCIe Gen5 x16, Fedora 43, CUDA 13.2

Config: INT4 experts (Marlin), AWQ attention, FP8 KV cache, LGS=2, 38% HCS

Mode	Decode (tok/s)	ms/tok	PCIe BW
v0.1.66-rc1 (baseline)	38.6	25.9	~28 GB/s
v0.1.66-rc1 + `--wc-alloc`	57.2	17.5	~46 GB/s
Improvement	+48%	-32%	+64%

Decode numbers from 50/100/250 token runs are consistent at 57.2-57.3 tok/s across all lengths. VRAM calibration probes confirm the improvement scales to long context (40K tokens: 32.1 → 45.7 tok/s).

How it works

Expert weights (56 GB for 122B INT4) live in system RAM. During decode, top-K experts per layer are DMA'd to GPU over PCIe. Regular pinned memory achieves ~28 GB/s because CPU cache coherency traffic contends with DMA reads. WriteCombined memory (cuMemHostAlloc(CU_MEMHOSTALLOC_WRITECOMBINED)) marks the memory as write-combining, which:

Bypasses CPU L1/L2/L3 caches entirely
Coalesces writes into full cache-line bursts
Gives the DMA engine exclusive memory bus access

The CPU never reads this data after initial setup, so bypassing caches has no downside.

Peak RAM management

Without incremental freeing: original heap (56 GB) + WC copy (56 GB) = 112 GB on a 91 GB system → OOM. The migration loop allocates WC for one layer, copies data, frees the original heap Vec, and moves to the next layer. Peak RAM stays at ~74 GB.

Changes

File	Lines	What
`build.rs`	+47	Dual sm_80 + sm_120 PTX compilation (backward-compatible)
`src/gpu_decode.rs`	+247	`WcBuffer`, `WcLayerPtrs`, `setup_wc_expert_memory()` with PyO3 bindings
`src/weights/mod.rs`	+32	`free_layer_backing_gpu()` for safe incremental Vec freeing
`python/krasis/server.py`	+7	`--wc-alloc` CLI flag
`python/krasis/model.py`	+14/-9	WC setup call + swap-aware RAM watchdog
`src/gpu_prefill.rs`	+1/-1	PTX filename fix for sm_80/sm_120 split

RAM watchdog fix (included)

The existing RAM watchdog uses a 5% floor on MemAvailable and doesn't count swap. This kills the process during WC migration on 91 GB systems even with 64 GB NVMe swap available. Fixed to:

Count SwapFree as available headroom (NVMe swap is fast enough for transient spikes)
Lower default floor from 5.0% to 0.5% (configurable via KRASIS_RAM_FLOOR_PERCENT)

Test plan

Builds cleanly with maturin develop --release (both sm_80 and sm_120 PTX)
Without --wc-alloc: identical behavior to baseline (no regression)
With --wc-alloc: 57 tok/s decode on 122B, model loads and serves correctly
Incremental heap freeing keeps peak RAM at ~74 GB on 91 GB system
Multi-GPU WC setup (untested, code path exists)

🤖 Generated with Claude Code

…Ie Gen5) Migrate expert weights from regular pinned memory to WriteCombined host memory via cuMemHostAlloc(WRITECOMBINED | PORTABLE). WC memory bypasses the CPU cache hierarchy, giving the GPU DMA engine uncontested access to the memory bus. Results on RTX 5090 (PCIe Gen5) with Qwen3.5-122B-A10B INT4: - Regular pinned: ~38 tok/s decode - WriteCombined: ~57 tok/s decode (+50%) - PCIe bandwidth: ~28 GB/s → ~46 GB/s (+64%) Changes: - build.rs: Dual sm_80 + sm_120 PTX compilation (backward-compatible, falls back to sm_80 JIT if sm_120 nvcc unavailable) - gpu_decode.rs: WcBuffer/WcLayerPtrs structs, setup_wc_expert_memory() with incremental per-layer heap freeing to control peak RAM - weights/mod.rs: free_layer_backing_gpu() for safe Vec replacement - server.py: --wc-alloc CLI flag - model.py: WC setup call after engine init, swap-aware RAM watchdog (counts SwapFree as headroom, lowers floor to 0.5%) - gpu_prefill.rs: Fix PTX filename for sm_80/sm_120 split Usage: krasis-server --wc-alloc (opt-in, no behavior change without flag) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: WriteCombined DMA staging for expert weights (+50% decode)#19

feat: WriteCombined DMA staging for expert weights (+50% decode)#19
3spky5u-oss wants to merge 1 commit intobrontoguana:mainfrom
3spky5u-oss:feat/writecombined-dma-staging

3spky5u-oss commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

3spky5u-oss commented Apr 8, 2026

Summary

Benchmark — Qwen3.5-122B-A10B on single RTX 5090

How it works

Peak RAM management

Changes

RAM watchdog fix (included)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants