feat: WriteCombined DMA staging for expert weights (+50% decode)#19
Open
3spky5u-oss wants to merge 1 commit intobrontoguana:mainfrom
Open
feat: WriteCombined DMA staging for expert weights (+50% decode)#193spky5u-oss wants to merge 1 commit intobrontoguana:mainfrom
3spky5u-oss wants to merge 1 commit intobrontoguana:mainfrom
Conversation
…Ie Gen5)
Migrate expert weights from regular pinned memory to WriteCombined host
memory via cuMemHostAlloc(WRITECOMBINED | PORTABLE). WC memory bypasses
the CPU cache hierarchy, giving the GPU DMA engine uncontested access to
the memory bus.
Results on RTX 5090 (PCIe Gen5) with Qwen3.5-122B-A10B INT4:
- Regular pinned: ~38 tok/s decode
- WriteCombined: ~57 tok/s decode (+50%)
- PCIe bandwidth: ~28 GB/s → ~46 GB/s (+64%)
Changes:
- build.rs: Dual sm_80 + sm_120 PTX compilation (backward-compatible,
falls back to sm_80 JIT if sm_120 nvcc unavailable)
- gpu_decode.rs: WcBuffer/WcLayerPtrs structs, setup_wc_expert_memory()
with incremental per-layer heap freeing to control peak RAM
- weights/mod.rs: free_layer_backing_gpu() for safe Vec replacement
- server.py: --wc-alloc CLI flag
- model.py: WC setup call after engine init, swap-aware RAM watchdog
(counts SwapFree as headroom, lowers floor to 0.5%)
- gpu_prefill.rs: Fix PTX filename for sm_80/sm_120 split
Usage: krasis-server --wc-alloc (opt-in, no behavior change without flag)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WriteCombined host memory bypasses the CPU cache hierarchy, giving the GPU DMA engine uncontested access to the memory bus during expert weight transfers. This is the operation that dominates decode latency in hybrid MoE inference.
--wc-allocflag — zero behavior change without itBenchmark — Qwen3.5-122B-A10B on single RTX 5090
Hardware: RTX 5090 32GB, Ryzen 9950X3D, 91GB DDR5, PCIe Gen5 x16, Fedora 43, CUDA 13.2
Config: INT4 experts (Marlin), AWQ attention, FP8 KV cache, LGS=2, 38% HCS
--wc-allocDecode numbers from 50/100/250 token runs are consistent at 57.2-57.3 tok/s across all lengths. VRAM calibration probes confirm the improvement scales to long context (40K tokens: 32.1 → 45.7 tok/s).
How it works
Expert weights (56 GB for 122B INT4) live in system RAM. During decode, top-K experts per layer are DMA'd to GPU over PCIe. Regular pinned memory achieves ~28 GB/s because CPU cache coherency traffic contends with DMA reads. WriteCombined memory (
cuMemHostAlloc(CU_MEMHOSTALLOC_WRITECOMBINED)) marks the memory as write-combining, which:The CPU never reads this data after initial setup, so bypassing caches has no downside.
Peak RAM management
Without incremental freeing: original heap (56 GB) + WC copy (56 GB) = 112 GB on a 91 GB system → OOM. The migration loop allocates WC for one layer, copies data, frees the original heap Vec, and moves to the next layer. Peak RAM stays at ~74 GB.
Changes
build.rssrc/gpu_decode.rsWcBuffer,WcLayerPtrs,setup_wc_expert_memory()with PyO3 bindingssrc/weights/mod.rsfree_layer_backing_gpu()for safe incremental Vec freeingpython/krasis/server.py--wc-allocCLI flagpython/krasis/model.pysrc/gpu_prefill.rsRAM watchdog fix (included)
The existing RAM watchdog uses a 5% floor on MemAvailable and doesn't count swap. This kills the process during WC migration on 91 GB systems even with 64 GB NVMe swap available. Fixed to:
SwapFreeas available headroom (NVMe swap is fast enough for transient spikes)KRASIS_RAM_FLOOR_PERCENT)Test plan
maturin develop --release(both sm_80 and sm_120 PTX)--wc-alloc: identical behavior to baseline (no regression)--wc-alloc: 57 tok/s decode on 122B, model loads and serves correctly🤖 Generated with Claude Code