[Feat]Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control#224
[Feat]Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control#224funjokerwen wants to merge 4 commits intoovg-project:mainfrom
Conversation
…ion across local and shared GPUs.
|
Thanks for the contribution! Before diving deeper, I have a quick question: what's the motivation and background for this PR? For example, what're the targeting scenarios and how this PR helps those scenarios? |
Thanks for the question.This PR is motivated by intra-node multi-GPU KV-cache memory pooling.
Different instances on the same node may be compute-bound or KV-cache/memory-bound.
In data-parallel mode, KV cache usage can become uneven due to request skew or sequence length variance.
The abstraction decouples logical KV cache usage from physical GPU placement, making it extensible to future scale-up or extended-memory setups. Overall, this PR improves KV-cache memory efficiency and flexibility under dynamic multi-GPU workloads. |
Summary
This PR introduces explicit physical device–aware page allocation and mapping for KV cache, enabling tiered allocation across local and shared GPUs. The change allows KV pages to be created on one GPU and safely mapped for access from another GPU, laying the foundation for scale-up / shared-memory inference scenarios (e.g. NVLink-connected GPUs).
Key Changes
C++ Core
Python Management Layer
Engine Integration
CLI
Tensor Parallel
Usage
kvctl config-shared kvcached 6 8GStart inference (sglang / vLLM): shared-memory config is automatically loaded.
KV cache pages are allocated tiered (local → shared GPU) and mapped with explicit physical device IDs, enabling cross-GPU KV sharing without changes to model code.
Impact