[Feat]Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control by funjokerwen · Pull Request #224 · ovg-project/kvcached

funjokerwen · 2025-12-18T06:16:18Z

Summary

This PR introduces explicit physical device–aware page allocation and mapping for KV cache, enabling tiered allocation across local and shared GPUs. The change allows KV pages to be created on one GPU and safely mapped for access from another GPU, laying the foundation for scale-up / shared-memory inference scenarios (e.g. NVLink-connected GPUs).

Key Changes

C++ Core

Page allocation and mapping APIs now accept an explicit phys_dev_id / access_dev_idx, decoupling where memory is allocated from which device accesses it.
make_shared_page / make_unique_page support allocating pages on a specified physical GPU.
map_to_kv_tensors and new single-page map APIs support per-page physical device selection.
Unified Page::map signature across CPU/GPU pages with device-index–based access control.
Updated PyTorch bindings to expose phys_device_ids to Python.

Python Management Layer

Page tracks its physical device (phys_dev_id).
PageAllocator gains:
- shared_memory_config to define shared GPUs and per-GPU memory limits.
- Tiered allocation strategy: local GPU first, then shared GPUs.
- Tracking of remote allocations and page locations.
- Mapping/free logic that distinguishes local vs. remote pages.
KVCacheManager transparently propagates shared-memory configuration.

Engine Integration

sglang and vLLM integrations load shared-memory configuration from file and pass it through to KVCacheManager.

CLI

New command to configure shared GPU memory per IPC:

kvctl config-shared <ipc> <gpu_id> <size>
# example
kvctl config-shared kvcached 6 8G

Shared-memory configuration is persisted in JSON and cleaned up with KV cache segments.

Tensor Parallel

TP IPC mapping now supports phys_device_ids and correctly broadcasts cross-device mapping metadata to workers.

Usage

Configure shared GPU memory:

kvctl config-shared kvcached 6 8G

Start inference (sglang / vLLM): shared-memory config is automatically loaded.
KV cache pages are allocated tiered (local → shared GPU) and mapped with explicit physical device IDs, enabling cross-GPU KV sharing without changes to model code.

Impact

Enables cross-GPU KV cache sharing with explicit device control.
Improves memory utilization and scalability for multi-GPU / scale-up inference.
Backward-compatible for single-GPU setups (defaults to local allocation).

…ion across local and shared GPUs.

…ocator test

ivanium · 2025-12-20T02:57:27Z

Thanks for the contribution! Before diving deeper, I have a quick question: what's the motivation and background for this PR? For example, what're the targeting scenarios and how this PR helps those scenarios?

funjokerwen · 2025-12-22T01:47:06Z

Thanks for the contribution! Before diving deeper, I have a quick question: what's the motivation and background for this PR? For example, what're the targeting scenarios and how this PR helps those scenarios?

Thanks for the question.This PR is motivated by intra-node multi-GPU KV-cache memory pooling.
It targets the following scenarios:

Heterogeneous model instances

Different instances on the same node may be compute-bound or KV-cache/memory-bound.
Sharing KV cache memory across GPUs improves overall memory utilization and reduces stranded capacity.

Imbalanced KV-cache growth in DP execution

In data-parallel mode, KV cache usage can become uneven due to request skew or sequence length variance.
Dynamic cross-GPU allocation helps rebalance memory pressure and avoid early OOM on individual GPUs.

scale-up extended-memory(Future)

The abstraction decouples logical KV cache usage from physical GPU placement, making it extensible to future scale-up or extended-memory setups.

Overall, this PR improves KV-cache memory efficiency and flexibility under dynamic multi-GPU workloads.

frankwen and others added 2 commits December 18, 2025 03:36

feat: Implement distributed KV cache pooling with tiered page allocat…

87004eb

…ion across local and shared GPUs.

test: remove unused expected_local calculation and comment from all…

d43ebce

…ocator test

funjokerwen changed the title ~~Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control~~ [Feat]Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control Dec 18, 2025

funjokerwen added 2 commits December 19, 2025 06:20

chore: apply pre-commit auto-fixes

bf965c3

docs: Correct hyphenation in page allocator comment.

8d60b6d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control#224

[Feat]Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control#224
funjokerwen wants to merge 4 commits intoovg-project:mainfrom
funjokerwen:main

funjokerwen commented Dec 18, 2025

Uh oh!

ivanium commented Dec 20, 2025

Uh oh!

funjokerwen commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

funjokerwen commented Dec 18, 2025

Summary

Key Changes

Usage

Impact

Uh oh!

ivanium commented Dec 20, 2025

Uh oh!

funjokerwen commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants