Skip to content

[Feat]Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control#224

Open
funjokerwen wants to merge 4 commits intoovg-project:mainfrom
funjokerwen:main
Open

[Feat]Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control#224
funjokerwen wants to merge 4 commits intoovg-project:mainfrom
funjokerwen:main

Conversation

@funjokerwen
Copy link

Summary

This PR introduces explicit physical device–aware page allocation and mapping for KV cache, enabling tiered allocation across local and shared GPUs. The change allows KV pages to be created on one GPU and safely mapped for access from another GPU, laying the foundation for scale-up / shared-memory inference scenarios (e.g. NVLink-connected GPUs).

Key Changes

C++ Core

  • Page allocation and mapping APIs now accept an explicit phys_dev_id / access_dev_idx, decoupling where memory is allocated from which device accesses it.
  • make_shared_page / make_unique_page support allocating pages on a specified physical GPU.
  • map_to_kv_tensors and new single-page map APIs support per-page physical device selection.
  • Unified Page::map signature across CPU/GPU pages with device-index–based access control.
  • Updated PyTorch bindings to expose phys_device_ids to Python.

Python Management Layer

  • Page tracks its physical device (phys_dev_id).
  • PageAllocator gains:
    • shared_memory_config to define shared GPUs and per-GPU memory limits.
    • Tiered allocation strategy: local GPU first, then shared GPUs.
    • Tracking of remote allocations and page locations.
    • Mapping/free logic that distinguishes local vs. remote pages.
  • KVCacheManager transparently propagates shared-memory configuration.

Engine Integration

  • sglang and vLLM integrations load shared-memory configuration from file and pass it through to KVCacheManager.

CLI

  • New command to configure shared GPU memory per IPC:
kvctl config-shared <ipc> <gpu_id> <size>
# example
kvctl config-shared kvcached 6 8G
  • Shared-memory configuration is persisted in JSON and cleaned up with KV cache segments.

Tensor Parallel

  • TP IPC mapping now supports phys_device_ids and correctly broadcasts cross-device mapping metadata to workers.

Usage

  1. Configure shared GPU memory:

kvctl config-shared kvcached 6 8G

  1. Start inference (sglang / vLLM): shared-memory config is automatically loaded.

  2. KV cache pages are allocated tiered (local → shared GPU) and mapped with explicit physical device IDs, enabling cross-GPU KV sharing without changes to model code.

Impact

  • Enables cross-GPU KV cache sharing with explicit device control.
  • Improves memory utilization and scalability for multi-GPU / scale-up inference.
  • Backward-compatible for single-GPU setups (defaults to local allocation).

@funjokerwen funjokerwen changed the title Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control [Feat]Enable Tiered / Cross-GPU KV Cache Memory Mapping with Explicit Physical Device Control Dec 18, 2025
@ivanium
Copy link
Collaborator

ivanium commented Dec 20, 2025

Thanks for the contribution! Before diving deeper, I have a quick question: what's the motivation and background for this PR? For example, what're the targeting scenarios and how this PR helps those scenarios?

@funjokerwen
Copy link
Author

Thanks for the contribution! Before diving deeper, I have a quick question: what's the motivation and background for this PR? For example, what're the targeting scenarios and how this PR helps those scenarios?

Thanks for the question.This PR is motivated by intra-node multi-GPU KV-cache memory pooling.
It targets the following scenarios:

  • Heterogeneous model instances

Different instances on the same node may be compute-bound or KV-cache/memory-bound.
Sharing KV cache memory across GPUs improves overall memory utilization and reduces stranded capacity.

  • Imbalanced KV-cache growth in DP execution

In data-parallel mode, KV cache usage can become uneven due to request skew or sequence length variance.
Dynamic cross-GPU allocation helps rebalance memory pressure and avoid early OOM on individual GPUs.

  • scale-up extended-memory(Future)

The abstraction decouples logical KV cache usage from physical GPU placement, making it extensible to future scale-up or extended-memory setups.

Overall, this PR improves KV-cache memory efficiency and flexibility under dynamic multi-GPU workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants