kvcached

kvcached is a new KV cache management system that supports on-demand KV cache allocation. It implements the concept of GPU virtual memory, allowing applications to reserve virtual address space without immediately committing physical memory. Physical memory is then automatically allocated and mapped as needed at runtime. This capability allows multiple LLMs to run concurrently on a single GPU or a group of GPUs (TP) and flexibly share the GPU memory, significantly improving GPU utilization and reducing memory fragmentation.

kvcached is compatible with popular LLM serving engines, including SGLang and vLLM.

kvcached Installation

Prerequisites

Python (tested with 3.9 - 3.11)
PyTorch (compatible with SGLang and vLLM)

kvcached can be installed as a plugin with SGLang and vLLM.

cd engine_integration/scripts
# install kvcached with SGLang v0.4.9
./setup.sh --engine sglang --engine-method source --engine-version 0.4.9
# install kvcached with vLLM v0.9.2
./setup.sh --engine vllm --engine-method source --engine-version 0.9.2

This script will download the specified versions of SGLang and vLLM, create separate venv environments (using uv), compile the code, apply the necessary patches, and install kvcached.

Run kvcached with Docker

You can test or develop kvcached with Docker.

To test kvcached with SGLang or VLLM.

docker pull ghcr.io/ovg-project/[kvcached-sglang|kvcached-vllm]:latest

For developmenet:

docker pull ghcr.io/ovg-project/kvcached-dev:latest

More instructions can be found here.

Testing

kvcached can be enabled or disabled by export ENABLE_KVCACHED=true or false. To verify the successful installation and benchmark the performance of SGLang/vLLM with kvcached, run:

cd benchmarks/simple_bench
export VENV_PATH=../../engine_integration/[sglang|vllm]-kvcached-venv
./start_server.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B
# Wait until LLM server is ready
./start_client.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B

The benchmark scripts automatically set ENABLE_KVCACHED=true. Please refer to each script for instructions on how to run SGLang/vLLM with kvcached.

Memory monitoring and control via kvcached CLI

kvcached includes a built-in CLI tool that allows you to monitor GPU memory usage and manage memory limits across different applications. A command kvctl is installed along with kvcached package:

kvctl

Once inside the CLI, type help to view all supported commands:

kvcached> help
Available commands:
  list [ipc ...]               List IPC segments and usage
  limit <ipc> <size>           Set absolute limit (e.g. 512M, 2G)
  limit-percent <ipc> <pct>    Set limit as percentage of total GPU RAM
  watch [-n sec] [ipc ...]     Continuously display usage table
  kvtop [ipc ...] [--refresh r]  Launch curses kvtop UI (q to quit)
  !<shell cmd>                 Run command in system shell
  help                         Show this help message
  delete <ipc>                 Delete IPC segment and its limit entry
  exit | quit                  Exit the shell

kvcached>

Use the kvtop command for real-time visualization of memory usage:

KVCache Memory Usage

IPC: SGLANG
[==##################----------------------------------------]
Prealloc: 792.0 MB | Used: 11.2 GB / 39.9 GB (30.1%) | Free: 27.9 GB

IPC: VLLM
[==#######--------------------------------------------------- ]
Prealloc: 768.0 MB | Used: 3.6 GB / 37.4 GB (11.7%) | Free: 33.0 GB

GPU Memory Usage
[########################################--------------------]
Used: 52.9 GB / 79.2 GB (66.8%) | Free: 26.3 GB

Press 'q' to quit

Contributing

We are grateful for and open to contributions and collaborations of any kind.

We use pre-commit to ensure a consistent coding style. You can set it up by

pip install pre-commit
pre-commit install

Before pushing your code, please run the following check and make sure your code passes all checks.

pre-commit run --all-files

Contacts

Feel free to contact us for contributions and collaborations.

Jiarong Xing (jxing@rice.edu)
Yifan Qiao (yifanqiao@berkeley.edu)
Shan Yu (shanyu1@g.ucla.edu)

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.gemini		.gemini
.github/workflows		.github/workflows
benchmarks		benchmarks
controller		controller
csrc		csrc
docker		docker
engine_integration		engine_integration
examples		examples
kvcached		kvcached
tests		tests
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
.license-header.txt		.license-header.txt
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
kvcached_autopatch.pth		kvcached_autopatch.pth
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kvcached

kvcached Installation

Prerequisites

Run kvcached with Docker

Testing

Memory monitoring and control via kvcached CLI

Contributing

Contacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kvcached

kvcached Installation

Prerequisites

Run kvcached with Docker

Testing

Memory monitoring and control via kvcached CLI

Contributing

Contacts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages