UnBoxKV-IO

Speeding up each inference request is instrumental in achieving high throughput and latency at scale. KV Cache is usually avoid redundant recomputation in each decode iteration. The free GPU space available to the KV cache is a scarce resource that needs to be managed in an efficient way in order to minimize the overhead of redundant recomputations. This work characterize the impact of KV caching. Specifically, we instrument vLLM to measure and analyze fine-grain KV cache access patterns during different inference stages (prefill, decode). We also study the recomputation and swap overhead for handling KV cache overflow problem in several scenarios that involve concurrent inference requests using several benchmarks. The results show some interesting observations and insights for optimizing the KV cache management and batching strategies.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Results		Results
benchmark_scripts		benchmark_scripts
plots_script		plots_script
vllm		vllm
vllm_for_kv_pattern		vllm_for_kv_pattern
.DS_Store		.DS_Store
License		License
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UnBoxKV-IO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UnBoxKV-IO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages