Accepted to MLSys 2026
Built on top of vLLM
Large Language Model (LLM) serving is increasingly bottlenecked by the size of the key-value (KV) cache, especially in long-context and long-generation workloads. While prior work has shown that only a small subset of tokens dominates attention at each decode step, exploiting this efficiently without hurting accuracy remains challenging.
FlexiCache is a hierarchical KV-cache management system that leverages a simple but powerful observation:
Attention heads differ in the temporal stability of their important KV pages.
Some heads repeatedly attend to nearly the same top-K pages over time, while others shift focus frequently.
FlexiCache uses this observation to reduce GPU memory usage, attention computation, and host-device transfer overhead while preserving model quality.
FlexiCache classifies KV heads into two groups:
- Stable heads: their top-K KV pages remain similar across nearby decode steps.
- Unstable heads: their top-K KV pages change frequently.
This distinction enables a more efficient hierarchical memory policy:
- For unstable heads, the full KV cache stays on the GPU.
- For stable heads, only the current top-K KV pages stay on the GPU, while the rest are offloaded to host memory.
- Stable heads are periodically reranked, and only newly promoted pages are transferred back to the GPU.
Unlike approaches that permanently discard less-important KV entries, FlexiCache retains the full context in host memory, so pages that become important later can still be recovered.
FlexiCache combines four ideas:
-
Stability-aware head classification
KV heads are profiled offline and classified as stable or unstable based on temporal stability. -
Hierarchical KV-cache placement
- Full KV cache for unstable heads on GPU
- Top-K KV pages for stable heads on GPU
- Remaining KV pages for stable heads in host memory
-
Periodic reranking for stable heads
Stable heads are reranked less frequently, reducing scoring overhead and host-device transfers. -
Sparse decode attention
During decoding, attention is computed only on the selected top-K pages for each head.
Across long-context and long-generation workloads, FlexiCache achieves:
- Up to 70% reduction in GPU KV-cache footprint
- 1.38×–1.55× higher offline serving throughput
- 1.6×–2.1× lower online token latency
- Up to 4× decode-kernel speedup
- ~99% of dense-attention accuracy on evaluated benchmarks
FlexiCache consistently outperforms vLLM on both Llama-3.1-8B and Mistral-7B in token throughput, with gains increasing as output length grows.
Accuracy retention on L-Eval
GPU Memory Savings. Over 70% memory savings is achieved at sequence lengths >20k with a token budget of 1024.
- GPU: 1× NVIDIA H100 94GB NVL
- Host memory: >= 256 GB DDR5 RAM
- Interconnect: PCIe 5.0
FlexiCache requires a large pinned host-memory pool for KV offloading. By default, the host KV-cache size is set to 180 GB, controlled by:
vllm/v1/flexicache/config.json
- Python: 3.12
- PyTorch: 2.6.0+cu124
- Triton: 3.2.0
- Transformers: 4.50.0
- Datasets: 3.6.0
- CUDA runtime: 12.8
- NVIDIA driver: 570.133.20
FlexiCache is built on top of vLLM 0.8.2.
# Clone the repository
git clone git@github.com:NazmulTakbir/FlexiCache.git
cd FlexiCache
# Create and activate a Conda environment
conda create -n FlexiCache python=3.12 -y
conda activate FlexiCache
# Build the FlexiCache-modified vLLM from source
export MAX_JOBS=20 # Adjust based on your system capacity
pip install -e . # builds vLLM from source (can take ~30 minutes)
# Install additional dependencies
pip install -r flexicache_requirements.txtFlexiCache uses V1 Triton attention backend of vLLM 0.8.2
export VLLM_ATTENTION_BACKEND="TRITON_ATTN_VLLM_V1"
export VLLM_USE_V1="1"
export TORCH_CUDA_ARCH_LIST="9.0;9.0a"
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 1 \
--no-enable-prefix-caching \
--disable-cascade-attn \
--max-num-batched-tokens 32768 \
--max-num-seqs 32 \
--rerank-frequency 16 \
--topK-budget 64 \
--num-unstable-heads 64 \
--unstable_heads_profile_task gov_report \
--enable-flexicacheNB: FlexiCache loads custom CUDA kernels via torch.utils.cpp_extension.load. Limiting the build to specific architectures speeds up compilation. "9.0;9.0a" targets NVIDIA Hopper GPUs. Change this if using a different GPU.
- rerank-frequency 16: rerank stable heads every 16 decode steps
- topK-budget: number of KV pages used for sparse attention
- num-unstable-heads: number of heads that always keep full KV on GPU
- unstable_heads_profile_task: selects the precomputed head classification profile
- enable-flexicache enables FlexiCache
Detailed benchmarking instructions are available in the benchmarking guide.
FlexiCache currently includes stability profiling and head classification metadata for the following models:
meta-llama/Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.2mistralai/Mistral-Small-24B-Instruct-2501Qwen/Qwen2.5-32B-Instruct
Additional models can be supported by running the same offline stability analysis and generating the corresponding head classification metadata.
FlexiCache is built on top of the vLLM framework.
If you use FlexiCache in your research, please cite our paper:
@misc{takbir2025flexicacheleveragingtemporalstability,
title={FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management},
author={Nazmul Takbir and Hamidreza Alikhani and Nikil Dutt and Sangeetha Abdu Jyothi},
year={2025},
eprint={2511.00868},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.00868},
}
For questions about the paper or implementation, please contact:
Nazmul Takbir ntakbir@uci.edu



