Cache-local reconstruction for bounded-memory LLM streaming.
StreamingLLM demonstrated that retaining attention sinks plus a recent tail enables bounded-memory streaming, relying on pos-shift—a mechanism that reassigns cache positions by modifying attention internals and requires pre-RoPE key storage. We present a simpler alternative: cache-local reconstruction, which clears the cache and re-decodes retained tokens at bounded positions. Reconstruction requires only clear() and decode()—no backend-specific knowledge of key storage format, no attention modification. Through a 2×2 ablation crossing position semantics (naive eviction vs. reconstruction) with sink presence across five architectures, we find that under naive selective eviction (kvRemove), sink behavior is architecture-dependent and unpredictable—ranging from 103× improvement (Llama) to 42× degradation (Phi). Under cache-local reconstruction, all five tested architectures converge to within 3–16% of baseline quality with sinks contributing <2%.
# Install dependencies (requires Node.js 18+)
npm install
# Download a model (Q4_K_M quantization recommended)
mkdir -p models
wget -O models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF/resolve/main/SmolLM2-1.7B-Instruct-Q4_K_M.gguf
# Run evaluation on PG19 corpus
node blink_kv.mjs models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
--dataset=dataset/pg19_first_book.txt \
--run=baseline,kvremove,reseed_per_boundary \
--start_size=0 --recent_size=256 \
--num_eval_tokens=2000| Model | Baseline | kvRemove 4+252 | kvRemove 0+256 | Blink KV 4+252 | Blink KV 0+256 |
|---|---|---|---|---|---|
| SmolLM2 1.7B | 8.47 | 56.77 | 280.08 | 9.14 | 9.06 |
| Llama 3.1 8B | 7.44 | 8.58 | 885.80 | 7.95 | 8.06 |
| Qwen 2.5 7B | 6.98 | 8.38 | 24.02 | 7.87 | 7.86 |
| Phi 3.5 3.8B | 7.77 | 1542.83 | 36.90 | 7.99 | 7.98 |
| Gemma 2 9B | 7.72 | 21.20 | 11.14 | 8.85 | 8.94 |
| Method | 2k | 8k | 20k |
|---|---|---|---|
| kvremove | 854.14 | 681.82 | 571.77 |
| Blink KV | 8.24 | 8.92 | 8.84 |
Blink KV maintains near-baseline perplexity regardless of sequence length (65–104× better than kvRemove at all lengths).
blink_kv.mjs # Main evaluation script (teacher-forced perplexity)
blink_kv.md # Paper (markdown)
package.json # Dependencies
dataset/
pg19_first_book.txt # PG19 evaluation corpus
experiments/
catalog.md # Full experiment catalog with results
README.md # Fly.io parallel execution guide
Dockerfile # Container for Fly.io execution
fly.toml # Fly.io app configuration
entrypoint.sh # Container entry point
run_experiment.sh # Experiment runner (reads env vars)
generate_fly_jobs.mjs # Job generator for 26-way parallel sweep
run_all.sh # Generated Fly.io job spawner
recreate_machines.sh # Machine recreation script
download_models_on_fly.sh # Model download helper
results/ # Raw JSON results from all 28 experiments
The full 26-experiment sweep runs in parallel on Fly.io using volume forking. See experiments/README.md for infrastructure setup and execution instructions.
@article{naqvi2026blinkkv,
title={Blink KV: Cache-Local Reconstruction for Bounded-Memory LLM Streaming},
author={Naqvi, Zuhair},
year={2026}
}MIT