Blink KV

Cache-local reconstruction for bounded-memory LLM streaming.

Abstract

StreamingLLM demonstrated that retaining attention sinks plus a recent tail enables bounded-memory streaming, relying on pos-shift—a mechanism that reassigns cache positions by modifying attention internals and requires pre-RoPE key storage. We present a simpler alternative: cache-local reconstruction, which clears the cache and re-decodes retained tokens at bounded positions. Reconstruction requires only clear() and decode()—no backend-specific knowledge of key storage format, no attention modification. Through a 2×2 ablation crossing position semantics (naive eviction vs. reconstruction) with sink presence across five architectures, we find that under naive selective eviction (kvRemove), sink behavior is architecture-dependent and unpredictable—ranging from 103× improvement (Llama) to 42× degradation (Phi). Under cache-local reconstruction, all five tested architectures converge to within 3–16% of baseline quality with sinks contributing <2%.

Quick Start

# Install dependencies (requires Node.js 18+)
npm install

# Download a model (Q4_K_M quantization recommended)
mkdir -p models
wget -O models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF/resolve/main/SmolLM2-1.7B-Instruct-Q4_K_M.gguf

# Run evaluation on PG19 corpus
node blink_kv.mjs models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  --dataset=dataset/pg19_first_book.txt \
  --run=baseline,kvremove,reseed_per_boundary \
  --start_size=0 --recent_size=256 \
  --num_eval_tokens=2000

Results Summary

Architecture Sweep (20K tokens, cache size 256, per-boundary reconstruction)

Model	Baseline	kvRemove 4+252	kvRemove 0+256	Blink KV 4+252	Blink KV 0+256
SmolLM2 1.7B	8.47	56.77	280.08	9.14	9.06
Llama 3.1 8B	7.44	8.58	885.80	7.95	8.06
Qwen 2.5 7B	6.98	8.38	24.02	7.87	7.86
Phi 3.5 3.8B	7.77	1542.83	36.90	7.99	7.98
Gemma 2 9B	7.72	21.20	11.14	8.85	8.94

Magnitude Sweep (SmolLM2 1.7B, 0+256, no sinks)

Method	2k	8k	20k
kvremove	854.14	681.82	571.77
Blink KV	8.24	8.92	8.84

Blink KV maintains near-baseline perplexity regardless of sequence length (65–104× better than kvRemove at all lengths).

Repository Structure

blink_kv.mjs               # Main evaluation script (teacher-forced perplexity)
blink_kv.md                 # Paper (markdown)
package.json                # Dependencies
dataset/
  pg19_first_book.txt       # PG19 evaluation corpus
experiments/
  catalog.md                # Full experiment catalog with results
  README.md                 # Fly.io parallel execution guide
  Dockerfile                # Container for Fly.io execution
  fly.toml                  # Fly.io app configuration
  entrypoint.sh             # Container entry point
  run_experiment.sh         # Experiment runner (reads env vars)
  generate_fly_jobs.mjs     # Job generator for 26-way parallel sweep
  run_all.sh                # Generated Fly.io job spawner
  recreate_machines.sh      # Machine recreation script
  download_models_on_fly.sh # Model download helper
  results/                  # Raw JSON results from all 28 experiments

Reproducing at Scale

The full 26-experiment sweep runs in parallel on Fly.io using volume forking. See experiments/README.md for infrastructure setup and execution instructions.

Citation

@article{naqvi2026blinkkv,
  title={Blink KV: Cache-Local Reconstruction for Bounded-Memory LLM Streaming},
  author={Naqvi, Zuhair},
  year={2026}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blink KV

Abstract

Quick Start

Results Summary

Architecture Sweep (20K tokens, cache size 256, per-boundary reconstruction)

Magnitude Sweep (SmolLM2 1.7B, 0+256, no sinks)

Repository Structure

Reproducing at Scale

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset		dataset
demo		demo
experiments		experiments
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
blink_kv.mjs		blink_kv.mjs
blink_kv.pdf		blink_kv.pdf
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

Blink KV

Abstract

Quick Start

Results Summary

Architecture Sweep (20K tokens, cache size 256, per-boundary reconstruction)

Magnitude Sweep (SmolLM2 1.7B, 0+256, no sinks)

Repository Structure

Reproducing at Scale

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages