Skip to content

lloyal-ai/blink-kv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Blink KV

Cache-local reconstruction for bounded-memory LLM streaming.

Paper (PDF)

Abstract

StreamingLLM demonstrated that retaining attention sinks plus a recent tail enables bounded-memory streaming, relying on pos-shift—a mechanism that reassigns cache positions by modifying attention internals and requires pre-RoPE key storage. We present a simpler alternative: cache-local reconstruction, which clears the cache and re-decodes retained tokens at bounded positions. Reconstruction requires only clear() and decode()—no backend-specific knowledge of key storage format, no attention modification. Through a 2×2 ablation crossing position semantics (naive eviction vs. reconstruction) with sink presence across five architectures, we find that under naive selective eviction (kvRemove), sink behavior is architecture-dependent and unpredictable—ranging from 103× improvement (Llama) to 42× degradation (Phi). Under cache-local reconstruction, all five tested architectures converge to within 3–16% of baseline quality with sinks contributing <2%.

Quick Start

# Install dependencies (requires Node.js 18+)
npm install

# Download a model (Q4_K_M quantization recommended)
mkdir -p models
wget -O models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF/resolve/main/SmolLM2-1.7B-Instruct-Q4_K_M.gguf

# Run evaluation on PG19 corpus
node blink_kv.mjs models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  --dataset=dataset/pg19_first_book.txt \
  --run=baseline,kvremove,reseed_per_boundary \
  --start_size=0 --recent_size=256 \
  --num_eval_tokens=2000

Results Summary

Architecture Sweep (20K tokens, cache size 256, per-boundary reconstruction)

Model Baseline kvRemove 4+252 kvRemove 0+256 Blink KV 4+252 Blink KV 0+256
SmolLM2 1.7B 8.47 56.77 280.08 9.14 9.06
Llama 3.1 8B 7.44 8.58 885.80 7.95 8.06
Qwen 2.5 7B 6.98 8.38 24.02 7.87 7.86
Phi 3.5 3.8B 7.77 1542.83 36.90 7.99 7.98
Gemma 2 9B 7.72 21.20 11.14 8.85 8.94

Magnitude Sweep (SmolLM2 1.7B, 0+256, no sinks)

Method 2k 8k 20k
kvremove 854.14 681.82 571.77
Blink KV 8.24 8.92 8.84

Blink KV maintains near-baseline perplexity regardless of sequence length (65–104× better than kvRemove at all lengths).

Repository Structure

blink_kv.mjs               # Main evaluation script (teacher-forced perplexity)
blink_kv.md                 # Paper (markdown)
package.json                # Dependencies
dataset/
  pg19_first_book.txt       # PG19 evaluation corpus
experiments/
  catalog.md                # Full experiment catalog with results
  README.md                 # Fly.io parallel execution guide
  Dockerfile                # Container for Fly.io execution
  fly.toml                  # Fly.io app configuration
  entrypoint.sh             # Container entry point
  run_experiment.sh         # Experiment runner (reads env vars)
  generate_fly_jobs.mjs     # Job generator for 26-way parallel sweep
  run_all.sh                # Generated Fly.io job spawner
  recreate_machines.sh      # Machine recreation script
  download_models_on_fly.sh # Model download helper
  results/                  # Raw JSON results from all 28 experiments

Reproducing at Scale

The full 26-experiment sweep runs in parallel on Fly.io using volume forking. See experiments/README.md for infrastructure setup and execution instructions.

Citation

@article{naqvi2026blinkkv,
  title={Blink KV: Cache-Local Reconstruction for Bounded-Memory LLM Streaming},
  author={Naqvi, Zuhair},
  year={2026}
}

License

MIT

About

Cache-local reconstruction for bounded-memory LLM streaming

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors