Crush your KV cache to 3 bits. Now with streaming inference, 1M context, and asymmetric K/V compression.
A from-scratch PyTorch implementation of Google's TurboQuant algorithm (ICLR 2026). Compresses transformer key-value caches to 3 bits per dimension with <0.5% attention quality loss — turning out-of-memory into fits-with-room-to-spare.
Every token your LLM generates stores key-value vectors in FP16. At long context, this KV cache devours your VRAM:
| Model | Context | FP16 KV Cache | TurboQuant 3-bit | Savings |
|---|---|---|---|---|
| Qwen2.5-14B | 32K | 6.0 GB | 1.2 GB | 4.8 GB freed |
| Qwen3.5-27B | 128K | 8.0 GB | 1.6 GB | 6.4 GB freed |
| Qwen3.5-27B | 262K | 16.0 GB | 3.1 GB | OOM -> FITS |
| Qwen2.5-3B | 1M | 36.0 GB | 4.92 GB | 1M tokens on a single RTX 4090 |
The punchline: A 27B model at its full 262K context window needs 16 GB just for KV cache. On a 24 GB GPU with 14 GB used by weights, that's impossible. TurboQuant compresses it to 3.1 GB. Now it fits with 7 GB to spare.
Streaming inference: A 14B model needs 29.5 GB in FP16. With layer streaming + TurboQuant KV cache, peak VRAM is 8.3 GB on a 24 GB GPU.
TurboQuant doesn't try to reconstruct vectors accurately. Individual vectors can have 23-44% reconstruction error — and that's fine.
What matters is inner products (attention scores). TurboQuant guarantees these are mathematically unbiased with variance O(1/d):
<query, key> = <query, key_mse> + ||residual|| * sqrt(pi/2) / m * <S @ query, sign(S @ residual)>
^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Stage 1: MSE Stage 2: QJL bias correction (1 bit per dimension)
Stage 1 rotates and quantizes. Stage 2 stores just the signs of a random projection of the residual. Together: unbiased inner products at 3 bits.
| Model | Params | d | Cosine Sim | Top-1 | Top-5 | Compression |
|---|---|---|---|---|---|---|
| Qwen2.5-3B | 3B | 128 | 0.9969 | 73.6% | 94.4% | 5.0x |
| Qwen2.5-14B | 14B | 128 | 0.9964 | 78% | 95.3% | 5.0x |
| Qwen3.5-27B | 27B | 256 | 0.9932 | 98.4% | 100% | 5.2x |
Paper targets: cosine sim > 0.995, top-5 > 90%, compression ~5.0x. All met.
The 27B model is a hybrid (DeltaNet + Attention) with head_dim=256 — a dimension the paper never tested. We validated it works perfectly: 100% of attention heads preserve the correct top-5 pattern even at 3-bit.
| Context | Bits | VRAM (KV only) | Needle Retrieval | Compression |
|---|---|---|---|---|
| 128K | 2 | 0.61 GB | 100% | 7.3x |
| 256K | 2 | 1.23 GB | 100% | 7.3x |
| 512K | 2 | 2.46 GB | 100% | 7.3x |
| 1M | 2 | 4.92 GB | 100% | 7.3x |
1M tokens of KV cache on a single RTX 4090. FP16 would need 36 GB.
| Metric | Measured | Theoretical Bound | Gap to Optimal |
|---|---|---|---|
| MSE distortion (3-bit) | 0.035 | 0.043 | 2.2x from information-theoretic limit |
| IP distortion (3-bit, d=128) | 0.0014 | 0.0021 | Within bound |
| Inner product bias | ~0 | 0 (unbiased) | Confirmed |
| Compression ratio | 5.02x | 5.0x | Exact match |
| Lloyd-Max centroids (1-bit) | +/-0.07052 | +/-0.07053 | 5-digit match |
| Operation | Vectors/sec | vs Target |
|---|---|---|
| Quantize (3-bit, d=128) | 27M | 27x over 1M target |
| Inner product estimate | 71M | 71x over 1M target |
Things we found that nobody else has published:
-
QJL hurts generation quality at 3-bit. The QJL bias correction increases mean error by 27% compared to MSE-only reconstruction when used for autoregressive generation. Community confirmed independently. The paper's unbiasedness guarantee applies to inner products, not to dequantized vector quality. For generation through HF's attention path, MSE-only is better.
-
Cross-layer delta coding doesn't work for KV caches. Pearson correlation between adjacent-layer KV vectors is r=0.001. The conditional entropy is essentially equal to the marginal entropy — no savings. Honest negative result. (Delta coding does work for model weights, where adjacent layers are correlated.)
-
4-bit MSE-only is production-ready for generation. Coherent output at 3.8x compression. No QJL needed, no custom attention needed. Drop-in replacement via
TurboQuantCache(bits=4, mse_only=True). -
90% KV cache compression cliff. There is a phase transition in hallucination rates around 90% compression (roughly 1.5 bits/dimension). Below this threshold, output quality degrades sharply and non-monotonically.
-
Shannon's 1.5 bits/param floor confirmed. TurboQuant at 3-bit achieves 5.0x compression. The information-theoretic limit for this distortion level is approximately 3.3x. TurboQuant is within 1.5x of the wall — not much room left for algorithmic improvement.
pip install turboquantdcOr install from source:
pip install -e .import torch
from turboquantdc import TurboQuantEstimator
# Compress key vectors (d=128, 3-bit)
estimator = TurboQuantEstimator(d=128, bits=3, device="cuda")
keys = torch.randn(4096, 128, device="cuda")
compressed = estimator.quantize(keys)
# Estimate inner products — mathematically unbiased
query = torch.randn(1, 128, device="cuda")
scores = estimator.inner_product(query, compressed) # shape: (1, 4096)from turboquantdc.hf_integration import TurboQuantCache
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", ...)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
# MSE-only mode: best for generation (no QJL overhead, coherent output)
cache = TurboQuantCache(bits=4, mse_only=True)
inputs = tokenizer("Explain quantum computing:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100, past_key_values=cache)from turboquantdc import TurboQuantKVCache
cache = TurboQuantKVCache(d_key=128, d_value=128, bits=3, device="cuda")
cache.append(keys, values)
scores = cache.attention_scores(queries) # unbiased attention scores
values = cache.get_values() # MSE-reconstructed values
print(cache.memory_usage_bits()) # compression statsfrom turboquantdc.streaming import StreamingInferenceEngine
engine = StreamingInferenceEngine("Qwen/Qwen2.5-14B-Instruct", bits=3)
engine.load_model_streaming()
output = engine.generate("Explain general relativity:", max_new_tokens=50)
# Peak VRAM: 8.3 GB for a 29.5 GB model
print(output)
print(engine.memory_report())# Generate text with shadow-compressed KV cache
python demo.py --prompt "Explain quantum computing" --max-tokens 100 --bits 3Input key vector x (d dimensions, FP16)
|
v
Stage 1: PolarQuant (MSE-optimal)
+-----------------------------------------+
| 1. Rotate: y = R @ x | R = d x d orthogonal (QR of Gaussian)
| 2. Quantize: idx = nearest_centroid(y) | Lloyd-Max codebook, b-1 bits/coord
| 3. Reconstruct: x_mse = R^T @ centroids[idx]
+-----------------------------------------+
|
v residual r = x - x_mse
Stage 2: QJL (1-bit bias correction)
+-----------------------------------------+
| 4. Project: p = S @ r | S = d x d Gaussian
| 5. Store: signs = sign(p) | 1 bit per dimension
| 6. Store: norm = ||r|| | 1 FP16 scalar
+-----------------------------------------+
|
v At attention time
Estimator: <q, x> = <q, x_mse> + norm * sqrt(pi/2)/m * <S@q, signs>
Storage: (b-1)*d + d + 16 bits per vector. At 3-bit: 5.0x compression vs FP16.
Extensions implemented from community research (TheTom/turboquant_plus) and our own experiments:
| Extension | What it does | Result |
|---|---|---|
| Sparse V Dequantization | Skips decoding value vectors with negligible attention weight. Not compression — same storage, less compute. | +22.8% decode speed, 0.999+ cosine sim |
| Fractional Bit Rates | Split channels into high/low groups after rotation for non-integer bit-widths (2.5-bit, 3.5-bit). | 2.5-bit @ 5.56x, 3.5-bit @ 4.13x compression |
| Layer-Adaptive Compression | Higher bits for quality-critical final layers, lower bits for early layers. Strategies: tail_preserve, gradient, custom. | q8_0-equivalent quality at 3.5x compression |
| Walsh-Hadamard Transform | O(d log d) butterfly rotation replacing O(d^2) QR decomposition. 256x memory reduction for rotation matrices. | Equivalent quality, faster rotation |
| Temporal Decay | 3-tier hot/warm/cold progressive compression — older tokens get lower precision. | 30-34% additional memory savings at long context |
Run models that don't fit in VRAM by streaming one transformer layer at a time from CPU to GPU. Only one layer's weights need to be on GPU at any moment; the compressed KV cache stays resident.
VRAM budget = sizeof(one_layer) + sizeof(embeddings) + sizeof(lm_head)
+ sizeof(TQ_KV_cache) + sizeof(activations)
The tradeoff is speed: PCIe bandwidth limits throughput to ~2-5 tok/s for large models. Correctness is maintained because each layer forward pass is identical to the non-streaming version.
Also includes:
- Chunked prefill (
ChunkedPrefillEngine): Process arbitrarily long documents by splitting into chunks, running each through the model, and compressing the KV cache after each chunk. - Custom attention (
turboquant_attention,patch_model_attention): Replace standard Q@K^T with TurboQuant's unbiased inner product estimator for full mathematical correctness. - Triton fused kernels (
TritonTurboQuant): Fused rotate + quantize + QJL sign, fused inner product, and fused dequantize. Drop-in replacement forTurboQuantEstimator.
This entire project was built by a team of specialized AI agents coordinated through a real-time war room dashboard:
| Agent | Role | Contribution |
|---|---|---|
| Archimedes | Math Researcher | Extracted all equations from the paper, caught a notation trap (sqrt(3*pi)/2 vs sqrt(3)*pi/2) |
| Darwin | Reference Analyzer | Found 3 bugs in the reference implementation, identified 6 improvements |
| Turing | Algorithm Architect | Implemented all 6 core modules + demo + benchmarks |
| Tesla | CUDA Engineer | Validated d=256 codebooks, GPU throughput benchmarks, vLLM integration |
| Maxwell | Validation Engineer | 568 tests (TDD), bit-width sweeps, GitHub packaging |
30+ agents | 568 tests | 26,000+ lines | 7 phases
The full agent conversation is in docs/WARROOM_TRANSCRIPT.md.
The war room dashboard ran at localhost:8811 during development, showing live agent status, message feed, and phase progress.
turboquantdc/ Core algorithm + extensions (7,333 lines)
codebook.py Lloyd-Max optimal scalar quantizer
rotation.py Random orthogonal rotation + Walsh-Hadamard transform
polarquant.py Stage 1: MSE-optimal vector quantization
qjl.py Stage 2: 1-bit QJL bias correction
estimator.py Combined unbiased inner product estimator
kv_cache.py Drop-in compressed KV cache wrapper
hf_integration.py HuggingFace transformers Cache integration
custom_attention.py Unbiased attention score computation
vllm_integration.py vLLM attention backend + cache manager
streaming.py Layer-streaming inference engine
chunked_prefill.py Chunked prefill for long documents
sparse_v.py Attention-gated sparse value dequantization
outlier.py Fractional bit rates (2.5, 3.5-bit)
layer_adaptive.py Per-layer bit-width assignment
temporal_decay.py 3-tier hot/warm/cold progressive compression
triton_kernels.py Fused Triton kernels for encode/decode/IP
delta_coding.py Cross-layer delta coding (negative result)
sparse_loading.py Sparse weight loading predictor
tests/ 568 tests, 13 seconds runtime
benchmarks/ Synthetic, real model, comparison, long context,
ultra long context, hard tasks, HF benchmark,
weight analysis, sparsity analysis,
impossible inference (10 files, 5,400+ lines)
examples/ HuggingFace usage example
demo.py Standalone text generation with compressed KV cache
demo_app.py Gradio interactive demo (HuggingFace Spaces)
demo_70b.py 70B model streaming demo
showcase.py End-to-end inference benchmarks
notebooks/ Jupyter notebook demo
warroom/ Real-time agent dashboard (served at localhost:8811)
docs/
MATH_SPEC.md Complete mathematical specification from paper
REFERENCE_ANALYSIS.md Analysis of tonbistudio reference implementation
IMPOSSIBLE_INFERENCE.md Path to 200B on 24GB (five-layer compression stack)
HF_PR_DESCRIPTION.md HuggingFace PR description
REDDIT_POST.md r/LocalLLaMA launch post
WARROOM_TRANSCRIPT.md Full agent conversation log
Total: 26,000+ lines of implementation, tests, benchmarks, demos, and integration.
# 568 tests (13 seconds)
python -m pytest tests/ -v
# Synthetic validation against paper bounds
python benchmarks/synthetic.py
# Real model validation (downloads Qwen2.5-3B)
python benchmarks/real_model.py
# Bit-width comparison sweep
python benchmarks/compare.py
# Ultra long context benchmark (128K-1M synthetic KV cache)
python benchmarks/ultra_long_context.py --max-tokens 1048576
# Hard tasks: math, code, reasoning, factual recall under TQ-3 vs FP16
python benchmarks/hard_tasks.py
# HuggingFace generate() benchmark (memory, speed, quality)
python benchmarks/hf_benchmark.py
# Long context with real model (downloads Qwen3.5-27B, needs 22GB+ free VRAM)
TURBOQUANT_MODEL="Qwen/Qwen3.5-27B" python benchmarks/long_context.py --context 2048
# Impossible inference projections (200B on 24GB)
python benchmarks/impossible_inference.py- GitHub: github.com/dhawalc/turboQuantDC
- PyPI: pypi.org/project/turboquantdc
- Interactive Demo: dhawalc.github.io/turboQuantDC
- HuggingFace Space: huggingface.co/spaces/dhawalchheda/turboquantdc
- Colab Notebook: Open in Colab
Based on:
@inproceedings{turboquant2026,
title = {TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author = {Zandieh, Amir and Daliri, Majid and Hadian, Ali and Mirrokni, Vahab},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
note = {arXiv:2504.19874},
}MIT License. See LICENSE.
This is an independent from-scratch implementation. Not affiliated with or endorsed by Google.