Research project exploring lossless compression beyond 4-bit quantization.
LLM weights, when quantized to 4-bit, have Shannon entropy significantly below 4 bits:
| Model | Parameters | Entropy | Compression over 4-bit |
|---|---|---|---|
| Qwen2.5-1.5B | 1.5B | 1.12 bits | 3.58x |
| Qwen2.5-0.5B | 494M | 1.15 bits | 3.47x |
| GPT-2 | 137M | 1.17 bits | 3.42x |
| SmolLM-135M | 134M | 1.54 bits | 2.59x |
| Average | — | — | 3.27x |
This means entropy coding (rANS) can compress 4-bit weights by an additional 3.27x losslessly.
For a 7B model on Apple Silicon M3 Pro (150 GB/s bandwidth):
| Format | Size | Tokens/sec |
|---|---|---|
| 4-bit quantized | 3.5 GB | 42.9 |
| Entropy-coded 4-bit | 1.07 GB | 140.0 |
3.27x speedup with zero quality loss.
PQQ/
├── src/
│ ├── quantizer.py # Quantization utilities
│ ├── rans.py # rANS entropy codec
│ ├── model_analyzer.py # Model analysis tools
│ └── compressed_format.py # File format
├── experiments/
│ └── validate_compression.py # Main validation
├── docs/
│ └── metal_kernel_design.md # GPU kernel design
├── models/ # Local model weights
├── entropy_coding_research.md # Full research doc
└── paper_findings.md # Original research context
# Validate compression on models
python experiments/validate_compression.py- Observation: 4-bit quantized weights have non-uniform symbol distribution (bell-curved)
- Entropy: Shannon entropy ≈ 1.1-1.5 bits, not 4 bits
- Compression: rANS encoding approaches entropy limit
- Inference: Fused decode+GEMM kernel loads less data from memory
Our contribution: rANS with fused decode+GEMM targeting Apple Silicon's extreme compute-to-bandwidth ratio (47:1).
- Metal kernel implementation (see docs/metal_kernel_design.md)
- MLX integration
- End-to-end benchmarks
- Publication
Research code - MIT License