Entropy-Coded Quantization for LLM Inference

Research project exploring lossless compression beyond 4-bit quantization.

Key Finding

LLM weights, when quantized to 4-bit, have Shannon entropy significantly below 4 bits:

Model	Parameters	Entropy	Compression over 4-bit
Qwen2.5-1.5B	1.5B	1.12 bits	3.58x
Qwen2.5-0.5B	494M	1.15 bits	3.47x
GPT-2	137M	1.17 bits	3.42x
SmolLM-135M	134M	1.54 bits	2.59x
Average	—	—	3.27x

This means entropy coding (rANS) can compress 4-bit weights by an additional 3.27x losslessly.

Impact

For a 7B model on Apple Silicon M3 Pro (150 GB/s bandwidth):

Format	Size	Tokens/sec
4-bit quantized	3.5 GB	42.9
Entropy-coded 4-bit	1.07 GB	140.0

3.27x speedup with zero quality loss.

Project Structure

PQQ/
├── src/
│   ├── quantizer.py      # Quantization utilities
│   ├── rans.py           # rANS entropy codec
│   ├── model_analyzer.py # Model analysis tools
│   └── compressed_format.py # File format
├── experiments/
│   └── validate_compression.py # Main validation
├── docs/
│   └── metal_kernel_design.md  # GPU kernel design
├── models/               # Local model weights
├── entropy_coding_research.md  # Full research doc
└── paper_findings.md     # Original research context

Quick Start

# Validate compression on models
python experiments/validate_compression.py

How It Works

Observation: 4-bit quantized weights have non-uniform symbol distribution (bell-curved)
Entropy: Shannon entropy ≈ 1.1-1.5 bits, not 4 bits
Compression: rANS encoding approaches entropy limit
Inference: Fused decode+GEMM kernel loads less data from memory

Prior Art

EntroLLM (2025): Huffman coding for edge devices
Liguori (2024): ANS for FPGA

Our contribution: rANS with fused decode+GEMM targeting Apple Silicon's extreme compute-to-bandwidth ratio (47:1).

Next Steps

Metal kernel implementation (see docs/metal_kernel_design.md)
MLX integration
End-to-end benchmarks
Publication

License

Research code - MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
experiments		experiments
models		models
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
entropy_analysis.py		entropy_analysis.py
entropy_coding_prototype.py		entropy_coding_prototype.py
entropy_coding_research.md		entropy_coding_research.md
paper_findings.md		paper_findings.md
procedural-quantization.md		procedural-quantization.md
rans_codec.py		rans_codec.py
research_directions.md		research_directions.md
test_real_model.py		test_real_model.py
validate_compression.py		validate_compression.py
validate_entropy_real.py		validate_entropy_real.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entropy-Coded Quantization for LLM Inference

Key Finding

Impact

Project Structure

Quick Start

How It Works

Prior Art

Next Steps

License

About

Uh oh!

Releases

Packages

Languages

drxddy/ecq

Folders and files

Latest commit

History

Repository files navigation

Entropy-Coded Quantization for LLM Inference

Key Finding

Impact

Project Structure

Quick Start

How It Works

Prior Art

Next Steps

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages