ZipServ is a lossless compression framework co-designed for efficient Large Language Model (LLM) inference. It addresses the memory and bandwidth bottlenecks in bit-exact LLM serving through novel hardware-aware compression techniques.
Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact LLM serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic.
ZipServ introduces:
- Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE): A novel fixed-length format that enables constant-time, parallel decoding
- Fused Decompression-GEMM (ZipGEMM) Kernel: Decodes weights on-the-fly directly into Tensor Core registers, eliminating intermediate buffers and maximizing compute intensity
- "Load-compressed, Compute-decompressed" Design: Eliminates redundant memory traffic through co-designed system architecture
Key Results:
- Reduces model size by up to 30%
- Achieves up to 2.21× kernel-level speedup over NVIDIA's cuBLAS
- Expedites end-to-end inference by an average of 1.22× over vLLM
ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.
- OS: Ubuntu 20.04+ (recommended)
- GCC: >= 9.0
- CMake: >= 3.30.3
- CUDA: >= 12.2
- NVIDIA GPU: Compute capability >= 8.0 (Ampere, e.g., A100, A6000, RTX 4090)
- Python: 3.10+
- PyTorch: >= 2.0
- Conda: For environment management
git clone https://github.com/xxyux/ZipServ.git
cd ZipServconda env create -f LInfer_env.yml
conda activate linfercd third_party/
bash setup_vllm.sh
cd ..This will download and apply necessary patches to vLLM for ZipServ integration.
python setup.py installThis installs the ZipServ Python package and compiles the CUDA extensions for model compression.
cd third_party/vllm
MAX_JOBS=16 pip install -e . -v > install_log.log 2>&1This installs vLLM with ZipServ (linfer) quantization support in editable mode.
Before running inference, you need to prepare your model by adding the quantization configuration.
ZipServ now supports direct inference with original model weights. You only need to create a configuration file — no pre-compression is required.
In your model directory (e.g., /path/to/llama-3-8B-Instruct/), create a new file named quant_config.json:
cd /path/to/llama-3-8B-Instruct/Create quant_config.json with the following content:
{
"quant_method": "linfer",
"lm_head": false
}Key fields:
quant_method: Must be"linfer"(the internal quantization method name)lm_head: Whether to quantize the language model head (trueorfalse)
Your model directory should look like this:
/hy-tmp/llama-3-8B-Instruct/
├── config.json # Original model config (do not modify)
├── quant_config.json # <-- Create this file
├── model.safetensors
├── tokenizer.json
└── ...
That's it! ZipServ will automatically compress weights on-the-fly during model loading.
Navigate to the vLLM directory:
cd third_party/vllmRun without ZipServ compression for comparison:
# Single GPU
python run_linfer.py \
-m /hy-tmp/llama-3-8B-Instruct \
--quantization dense \
-b 8 -i 64 -o 256 \
--tensor-parallel-size 1
# Multi-GPU (2 GPUs)
python run_linfer.py \
-m /hy-tmp/Mistral-24B \
--quantization dense \
-b 8 -i 64 -o 256 \
--tensor-parallel-size 2Run with ZipServ (linfer) compression:
# Single GPU - Llama-3-8B
python run_linfer.py \
-m /hy-tmp/llama-3-8B-Instruct \
--quantization linfer \
-b 8 -i 64 -o 256 \
--tensor-parallel-size 1
# Multi-GPU - Mistral-24B (2 GPUs)
python run_linfer.py \
-m /hy-tmp/Mistral-24B \
--quantization linfer \
-b 8 -i 64 -o 256 \
--tensor-parallel-size 2
# Multi-GPU - Llama-3-70B (4 GPUs)
python run_linfer.py \
-m /hy-tmp/llama-3-70B-Instruct \
--quantization linfer \
-b 8 -i 64 -o 256 \
--tensor-parallel-size 4| Argument | Description | Example |
|---|---|---|
-m, --model |
Path to the model directory | /hy-tmp/llama-3-8B-Instruct |
--quantization |
Quantization method: dense or linfer |
linfer |
-b, --batch-size |
Batch size for inference | 8 |
-i, --input-len |
Input sequence length | 64 |
-o, --output-len |
Output sequence length | 256 |
--tensor-parallel-size |
Number of GPUs for tensor parallelism | 1, 2, 4 |
ZipServ/
├── LInfer_py/ # Python utilities for model compression
│ ├── compress_model.py # Model compression script
│ ├── verify_compression.py # Verification tool
│ └── backend/ # Backend implementations
├── csrc/ # CUDA source code
│ ├── L_API.cu # Core compression/decompression API (TCA-TBE)
│ ├── L_Kernel.cuh # CUDA kernels (ZipGEMM)
│ └── ...
├── third_party/
│ └── vllm/ # Patched vLLM with ZipServ support
│ ├── vllm/model_executor/layers/quantization/linfer.py
│ └── ...
├── setup.py # ZipServ installation script
├── LInfer_env.yml # Conda environment specification
└── README.md # This file
Solution: Make sure you have installed ZipServ properly:
cd /path/to/ZipServ
python setup.py installSolution: Ensure your model's config.json contains the quantization_config field with quant_method: "linfer".
Solution: Reduce batch size (-b) or increase tensor parallelism (--tensor-parallel-size).