ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

ZipServ is a lossless compression framework co-designed for efficient Large Language Model (LLM) inference. It addresses the memory and bandwidth bottlenecks in bit-exact LLM serving through novel hardware-aware compression techniques.

Overview

Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact LLM serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic.

ZipServ introduces:

Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE): A novel fixed-length format that enables constant-time, parallel decoding
Fused Decompression-GEMM (ZipGEMM) Kernel: Decodes weights on-the-fly directly into Tensor Core registers, eliminating intermediate buffers and maximizing compute intensity
"Load-compressed, Compute-decompressed" Design: Eliminates redundant memory traffic through co-designed system architecture

Key Results:

Reduces model size by up to 30%
Achieves up to 2.21× kernel-level speedup over NVIDIA's cuBLAS
Expedites end-to-end inference by an average of 1.22× over vLLM

ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.

Requirements

System Requirements

OS: Ubuntu 20.04+ (recommended)
GCC: >= 9.0
CMake: >= 3.30.3
CUDA: >= 12.2
NVIDIA GPU: Compute capability >= 8.0 (Ampere, e.g., A100, A6000, RTX 4090)

Software Dependencies

Python: 3.10+
PyTorch: >= 2.0
Conda: For environment management

Installation

Step 1: Clone the Repository

git clone https://github.com/xxyux/ZipServ.git
cd ZipServ

Step 2: Create Conda Environment

conda env create -f LInfer_env.yml
conda activate linfer

Step 3: Setup vLLM

cd third_party/
bash setup_vllm.sh
cd ..

This will download and apply necessary patches to vLLM for ZipServ integration.

Step 4: Install ZipServ Compression Tools

python setup.py install

This installs the ZipServ Python package and compiles the CUDA extensions for model compression.

Step 5: Install vLLM (for Inference)

cd third_party/vllm
MAX_JOBS=16 pip install -e . -v > install_log.log 2>&1

This installs vLLM with ZipServ (linfer) quantization support in editable mode.

Model Preparation

Before running inference, you need to prepare your model by adding the quantization configuration.

ZipServ now supports direct inference with original model weights. You only need to create a configuration file — no pre-compression is required.

Configuration

In your model directory (e.g., /path/to/llama-3-8B-Instruct/), create a new file named quant_config.json:

cd /path/to/llama-3-8B-Instruct/

Create quant_config.json with the following content:

{
  "quant_method": "linfer",
  "lm_head": false
}

Key fields:

quant_method: Must be "linfer" (the internal quantization method name)
lm_head: Whether to quantize the language model head (true or false)

Your model directory should look like this:

/hy-tmp/llama-3-8B-Instruct/
├── config.json              # Original model config (do not modify)
├── quant_config.json        # <-- Create this file
├── model.safetensors
├── tokenizer.json
└── ...

That's it! ZipServ will automatically compress weights on-the-fly during model loading.

Running Inference

Navigate to the vLLM directory:

cd third_party/vllm

1. Baseline (Dense) Inference

Run without ZipServ compression for comparison:

# Single GPU
python run_linfer.py \
    -m /hy-tmp/llama-3-8B-Instruct \
    --quantization dense \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 1

# Multi-GPU (2 GPUs)
python run_linfer.py \
    -m /hy-tmp/Mistral-24B \
    --quantization dense \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 2

2. ZipServ Compressed Inference

Run with ZipServ (linfer) compression:

# Single GPU - Llama-3-8B
python run_linfer.py \
    -m /hy-tmp/llama-3-8B-Instruct \
    --quantization linfer \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 1

# Multi-GPU - Mistral-24B (2 GPUs)
python run_linfer.py \
    -m /hy-tmp/Mistral-24B \
    --quantization linfer \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 2

# Multi-GPU - Llama-3-70B (4 GPUs)
python run_linfer.py \
    -m /hy-tmp/llama-3-70B-Instruct \
    --quantization linfer \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 4

Command Line Arguments

Argument	Description	Example
`-m, --model`	Path to the model directory	`/hy-tmp/llama-3-8B-Instruct`
`--quantization`	Quantization method: `dense` or `linfer`	`linfer`
`-b, --batch-size`	Batch size for inference	`8`
`-i, --input-len`	Input sequence length	`64`
`-o, --output-len`	Output sequence length	`256`
`--tensor-parallel-size`	Number of GPUs for tensor parallelism	`1`, `2`, `4`

Project Structure

ZipServ/
├── LInfer_py/                    # Python utilities for model compression
│   ├── compress_model.py         # Model compression script
│   ├── verify_compression.py     # Verification tool
│   └── backend/                  # Backend implementations
├── csrc/                         # CUDA source code
│   ├── L_API.cu                  # Core compression/decompression API (TCA-TBE)
│   ├── L_Kernel.cuh              # CUDA kernels (ZipGEMM)
│   └── ...
├── third_party/
│   └── vllm/                     # Patched vLLM with ZipServ support
│       ├── vllm/model_executor/layers/quantization/linfer.py
│       └── ...
├── setup.py                      # ZipServ installation script
├── LInfer_env.yml                # Conda environment specification
└── README.md                     # This file

Troubleshooting

Issue: `ImportError: Please compile and install the LInfer C++ extension`

Solution: Make sure you have installed ZipServ properly:

cd /path/to/ZipServ
python setup.py install

Issue: `ValueError: LInferConfig requires 'linfer', got ''`

Solution: Ensure your model's config.json contains the quantization_config field with quant_method: "linfer".

Issue: CUDA out of memory

Solution: Reduce batch size (-b) or increase tensor parallelism (--tensor-parallel-size).

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
LInfer_py		LInfer_py
build		build
csrc		csrc
docs		docs
end2end_inference		end2end_inference
kernel_benchmark		kernel_benchmark
third_party		third_party
.Init_SpInfer.sh.swp		.Init_SpInfer.sh.swp
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
Init_SpInfer.sh		Init_SpInfer.sh
LICENSE		LICENSE
LInfer_env.yml		LInfer_env.yml
README.md		README.md
setup.py		setup.py
setup_log.log		setup_log.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Table of Contents

Overview

Requirements

System Requirements

Software Dependencies

Installation

Step 1: Clone the Repository

Step 2: Create Conda Environment

Step 3: Setup vLLM

Step 4: Install ZipServ Compression Tools

Step 5: Install vLLM (for Inference)

Model Preparation

Configuration

Running Inference

1. Baseline (Dense) Inference

2. ZipServ Compressed Inference

Command Line Arguments

Project Structure

Troubleshooting

Issue: `ImportError: Please compile and install the LInfer C++ extension`

Issue: `ValueError: LInferConfig requires 'linfer', got ''`

Issue: CUDA out of memory

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Table of Contents

Overview

Requirements

System Requirements

Software Dependencies

Installation

Step 1: Clone the Repository

Step 2: Create Conda Environment

Step 3: Setup vLLM

Step 4: Install ZipServ Compression Tools

Step 5: Install vLLM (for Inference)

Model Preparation

Configuration

Running Inference

1. Baseline (Dense) Inference

2. ZipServ Compressed Inference

Command Line Arguments

Project Structure

Troubleshooting

Issue: ImportError: Please compile and install the LInfer C++ extension

Issue: ValueError: LInferConfig requires 'linfer', got ''

Issue: CUDA out of memory

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Issue: `ImportError: Please compile and install the LInfer C++ extension`

Issue: `ValueError: LInferConfig requires 'linfer', got ''`

Packages