Skip to content

xxyux/ZipServ

Repository files navigation

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

ZipServ is a lossless compression framework co-designed for efficient Large Language Model (LLM) inference. It addresses the memory and bandwidth bottlenecks in bit-exact LLM serving through novel hardware-aware compression techniques.


Table of Contents


Overview

Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact LLM serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic.

ZipServ introduces:

  • Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE): A novel fixed-length format that enables constant-time, parallel decoding
  • Fused Decompression-GEMM (ZipGEMM) Kernel: Decodes weights on-the-fly directly into Tensor Core registers, eliminating intermediate buffers and maximizing compute intensity
  • "Load-compressed, Compute-decompressed" Design: Eliminates redundant memory traffic through co-designed system architecture

Key Results:

  • Reduces model size by up to 30%
  • Achieves up to 2.21× kernel-level speedup over NVIDIA's cuBLAS
  • Expedites end-to-end inference by an average of 1.22× over vLLM

ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.


Requirements

System Requirements

  • OS: Ubuntu 20.04+ (recommended)
  • GCC: >= 9.0
  • CMake: >= 3.30.3
  • CUDA: >= 12.2
  • NVIDIA GPU: Compute capability >= 8.0 (Ampere, e.g., A100, A6000, RTX 4090)

Software Dependencies

  • Python: 3.10+
  • PyTorch: >= 2.0
  • Conda: For environment management

Installation

Step 1: Clone the Repository

git clone https://github.com/xxyux/ZipServ.git
cd ZipServ

Step 2: Create Conda Environment

conda env create -f LInfer_env.yml
conda activate linfer

Step 3: Setup vLLM

cd third_party/
bash setup_vllm.sh
cd ..

This will download and apply necessary patches to vLLM for ZipServ integration.

Step 4: Install ZipServ Compression Tools

python setup.py install

This installs the ZipServ Python package and compiles the CUDA extensions for model compression.

Step 5: Install vLLM (for Inference)

cd third_party/vllm
MAX_JOBS=16 pip install -e . -v > install_log.log 2>&1

This installs vLLM with ZipServ (linfer) quantization support in editable mode.


Model Preparation

Before running inference, you need to prepare your model by adding the quantization configuration.

ZipServ now supports direct inference with original model weights. You only need to create a configuration file — no pre-compression is required.

Configuration

In your model directory (e.g., /path/to/llama-3-8B-Instruct/), create a new file named quant_config.json:

cd /path/to/llama-3-8B-Instruct/

Create quant_config.json with the following content:

{
  "quant_method": "linfer",
  "lm_head": false
}

Key fields:

  • quant_method: Must be "linfer" (the internal quantization method name)
  • lm_head: Whether to quantize the language model head (true or false)

Your model directory should look like this:

/hy-tmp/llama-3-8B-Instruct/
├── config.json              # Original model config (do not modify)
├── quant_config.json        # <-- Create this file
├── model.safetensors
├── tokenizer.json
└── ...

That's it! ZipServ will automatically compress weights on-the-fly during model loading.


Running Inference

Navigate to the vLLM directory:

cd third_party/vllm

1. Baseline (Dense) Inference

Run without ZipServ compression for comparison:

# Single GPU
python run_linfer.py \
    -m /hy-tmp/llama-3-8B-Instruct \
    --quantization dense \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 1

# Multi-GPU (2 GPUs)
python run_linfer.py \
    -m /hy-tmp/Mistral-24B \
    --quantization dense \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 2

2. ZipServ Compressed Inference

Run with ZipServ (linfer) compression:

# Single GPU - Llama-3-8B
python run_linfer.py \
    -m /hy-tmp/llama-3-8B-Instruct \
    --quantization linfer \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 1

# Multi-GPU - Mistral-24B (2 GPUs)
python run_linfer.py \
    -m /hy-tmp/Mistral-24B \
    --quantization linfer \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 2

# Multi-GPU - Llama-3-70B (4 GPUs)
python run_linfer.py \
    -m /hy-tmp/llama-3-70B-Instruct \
    --quantization linfer \
    -b 8 -i 64 -o 256 \
    --tensor-parallel-size 4

Command Line Arguments

Argument Description Example
-m, --model Path to the model directory /hy-tmp/llama-3-8B-Instruct
--quantization Quantization method: dense or linfer linfer
-b, --batch-size Batch size for inference 8
-i, --input-len Input sequence length 64
-o, --output-len Output sequence length 256
--tensor-parallel-size Number of GPUs for tensor parallelism 1, 2, 4

Project Structure

ZipServ/
├── LInfer_py/                    # Python utilities for model compression
│   ├── compress_model.py         # Model compression script
│   ├── verify_compression.py     # Verification tool
│   └── backend/                  # Backend implementations
├── csrc/                         # CUDA source code
│   ├── L_API.cu                  # Core compression/decompression API (TCA-TBE)
│   ├── L_Kernel.cuh              # CUDA kernels (ZipGEMM)
│   └── ...
├── third_party/
│   └── vllm/                     # Patched vLLM with ZipServ support
│       ├── vllm/model_executor/layers/quantization/linfer.py
│       └── ...
├── setup.py                      # ZipServ installation script
├── LInfer_env.yml                # Conda environment specification
└── README.md                     # This file

Troubleshooting

Issue: ImportError: Please compile and install the LInfer C++ extension

Solution: Make sure you have installed ZipServ properly:

cd /path/to/ZipServ
python setup.py install

Issue: ValueError: LInferConfig requires 'linfer', got ''

Solution: Ensure your model's config.json contains the quantization_config field with quant_method: "linfer".

Issue: CUDA out of memory

Solution: Reduce batch size (-b) or increase tensor parallelism (--tensor-parallel-size).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors