Q-Infer: Towards Efficient GPU-CPU Collaborative LLM Inference via Sparsity-Aware Dynamic Scheduling

Getting Started

Installation
Model Weights
Inference

Setup and Installation

Pre-requisites

Requires the following dependencies:

CMake (3.17+)
Python (3.8+) and pip (19.3+), for converting model weights and automatic FFN offloading

cd Q-infer
pip install -r requirements.txt # install Python helpers' dependencies

Build

Using CMake(3.17+):

Build on NVIDIA GPU:

cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release

Build on NPU:

cmake -S . -B build -DLLAMA_CANN=ON
cmake --build build --config Release

Model Weights

Q-Infer based on PowerInfer models, which are stored in a special format called PowerInfer GGUF based on GGUF format, consisting of both LLM weights and predictor weights.

Download PowerInfer GGUF via Hugging Face

You can obtain PowerInfer GGUF weights at *.powerinfer.gguf as well as profiled model activation statistics for 'hot'-neuron offloading from each Hugging Face repo below.

Base Model	PowerInfer GGUF
LLaMA(ReLU)-2-7B	PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF
LLaMA(ReLU)-2-13B	PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF
Falcon(ReLU)-40B	PowerInfer/ReluFalcon-40B-PowerInfer-GGUF
LLaMA(ReLU)-2-70B	PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF
ProSparse-LLaMA-2-7B	PowerInfer/ProSparse-LLaMA-2-7B-GGUF
ProSparse-LLaMA-2-13B	PowerInfer/ProSparse-LLaMA-2-13B-GGUF
Bamboo-base-7B 🌟	PowerInfer/Bamboo-base-v0.1-gguf
Bamboo-DPO-7B 🌟	PowerInfer/Bamboo-DPO-v0.1-gguf

We recommend using huggingface-cli to download the whole model repo. For example, the following command will download PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF into the ./ReluLLaMA-7B directory.

huggingface-cli download --resume-download --local-dir ReluLLaMA-7B --local-dir-use-symlinks False PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF

As such, PowerInfer can automatically make use of the following directory structure for feature-complete model offloading:

.
├── *.powerinfer.gguf (Unquantized PowerInfer model)
├── *.q4.powerinfer.gguf (INT4 quantized PowerInfer model, if available)
├── activation (Profiled activation statistics for fine-grained FFN offloading)
│   ├── activation_x.pt (Profiled activation statistics for layer x)
│   └── ...
├── *.[q4].powerinfer.gguf.generated.gpuidx (Generated GPU index at runtime for corresponding model)

Convert from Original Model Weights + Predictor Weights

Hugging Face limits single model weight to 50GiB. For unquantized models >= 40B, you can convert PowerInfer GGUF from the original model weights and predictor weights obtained from Hugging Face.

Base Model	Original Model	Predictor
LLaMA(ReLU)-2-7B	SparseLLM/ReluLLaMA-7B	PowerInfer/ReluLLaMA-7B-Predictor
LLaMA(ReLU)-2-13B	SparseLLM/ReluLLaMA-13B	PowerInfer/ReluLLaMA-13B-Predictor
Falcon(ReLU)-40B	SparseLLM/ReluFalcon-40B	PowerInfer/ReluFalcon-40B-Predictor
LLaMA(ReLU)-2-70B	SparseLLM/ReluLLaMA-70B	PowerInfer/ReluLLaMA-70B-Predictor
ProSparse-LLaMA-2-7B	SparseLLM/ProSparse-LLaMA-2-7B	PowerInfer/ProSparse-LLaMA-2-7B-Predictor
ProSparse-LLaMA-2-13B	SparseLLM/ProSparse-LLaMA-2-13B	PowerInfer/ProSparse-LLaMA-2-13B-Predictor
Bamboo-base-7B 🌟	PowerInfer/Bamboo-base-v0.1	PowerInfer/Bamboo-base-v0.1-predictor
Bamboo-DPO-7B 🌟	PowerInfer/Bamboo-DPO-v0.1	PowerInfer/Bamboo-DPO-v0.1-predictor

You can use the following command to convert the original model weights and predictor weights to PowerInfer GGUF:

# make sure that you have done `pip install -r requirements.txt`
python convert.py --outfile /PATH/TO/POWERINFER/GGUF/REPO/MODELNAME.powerinfer.gguf /PATH/TO/ORIGINAL/MODEL /PATH/TO/PREDICTOR
# python convert.py --outfile ./ReluLLaMA-70B-PowerInfer-GGUF/llama-70b-relu.powerinfer.gguf ./SparseLLM/ReluLLaMA-70B ./PowerInfer/ReluLLaMA-70B-Predictor

For the same reason, we suggest keeping the same directory structure as PowerInfer GGUF repos after conversion.

Convert Original models into dense GGUF models(compatible with llama.cpp)

python convert-dense.py --outfile /PATH/TO/DENSE/GGUF/REPO/MODELNAME.gguf /PATH/TO/ORIGINAL/MODEL
# python convert-dense.py --outfile ./Bamboo-DPO-v0.1-gguf/bamboo-7b-dpo-v0.1.gguf --outtype f16 ./Bamboo-DPO-v0.1

Please note that the generated dense GGUF models might not work properly with llama.cpp, as we have altered activation functions (for ReluLLaMA and Prosparse models), or the model architecture (for Bamboo models). The dense GGUF models generated by convert-dense.py can be used for PowerInfer in dense inference mode, but might not work properly with llama.cpp.

Inference

For CPU-GPU hybrid inference with all available VRAM, you can use the following instructions to run:

./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
# e.g.: ./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
# For Windows: .\build\bin\Release\main.exe -m .\ReluFalcon-40B-PowerInfer-GGUF\falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"

If you want to limit the VRAM usage of GPU:

./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gb
# e.g.: ./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 8
# For Windows: .\build\bin\Release\main.exe -m .\ReluLLaMA-7B-PowerInfer-GGUF\llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 8

Under CPU-GPU hybrid inference, PowerInfer will automatically offload all dense activation blocks to GPU, then split FFN and offload to GPU if possible.

Dense inference mode (limited support)

If you want to run PowerInfer to infer with the dense variants of the PowerInfer model family, you can use similarly as llama.cpp does:

./build/bin/main -m /PATH/TO/DENSE/MODEL -n $output_token_count -t $thread_num -p $prompt -ngl $num_gpu_layers
# e.g.: ./build/bin/main -m ./Bamboo-base-v0.1-gguf/bamboo-7b-v0.1.gguf -n 128 -t 8 -p "Once upon a time" -ngl 12

So is the case for other examples/ like server and batched_generation. Please note that the dense inference mode is not a "compatible mode" for all models. We have altered activation functions (for ReluLLaMA and Prosparse models) in this mode to match with our model family.

Serving, Perplexity Evaluation, and more applications

PowerInfer supports serving and batched generation with the same instructions as llama.cpp. Generally, you can use the same command as llama.cpp, except for -ngl argument which has been replaced by --vram-budget for PowerInfer. Please refer to the detailed instructions in each examples/ directory. For example:

Quantization

PowerInfer has optimized quantization support for INT4(Q4_0) models. You can use the following instructions to quantize PowerInfer GGUF model:

./build/bin/quantize /PATH/TO/MODEL /PATH/TO/OUTPUT/QUANTIZED/MODEL Q4_0
# e.g.: ./build/bin/quantize ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.powerinfer.gguf ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf Q4_0
# For Windows: .\build\bin\Release\quantize.exe .\ReluFalcon-40B-PowerInfer-GGUF\falcon-40b-relu.powerinfer.gguf .\ReluFalcon-40B-PowerInfer-GGUF\falcon-40b-relu.q4.powerinfer.gguf Q4_0

Then you can use the quantized model for inference with the same instructions as above.

FAQs

What if I encountered CUDA_ERROR_OUT_OF_MEMORY?
- You can try to run with --reset-gpu-index argument to rebuild the GPU index for this model to avoid any stale cache.
- Due to our current implementation, model offloading might not be as accurate as expected. You can try with --vram-budget with a slightly lower value.
Why is there a noticeable downgrade in the performance metrics of our current ReLU model, particularly the 70B model?
- In contrast to the typical requirement of around 2T tokens for LLM training, our model's fine-tuning was conducted with only 5B tokens. This insufficient retraining has resulted in the model's inability to regain its original performance. We are actively working on updating to a more capable model, so please stay tuned.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
build		build
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml-cann		ggml-cann
gguf-py		gguf-py
grammars		grammars
kv-cache		kv-cache
media		media
models		models
pocs		pocs
powerinfer-py		powerinfer-py
prompts		prompts
qinfer-py		qinfer-py
scripts		scripts
spm-headers		spm-headers
tests		tests
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Package.swift		Package.swift
README.md		README.md
SHA256SUMS		SHA256SUMS
atomic_windows.h		atomic_windows.h
build.zig		build.zig
codecov.yml		codecov.yml
convert-dense.py		convert-dense.py
convert-hf-to-powerinfer-gguf.py		convert-hf-to-powerinfer-gguf.py
convert.py		convert.py
flake.lock		flake.lock
flake.nix		flake.nix
ggml-alloc.c		ggml-alloc.c
ggml-alloc.h		ggml-alloc.h
ggml-backend-impl.h		ggml-backend-impl.h
ggml-backend.c		ggml-backend.c
ggml-backend.h		ggml-backend.h
ggml-cann.cpp		ggml-cann.cpp
ggml-cann.h		ggml-cann.h
ggml-common.h		ggml-common.h
ggml-cuda.cu		ggml-cuda.cu
ggml-cuda.h		ggml-cuda.h
ggml-impl.h		ggml-impl.h
ggml-metal.h		ggml-metal.h
ggml-metal.m		ggml-metal.m
ggml-metal.metal		ggml-metal.metal
ggml-mpi.c		ggml-mpi.c
ggml-mpi.h		ggml-mpi.h
ggml-opencl.cpp		ggml-opencl.cpp
ggml-opencl.h		ggml-opencl.h
ggml-quants.c		ggml-quants.c
ggml-quants.h		ggml-quants.h
ggml.c		ggml.c
ggml.h		ggml.h
llama.cpp		llama.cpp
llama.h		llama.h
mypy.ini		mypy.ini
parallel.log		parallel.log
requirements.txt		requirements.txt
run_with_preset.py		run_with_preset.py
statistic.cpp		statistic.cpp
statistic.h		statistic.h
unicode.h		unicode.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q-Infer: Towards Efficient GPU-CPU Collaborative LLM Inference via Sparsity-Aware Dynamic Scheduling

Getting Started

Setup and Installation

Pre-requisites

Build

Model Weights

Download PowerInfer GGUF via Hugging Face

Convert from Original Model Weights + Predictor Weights

Inference

Serving, Perplexity Evaluation, and more applications

Quantization

FAQs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

PDS-Lab/Q-Infer

Folders and files

Latest commit

History

Repository files navigation

Q-Infer: Towards Efficient GPU-CPU Collaborative LLM Inference via Sparsity-Aware Dynamic Scheduling

Getting Started

Setup and Installation

Pre-requisites

Build

Model Weights

Download PowerInfer GGUF via Hugging Face

Convert from Original Model Weights + Predictor Weights

Inference

Serving, Perplexity Evaluation, and more applications

Quantization

FAQs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages