FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

This repository is the official implementation of FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Brief Introduction

In this work, we propose a continuous pipeline-parallel tree-based speculative decoding framework for distributed inference, called FlowSpec, to reduce the inference latency with sparse requests. Our framework incorporates a lightweight draft model for token generation and the base LLM for pipeline-parallel verification, which enables the output of multiple tokens in a single forward propagation and hence mitigates the sparse request issue.

Requirements

Basic information

jetpack: 5.1.2
cuda: 11.4
python: 3.8

Virtual Environment Setup

python -m venv flowspec

source ~/venv/flowspec/bin/activate

pip install -r requirements.txt

# Or for Jetson
pip install -r requirements_jetson.txt

Or Conda Environment Setup

conda create -n flowspec python=3.8

conda activate flowspec

pip install -r requirements.txt

# Or for Jetson
pip install -r requirements_jetson.txt

Install Torch And Bitsandbytes (Only for Jetson)

# Download the torch-1.11.0-cp38-cp38-linux_aarch64.whl from online resources
wget https://nvidia.box.com/shared/static/ssf2v7pf5i245fk4i0q926hy4imzs2ph.whl

pip install torch-1.11.0-cp38-cp38-linux_aarch64.whl

# Install bitsandbytes for Jetson (version 0.41.2)
git clone https://github.com/to-aoki/bitsandbytes.git

cd bitsandbytes

(make sure right configurations for paths of nvcc, cuda, cuda-awared torch...)

CUDA_VERSION=114 make cuda11x

python setup.py install

Quick Start

First, get split models and configs on server

Split model: split_and_save_models.py

set proper model path to load model; set target path to save the state_dict of the split stage model
set number of stages and layers
run python split_and_save_models.py
(Only for distributed test) send the state_dict of the split models and the weight of the draft models to the devices

Then set configurations in config/run_config.py

model name, model paths, running methods...

Finished all above steps, run run_pipe.sh (local test with multi-process) or run_jetson.sh (distributed test with multi-machine)

# split models and save
python split_and_save_models.py

# set configurations for running
sudo nano config/run_config.py

# run
bash run_pipe.sh
# or
bash run_jetson.sh

Evaluation

To start large scale evaluation, run run_pipe_eval.sh or scripts/run_jetson_eval.sh for 7B model for local and distributed scenarios, respectively. Set the model configuration via config/run_config.py.

Set quant in run_config.py to choose the quantization method, if needed.

model evaluation

# run
bash run_eval.sh
# or
PYTHONPATH=. eval/bash run_jetson_eval.sh

TP

For TP evaluation, refer to the tp dir, whose constructure is similar with the main directory. The pipeline type should be "tp" before.

# run
PYTHONPATH=. bash tp/run_tp.sh
# eval
PYTHONPATH=. bash tp/run_tp_eval.sh

Pre-trained Models

We use the draft model weights provided by EAGLE for evaluation.

Results

Extended Results, which contain the performance comparison between FlowSpec and the baselines across 6 datasets under two sampling settings (Temperature = 0 or 1, 0 means greedy sampling). We select 20 samples for each dataset and limit the length of the generated sequences to 128. V and L2 are short for LLaMA2-Chat and Vicuna-v1.3, respectively. 7B and 13B denote the number of parameters of the respective models.

Acknowledgement

The implementation of FlowSpec reuses the code from EAGLE and refers to OPT-Tree and Jupiter.

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
comm		comm
config		config
data		data
docker		docker
docs		docs
eagle		eagle
eval		eval
figs		figs
model		model
profiler		profiler
records		records
test		test
tools		tools
tp		tp
.gitignore		.gitignore
README.md		README.md
pipeline_utils.py		pipeline_utils.py
requirements.txt		requirements.txt
requirements_jetson.txt		requirements_jetson.txt
run_jetson.sh		run_jetson.sh
run_pipe.py		run_pipe.py
run_pipe.sh		run_pipe.sh
stage_ea_config.py		stage_ea_config.py
stage_ea_model.py		stage_ea_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Brief Introduction

Requirements

Basic information

Virtual Environment Setup

Quick Start

Evaluation

TP

Pre-trained Models

Results

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Brief Introduction

Requirements

Basic information

Virtual Environment Setup

Quick Start

Evaluation

TP

Pre-trained Models

Results

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages