This repository is the official implementation of FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
In this work, we propose a continuous pipeline-parallel tree-based speculative decoding framework for distributed inference, called FlowSpec, to reduce the inference latency with sparse requests. Our framework incorporates a lightweight draft model for token generation and the base LLM for pipeline-parallel verification, which enables the output of multiple tokens in a single forward propagation and hence mitigates the sparse request issue.
jetpack: 5.1.2
cuda: 11.4
python: 3.8
python -m venv flowspec
source ~/venv/flowspec/bin/activate
pip install -r requirements.txt
# Or for Jetson
pip install -r requirements_jetson.txtOr Conda Environment Setup
conda create -n flowspec python=3.8
conda activate flowspec
pip install -r requirements.txt
# Or for Jetson
pip install -r requirements_jetson.txtInstall Torch And Bitsandbytes (Only for Jetson)
# Download the torch-1.11.0-cp38-cp38-linux_aarch64.whl from online resources
wget https://nvidia.box.com/shared/static/ssf2v7pf5i245fk4i0q926hy4imzs2ph.whl
pip install torch-1.11.0-cp38-cp38-linux_aarch64.whl
# Install bitsandbytes for Jetson (version 0.41.2)
git clone https://github.com/to-aoki/bitsandbytes.git
cd bitsandbytes
(make sure right configurations for paths of nvcc, cuda, cuda-awared torch...)
CUDA_VERSION=114 make cuda11x
python setup.py installFirst, get split models and configs on server
Split model: split_and_save_models.py
- set proper model path to load model; set target path to save the state_dict of the split stage model
- set number of stages and layers
- run
python split_and_save_models.py - (Only for distributed test) send the state_dict of the split models and the weight of the draft models to the devices
Then set configurations in config/run_config.py
- model name, model paths, running methods...
Finished all above steps, run run_pipe.sh (local test with multi-process) or run_jetson.sh (distributed test with multi-machine)
# split models and save
python split_and_save_models.py
# set configurations for running
sudo nano config/run_config.py
# run
bash run_pipe.sh
# or
bash run_jetson.shTo start large scale evaluation, run run_pipe_eval.sh or scripts/run_jetson_eval.sh for 7B model for local and distributed scenarios, respectively. Set the model configuration via config/run_config.py.
Set quant in run_config.py to choose the quantization method, if needed.
model evaluation
# run
bash run_eval.sh
# or
PYTHONPATH=. eval/bash run_jetson_eval.shFor TP evaluation, refer to the tp dir, whose constructure is similar with the main directory. The pipeline type should be "tp" before.
# run
PYTHONPATH=. bash tp/run_tp.sh
# eval
PYTHONPATH=. bash tp/run_tp_eval.shWe use the draft model weights provided by EAGLE for evaluation.
Extended Results, which contain the performance comparison between FlowSpec and the baselines across 6 datasets under two sampling settings (Temperature = 0 or 1, 0 means greedy sampling). We select 20 samples for each dataset and limit the length of the generated sequences to 128. V and L2 are short for LLaMA2-Chat and Vicuna-v1.3, respectively. 7B and 13B denote the number of parameters of the respective models.
The implementation of FlowSpec reuses the code from EAGLE and refers to OPT-Tree and Jupiter.

