Skip to content

Leosang-lx/FlowSpec

Repository files navigation

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

This repository is the official implementation of FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Brief Introduction

In this work, we propose a continuous pipeline-parallel tree-based speculative decoding framework for distributed inference, called FlowSpec, to reduce the inference latency with sparse requests. Our framework incorporates a lightweight draft model for token generation and the base LLM for pipeline-parallel verification, which enables the output of multiple tokens in a single forward propagation and hence mitigates the sparse request issue.

Workflow

Requirements

Basic information

jetpack: 5.1.2
cuda: 11.4
python: 3.8

Virtual Environment Setup

python -m venv flowspec

source ~/venv/flowspec/bin/activate

pip install -r requirements.txt

# Or for Jetson
pip install -r requirements_jetson.txt

Or Conda Environment Setup

conda create -n flowspec python=3.8

conda activate flowspec

pip install -r requirements.txt

# Or for Jetson
pip install -r requirements_jetson.txt

Install Torch And Bitsandbytes (Only for Jetson)

# Download the torch-1.11.0-cp38-cp38-linux_aarch64.whl from online resources
wget https://nvidia.box.com/shared/static/ssf2v7pf5i245fk4i0q926hy4imzs2ph.whl

pip install torch-1.11.0-cp38-cp38-linux_aarch64.whl

# Install bitsandbytes for Jetson (version 0.41.2)
git clone https://github.com/to-aoki/bitsandbytes.git

cd bitsandbytes

(make sure right configurations for paths of nvcc, cuda, cuda-awared torch...)

CUDA_VERSION=114 make cuda11x

python setup.py install

Quick Start

First, get split models and configs on server

Split model: split_and_save_models.py

  • set proper model path to load model; set target path to save the state_dict of the split stage model
  • set number of stages and layers
  • run python split_and_save_models.py
  • (Only for distributed test) send the state_dict of the split models and the weight of the draft models to the devices

Then set configurations in config/run_config.py

  • model name, model paths, running methods...

Finished all above steps, run run_pipe.sh (local test with multi-process) or run_jetson.sh (distributed test with multi-machine)

# split models and save
python split_and_save_models.py

# set configurations for running
sudo nano config/run_config.py

# run
bash run_pipe.sh
# or
bash run_jetson.sh

Evaluation

To start large scale evaluation, run run_pipe_eval.sh or scripts/run_jetson_eval.sh for 7B model for local and distributed scenarios, respectively. Set the model configuration via config/run_config.py.

Set quant in run_config.py to choose the quantization method, if needed.

model evaluation

# run
bash run_eval.sh
# or
PYTHONPATH=. eval/bash run_jetson_eval.sh

TP

For TP evaluation, refer to the tp dir, whose constructure is similar with the main directory. The pipeline type should be "tp" before.

# run
PYTHONPATH=. bash tp/run_tp.sh
# eval
PYTHONPATH=. bash tp/run_tp_eval.sh

Pre-trained Models

We use the draft model weights provided by EAGLE for evaluation.

Results

Extended Results, which contain the performance comparison between FlowSpec and the baselines across 6 datasets under two sampling settings (Temperature = 0 or 1, 0 means greedy sampling). We select 20 samples for each dataset and limit the length of the generated sequences to 128. V and L2 are short for LLaMA2-Chat and Vicuna-v1.3, respectively. 7B and 13B denote the number of parameters of the respective models.

Main Result

Acknowledgement

The implementation of FlowSpec reuses the code from EAGLE and refers to OPT-Tree and Jupiter.

About

Continuous Pipelined Speculative Decoding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors