Skip to content

OpsPAI/PreServe

Repository files navigation

PreServe: Intelligent Management for LMaaS Systems via Hierarchical Prediction

This is the repository of PreServe for the paper PreServe: Intelligent Management for LMaaS Systems via Hierarchical Prediction.

In this work, we propose PreServe, a tailored LMaaS (Large Model as a Service) management framework based on hierarchical prediction.

framework

Repository Organization

├── LLMServe
│   ├── global_scheduler
│   │   ├── workload_predictor
│   │   │   └── predictor.py
│   │   ├── load_predictor
│   │   │   ├── data_loader.py
│   │   │   ├── model.py
│   │   │   └── predictor.py
│   │   ├── scaler.py
│   │   └── scheduler.py
│   ├── request_generater
│   │   ├── generator.py
│   │   ├── load.py
│   │   └── workload.py
│   ├── serve_instance
│   │   ├── instance.py
│   │   └── lookahead.py
│   ├── benchmark.py
│   ├── config.py
│   ├── logger.py
│   ├── test_bench.py
│   ├── test_request_predictor.py
│   └── util.py
├── data
│   ├── workloads/...
│   ├── datasets/...
│   ├── download_datasets.sh
│   ├── download_workloads.sh
│   ├── preprocess_datasets.py
│   ├── preprocess_workloads.py
│   ├── run.sh
│   └── README.md
├── experiments
│   ├── motivation_study/...
│   ├── RQ1/...
│   ├── RQ2/...
│   ├── RQ3/...
│   ├── RQ4/...
│   ├── download_gdrive.py
│   └── Reproducibility.md
├── results
│   ├── cases/
│   └── result_analysis_metrics.py
├── scripts
│   ├── ...
│   ├── benchmark_RQ2.sh
│   └── benchmark_RQ3_7b.sh
├── assets/...
├── instance_configurations_4.json
├── instance_configurations.json
├── offline_train.py
├── setup.py
├── requirements.txt
├── .gitignore
└── README.md

Installation

First, create a Conda environment with Python>=3.10:

conda create -n LLMServe python=3.10 -y
conda activate LLMServe

Next, install dependencies:

# clone the repository
cd PreServe
pip install -r requirements.txt

Finally, install the package locally:

pip install -e .

How-to-use

1. Prepare Benchmark Datasets

Download the workload and load datasets.

cd ./data/
./run.sh

Example preprocessing:

# Authenticate with Hugging Face (replace {} with your token):
huggingface-cli login --add-to-git-credential --token hf_{} 

# Preprocess the load dataset: ShareGPT
python preprocess_datasets.py \
	--min_input_tokens 16 \
	--min_out_tokens 16 \
	--max_input_tokens 4096 \
	--max_out_tokens 4096 \
	--min_total_tokens 32 \
	--max_total_tokens 4096 \
	--tokenizer_name "meta-llama/Llama-2-7b-hf"

# Preprocess the load dataset: Azure_code & Azure_conv
python preprocess_workloads.py

2. Train the Request Load Predictor

CUDA_VISIBLE_DEVICES=0 python offline_train.py --response_type 1 --use_prompt 1 --resample 1

3. Local benchmark

Start the vllm server

Example command:

CUDA_VISIBLE_DEVICES=0 vllm serve "meta-llama/Llama-2-7b-hf" --port 8000

Run the benchmark

Example command:

cd ./LLMServe
python benchmark.py \
	--request_num "2000" \
	--model_name "meta-llama/Llama-2-7b-hf" \
	--result_dir "../results/cases/" \
	--load "ShareGPT" \
	--load_dataset_path "../data/datasets/ShareGPT/cleaned.csv" \
	--workload "poisson" \
	--qps "2" \
	--num_instances "1" \
	--scheduler_policy "preserve" \
	--scaler_policy "none" \
	--req_predictor_policy "load_predictor" \
	--max_model_len 4096 \
	--max_num_seqs 128 \
	--max_num_batched_tokens 8192 

Reproducibility

The instructions to reproduce the experiment results in our paper can be found in the Reproducibility.

(For the sake of ease of use for users, here we adopt fixed LLM instance tables, which can also be flexibly integrated with frameworks such as Ray.)

About

PreServe: Intelligent Management for LMaaS Systems via Hierarchical Prediction [ICSE'26]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published