This is the repository of PreServe for the paper PreServe: Intelligent Management for LMaaS Systems via Hierarchical Prediction.
In this work, we propose PreServe, a tailored LMaaS (Large Model as a Service) management framework based on hierarchical prediction.
├── LLMServe
│ ├── global_scheduler
│ │ ├── workload_predictor
│ │ │ └── predictor.py
│ │ ├── load_predictor
│ │ │ ├── data_loader.py
│ │ │ ├── model.py
│ │ │ └── predictor.py
│ │ ├── scaler.py
│ │ └── scheduler.py
│ ├── request_generater
│ │ ├── generator.py
│ │ ├── load.py
│ │ └── workload.py
│ ├── serve_instance
│ │ ├── instance.py
│ │ └── lookahead.py
│ ├── benchmark.py
│ ├── config.py
│ ├── logger.py
│ ├── test_bench.py
│ ├── test_request_predictor.py
│ └── util.py
├── data
│ ├── workloads/...
│ ├── datasets/...
│ ├── download_datasets.sh
│ ├── download_workloads.sh
│ ├── preprocess_datasets.py
│ ├── preprocess_workloads.py
│ ├── run.sh
│ └── README.md
├── experiments
│ ├── motivation_study/...
│ ├── RQ1/...
│ ├── RQ2/...
│ ├── RQ3/...
│ ├── RQ4/...
│ ├── download_gdrive.py
│ └── Reproducibility.md
├── results
│ ├── cases/
│ └── result_analysis_metrics.py
├── scripts
│ ├── ...
│ ├── benchmark_RQ2.sh
│ └── benchmark_RQ3_7b.sh
├── assets/...
├── instance_configurations_4.json
├── instance_configurations.json
├── offline_train.py
├── setup.py
├── requirements.txt
├── .gitignore
└── README.md
First, create a Conda environment with Python>=3.10:
conda create -n LLMServe python=3.10 -y
conda activate LLMServeNext, install dependencies:
# clone the repository
cd PreServe
pip install -r requirements.txtFinally, install the package locally:
pip install -e .Download the workload and load datasets.
cd ./data/
./run.shExample preprocessing:
# Authenticate with Hugging Face (replace {} with your token):
huggingface-cli login --add-to-git-credential --token hf_{}
# Preprocess the load dataset: ShareGPT
python preprocess_datasets.py \
--min_input_tokens 16 \
--min_out_tokens 16 \
--max_input_tokens 4096 \
--max_out_tokens 4096 \
--min_total_tokens 32 \
--max_total_tokens 4096 \
--tokenizer_name "meta-llama/Llama-2-7b-hf"
# Preprocess the load dataset: Azure_code & Azure_conv
python preprocess_workloads.pyCUDA_VISIBLE_DEVICES=0 python offline_train.py --response_type 1 --use_prompt 1 --resample 1Start the vllm server
Example command:
CUDA_VISIBLE_DEVICES=0 vllm serve "meta-llama/Llama-2-7b-hf" --port 8000Run the benchmark
Example command:
cd ./LLMServe
python benchmark.py \
--request_num "2000" \
--model_name "meta-llama/Llama-2-7b-hf" \
--result_dir "../results/cases/" \
--load "ShareGPT" \
--load_dataset_path "../data/datasets/ShareGPT/cleaned.csv" \
--workload "poisson" \
--qps "2" \
--num_instances "1" \
--scheduler_policy "preserve" \
--scaler_policy "none" \
--req_predictor_policy "load_predictor" \
--max_model_len 4096 \
--max_num_seqs 128 \
--max_num_batched_tokens 8192 The instructions to reproduce the experiment results in our paper can be found in the Reproducibility.
(For the sake of ease of use for users, here we adopt fixed LLM instance tables, which can also be flexibly integrated with frameworks such as Ray.)
