This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.
To provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d.
The benchmarking system drives synthetic or trace-based traffic into an llm-d-powered inference stack, orchestrated via Kubernetes. Requests are routed through a scalable load generator, with results collected and visualized for latency, throughput, and cache effectiveness.
Pieces of information identifying a particular cluster. This information includes, but it is not limited to, GPU model, llm model and llm-d parameters (an environment file, and optionally a values.yaml file for llm-d-deployer)
Load Generator (python code), written using software facilites available at https://github.com/fmperf-project/fmperf.
Note
This will be expanded with additional load generators in the future.
FMPerf workload specification, with load profile (e.g., share-gpt vs long-input) and load levels (e.g., QPS values). IMPORTANT: these definitions will be expanded with specifications for other load generators.
Important
The triple <scenario>,<harness>,<workload>, combined with the standup/teardown capabilities provided by llm-d-deployer (https://github.com/llm-d/llm-d-deployer) should provide enough information to allow an experiment to be reproduced.
- llm-d-deployer (https://github.com/llm-d/llm-d-deployer)
- fm-perf: https://github.com/fmperf-project/fmperf
git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
export LLMDBENCH_CLUSTER_HOST="https://api.fmaas-platform-eval.fmaas.res.ibm.com"
export LLMDBENCH_CLUSTER_TOKEN="..."
Tip
You can simply use your current context. After running kubectl/oc login, just set export LLMDBENCH_CLUSTER_HOST=auto (and leave LLMDBENCH_CLUSTER_TOKEN unconfigured)
Important
No matter which method used (i.e., fully specify LLMDBENCH_CLUSTER_HOST and LLMDBENCH_CLUSTER_TOKEN or simply use the current context), there is an additional variable which will always require definition: LLMDBENCH_HF_TOKEN
Caution
Please make sure the environment variable LLMDBENCH_VLLM_COMMON_PVC_STORAGE_CLASS points to a storage class specific to your cluster. The default value will most likely fail.
A complete list of available variables (and its default values) can be found by running
cat setup/env.sh | grep "^export LLMDBENCH_" | sort
Note
The namespaces specified by the environment variables LLMDBENCH_VLLM_COMMON_NAMESPACE and LLMDBENCH_FMPERF_SERVICE_ACCOUNT will be automatically created.
Tip
If you want all generated yaml files and all data collected to reside on the same directory, set the environment variable LLMDBENCH_CONTROL_WORK_DIR explicitly before starting execution.
Run the command line with the option -h in order to produce a list of steps
./setup/standup.sh -h
Note
Each individual "step file" is named in a way that briefly describes each one the multiple steps required for a full deployment.
Tip
Steps 0-5 can be considered "preparation" and can be skipped in most deployments.
./setup/standup.sh -n
vLLM instances can be deployed by one of the following methods:
- "standalone" (a simple deployment with services associated to the deployment)
- "deployer" (invoking "llm-d-deployer").
This is controlled by the environment variable LLMDBENCH_DEPLOY_METHODS (default "deployer"). The value of the environment variable can be overriden by the paraemeter -t/--types (applicable for both teardown.sh and standup.sh)
Warning
At this time, only one simultaneous deployment method is supported
All available models are listed and controlled by the variable LLMDBENCH_DEPLOY_MODEL_LIST. The value of the above mentioned environment variable can be overriden by the paraemeter -m/--model (applicable for both teardown.sh and standup.sh).
Warning
At this time, only one simultaneous model is supported
Tip
The following aliases can be used in place of the full mode name, for convenience (llama-3b -> meta-llama/Llama-3.2-3B-Instruct, llama-8b -> meta-llama/Llama-3.1-8B-Instruct, llama-70b -> meta-llama/Llama-3.1-70B-Instruct, llama-17b -> RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic)
All relevant variables to a particular experiment are stored in a "scenario" (folder aptly named).
The expectation is that an experiment is run by initially executing:
source scenario/<scenario name>
At this point, with all the environment variables set (tip, env | grep ^LLMDBENCH_ | sort) you should be ready to deploy and test
./setup/standup.sh
Note
The scenario can also be indicated as part of the command line optios for standup.sh (e.g. ./setup/standup.sh -c ocp_H100MIG_deployer_llama-3b)
To re-execute only individual steps (full name or number):
./setup/standup.sh --step 08_smoketest.sh
./setup/standup.sh -s 7
./setup/standup.sh -s 3-5
./setup/standup.sh -s 5,7
Once llm-d is fully deployed, an experiment can be run
./run.sh
Important
This command will run an experiment, collect data and perform an initial analysis (generating statistics and plots). One can go straight to the analysis by adding the option -z/--skip to the above command
Note
The scenario can also be indicated as part of the command line optios for run.sh (e.g., ./run.sh -c ocp_L40_standalone_llama-8b)
Finally, cleanup everything
./setup/teardown.sh
Note
The scenario can also be indicated as part of the command line optios for teardown.sh (e.g., ./teardown.sh -c kubernetes_H200_deployer_llama-8b)
All the information collected inside the directory pointed by the environment variable LLMDBENCH_CONTROL_WORK_DIR should be enough to allow others to reproduce the experiment with the same parameters. In particular, all the parameters - always exposed as environment variables - applied to llm-d or vllm stacks can be found at ${LLMDBENCH_CONTROL_WORK_DIR}/environment/variables
A sample output of the contentx of ${LLMDBENCH_CONTROL_WORK_DIR} for a very simple experiment is shown here
./analysis
./analysis/data
./analysis/data/stats.txt
./analysis/plots
./analysis/plots/latency_analysis.png
./analysis/plots/README.md
./analysis/plots/throughput_analysis.png
./setup
./setup/yamls
./setup/yamls/05_pvc_workload-pvc.yaml
./setup/yamls/pod_benchmark-launcher.yaml
./setup/yamls/05_b_service_access_to_fmperf_data.yaml
./setup/yamls/07_deployer_values.yaml
./setup/yamls/05_namespace_sa_rbac_secret.yaml
./setup/yamls/04_prepare_namespace_llama-3b.yaml
./setup/yamls/05_a_pod_access_to_fmperf_data.yaml
./setup/yamls/03_cluster-monitoring-config_configmap.yaml
./setup/commands
./setup/commands/1748350741979704000_command.log
...
./setup/commands/1748350166902915000_command.log
./setup/sed-commands
./results
./results/LMBench_short_input_qps0.5.csv
./results/pod_log_response.txt
./environment
./environment/context.ctx
./environment/variables
./workload
./workload/harnesses
./workload/profiles
./workload/profiles/sanity_short-input.yaml
For a simplified workflow that includes automated analysis of benchmark results, check out the quickstart-k8s launcher. This workflow provides:
- Easy deployment and execution of benchmarks on Kubernetes
- Support for comparing multiple LLM models
- Generation of comprehensive performance visualizations
-
Single Model Benchmark: Run benchmarks for a single model with automated analysis
- See Single Model Quickstart for details
-
Multi-Model Comparison: Compare performance across multiple LLM models
- See Multi-Model Comparison Quickstart for details
To get started, navigate to the quickstart-k8s directory and follow the instructions in the respective README files.