A reproducible benchmarking framework for evaluating LLM inference across vLLM, sglang, and TensorRT-LLM backends. Supports configurable workloads, concurrency sweeps, and result visualization.
conda activate bench && cd inference-bench/
# Install dependencies
sh scripts/install_deps.sh
# Set API key if needed
export OPENAI_API_KEY=token-abc123vllm serve --config configs/llama/vllm/serve/baseline.yaml
python src/benchmark_serving.py \
--config-path ../configs/llama/vllm/benchmark \
--config-name instruct_latency.yamlpython scripts/launch_sglang_server.py --config configs/llama/sglang/serve/baseline.yaml
python src/benchmark_serving.py \
--config-path ../configs/llama/sglang/benchmark \
--config-name instruct_latency.yamldocker run --rm -it \
--net host \
--gpus all \
-p 8000:8000 \
nvcr.io/nvidia/tensorrt-llm/release
python scripts/launch_trtllm_server.py --config configs/llama/trtllm/serve/baseline.yaml
python src/benchmark_serving.py \
--config-path ../configs/llama/trtllm/benchmark \
--config-name chat_latency.yaml
python scripts/stop_trtllm_server.pycurl http://localhost:8000/v1/completions \
-H "Authorization: Bearer token-abc123" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"prompt": "The new movie that got Oscar this year",
"max_tokens": 5
}' | jq- Pick backend + optimization config.
- Start a fresh server.
- Run benchmarks (single config or concurrency × request-rate sweep).
- Stop server.
- Repeat for next setup.
Results are saved to results/ and can be visualized in
notebooks/visualize_results.ipynb.
| Metric | Description |
|---|---|
| Request Throughput | Requests per second |
| Token Throughput | Input/output/total tokens per second |
| Goodput metrics | |
| End-to-End Latency | Mean/median/std/p95/p99 request latency (ms) |
| Time to First Token | TTFT (ms) for streaming requests |
| Inter-Token Latency | Delay between tokens (ms) |
| TPOT | Token processing time after first token |
| Concurrency | Aggregate request concurrency |
configs/ # backend configs (serve + benchmark)
scripts/ # launch/stop servers, sweeps, install deps
src/ # benchmarking scripts & dataset loader
dataset/ # benchmark datasets
results/ # saved results
notebooks/ # visualization
tests/ # tests