Inference Bench

A reproducible benchmarking framework for evaluating LLM inference across vLLM, sglang, and TensorRT-LLM backends. Supports configurable workloads, concurrency sweeps, and result visualization.

🔧 Installation & Setup

conda activate bench && cd inference-bench/

# Install dependencies
sh scripts/install_deps.sh

# Set API key if needed
export OPENAI_API_KEY=token-abc123

🚀 Running Backends

vLLM

vllm serve --config configs/llama/vllm/serve/baseline.yaml

python src/benchmark_serving.py \
--config-path ../configs/llama/vllm/benchmark \
--config-name instruct_latency.yaml

sglang

python scripts/launch_sglang_server.py --config configs/llama/sglang/serve/baseline.yaml

python src/benchmark_serving.py \
--config-path ../configs/llama/sglang/benchmark \
--config-name instruct_latency.yaml

TensorRT-LLM

docker run --rm -it \
--net host \
--gpus all \
-p 8000:8000 \
nvcr.io/nvidia/tensorrt-llm/release

python scripts/launch_trtllm_server.py --config configs/llama/trtllm/serve/baseline.yaml

python src/benchmark_serving.py \
--config-path ../configs/llama/trtllm/benchmark \
--config-name chat_latency.yaml

python scripts/stop_trtllm_server.py

📡 Sanity Check

curl http://localhost:8000/v1/completions \
-H "Authorization: Bearer token-abc123" \
-H "Content-Type: application/json" \
-d '{
        "model": "meta-llama/Llama-3.2-1B-Instruct",
        "prompt": "The new movie that got Oscar this year",
        "max_tokens": 5
    }' | jq

📊 Benchmark Workflow

Pick backend + optimization config.
Start a fresh server.
Run benchmarks (single config or concurrency × request-rate sweep).
Stop server.
Repeat for next setup.

Results are saved to results/ and can be visualized in notebooks/visualize_results.ipynb.

📐 Metrics

Metric	Description
Request Throughput	Requests per second
Token Throughput	Input/output/total tokens per second
Goodput metrics
End-to-End Latency	Mean/median/std/p95/p99 request latency (ms)
Time to First Token	TTFT (ms) for streaming requests
Inter-Token Latency	Delay between tokens (ms)
TPOT	Token processing time after first token
Concurrency	Aggregate request concurrency

📂 Folder Structure

configs/      # backend configs (serve + benchmark)
scripts/      # launch/stop servers, sweeps, install deps
src/          # benchmarking scripts & dataset loader
dataset/      # benchmark datasets
results/      # saved results
notebooks/    # visualization
tests/        # tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inference Bench

🔧 Installation & Setup

🚀 Running Backends

vLLM

sglang

TensorRT-LLM

📡 Sanity Check

📊 Benchmark Workflow

📐 Metrics

📂 Folder Structure

🔗 References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
configs/llama		configs/llama
dataset		dataset
docs		docs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore

aasthavar/inference-bench

Folders and files

Latest commit

History

Repository files navigation

Inference Bench

🔧 Installation & Setup

🚀 Running Backends

vLLM

sglang

TensorRT-LLM

📡 Sanity Check

📊 Benchmark Workflow

📐 Metrics

📂 Folder Structure

🔗 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages