Skip to content

aasthavar/inference-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inference Bench

A reproducible benchmarking framework for evaluating LLM inference across vLLM, sglang, and TensorRT-LLM backends. Supports configurable workloads, concurrency sweeps, and result visualization.


🔧 Installation & Setup

conda activate bench && cd inference-bench/

# Install dependencies
sh scripts/install_deps.sh

# Set API key if needed
export OPENAI_API_KEY=token-abc123

🚀 Running Backends

vLLM

vllm serve --config configs/llama/vllm/serve/baseline.yaml

python src/benchmark_serving.py \
--config-path ../configs/llama/vllm/benchmark \
--config-name instruct_latency.yaml

sglang

python scripts/launch_sglang_server.py --config configs/llama/sglang/serve/baseline.yaml

python src/benchmark_serving.py \
--config-path ../configs/llama/sglang/benchmark \
--config-name instruct_latency.yaml

TensorRT-LLM

docker run --rm -it \
--net host \
--gpus all \
-p 8000:8000 \
nvcr.io/nvidia/tensorrt-llm/release

python scripts/launch_trtllm_server.py --config configs/llama/trtllm/serve/baseline.yaml

python src/benchmark_serving.py \
--config-path ../configs/llama/trtllm/benchmark \
--config-name chat_latency.yaml

python scripts/stop_trtllm_server.py

📡 Sanity Check

curl http://localhost:8000/v1/completions \
-H "Authorization: Bearer token-abc123" \
-H "Content-Type: application/json" \
-d '{
        "model": "meta-llama/Llama-3.2-1B-Instruct",
        "prompt": "The new movie that got Oscar this year",
        "max_tokens": 5
    }' | jq

📊 Benchmark Workflow

  1. Pick backend + optimization config.
  2. Start a fresh server.
  3. Run benchmarks (single config or concurrency × request-rate sweep).
  4. Stop server.
  5. Repeat for next setup.

Results are saved to results/ and can be visualized in notebooks/visualize_results.ipynb.


📐 Metrics

Metric Description
Request Throughput Requests per second
Token Throughput Input/output/total tokens per second
Goodput metrics
End-to-End Latency Mean/median/std/p95/p99 request latency (ms)
Time to First Token TTFT (ms) for streaming requests
Inter-Token Latency Delay between tokens (ms)
TPOT Token processing time after first token
Concurrency Aggregate request concurrency

📂 Folder Structure

configs/      # backend configs (serve + benchmark)
scripts/      # launch/stop servers, sweeps, install deps
src/          # benchmarking scripts & dataset loader
dataset/      # benchmark datasets
results/      # saved results
notebooks/    # visualization
tests/        # tests

🔗 References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published