Realistic LLM inference benchmarking tool for Sequrity.ai.
Fixes the core problems with InferenceX:
- Real text workloads instead of random tokens
- Proper SSE-parsed TTFT (not HTTP chunk approximation)
- Explicit failed request tracking
- Separate input/output token throughput
- Multi-turn conversation simulation (unique — nobody else benchmarks this)
# Start vLLM server
CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct --port 8000 --api-key test
# Run benchmark
python -m src.benchmark.runner --config configs/baseline_vllm_a6000.yamlsrc/benchmark/— async client, metrics, runnersrc/workloads/— profiles, dataset, arrival patterns, multi-turnconfigs/— YAML benchmark configsresults/— JSON benchmark outputs