A Rust-based tool for running token throughput and latency benchmarks on language models.
Download the latest release from releases.
# Note that you will need rust for this
# Depending on your distro you may also need other dependencies
cargo build --releaseRun the benchmark with the following command:
llmperf --model <MODEL_NAME>Replace <MODEL_NAME> with the model you want to test.
Run llmperf --help to see all available options and their defaults:
# Short help
llmperf -h
# Long help
llmperf --helpBasic usage with a specified model:
Note that it reads OPENAI_BASE_URL first.
export OPENAI_BASE_URL=http://localhost:8000/v1 # vLLM endpoint
# or the legacy
export OPENAI_API_BASE=http://localhost:8000/v1
llmperf --model Qwen/Qwen3-4B-Instruct-2507# default is warn
export RUST_LOG=INFO # Set log level, DEBUG, INFO, WARN, ERROR
# Default to 600 seconds, this is the timeout per request
export OPENAI_API_TIMEOUT=600
# Base URL, throws an error if unset
export OPENAI_API_BASE=http://localhost:8000/v1
# API key, optional
export OPENAI_API_KEY=sk-secret-key
# HF_TOKEN, optional, for downloading private tokenizers
export HF_TOKEN=hf-abc123
# HF_HOME, optional, path for downloading tokenizers
export HF_HOME=/tmp/hfThere are currently no planned features as it is subject to common issues or concerns. The goal is to provide a simple tool, which does not have very heavy configurations.
Some features that were considered but dropped:
- Non streaming requests
- JSON inputs
- Warmup requests, suggest is to either by run test twice or submit a separate request
- CLI Reference - Full list of options and output file formats
- Metrics - Detailed explanation of metrics and how to read output files
- Prompt - How prompts are constructed
- Sample Tests - Example benchmark comparisons
- Deployment - How to run this as an image or deploy on k8s.
- A local copy of sonnet.txt is no longer needed as it is baked into the binary with
include_str!. However, compiling the binary from source needs this file to be present in the root directory.
- VLLM: Works well, supports streaming. When running against a reasoning model, they do not send back the
</think>token, but they send a different json schema:
Note: server was ran with reasoning_parser: deepseek_r1
Example:
# Reasoning model response
{"choices":[{"index":0,"delta":{"reasoning":" on","reasoning_content":" on"},"finish_reason":null}]}
# Content
{"choices":[{"index":0,"delta":{"content":" math","reasoning_content":null},"finish_reason":null}]}
As such, there could be a missing token for </think> but I will have to do more tests to confirm.
- llamacpp: Same as vLLM
Note: server was ran with reasoning-format : deepseek