LLM Performance Benchmark

A Rust-based tool for running token throughput and latency benchmarks on language models.

Installation

From releases

Download the latest release from releases.

From source

# Note that you will need rust for this
# Depending on your distro you may also need other dependencies
cargo build --release

Usage

Run the benchmark with the following command:

llmperf --model <MODEL_NAME>

Replace <MODEL_NAME> with the model you want to test.

Options

Run llmperf --help to see all available options and their defaults:

# Short help
llmperf -h
# Long help
llmperf --help

Example

Basic usage with a specified model:

Note that it reads OPENAI_BASE_URL first.

export OPENAI_BASE_URL=http://localhost:8000/v1 # vLLM endpoint
# or the legacy
export OPENAI_API_BASE=http://localhost:8000/v1
llmperf --model Qwen/Qwen3-4B-Instruct-2507

Environment variables

# default is warn
export RUST_LOG=INFO # Set log level, DEBUG, INFO, WARN, ERROR
# Default to 600 seconds, this is the timeout per request
export OPENAI_API_TIMEOUT=600 
# Base URL, throws an error if unset
export OPENAI_API_BASE=http://localhost:8000/v1
# API key, optional
export OPENAI_API_KEY=sk-secret-key
# HF_TOKEN, optional, for downloading private tokenizers
export HF_TOKEN=hf-abc123
# HF_HOME, optional, path for downloading tokenizers
export HF_HOME=/tmp/hf

Roadmap

There are currently no planned features as it is subject to common issues or concerns. The goal is to provide a simple tool, which does not have very heavy configurations.

Some features that were considered but dropped:

Non streaming requests
JSON inputs
Warmup requests, suggest is to either by run test twice or submit a separate request

Additional details

CLI Reference - Full list of options and output file formats
Metrics - Detailed explanation of metrics and how to read output files
Prompt - How prompts are constructed
Sample Tests - Example benchmark comparisons
Deployment - How to run this as an image or deploy on k8s.
A local copy of sonnet.txt is no longer needed as it is baked into the binary with include_str!. However, compiling the binary from source needs this file to be present in the root directory.

Tested endpoints and their notes

VLLM: Works well, supports streaming. When running against a reasoning model, they do not send back the </think> token, but they send a different json schema:

Note: server was ran with reasoning_parser: deepseek_r1

Example:

# Reasoning model response
{"choices":[{"index":0,"delta":{"reasoning":" on","reasoning_content":" on"},"finish_reason":null}]}
# Content
{"choices":[{"index":0,"delta":{"content":" math","reasoning_content":null},"finish_reason":null}]}

As such, there could be a missing token for </think> but I will have to do more tests to confirm.

llamacpp: Same as vLLM

Note: server was ran with reasoning-format : deepseek

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
docs		docs
postprocess		postprocess
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
dist-workspace.toml		dist-workspace.toml
sonnet.txt		sonnet.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Performance Benchmark

Installation

From releases

From source

Usage

Options

Example

Environment variables

Roadmap

Additional details

Tested endpoints and their notes

About

Uh oh!

Releases 18

Packages

Uh oh!

Languages

wheynelau/llmperf-rs

Folders and files

Latest commit

History

Repository files navigation

LLM Performance Benchmark

Installation

From releases

From source

Usage

Options

Example

Environment variables

Roadmap

Additional details

Tested endpoints and their notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 18

Packages 0

Uh oh!

Languages

Packages