Skip to content

wheynelau/llmperf-rs

Repository files navigation

LLM Performance Benchmark

A Rust-based tool for running token throughput and latency benchmarks on language models.

Installation

From releases

Download the latest release from releases.

From source

# Note that you will need rust for this
# Depending on your distro you may also need other dependencies
cargo build --release

Usage

Run the benchmark with the following command:

llmperf --model <MODEL_NAME>

Replace <MODEL_NAME> with the model you want to test.

Options

Run llmperf --help to see all available options and their defaults:

# Short help
llmperf -h
# Long help
llmperf --help

Example

Basic usage with a specified model:

Note that it reads OPENAI_BASE_URL first.

export OPENAI_BASE_URL=http://localhost:8000/v1 # vLLM endpoint
# or the legacy
export OPENAI_API_BASE=http://localhost:8000/v1
llmperf --model Qwen/Qwen3-4B-Instruct-2507

Environment variables

# default is warn
export RUST_LOG=INFO # Set log level, DEBUG, INFO, WARN, ERROR
# Default to 600 seconds, this is the timeout per request
export OPENAI_API_TIMEOUT=600 
# Base URL, throws an error if unset
export OPENAI_API_BASE=http://localhost:8000/v1
# API key, optional
export OPENAI_API_KEY=sk-secret-key
# HF_TOKEN, optional, for downloading private tokenizers
export HF_TOKEN=hf-abc123
# HF_HOME, optional, path for downloading tokenizers
export HF_HOME=/tmp/hf

Roadmap

There are currently no planned features as it is subject to common issues or concerns. The goal is to provide a simple tool, which does not have very heavy configurations.

Some features that were considered but dropped:

  • Non streaming requests
  • JSON inputs
  • Warmup requests, suggest is to either by run test twice or submit a separate request

Additional details

  • CLI Reference - Full list of options and output file formats
  • Metrics - Detailed explanation of metrics and how to read output files
  • Prompt - How prompts are constructed
  • Sample Tests - Example benchmark comparisons
  • Deployment - How to run this as an image or deploy on k8s.
  • A local copy of sonnet.txt is no longer needed as it is baked into the binary with include_str!. However, compiling the binary from source needs this file to be present in the root directory.

Tested endpoints and their notes

  • VLLM: Works well, supports streaming. When running against a reasoning model, they do not send back the </think> token, but they send a different json schema:

Note: server was ran with reasoning_parser: deepseek_r1

Example:

# Reasoning model response
{"choices":[{"index":0,"delta":{"reasoning":" on","reasoning_content":" on"},"finish_reason":null}]}
# Content
{"choices":[{"index":0,"delta":{"content":" math","reasoning_content":null},"finish_reason":null}]}

As such, there could be a missing token for </think> but I will have to do more tests to confirm.

  • llamacpp: Same as vLLM

Note: server was ran with reasoning-format : deepseek

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages