Modernised benchmarking harness for exercising NVIDIA NIM, vLLM, Ollama, and llama.cpp deployments. The project ships with a FastAPI backend that orchestrates load tests and a Vite + React dashboard for monitoring runs.
- Asynchronous FastAPI backend with pluggable clients for NIM, vLLM, Ollama, and llama.cpp
- Benchmark executor with warmups, concurrency control, and automatic statistics (p50/p95 latency, TPS, TTFT)
- SQLite-backed history with REST endpoints for scheduling and tracking runs
- Auto-benchmark sweeps for quickly exploring concurrency, token, and temperature combinations
- React dashboard featuring backend selectors, parameter knobs, and live history updates
- Built-in model management for Ollama, NVIDIA NIM, and Hugging Face checkpoints
- Runtime controls to mark models as running or stopped across all supported providers
- Containerised deployment (Dockerfiles +
docker-compose.yml) with multi-architecture build script - Start and deploy helper scripts for local development or CI pipelines
python -m venv .venv
source .venv/bin/activate
pip install -r benchmark/requirements.txt
uvicorn app.main:app --reload --app-dir benchmark/backend/app --port 8000cd benchmark/frontend
npm install
npm run devThe frontend proxies API traffic to http://localhost:8000 by default.
To build and launch both services locally:
./scripts/start.shThis command builds the backend and frontend images (compatible with x86_64 and ARM64) and starts them via Docker Compose. The dashboard is available at http://localhost:5173.
Optional API keys can be injected into the backend container by exporting them before running the script. For example,
LLAMACPP_API_KEY=sk-local ./scripts/start.sh ensures llama.cpp requests from the dashboard include the bearer token.
./scripts/deploy.sh produces ready-to-run images for the backend and frontend. Set REGISTRY and TAG to push the images
into your container registry:
REGISTRY=registry.example.com TAG=prod PUSH=1 ./scripts/deploy.shSubmit a POST request to /api/benchmarks/auto with the following payload shape:
{
"provider": "nim",
"model_name": "nemotron-3-8b-instruct",
"prompt": "Write a haiku about GPUs.",
"sweep_concurrency": [1, 4, 8],
"sweep_max_tokens": [128, 256],
"sweep_temperature": [0.1, 0.3],
"parameters": {
"request_count": 10,
"concurrency": 1,
"warmup_requests": 2,
"max_tokens": 128,
"temperature": 0.1,
"top_p": 0.9,
"repetition_penalty": 1.0,
"stream": true,
"timeout": 120
}
}The endpoint returns a list of completed runs with metrics for each combination.
The API exposes helper endpoints so you can pull and inspect model assets directly from the dashboard:
| Endpoint | Description |
|---|---|
GET /api/models/ollama |
List models currently installed on an Ollama host. Optional base_url overrides the default target. |
POST /api/models/ollama/pull |
Trigger an Ollama pull operation for the requested model name. |
POST /api/models/nim/search |
Query the NVIDIA NGC catalog for NIM deployments (requires an NGC API key). |
POST /api/models/nim/pull |
Run docker pull against nvcr.io using the supplied NGC API key. |
POST /api/models/ngc/cli |
Use the NGC CLI to pull a checkpoint into the model cache and generate configs for llama.cpp, Ollama, sglang, and vLLM (TensorRT-LLM optional). |
POST /api/models/huggingface/search |
Search the Hugging Face model hub using an optional HF API token. |
POST /api/models/huggingface/download |
Download a gated model snapshot into the backend's model cache directory. |
GET /api/models/runtimes |
Inspect which models are currently marked as running across providers. |
POST /api/models/runtimes/start |
Mark a model as running for a given provider/base URL combination. |
POST /api/models/runtimes/stop |
Stop tracking a running model for a provider. |
Supply API keys in the request body when required. The frontend surfaces these flows under the new "Model management" section.
The backend can orchestrate NGC CLI downloads when ngc is available on the host. A helper script installs the CLI into
/usr/local/bin/ngc and picks the correct archive for common CPU architectures (x86_64 or arm64):
./scripts/install_ngc_cli.shThe script expects unzip to be available and may require sudo privileges to write to /usr/local/bin. Override the
download location by setting NGC_CLI_URL if you mirror the archive internally or need a different platform build.
pytestThe backend unit tests cover the statistics engine and parameter sweep generator.