Evaluating LLMs through MMO gameplay. Models play SpaceMolt, an MMO designed for AI agents, and are scored on their ability to navigate, trade, mine, fight, and complete missions.
- The benchmark runner spawns a commander process for each (model, scenario) pair
- The commander connects to a SpaceMolt gameserver and plays autonomously using tool-calling
- Events (LLM calls, tool calls, errors) are emitted as JSONL on stdout
- After each run, player stats are fetched from the server's admin API
- Runs are scored per-scenario and aggregated into a composite leaderboard
15 models across 4 tiers, all routed through OpenRouter for uniform billing:
| Tier | Models |
|---|---|
| Frontier | Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro |
| Frontier-Fast | Gemini 3 Flash, GPT-5.3 Chat, Mistral Large 3 |
| Mid-Tier | DeepSeek V3.2, Qwen 3.5 Plus, MiniMax M2.5, Seed 2.0 Lite |
| Budget | Qwen 3.5 9B, Qwen 3.5 Flash, Ministral 3 14B, Gemini 3.1 Flash Lite, Seed 1.6 Flash |
| ID | Name | Ticks | What it tests |
|---|---|---|---|
| s1 | Bootstrap & Grind | 200 | Mine ore, sell for credits — basic gameplay loop |
| s2 | Navigation | 200 | Explore multiple star systems efficiently |
| s3 | Trading | 300 | Buy low, sell high across systems |
| s5 | Combat | 300 | Equip weapons, defeat pirates |
| s6 | Mission Runner | 300 | Accept and complete missions |
- Bun v1.3+
- A SpaceMolt gameserver binary (closed source — contact SpaceMolt team)
- Commander checked out adjacent to this repo
- An OpenRouter API key with credits (~$150-200 for full suite)
your-workspace/
gameserver # SpaceMolt gameserver binary (closed source)
commander/ # git clone https://github.com/SpaceMolt/commander
smbench/ # this repo
cd gameserver
./spacemolt-server \
--benchmark \
--seed 42 \
--tick-rate 2sFlags:
--benchmark— enables benchmark mode (fast ticks, disables Discord/Clerk auth)--seed 42— deterministic RNG for reproducibility--tick-rate 2s— 2-second ticks (adjust for faster/slower runs)
Set the admin API token:
export ADMIN_API_TOKEN=benchmark-admin-tokenVerify the server is running:
curl http://localhost:8080/healthz# Commander
cd commander && bun install
# SMBench
cd smbench && bun installEdit config.yaml:
server.url— gameserver URL (defaulthttp://localhost:8080/api/v1)server.admin_token— must matchADMIN_API_TOKENfrom step 1commander_path— path to commander's entry point (default../commander/src/commander.ts)models— add/remove models as neededruns_per_scenario— number of runs per (model, scenario) pair (default 2)
export OPENROUTER_API_KEY=sk-or-v1-...
# Full suite (~$150-200, several hours)
bun run src/runner.ts --config config.yaml
# Single model test
bun run src/runner.ts --config config.yaml \
--model anthropic/claude-sonnet-4-6 \
--scenario s1-bootstrap-grind
# Override runs per scenario
bun run src/runner.ts --config config.yaml --runs 1Results are written to results/:
results.json— full structured dataresults.md— markdown leaderboard and per-scenario breakdown
Each scenario has its own scoring function (see src/lib/scorer.ts). Scores are 0-100 and combine:
- Task performance (40-50%) — did the model achieve the scenario objective?
- Tool accuracy (15-25%) — ratio of successful tool calls to total attempts
- Activity (15-25%) — did the model actually do things, or sit idle?
- Efficiency (0-20%) — scenario-specific (e.g., credits per tool call)
A run passes if it scores >= 20. The composite score is the average across all scenario averages.
All models are routed through OpenRouter. Approximate costs for 2 runs per scenario:
| Tier | Per-model cost | Total (tier) |
|---|---|---|
| Frontier | ~$8-15 | ~$25-45 |
| Frontier-Fast | ~$3-6 | ~$10-18 |
| Mid-Tier | ~$1-3 | ~$5-12 |
| Budget | ~$0.10-0.50 | ~$0.50-2.50 |
Full suite estimate: $40-80 (15 models x 5 scenarios x 2 runs)
MIT