A benchmark for evaluating LLM spatial reasoning and navigation abilities through maze-solving tasks.

Honestly, I'm not sure. I think spatial reasoning and tool use.
The model receives a maze representation and must use a tool call to output movement commands (up, down, left, right) to navigate from start to goal. Performance is measured by success rate, steps taken, time, and API cost.
- Simple: Long corridors, few decision points
- Complex: More branches and dead ends
- Continuous: Model receives an updated view of the maze after every move, showing its new position. This tests the model's ability to iteratively navigate using feedback.
- Initial: Model sees the maze only once at the start and must output all moves from memory. This tests the model's ability to plan a complete path upfront.
- 5x5 (trivial)
- 11x11 (small)
- 21x21 (medium)
- 31x31 (challenging)
mazebench/
├── bench/ # Benchmark runner (Bun + AI SDK)
│ ├── src/bench/
│ │ ├── ui/ # Interactive CLI interface (Ink + React)
│ │ └── ... # Core benchmark logic
│ └── package.json
└── dashboard/ # Results visualization (Next.js)
cd bench
bun install
bun run runThis launches an interactive CLI interface where you can:
- Select a benchmark suite
- Enter a version tag for the run
- View real-time progress and results
Results are saved to bench/src/bench/results/.
cd dashboard
bun install
bun devOpen http://localhost:3000 to see the dashboard.
Edit bench/src/bench/config.ts to customize:
- Maze configurations (size, complexity, vision)
- Number of runs per config
- Max steps allowed
Edit bench/src/bench/models.ts to add/remove models to benchmark.
- Success rate: % of mazes solved
- Average steps: Mean steps to reach goal (successful runs)
- Average time: Mean duration per maze
- Cost: Total API cost for the benchmark run