MazeBench

MazeBench

A benchmark for evaluating LLM spatial reasoning and navigation abilities through maze-solving tasks.

What it measures

Honestly, I'm not sure. I think spatial reasoning and tool use.

How it works

The model receives a maze representation and must use a tool call to output movement commands (up, down, left, right) to navigate from start to goal. Performance is measured by success rate, steps taken, time, and API cost.

Complexity levels

Simple: Long corridors, few decision points
Complex: More branches and dead ends

Observation modes

Continuous: Model receives an updated view of the maze after every move, showing its new position. This tests the model's ability to iteratively navigate using feedback.
Initial: Model sees the maze only once at the start and must output all moves from memory. This tests the model's ability to plan a complete path upfront.

Maze sizes

5x5 (trivial)
11x11 (small)
21x21 (medium)
31x31 (challenging)

Project structure

mazebench/
├── bench/          # Benchmark runner (Bun + AI SDK)
│   ├── src/bench/
│   │   ├── ui/     # Interactive CLI interface (Ink + React)
│   │   └── ...     # Core benchmark logic
│   └── package.json
└── dashboard/      # Results visualization (Next.js)

Quick start

Run benchmarks

cd bench
bun install
bun run run

This launches an interactive CLI interface where you can:

Select a benchmark suite
Enter a version tag for the run
View real-time progress and results

Results are saved to bench/src/bench/results/.

View results

cd dashboard
bun install
bun dev

Open http://localhost:3000 to see the dashboard.

Configuration

Edit bench/src/bench/config.ts to customize:

Maze configurations (size, complexity, vision)
Number of runs per config
Max steps allowed

Edit bench/src/bench/models.ts to add/remove models to benchmark.

Metrics

Success rate: % of mazes solved
Average steps: Mean steps to reach goal (successful runs)
Average time: Mean duration per maze
Cost: Total API cost for the benchmark run

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
bench		bench
dashboard		dashboard
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MazeBench

What it measures

How it works

Complexity levels

Observation modes

Maze sizes

Project structure

Quick start

Run benchmarks

View results

Configuration

Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MazeBench

What it measures

How it works

Complexity levels

Observation modes

Maze sizes

Project structure

Quick start

Run benchmarks

View results

Configuration

Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages