Skip to content

Mergemat/maze-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 

Repository files navigation

image

MazeBench

A benchmark for evaluating LLM spatial reasoning and navigation abilities through maze-solving tasks. image

What it measures

Honestly, I'm not sure. I think spatial reasoning and tool use.

How it works

The model receives a maze representation and must use a tool call to output movement commands (up, down, left, right) to navigate from start to goal. Performance is measured by success rate, steps taken, time, and API cost.

Complexity levels

  • Simple: Long corridors, few decision points
  • Complex: More branches and dead ends

Observation modes

  • Continuous: Model receives an updated view of the maze after every move, showing its new position. This tests the model's ability to iteratively navigate using feedback.
  • Initial: Model sees the maze only once at the start and must output all moves from memory. This tests the model's ability to plan a complete path upfront.

Maze sizes

  • 5x5 (trivial)
  • 11x11 (small)
  • 21x21 (medium)
  • 31x31 (challenging)

Project structure

mazebench/
├── bench/          # Benchmark runner (Bun + AI SDK)
│   ├── src/bench/
│   │   ├── ui/     # Interactive CLI interface (Ink + React)
│   │   └── ...     # Core benchmark logic
│   └── package.json
└── dashboard/      # Results visualization (Next.js)

Quick start

Run benchmarks

cd bench
bun install
bun run run

This launches an interactive CLI interface where you can:

  • Select a benchmark suite
  • Enter a version tag for the run
  • View real-time progress and results

Results are saved to bench/src/bench/results/.

View results

cd dashboard
bun install
bun dev

Open http://localhost:3000 to see the dashboard.

Configuration

Edit bench/src/bench/config.ts to customize:

  • Maze configurations (size, complexity, vision)
  • Number of runs per config
  • Max steps allowed

Edit bench/src/bench/models.ts to add/remove models to benchmark.

Metrics

  • Success rate: % of mazes solved
  • Average steps: Mean steps to reach goal (successful runs)
  • Average time: Mean duration per maze
  • Cost: Total API cost for the benchmark run

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors