Skip to content

plduhoux/arenai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

119 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArenAI

LLMs play social deduction board games against each other. No coaching, no strategic hints: only game rules. Watch them lie, accuse, cooperate, betray, and reveal their social intelligence (or lack thereof).

ArenAI screenshot

Games

Werewolf

Classic social deduction. Villagers vs. Werewolves with Seer, Witch, and Mayor roles. Wolves chat privately at night, then must act innocent during day discussions. The Seer investigates players, the Witch can save or kill, and the Mayor breaks tied votes. Win by elimination.

  • 6-20 players, scaled wolf count (2/3/4 wolves)
  • Mayor election on day 1
  • Wolf private chat creates the core tension: coordination context bleeds into public statements
  • ~50-60k tokens per game

Two Rooms and a Boom

Two teams, two rooms, one bomb. Blue protects the President, Red positions the Bomber. Over 3 rounds, players discuss, elect room leaders, and exchange hostages. Card sharing is verified by the game; verbal claims are not. This distinction is where the real deception happens.

  • 6-20 players, scaled hostage count
  • 3 rounds with degressive discussion turns
  • ~100k tokens per game

Secret Dictator

Status: under construction. The engine is implemented but has not been tested extensively yet.

Hidden roles, policy cards, legislative deception. Liberals vs Fascists with a hidden Dictator. Each round: elect a government, draw policy cards, enact legislation. Presidential powers unlock as fascist policies pass. Configurable terminology to avoid LLM bias on loaded terms.

  • 5-20 players
  • Veto power after the 5th fascist policy
  • ~200k tokens per game

What We're Measuring

This is a benchmark, not a tutorial. LLMs receive only the game rules and their role. No strategic directives, no "you should bluff", no "protect your identity". What emerges:

Capability Question
Deception Can a model lie convincingly when its role requires it?
Theory of mind Can it reason about what others know, believe, and suspect?
Context isolation Can it keep private info out of public statements?
Strategic inference Can it derive optimal play from rules alone?
Persuasion Can it change other players' votes through argumentation?
Coalition detection Can it identify coordinated behavior among opponents?

Private Thoughts

An optional enableThoughts mode asks each player to write a THOUGHT: before their MESSAGE: in a single LLM request. This is not an extra API call; same request, just more output tokens. It serves as a decompression airlock (forces the model to process private knowledge before speaking publicly) and gives the observer insight into the model's reasoning.

Information Boundaries

The core challenge for LLMs in social deduction is managing what they know vs. what they should say:

Information Scope Leak risk
Role assignment Player only LLMs sometimes self-reveal
Wolf chat Wolves only #1 source of "Freudian slips"
Seer/Witch results Role holder only Should stay private until strategic
Card sharing (Two Rooms) Two players, verified Cannot be falsified
Verbal claims All in room Can be lies, never verified
Private thoughts Observer only Never enters any player's context

Quick Start

git clone https://github.com/plduhoux/arenai.git
cd arenai && npm install
cd client && npm install && npx vite build && cd ..
node server/index.js

Open http://localhost:8085. Go to Settings, add your API keys, and start playing.

Supported Providers

Anthropic, OpenAI, Google (Gemini), xAI (Grok), Moonshot (Kimi). Each model can be tested from the Settings page before running games.

Game Configuration

  • Player count: 5-20 (varies by game)
  • Model per faction: pit any model against any other
  • Discussion rounds: 1 (fast), 2 (default), 3 (thorough)
  • Battle mode: run multiple games with swapped factions for fair comparison
  • Enable thoughts: private reasoning before public statements

Architecture

core/
  game-runner.js       Orchestrator (pause/resume/stop, periodic saves)
  llm-client.js        Multi-provider LLM client, token tracking, prompt caching

games/
  werewolf/            Engine + prompts + plugin
  two-rooms/           Engine + prompts + plugin
  secret-dictator/     Engine + prompts + plugin

server/
  index.js             Express 5 API + SSE streaming
  db.js                SQLite + migrations
  elo.js               ELO ratings (K=32, base 1500)
  token-stats.js       Token usage analytics

client/                Vue 3 + Vite SPA
  src/views/           About, Dashboard, Game, NewGame, Stats, Settings
  src/components/      LiveFeed, StatusBar, RoundCards, PlayerChip, EloTable

scripts/
  generate-static.js   Export saved games as a static site (no backend needed)
  deploy-static.sh     Generate + deploy via SFTP

Plugin Interface

Each game exports a standard interface: setup(), getCurrentPhase(), isOver(), getDisplayState(). Phases are async functions that emit events via onEvent(), flowing through SSE to the frontend in real-time and persisted for historical replay.

Static Site / Showcase Mode

Save your best games (star button on the dashboard), then generate a fully static site:

node scripts/generate-static.js    # exports saved games + pre-computed stats
./scripts/deploy-static.sh         # generate + deploy via SFTP

The static site has the same UI but no backend: all data is pre-generated JSON. Stats, ELO, and token usage are included. Deployable anywhere (OVH, GitHub Pages, Netlify).

Token Optimization

  • Simultaneous voting and actions (Promise.all)
  • Rebuttals only for mentioned players
  • Historical context compression: recent rounds in full, older rounds summarized
  • Prompt caching via provider APIs
  • Thoughts optional and off by default

ELO System

Per-model ratings (K=32, base 1500) with per-role breakdown (wolf ELO, villager ELO, etc.). Updated after each game based on expected vs. actual outcome.

Example Games

Watch full game replays on the live showcase:

See also the benchmark plan for the full round-robin protocol (5 frontier models, 3 games, 300 matches). For curated highlights of LLM blunders and brilliant plays, check the notable moments.

Data

NEVER delete data/games.db. All games, logs, stats, ELO, and token usage live there. Schema auto-creates on first run; changes are migrations only.

Stack

Node.js, Express 5, SQLite (better-sqlite3), Vue 3, Vite. Multi-provider LLM support (Anthropic SDK, OpenAI SDK, Google GenAI SDK).

Inspiration

  • Foaster.ai Werewolf Bench: Werewolf benchmark for LLMs. Their setup inspired our Mayor election and wolf private chat mechanics.
  • The real board games: Les Loups-Garous de Thiercelieux, Secret Hitler (Goat Wolf & Cabbage), Two Rooms and a Boom (Tuesday Knight Games).

About

LLMs play social deduction games against each other. Secret Dictator, Werewolf, Two Rooms and a Boom.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors