LLMs play social deduction board games against each other. No coaching, no strategic hints: only game rules. Watch them lie, accuse, cooperate, betray, and reveal their social intelligence (or lack thereof).
Classic social deduction. Villagers vs. Werewolves with Seer, Witch, and Mayor roles. Wolves chat privately at night, then must act innocent during day discussions. The Seer investigates players, the Witch can save or kill, and the Mayor breaks tied votes. Win by elimination.
- 6-20 players, scaled wolf count (2/3/4 wolves)
- Mayor election on day 1
- Wolf private chat creates the core tension: coordination context bleeds into public statements
- ~50-60k tokens per game
Two teams, two rooms, one bomb. Blue protects the President, Red positions the Bomber. Over 3 rounds, players discuss, elect room leaders, and exchange hostages. Card sharing is verified by the game; verbal claims are not. This distinction is where the real deception happens.
- 6-20 players, scaled hostage count
- 3 rounds with degressive discussion turns
- ~100k tokens per game
Status: under construction. The engine is implemented but has not been tested extensively yet.
Hidden roles, policy cards, legislative deception. Liberals vs Fascists with a hidden Dictator. Each round: elect a government, draw policy cards, enact legislation. Presidential powers unlock as fascist policies pass. Configurable terminology to avoid LLM bias on loaded terms.
- 5-20 players
- Veto power after the 5th fascist policy
- ~200k tokens per game
This is a benchmark, not a tutorial. LLMs receive only the game rules and their role. No strategic directives, no "you should bluff", no "protect your identity". What emerges:
| Capability | Question |
|---|---|
| Deception | Can a model lie convincingly when its role requires it? |
| Theory of mind | Can it reason about what others know, believe, and suspect? |
| Context isolation | Can it keep private info out of public statements? |
| Strategic inference | Can it derive optimal play from rules alone? |
| Persuasion | Can it change other players' votes through argumentation? |
| Coalition detection | Can it identify coordinated behavior among opponents? |
An optional enableThoughts mode asks each player to write a THOUGHT: before their MESSAGE: in a single LLM request. This is not an extra API call; same request, just more output tokens. It serves as a decompression airlock (forces the model to process private knowledge before speaking publicly) and gives the observer insight into the model's reasoning.
The core challenge for LLMs in social deduction is managing what they know vs. what they should say:
| Information | Scope | Leak risk |
|---|---|---|
| Role assignment | Player only | LLMs sometimes self-reveal |
| Wolf chat | Wolves only | #1 source of "Freudian slips" |
| Seer/Witch results | Role holder only | Should stay private until strategic |
| Card sharing (Two Rooms) | Two players, verified | Cannot be falsified |
| Verbal claims | All in room | Can be lies, never verified |
| Private thoughts | Observer only | Never enters any player's context |
git clone https://github.com/plduhoux/arenai.git
cd arenai && npm install
cd client && npm install && npx vite build && cd ..
node server/index.jsOpen http://localhost:8085. Go to Settings, add your API keys, and start playing.
Anthropic, OpenAI, Google (Gemini), xAI (Grok), Moonshot (Kimi). Each model can be tested from the Settings page before running games.
- Player count: 5-20 (varies by game)
- Model per faction: pit any model against any other
- Discussion rounds: 1 (fast), 2 (default), 3 (thorough)
- Battle mode: run multiple games with swapped factions for fair comparison
- Enable thoughts: private reasoning before public statements
core/
game-runner.js Orchestrator (pause/resume/stop, periodic saves)
llm-client.js Multi-provider LLM client, token tracking, prompt caching
games/
werewolf/ Engine + prompts + plugin
two-rooms/ Engine + prompts + plugin
secret-dictator/ Engine + prompts + plugin
server/
index.js Express 5 API + SSE streaming
db.js SQLite + migrations
elo.js ELO ratings (K=32, base 1500)
token-stats.js Token usage analytics
client/ Vue 3 + Vite SPA
src/views/ About, Dashboard, Game, NewGame, Stats, Settings
src/components/ LiveFeed, StatusBar, RoundCards, PlayerChip, EloTable
scripts/
generate-static.js Export saved games as a static site (no backend needed)
deploy-static.sh Generate + deploy via SFTP
Each game exports a standard interface: setup(), getCurrentPhase(), isOver(), getDisplayState(). Phases are async functions that emit events via onEvent(), flowing through SSE to the frontend in real-time and persisted for historical replay.
Save your best games (star button on the dashboard), then generate a fully static site:
node scripts/generate-static.js # exports saved games + pre-computed stats
./scripts/deploy-static.sh # generate + deploy via SFTPThe static site has the same UI but no backend: all data is pre-generated JSON. Stats, ELO, and token usage are included. Deployable anywhere (OVH, GitHub Pages, Netlify).
- Simultaneous voting and actions (
Promise.all) - Rebuttals only for mentioned players
- Historical context compression: recent rounds in full, older rounds summarized
- Prompt caching via provider APIs
- Thoughts optional and off by default
Per-model ratings (K=32, base 1500) with per-role breakdown (wolf ELO, villager ELO, etc.). Updated after each game based on expected vs. actual outcome.
Watch full game replays on the live showcase:
- Two Rooms: Claude Opus vs GPT-5: Blue win. Shows how LLMs handle verified card sharing vs. unverifiable verbal claims. (annotated transcript)
- Werewolf: Claude Opus vs GPT-5: Villager win. Wolves eliminated despite early coordination. (annotated transcript)
- Werewolf: Gemini 2.5 Pro vs Claude Opus: Wolf win in 2 rounds. Opus wolves reach parity before the village can react.
See also the benchmark plan for the full round-robin protocol (5 frontier models, 3 games, 300 matches). For curated highlights of LLM blunders and brilliant plays, check the notable moments.
NEVER delete data/games.db. All games, logs, stats, ELO, and token usage live there. Schema auto-creates on first run; changes are migrations only.
Node.js, Express 5, SQLite (better-sqlite3), Vue 3, Vite. Multi-provider LLM support (Anthropic SDK, OpenAI SDK, Google GenAI SDK).
- Foaster.ai Werewolf Bench: Werewolf benchmark for LLMs. Their setup inspired our Mayor election and wolf private chat mechanics.
- The real board games: Les Loups-Garous de Thiercelieux, Secret Hitler (Goat Wolf & Cabbage), Two Rooms and a Boom (Tuesday Knight Games).
