Hobby platform for analyzing ARC puzzles with multi-provider LLMs, reasoning capture, conversation chaining, and performance analytics.
Production: https://arc.markbarney.net
Staging: https://arc-explainer-staging.up.railway.app/ (branch ARC3)
Docs: CLAUDE.md • API Reference • Changelog
# Clone and install
git clone <repository-url> arc-explainer
cd arc-explainer
git submodule update --init --recursive
npm install
# Minimal .env (root)
OPENAI_API_KEY=your_key_here # needed for OpenAI + Responses API
OPENROUTER_API_KEY=your_key_if_used # optional; BYOK enforced in prod
DATABASE_URL=postgresql://... # optional for local DB-backed features
# Run development server
npm run dev # Allow ~10s to warm up, then open localhost:5173
# Or build and run dev server
npm run build-devMore detail: CLAUDE.md and docs/reference/architecture/DEVELOPER_GUIDE.md.
- Production enforces Bring Your Own Key for paid providers (OpenAI, xAI, Anthropic, Google, DeepSeek, OpenRouter). Keys are session-only, never stored.
- Dev/staging: server keys may exist, but tests should work with your own keys too.
- Worm Arena & Poetiq flows accept user-supplied keys via UI; backend injects them per session (see docs/reference/api/EXTERNAL_API.md and docs/reference/api/SnakeBench_WormArena_API.md).
- Puzzle Analyst:
/task/:taskId— high-density grid of analyses. - RE-ARC Bench:
/re-arc— generate unique evaluation datasets and validate solver submissions. - Worm Arena:
/worm-arena(replays),/worm-arena/live/:sessionId(live),/worm-arena/stats(leaderboard). - ARC3 playground:
/arc3/playground— watch agents solve real ARC-AGI-3 games. - APIs: start with
/api/health, then/api/puzzle/overview; see EXTERNAL_API.md for the full surface area.
- Architecture & patterns: Developer Guide (SRP, repositories, services, streaming).
- Hooks reference: frontend hooks.
- SnakeBench/Worm Arena API: SnakeBench_WormArena_API.md.
- BYOK details: EXTERNAL_API.md.
- Data: ARC puzzles under
data/; SnakeBench replays underexternal/SnakeBench/backend/completed_games. - Streaming contract: see Responses API docs in
docs/reference/api/(ResponsesAPI.md, OpenAI_Responses_API_Streaming_Implementation.md).
- Staging: Railway at
arc-explainer-staging.up.railway.app, tracking branchARC3. - Production: auto-deploys from
main. Use PRs intoARC3; do not push breaking changes directly tomain. - Env flags:
ENABLE_SSE_STREAMING(server),VITE_ENABLE_SSE_STREAMING(client).
Frontend: React 18 + TypeScript + Vite + TailwindCSS + DaisyUI components Backend: Express.js + TypeScript + PostgreSQL (Drizzle ORM) + in-memory fallback AI Integration: Unified BaseAIService pattern supporting 6+ providers Real-time: WebSocket streaming for Saturn solver and batch progress Deployment: Railway-ready with Docker support
- Repository pattern - Clean separation between data access and business logic
- Provider abstraction - Unified interface across OpenAI, Anthropic, xAI, etc.
- Optimistic updates - Instant UI feedback with server reconciliation
- Response preservation - Raw API responses saved before parsing for debugging
- Conversation chaining - Provider-aware context management with 30-day persistence
- Home / puzzles
//browser/task/:taskId(new default - Puzzle Analyst)/puzzle/:taskId(legacy - PuzzleExaminer)/examine/:taskId/puzzles/database
- Discussion
/discussion/discussion/:taskId
- Analytics / rankings
/analytics/leaderboards/elo/elo/leaderboard/elo/:taskId/compare/compare/:taskId
- Feedback / debate
/feedback/test-solution/test-solution/:taskId/debate/debate/:taskId
- Models
/models/model-config/model-comparison
- Solvers
/puzzle/saturn/:taskId/puzzle/grover/:taskId/puzzle/beetree/:taskId?/puzzle/poetiq/:taskId/poetiq
- RE-ARC Bench (new - community testing)
/re-arc- generate datasets and evaluate submissions
- ARC3
/arc3/arc3/playground/arc3/games/arc3/games/:gameId
- Worm Arena / SnakeBench
/snakebench/snake-arena(redirect)/worm-arena/worm-arena/live/worm-arena/live/:sessionId/worm-arena/matches/worm-arena/stats/worm-arena/models(new - model match history)/worm-arena/rules(new - rules & prompt transparency)
- Admin
/admin/admin/models/admin/ingest-hf/admin/openrouter
- Other
/trading-cards/hall-of-fame/human-cards(redirect)/kaggle-readiness/scoring/about/llm-reasoning/llm-reasoning/advanced- plus a catch-all 404
- Health
GET /api/health
- Models
GET /api/modelsGET /api/models/:modelKeyGET /api/models/provider/:provider
- Model management (GUI)
GET /api/model-management/listGET /api/model-management/statsGET /api/model-management/searchPOST /api/model-management/validatePOST /api/model-management/toggle-activePOST /api/model-management/create-aliasPOST /api/model-management/addPUT /api/model-management/notesDELETE /api/model-management/deleteGET /api/model-management/openrouter-models
- ARC puzzles
GET /api/puzzle/listGET /api/puzzle/overviewGET /api/puzzle/task/:taskIdPOST /api/puzzle/bulk-statusPOST /api/puzzle/analyze/:taskId/:modelPOST /api/puzzle/analyze-listGET /api/puzzle/:puzzleId/has-explanationPOST /api/puzzle/reinitializePOST /api/puzzle/validate(returns 501)- Stats:
GET /api/puzzle/accuracy-statsGET /api/puzzle/general-statsGET /api/puzzle/raw-statsGET /api/puzzle/performance-statsGET /api/puzzle/performance-stats-filteredGET /api/puzzle/trustworthiness-stats-filteredGET /api/puzzle/confidence-statsGET /api/puzzle/worst-performingGET /api/puzzles/stats
- Generic analysis SSE
POST /api/stream/analyzeGET /api/stream/analyze/:taskId/:modelKey/:sessionIdDELETE /api/stream/analyze/:sessionIdPOST /api/stream/cancel/:sessionId
- Discussion
GET /api/discussion/eligible
- Metrics & cost
GET /api/metrics/reliabilityGET /api/metrics/comprehensive-dashboardGET /api/metrics/compareGET /api/metrics/costs/modelsGET /api/metrics/costs/models/mapGET /api/metrics/costs/models/:modelNameGET /api/metrics/costs/models/:modelName/trendsGET /api/metrics/costs/system/stats
- Model dataset performance
GET /api/model-dataset/performance/:modelName/:datasetNameGET /api/model-dataset/modelsGET /api/model-dataset/datasetsGET /api/model-dataset/metrics/:modelName/:datasetName
- Prompts
POST /api/prompt/preview/:provider/:taskIdGET /api/promptsPOST /api/prompt-preview
- Explanations
GET /api/puzzle/:puzzleId/explanations/summaryGET /api/puzzle/:puzzleId/explanationsGET /api/puzzle/:puzzleId/explanationGET /api/explanations/:idPOST /api/puzzle/save-explained/:puzzleId- Rebuttal chain:
GET /api/explanations/:id/chainGET /api/explanations/:id/original
- Feedback + solutions
POST /api/feedbackGET /api/feedbackGET /api/feedback/statsGET /api/feedback/accuracy-statsGET /api/feedback/accuracy-stats-filteredGET /api/feedback/overconfident-modelsGET /api/feedback/debate-accuracy-statsGET /api/explanation/:explanationId/feedbackGET /api/puzzle/:puzzleId/feedbackGET /api/puzzles/:puzzleId/solutionsPOST /api/puzzles/:puzzleId/solutionsPOST /api/solutions/:solutionId/voteGET /api/solutions/:solutionId/votes
- ELO
GET /api/elo/comparisonGET /api/elo/comparison/:puzzleIdPOST /api/elo/voteGET /api/elo/leaderboardGET /api/elo/modelsGET /api/elo/stats
- Saturn
POST /api/saturn/analyze/:taskIdGET /api/stream/saturn/:taskId/:modelKeyPOST /api/saturn/analyze-with-reasoning/:taskIdGET /api/saturn/status/:sessionId
- Grover
POST /api/puzzle/grover/:taskId/:modelKeyGET /api/stream/grover/:taskId/:modelKeyGET /api/grover/status/:sessionId
- Poetiq
POST /api/poetiq/solve/:taskIdPOST /api/poetiq/batchGET /api/poetiq/batch/:sessionIdGET /api/poetiq/status/:sessionIdGET /api/poetiq/modelsGET /api/poetiq/community-progressGET /api/poetiq/stream/:sessionIdPOST /api/poetiq/stream/solve/:taskIdPOST /api/poetiq/stream/start/:sessionId
- Beetree
POST /api/beetree/runGET /api/beetree/status/:sessionIdPOST /api/beetree/estimateGET /api/beetree/history/:taskIdGET /api/beetree/cost-breakdown/:explanationIdPOST /api/beetree/cancel/:sessionIdGET /api/stream/analyze/beetree-:sessionId
- SnakeBench
GET /api/snakebench/models-with-games(new)GET /api/snakebench/model-history-full(new)GET /api/snakebench/model-insights(new)GET /api/snakebench/llm-player/prompt-template(new)POST /api/snakebench/run-matchPOST /api/snakebench/run-batchGET /api/snakebench/gamesGET /api/snakebench/games/:gameIdGET /api/snakebench/matchesGET /api/snakebench/healthGET /api/snakebench/recent-activityGET /api/snakebench/leaderboardGET /api/snakebench/statsGET /api/snakebench/model-ratingGET /api/snakebench/model-historyGET /api/snakebench/greatest-hitsGET /api/snakebench/trueskill-leaderboard
- Worm Arena Live SSE
POST /api/wormarena/prepareGET /api/wormarena/stream/:sessionId
- ARC3
GET /api/arc3/default-promptGET /api/arc3/system-promptsGET /api/arc3/system-prompts/:idGET /api/arc3/gamesPOST /api/arc3/start-gamePOST /api/arc3/manual-actionPOST /api/arc3/real-game/runPOST /api/arc3/stream/prepareGET /api/arc3/stream/:sessionIdPOST /api/arc3/stream/cancel/:sessionIdPOST /api/arc3/stream/:sessionId/continueGET /api/arc3/stream/:sessionId/continue-stream
- Batch
POST /api/batch/startGET /api/batch/status/:sessionIdPOST /api/batch/pause/:sessionIdPOST /api/batch/resume/:sessionIdGET /api/batch/results/:sessionIdGET /api/batch/sessions
- Admin
GET /api/admin/quick-statsGET /api/admin/recent-activityPOST /api/admin/validate-ingestionPOST /api/admin/start-ingestionGET /api/admin/ingestion-historyGET /api/admin/hf-folders- OpenRouter admin:
GET /api/admin/openrouter/catalogGET /api/admin/openrouter/discoverPOST /api/admin/openrouter/importGET /api/admin/openrouter/sync-config
- Recovery helpers:
GET /api/admin/recovery-statsPOST /api/admin/recover-multiple-predictions
This platform enables systematic study of AI reasoning capabilities on abstract visual patterns:
- Model comparison - Evaluate reasoning across GPT-5, o-series, Grok-4, Claude, Gemini, DeepSeek
- Cost-performance analysis - Token usage vs. accuracy trade-offs for different providers
- Confidence calibration - Study overconfidence patterns and trustworthiness scoring
- Reasoning depth - Analyze structured thinking from models with reasoning token support
- Conversation dynamics - Track how context affects progressive reasoning refinement
- Batch evaluation - Large-scale systematic testing across 1,000+ puzzles
- Unrestricted API - Full programmatic access to all analyses and metrics
- HuggingFace integration - Import external predictions for comparative analysis
- Raw response storage - Complete API payloads preserved for custom analysis
- Custom prompts - Design specialized evaluation frameworks
API Documentation: docs/EXTERNAL_API.md
The Abstract Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a benchmark for testing fluid intelligence in AI systems.
- ARC-AGI-1: 400 training + 400 evaluation puzzles
- ARC-AGI-2: 1,000 training + 120 evaluation puzzles (public)
- Private test sets: Semi-private (commercial) and fully-private (competition) sets calibrated to same difficulty
Each puzzle consists of:
- Training examples: 3 input/output pairs demonstrating the pattern
- Test cases: 1-2 input grids requiring output prediction
- Grids: Rectangular matrices (1x1 to 30x30) with integers 0-9 (visualized as colors or emojis)
- Predict exact output grid dimensions and all cell values
- 2 attempts allowed per test input
- Must work on first encounter with the puzzle
- Human performance: ~66% on evaluation set
data/
├── training/ # 1000 tasks for algorithm training
├── evaluation/ # 120 tasks for testing (ARC-AGI-1)
├── evaluation2/ # 120 tasks for testing (ARC-AGI-2)
└── training2/ # Additional training tasks
Read the ARC-AGI-2 paper: arxiv.org/pdf/2505.11831
- ARC puzzle GIF generator:
.claude/skills/slack-gif-creator/create_arc_puzzle_gif.py <puzzle_id>→arc_puzzle_<id>.gif(requirespillow,imageio,numpy). - Feature flags and toggles: see
shared/utils/featureFlags.tsandshared/config/streaming.ts.
Contributions welcome. Start with CLAUDE.md for coding standards, SRP/DRY expectations, and streaming requirements. Release notes live in CHANGELOG.md; feature history previously in README now lives there.