Run ARC-AGI tasks against multiple model adapters (OpenAI, Anthropic, Gemini, Fireworks, Grok, OpenRouter, X.AI, custom etc.) with built-in rate limiting, retries, and scoring.
- Clone this repo:
git clone https://github.com/arcprize/arc-agi-benchmarking.git
cd arc-agi-benchmarking- Install (installs all adapters + SDKs):
pip install .- Single-task dry run (no API keys) with the local
random-baselineadapter:
python main.py \
--data_dir data/sample/tasks \
--config random-baseline \
--task_id 66e6c45b \
--save_submission_dir submissions/random-single \
--log-level INFO- Run all bundled sample tasks with the random solver:
python cli/run_all.py \
--config random-baseline \
--data_dir data/sample/tasks \
--save_submission_dir submissions/random-baseline-sample \
--log-level INFO- Score the outputs you just generated:
python src/arc_agi_benchmarking/scoring/scoring.py \
--task_dir data/sample/tasks \
--submission_dir submissions/random-baseline-sample \
--results_dir results/random-baseline-sampleIf using the random solver, expect all the attempts to be incorrect.
If you want to run real models, change the config and add the corresponding API keys (see Data and Config sections below).
Rather than using the sample data in data/sample/tasks/, you can use the real ARC-AGI tasks from the following repositories:
- ARC-AGI-1 (2019):
git clone https://github.com/fchollet/ARC-AGI.git data/arc-agi - ARC-AGI-2 (2025):
git clone https://github.com/arcprize/ARC-AGI-2.git data/arc-agi
--data_dir: Folder containing ARC task.jsonfiles (e.g.,data/sample/tasks).--config: Model config name frommodels.yml. Used by both single-task and batch.--save_submission_dir: Where to write outputs. Use the same flag for single-task and batch (alias:--submissions-rootremains for backward compatibility). Recommended structure:<save_submission_dir>/<config>/<version>/<eval_type>/, ex:submissions/gpt-4o-2024-11-20/v1/public_eval/.--num_attempts: How many attempts per test pair (per task).--retry_attempts: Internal retries within an attempt if the provider call fails.--log-level:DEBUG|INFO|WARNING|ERROR|CRITICAL|NONE.--enable-metrics: Toggle metrics collection (saved inmetrics_output/).- Scoring-specific:
--submission_dir: Where your run wrote outputs--results_dirWhere to write aggregated metrics/results
For runs beyond the Quickstart:
- Batch (recommended):
python cli/run_all.pywith your task list, model config, data dir, submission dir, attempts/retries, and log level. Uses asyncio, provider rate limiting, and tenacity retries; outputs land in--save_submission_dir(e.g.,submissions/<config>/<version>/<eval_type>).run_allhandles one model config per invocation; run multiple configs by invoking it multiple times (seerun_all_configs_local.shfor a pattern). - Single task (debug):
python main.pywith a single--config,--task_id, and your data dir/save directory and log level. See the CLI parameters section for flag details.
Tests are run based on model configs. Model configs hold the configuration (max output tokens, temperature, pricing etc.) for each test.
Model configs live in src/arc_agi_benchmarking/models.yml. Example:
- name: "gpt-4o-2024-11-20" # config name you reference on the CLI; typically includes the reasoning level for clarity (e.g., "-basic", "-advanced")
model_name: "gpt-4o-2024-11-20" # provider’s actual model id
provider: "openai" # must match an adapter
max_output_tokens: 4096 # optional; provider-specific
temperature: 0.0 # optional; provider-specific
pricing:
date: "2024-11-20"
input: 5.00 # USD per 1M input tokens
output: 15.00 # USD per 1M output tokens- Standard fields:
name,model_name,provider,pricing(input/outputper 1M tokens,datefor traceability). - Provider kwargs: any extra keys become
kwargsand are passed directly to the SDK (e.g.,temperature,max_output_tokens,stream, etc.). - Rate limits live in
provider_config.yml(rate,periodper provider). - Environment: set provider keys (e.g.,
OPENAI_API_KEY,ANTHROPIC_API_KEY,GEMINI_API_KEY,HUGGING_FACE_API_KEY). Copy.env.exampleto.envand fill in.
-
Add a new model config: add an entry to
models.ymlwith an existing provider; then use--config <name>on the CLI -
If you're adding a new adapter:
- Create
src/arc_agi_benchmarking/adapters/<provider>.pyimplementingProviderAdapter - Export it from
src/arc_agi_benchmarking/adapters/__init__.py - Add a branch in
main.py(and any factories) so the provider name is recognized - Add a config entry in
models.ymlpointing toprovider: "<provider>" - [Optional] Add tests (adapters and parsing) to cover basic flows
- Create
To score a run you'll need 1) your test's submission directory and 2) the source taskset (which contains the solutions)
Score a run:
python src/arc_agi_benchmarking/scoring/scoring.py
--task_dir <data_dir>/data/evaluation
--submission_dir submissions/<config>
--results_dir results/<config>- Add new providers/models in
src/arc_agi_benchmarking/adaptersandmodels.yml. - Run tests:
pytest. - Use the bundled sample task + submission for quick scoring checks.