Multi-agent system for automated theorem proving using Lean 4, powered by smolagents.
- Install dependencies:
# Recommended: uv (much faster)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activate
# Alternative: pip
pip install -r requirements.txt- Configure the system (optional):
Edit configs/config.yaml to customize:
- API keys and endpoints
- Model selection
- Agent parameters and budgets
- Timeouts and paths
See the Configuration section for details.
- Run the multi-agent system:
export OPENROUTER_API_KEY=...
python agents/math_prover_agent.py \
--subset_size 20 \
--json_file valid.json \
--model anthropic/claude-sonnet-4 \
--concurrency 4 \
--stages 2- List available checkpoints and resume from a previous run:
# List all available checkpoints
python agents/math_prover_agent.py --list_checkpoints
# Resume from a specific run (recommended for stage-based checkpoints)
python agents/math_prover_agent.py --resume_run 20250816-204214
# Resume from a specific stage within a run
python agents/math_prover_agent.py --resume_run 20250816-204214 --resume_stage 1- Or try the prompt-only baselines:
# OpenRouter baseline
export OPENROUTER_API_KEY=...
python benchmark_openrouter.py --subset_size 20 --json_file valid.json --model anthropic/claude-sonnet-4 --concurrency 4 --stages 1
# OpenAI Responses baseline
export OPENAI_API_KEY=...
python benchmark_openai.py --subset_size 20 --json_file valid.json --model gpt-4.1 --concurrency 4 --effort low --max_output_tokens 4096 --stages 1Checkpoints and logs are written under tmp/ and log/ respectively (details below).
Multi-agent system powered by smolagents:
- Idea generator agent: searches for relevant lemmas via
moogle_semantic_search, plans a strategy, and delegates code generation. - Code generator agent: produces Lean 4 code and verifies it with the
verify_lean_prooftool (calls Lean via Lake).
Flags:
--subset_size int(default from config): Number of tasks to run. Use0or-1for the full dataset.--json_file Path(defaultvalid.json): Path to the theorems JSON.--log_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}: Logging verbosity.--max_steps int(default from config): Max agent steps per theorem.--planning_interval int(default from config): How often to run the planning phase.--concurrency int(default from config): Number of theorems processed in parallel.--model str(default from config): Model ID (OpenRouter-compatible). Available examples are listed inconfig.py.--checkpoint strName to resume from (legacy single-file checkpoint intmp/checkpoints/<name>.json).--resume_run strResume from a specific run ID (e.g., '20250816-204214'). New stage-based format.--resume_stage intSpecific stage to resume from (used with --resume_run). If not specified, resumes from the latest stage.--save_checkpoint strName to save final legacy checkpoint as.--checkpoint_interval intSave legacy checkpoint every N tasks.--list_checkpointsList available checkpoints (both legacy and stage-based) and exit.--stages intNumber of passes over the dataset. Stage > 1 re-runs unsolved tasks.
Per-stage run checkpoints are always saved to tmp/checkpoints/run-<timestamp>/stage-<n>.json and include summary fields such as processed_count, solved_stage, solved_cumulative, unsolved_remaining, and results_*.
Notes:
- Provider is fixed to OpenRouter in this script; ensure
OPENROUTER_API_KEYis set. - Tokens are counted with
tiktokenwhen available and reported at the end.
Prompt-only baseline using OpenRouter's Chat Completions. It asks the model to replace sorry with a proof, extracts the Lean proof body, and verifies with Lake.
Flags:
--subset_size intUse 0 or -1 for all tasks.--json_file PathPath to tasks (defaultvalid.json).--model strOpenRouter model ID.--log_level {DEBUG,INFO,...}Logging verbosity.--concurrency intNumber of tasks verified in parallel.--stages intMulti-pass runs; later stages retry only unsolved tasks.
Requires OPENROUTER_API_KEY.
Prompt-only baseline using the OpenAI Responses API. Similar flow: prompt → extract proof body → verify with Lake.
Flags:
--subset_size intUse 0 or -1 for all tasks.--json_file Path--model strOpenAI model name.--log_level {DEBUG,INFO,...}--concurrency int--effort {low,medium,high}Reasoning effort used in the Responses API.--max_output_tokens intMaximum tokens requested for generation.--stages int
Requires OPENAI_API_KEY.
Parses the Lean sources and emits valid.json with normalized theorems where any existing proofs are replaced by sorry so every item is solvable by the LLM/agent. The script:
- Reads
miniF2F-lean4/MiniF2F/Valid.lean(path defined inconfig.py). - Extracts declarations (
theorem,lemma,def,example,instance,abbrev). - Detects whether the original declaration had a proof and records
is_solved. - Rewrites proof blocks to a placeholder
sorry. - Writes a list of objects to
valid.jsonwith fields:name,statement,is_solved.
No CLI flags; just run:
python process_lean.pyMinimal helper to compile ad‑hoc Lean code with Lake inside miniF2F-lean4.
Usage (mutually exclusive):
--file <path>: path to a Lean file whose contents are compiled.--code "<lean code>": pass Lean code directly as a string.- If neither is provided, the script reads Lean code from STDIN.
Example:
python verify_task.py --code "theorem t1 : 2 + 2 = 4 := by norm_num"The script ensures import MiniF2F.Minif2fImport is present, writes a temp file under miniF2F-lean4, runs lake env lean and forwards compiler output, returning Lean's exit code.
Run multiple models from a single YAML config and get a leaderboard:
python runner.py --config configs/run.yaml --dataset minif2f --max-workers 8- Config file:
configs/run.yamlmodel_list: list of model keys- Per-model blocks (keys must match
model_list):model_name,api_type(openrouter|openai|agent),parallel, optionalnum_examples,system_prompt,max_tokens
- Results are written to:
results/leaderboard.md— aggregated leaderboardresults/details/<model_key>/summary.jsonandresults/details/<model_key>/results.json
- Use
--no-cacheto force re-run even if summary exists.
uv is recommended for faster dependency resolution and automatic virtual environment management.
All defaults live in configs/config.yaml and are loaded by configs/config_loader.py:
paths- Project directories:MINIF2F_DIR,LEAN_SOURCE_FILE,LEAN_OUTPUT_FILE,LOG_DIR,TMP_DIRapi- Provider settings:AVAILABLE_PROVIDERS,DEFAULT_PROVIDER, API keys and endpointsmodels- Model configuration:DEFAULT_MODELlean- Lean 4 settings:LEAN_TIMEOUTllm- LLM/API timeouts:LLM_REQUEST_TIMEOUTagent- Agent parameters:DEFAULT_MAX_STEPS,DEFAULT_PLANNING_INTERVAL,DEFAULT_CONCURRENCY,DEFAULT_SUBSET_SIZEbudgets- Search and compilation budgets by difficulty level:fast_path(difficulty ≤ 0.3): quick processingmicro_plan(0.3 < difficulty ≤ 0.6): balanced approachfull_pipeline(difficulty > 0.6): comprehensive processing
logging- Logging configuration:LOG_FORMAT
The YAML config supports environment variable substitution:
api:
openai:
api_key: "${OPENAI_API_KEY}" # Required variable
default_provider: "${MATH_AGENT_PROVIDER:-openrouter}" # With default valueImport configuration as usual (no changes needed):
from configs.config_loader import (
BASE_DIR, MINIF2F_DIR, DEFAULT_MODEL, LEAN_TIMEOUT,
AGENT_BUDGETS, DEFAULT_MAX_STEPS
)Validation helpers: validate_config() and validate_provider_credentials() (scripts call these and will error early if prerequisites are missing).
valid.json: generated byprocess_lean.py; consumed by benchmarks and agents.- Stage-based checkpoints (recommended):
tmp/checkpoints/run-<YYYYMMDD-HHMMSS>/stage-<n>.json- always written per stage; includes cumulative and per-stage results and the list of unsolved items. Use--resume_run <run_id>to resume. - Legacy checkpoint files:
tmp/checkpoints/<name>.json(if you use--save_checkpoint/--checkpointinagents/math_prover_agent.py). - Logs:
log/llm_requests.log: provider requests (benchmarks).log/tools.log: output from Lean verifier and search tools.log/agent_benchmark.log: multi-agent run logs.
agents/: Multi-agent system implementationminiF2F-lean4/: Lean 4 theorem libraryconfigs/: Configuration filesconfig.yaml: Main configuration in YAML formatconfig_loader.py: Configuration loader (don't edit manually)
pyproject.toml: Project configuration and dependencies (uv)requirements.txt: Python dependencies (pip fallback)uv.lock: Lock file for reproducible buildstmp/: Checkpoints and temporary fileslog/: Log files
- Which provider should I use? The multi-agent pipeline currently uses OpenRouter; prompt baselines are provided for both OpenRouter and OpenAI.
- Where do results go? Logs in
log/, checkpoints intmp/checkpoints/.... The source tasks are invalid.json. - Can I resume from a previous run? Yes. Use
--stagesfor automatic re-tries across stages. For the multi-agent script, you can resume from stage-based checkpoints with--resume_run <run_id>(recommended) or use legacy--checkpoint/--save_checkpointoptions. - Should I use uv or pip? uv is recommended - faster and creates virtual environments automatically.