Math Agent Vikhr

Multi-agent system for automated theorem proving using Lean 4, powered by smolagents.

Quick Start

Install dependencies:

# Recommended: uv (much faster)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activate

# Alternative: pip
pip install -r requirements.txt

Configure the system (optional):

Edit configs/config.yaml to customize:

API keys and endpoints
Model selection
Agent parameters and budgets
Timeouts and paths

See the Configuration section for details.

Run the multi-agent system:

export OPENROUTER_API_KEY=...
python agents/math_prover_agent.py \
  --subset_size 20 \
  --json_file valid.json \
  --model anthropic/claude-sonnet-4 \
  --concurrency 4 \
  --stages 2

List available checkpoints and resume from a previous run:

# List all available checkpoints
python agents/math_prover_agent.py --list_checkpoints

# Resume from a specific run (recommended for stage-based checkpoints)
python agents/math_prover_agent.py --resume_run 20250816-204214

# Resume from a specific stage within a run
python agents/math_prover_agent.py --resume_run 20250816-204214 --resume_stage 1

Or try the prompt-only baselines:

# OpenRouter baseline
export OPENROUTER_API_KEY=...
python benchmark_openrouter.py --subset_size 20 --json_file valid.json --model anthropic/claude-sonnet-4 --concurrency 4 --stages 1

# OpenAI Responses baseline
export OPENAI_API_KEY=...
python benchmark_openai.py --subset_size 20 --json_file valid.json --model gpt-4.1 --concurrency 4 --effort low --max_output_tokens 4096 --stages 1

Checkpoints and logs are written under tmp/ and log/ respectively (details below).

Scripts and CLI flags

agents/math_prover_agent.py

Multi-agent system powered by smolagents:

Idea generator agent: searches for relevant lemmas via moogle_semantic_search, plans a strategy, and delegates code generation.
Code generator agent: produces Lean 4 code and verifies it with the verify_lean_proof tool (calls Lean via Lake).

Flags:

--subset_size int (default from config): Number of tasks to run. Use 0 or -1 for the full dataset.
--json_file Path (default valid.json): Path to the theorems JSON.
--log_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}: Logging verbosity.
--max_steps int (default from config): Max agent steps per theorem.
--planning_interval int (default from config): How often to run the planning phase.
--concurrency int (default from config): Number of theorems processed in parallel.
--model str (default from config): Model ID (OpenRouter-compatible). Available examples are listed in config.py.
--checkpoint str Name to resume from (legacy single-file checkpoint in tmp/checkpoints/<name>.json).
--resume_run str Resume from a specific run ID (e.g., '20250816-204214'). New stage-based format.
--resume_stage int Specific stage to resume from (used with --resume_run). If not specified, resumes from the latest stage.
--save_checkpoint str Name to save final legacy checkpoint as.
--checkpoint_interval int Save legacy checkpoint every N tasks.
--list_checkpoints List available checkpoints (both legacy and stage-based) and exit.
--stages int Number of passes over the dataset. Stage > 1 re-runs unsolved tasks.

Per-stage run checkpoints are always saved to tmp/checkpoints/run-<timestamp>/stage-<n>.json and include summary fields such as processed_count, solved_stage, solved_cumulative, unsolved_remaining, and results_*.

Notes:

Provider is fixed to OpenRouter in this script; ensure OPENROUTER_API_KEY is set.
Tokens are counted with tiktoken when available and reported at the end.

benchmark_openrouter.py

Prompt-only baseline using OpenRouter's Chat Completions. It asks the model to replace sorry with a proof, extracts the Lean proof body, and verifies with Lake.

Flags:

--subset_size int Use 0 or -1 for all tasks.
--json_file Path Path to tasks (default valid.json).
--model str OpenRouter model ID.
--log_level {DEBUG,INFO,...} Logging verbosity.
--concurrency int Number of tasks verified in parallel.
--stages int Multi-pass runs; later stages retry only unsolved tasks.

Requires OPENROUTER_API_KEY.

benchmark_openai.py

Prompt-only baseline using the OpenAI Responses API. Similar flow: prompt → extract proof body → verify with Lake.

Flags:

--subset_size int Use 0 or -1 for all tasks.
--json_file Path
--model str OpenAI model name.
--log_level {DEBUG,INFO,...}
--concurrency int
--effort {low,medium,high} Reasoning effort used in the Responses API.
--max_output_tokens int Maximum tokens requested for generation.
--stages int

Requires OPENAI_API_KEY.

process_lean.py

Parses the Lean sources and emits valid.json with normalized theorems where any existing proofs are replaced by sorry so every item is solvable by the LLM/agent. The script:

Reads miniF2F-lean4/MiniF2F/Valid.lean (path defined in config.py).
Extracts declarations (theorem, lemma, def, example, instance, abbrev).
Detects whether the original declaration had a proof and records is_solved.
Rewrites proof blocks to a placeholder sorry.
Writes a list of objects to valid.json with fields: name, statement, is_solved.

No CLI flags; just run:

python process_lean.py

verify_task.py

Minimal helper to compile ad‑hoc Lean code with Lake inside miniF2F-lean4.

Usage (mutually exclusive):

--file <path>: path to a Lean file whose contents are compiled.
--code "<lean code>": pass Lean code directly as a string.
If neither is provided, the script reads Lean code from STDIN.

Example:

python verify_task.py --code "theorem t1 : 2 + 2 = 4 := by norm_num"

The script ensures import MiniF2F.Minif2fImport is present, writes a temp file under miniF2F-lean4, runs lake env lean and forwards compiler output, returning Lean's exit code.

Unified runner (DOoM-style)

Run multiple models from a single YAML config and get a leaderboard:

python runner.py --config configs/run.yaml --dataset minif2f --max-workers 8

Config file: configs/run.yaml
- model_list: list of model keys
- Per-model blocks (keys must match model_list):
  - model_name, api_type (openrouter|openai|agent), parallel, optional num_examples, system_prompt, max_tokens
Results are written to:
- results/leaderboard.md — aggregated leaderboard
- results/details/<model_key>/summary.json and results/details/<model_key>/results.json
Use --no-cache to force re-run even if summary exists.

Dependencies

uv is recommended for faster dependency resolution and automatic virtual environment management.

Configuration

All defaults live in configs/config.yaml and are loaded by configs/config_loader.py:

Structure

paths - Project directories: MINIF2F_DIR, LEAN_SOURCE_FILE, LEAN_OUTPUT_FILE, LOG_DIR, TMP_DIR
api - Provider settings: AVAILABLE_PROVIDERS, DEFAULT_PROVIDER, API keys and endpoints
models - Model configuration: DEFAULT_MODEL
lean - Lean 4 settings: LEAN_TIMEOUT
llm - LLM/API timeouts: LLM_REQUEST_TIMEOUT
agent - Agent parameters: DEFAULT_MAX_STEPS, DEFAULT_PLANNING_INTERVAL, DEFAULT_CONCURRENCY, DEFAULT_SUBSET_SIZE
- budgets - Search and compilation budgets by difficulty level:
  - fast_path (difficulty ≤ 0.3): quick processing
  - micro_plan (0.3 < difficulty ≤ 0.6): balanced approach
  - full_pipeline (difficulty > 0.6): comprehensive processing
logging - Logging configuration: LOG_FORMAT

Environment Variables

The YAML config supports environment variable substitution:

api:
  openai:
    api_key: "${OPENAI_API_KEY}"  # Required variable
  default_provider: "${MATH_AGENT_PROVIDER:-openrouter}"  # With default value

Usage in Code

Import configuration as usual (no changes needed):

from configs.config_loader import (
    BASE_DIR, MINIF2F_DIR, DEFAULT_MODEL, LEAN_TIMEOUT,
    AGENT_BUDGETS, DEFAULT_MAX_STEPS
)

Validation

Validation helpers: validate_config() and validate_provider_credentials() (scripts call these and will error early if prerequisites are missing).

Checkpoints, logs, and outputs

valid.json: generated by process_lean.py; consumed by benchmarks and agents.
Stage-based checkpoints (recommended): tmp/checkpoints/run-<YYYYMMDD-HHMMSS>/stage-<n>.json - always written per stage; includes cumulative and per-stage results and the list of unsolved items. Use --resume_run <run_id> to resume.
Legacy checkpoint files: tmp/checkpoints/<name>.json (if you use --save_checkpoint / --checkpoint in agents/math_prover_agent.py).
Logs:
- log/llm_requests.log: provider requests (benchmarks).
- log/tools.log: output from Lean verifier and search tools.
- log/agent_benchmark.log: multi-agent run logs.

Project structure

agents/: Multi-agent system implementation
miniF2F-lean4/: Lean 4 theorem library
configs/: Configuration files
- config.yaml: Main configuration in YAML format
- config_loader.py: Configuration loader (don't edit manually)
pyproject.toml: Project configuration and dependencies (uv)
requirements.txt: Python dependencies (pip fallback)
uv.lock: Lock file for reproducible builds
tmp/: Checkpoints and temporary files
log/: Log files

FAQ

Which provider should I use? The multi-agent pipeline currently uses OpenRouter; prompt baselines are provided for both OpenRouter and OpenAI.
Where do results go? Logs in log/, checkpoints in tmp/checkpoints/.... The source tasks are in valid.json.
Can I resume from a previous run? Yes. Use --stages for automatic re-tries across stages. For the multi-agent script, you can resume from stage-based checkpoints with --resume_run <run_id> (recommended) or use legacy --checkpoint/--save_checkpoint options.
Should I use uv or pip? uv is recommended - faster and creates virtual environments automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Math Agent Vikhr

Quick Start

Scripts and CLI flags

agents/math_prover_agent.py

benchmark_openrouter.py

benchmark_openai.py

process_lean.py

verify_task.py

Unified runner (DOoM-style)

Dependencies

Configuration

Structure

Environment Variables

Usage in Code

Validation

Checkpoints, logs, and outputs

Project structure

FAQ

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
agents		agents
configs		configs
miniF2F-lean4		miniF2F-lean4
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_openai.py		benchmark_openai.py
benchmark_openrouter.py		benchmark_openrouter.py
failed_tasks.json		failed_tasks.json
parse_logs.py		parse_logs.py
process_lean.py		process_lean.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_phoenix.sh		run_phoenix.sh
runner.py		runner.py
uv.lock		uv.lock
valid.json		valid.json
verify_task.py		verify_task.py

License

VikhrModels/math_agent_vikhr

Folders and files

Latest commit

History

Repository files navigation

Math Agent Vikhr

Quick Start

Scripts and CLI flags

agents/math_prover_agent.py

benchmark_openrouter.py

benchmark_openai.py

process_lean.py

verify_task.py

Unified runner (DOoM-style)

Dependencies

Configuration

Structure

Environment Variables

Usage in Code

Validation

Checkpoints, logs, and outputs

Project structure

FAQ

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages