GitHub - frenzymath/FATE-Eval: Repository for evaluation codes for FATE benchmark

FATE-Eval

This project is the official evaluation code for the FATE benchmark. It is an open-source toolkit for generating and verifying Lean 4 solutions to math problems, with support for pass@k metrics and cost tracking.

Features

Unified generation interface across commercial APIs
Lean 4 verification with static precheck and batched REPL verification
pass@k computation and result aggregation
Cost tracking for API calls

Requirements

Python 3.11+
Lean 4 toolchain and lake installed if running local verification.

Installation

pip install -r requirements.txt

Quickstart

Prepare your model configurations in config/models.yaml and verification configuration in config/verify_config.yaml.
Prepare Lean Dependencies: This repository provides three versions of Lean workspaces under the lean_workspaces directory. Run
```
lake exe cache get
```
in the corresponding directory before running verification or the full pipeline.

Run generation only:

python -m src.generate --model openai_o3 \
  --dataset data/FATE-H.json \
  --n 100 --k 1 --mode lean

Run the full pipeline (generate then verify):

python -m src.main --model openai_o3 \
  --dataset data/FATE-H.json \
  --n 100 --k 1 --mode lean

Outputs are saved under output/generate/<model>/..., and verification summaries are saved under output/verify/... or the paths configured in your YAML files.

Command-Line Arguments: The src/main.py script for running the full generation and verification pipeline accepts the following arguments:

--model (required): The name of the model to evaluate.
--dataset (required): The path to the dataset file.
--n (optional, default: 10): The number of problems to process.
--k (optional, default: 1): The number of attempts per problem.
--api_key (optional): The API key for model calls. If omitted, it falls back to environment variables.
--mode (optional, default: "lean"): Modes for different prompts.
--timeout (optional): The timeout in seconds for a single verification task. Overrides the setting in the config file if provided.
--max_workers (optional): The maximum number of concurrent workers for verification. Overrides the setting in the config file if provided.

Directory Structure (Key Parts)

src/: Generation, verification, model interfaces, and Lean utilities
config/: YAML configuration files for models and verification
output/: Generated and verified results
logs/: Runtime logs
lean_workspaces/: Contains different versions of Lean workspaces

License

MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FATE-Eval

Features

Requirements

Installation

Quickstart

Directory Structure (Key Parts)

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
data		data
lean_workspace		lean_workspace
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

frenzymath/FATE-Eval

Folders and files

Latest commit

History

Repository files navigation

FATE-Eval

Features

Requirements

Installation

Quickstart

Directory Structure (Key Parts)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages