Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions imobench/hf_dataset_cards/answerbench.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
language:
- en
license: cc-by-4.0
task_categories:
- question-answering
tags:
- mathematics
- reasoning
- math-olympiad
pretty_name: IMO-AnswerBench
size_categories:
- n < 1k
---

# IMO-AnswerBench

IMO-AnswerBench is a dataset of 400 challenging short-answer mathematical problems derived from national, regional, and international Math Olympiads (IMO, IMO Shortlist, USAMO, USAMTS, AIME, etc.).

It is designed to evaluate the robust mathematical reasoning capabilities of AI models.

## Dataset Structure

The dataset contains the following columns:

- `Problem ID`: Unique identifier for each problem.
- `Problem`: The problem statement in LaTeX format.
- `Short Answer`: The gold standard final answer.
- `Category`: One of the four main IMO categories: Algebra, Combinatorics, Geometry, or Number theory.
- `Level`: Difficulty classification (pre-IMO, IMO-easy, IMO-medium, IMO-hard).
- `Subcategory`: Specific mathematical sub-topic.
- `Source`: The original competition source of the problem.

## Citation

```latex
@inproceedings{luong-etal-2025-towards,
title = "Towards Robust Mathematical Reasoning",
author = {Thang Luong and Dawsen Hwang and Hoang H. Nguyen and Golnaz Ghiasi and Yuri Chervonyi and Insuk Seo and Junsu Kim and Garrett Bingham and Jonathan Lee and Swaroop Mishra and Alex Zhai and Clara Huiyi Hu and Henryk Michalewski and Jimin Kim and Jeonghyun Ahn and Junhwi Bae and Xingyou Song and Trieu H. Trinh and Quoc V. Le and Junehyuk Jung},
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
year = "2025",
url = "https://aclanthology.org/2025.emnlp-main.1794/",
}
```
45 changes: 45 additions & 0 deletions imobench/hf_dataset_cards/gradingbench.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
language:
- en
license: cc-by-4.0
task_categories:
- evaluation
tags:
- mathematics
- reasoning
- auto-grading
- human-evaluation
pretty_name: IMO-GradingBench
size_categories:
- 10k < n < 100k
---

# IMO-GradingBench

IMO-GradingBench is a large dataset designed to advance automatic evaluation of mathematical proofs. It contains over 186,000 grading entries (with a subset of 1,000 high-quality human gradings) for model-generated solutions to IMO Bench problems.

## Dataset Structure

The dataset contains the following columns:

- `Grading ID`: Unique identifier for the grading entry.
- `Problem ID`: Reference to the problem being graded.
- `Problem`: The problem statement.
- `Solution`: Reference ground-truth solution.
- `Grading guidelines`: Criteria used for grading.
- `Response`: The model-generated output being evaluated.
- `Points`: Numeric score assigned (0-10 or 0-7 scale depending on configuration).
- `Reward`: Qualitative category (Correct, Partial, Incorrect, etc.).
- `Problem Source`: Original competition source.

## Citation

```latex
@inproceedings{luong-etal-2025-towards,
title = "Towards Robust Mathematical Reasoning",
author = {Thang Luong and Dawsen Hwang and Hoang H. Nguyen and Golnaz Ghiasi and Yuri Chervonyi and Insuk Seo and Junsu Kim and Garrett Bingham and Jonathan Lee and Swaroop Mishra and Alex Zhai and Clara Huiyi Hu and Henryk Michalewski and Jimin Kim and Jeonghyun Ahn and Junhwi Bae and Xingyou Song and Trieu H. Trinh and Quoc V. Le and Junehyuk Jung},
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
year = "2025",
url = "https://aclanthology.org/2025.emnlp-main.1794/",
}
```
44 changes: 44 additions & 0 deletions imobench/hf_dataset_cards/proofbench.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
language:
- en
license: cc-by-4.0
task_categories:
- theorem-proving
tags:
- mathematics
- reasoning
- math-olympiad
- proof
pretty_name: IMO-ProofBench
size_categories:
- n < 1k
---

# IMO-ProofBench

IMO-ProofBench consists of 60 proof-based mathematical problems, vetted by experts. Each problem includes a complete ground-truth solution and specific grading guidelines.

## Dataset Structure

The dataset contains the following columns:

- `Problem ID`: Unique identifier.
- `Problem`: The problem statement in LaTeX format.
- `Solution`: Full reference proof.
- `Grading guidelines`: Step-by-step criteria for scoring.
- `Category`: Main mathematical category.
- `Level`: Difficulty classification (IMO-easy, IMO-medium, IMO-hard).
- `Short Answer`: A brief summary of the final answer/result.
- `Source`: The original competition source.

## Citation

```latex
@inproceedings{luong-etal-2025-towards,
title = "Towards Robust Mathematical Reasoning",
author = {Thang Luong and Dawsen Hwang and Hoang H. Nguyen and Golnaz Ghiasi and Yuri Chervonyi and Insuk Seo and Junsu Kim and Garrett Bingham and Jonathan Lee and Swaroop Mishra and Alex Zhai and Clara Huiyi Hu and Henryk Michalewski and Jimin Kim and Jeonghyun Ahn and Junhwi Bae and Xingyou Song and Trieu H. Trinh and Quoc V. Le and Junehyuk Jung},
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
year = "2025",
url = "https://aclanthology.org/2025.emnlp-main.1794/",
}
```