google-deepmind · Ashutosh0x · Feb 12, 2026
diff --git a/imobench/hf_dataset_cards/answerbench.md b/imobench/hf_dataset_cards/answerbench.md
@@ -0,0 +1,44 @@
+---
+language:
+- en
+license: cc-by-4.0
+task_categories:
+- question-answering
+tags:
+- mathematics
+- reasoning
+- math-olympiad
+pretty_name: IMO-AnswerBench
+size_categories:
+- n < 1k
+---
+
+# IMO-AnswerBench
+
+IMO-AnswerBench is a dataset of 400 challenging short-answer mathematical problems derived from national, regional, and international Math Olympiads (IMO, IMO Shortlist, USAMO, USAMTS, AIME, etc.). 
+
+It is designed to evaluate the robust mathematical reasoning capabilities of AI models.
+
+## Dataset Structure
+
+The dataset contains the following columns:
+
+- `Problem ID`: Unique identifier for each problem.
+- `Problem`: The problem statement in LaTeX format.
+- `Short Answer`: The gold standard final answer.
+- `Category`: One of the four main IMO categories: Algebra, Combinatorics, Geometry, or Number theory.
+- `Level`: Difficulty classification (pre-IMO, IMO-easy, IMO-medium, IMO-hard).
+- `Subcategory`: Specific mathematical sub-topic.
+- `Source`: The original competition source of the problem.
+
+## Citation
+
+```latex
+@inproceedings{luong-etal-2025-towards,
+    title = "Towards Robust Mathematical Reasoning",
+    author  = {Thang Luong and Dawsen Hwang and Hoang H. Nguyen and Golnaz Ghiasi and Yuri Chervonyi and Insuk Seo and Junsu Kim and Garrett Bingham and Jonathan Lee and Swaroop Mishra and Alex Zhai and Clara Huiyi Hu and Henryk Michalewski and Jimin Kim and Jeonghyun Ahn and Junhwi Bae and Xingyou Song and Trieu H. Trinh and Quoc V. Le and Junehyuk Jung},
+    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
+    year = "2025",
+    url = "https://aclanthology.org/2025.emnlp-main.1794/",
+}
+```
diff --git a/imobench/hf_dataset_cards/gradingbench.md b/imobench/hf_dataset_cards/gradingbench.md
@@ -0,0 +1,45 @@
+---
+language:
+- en
+license: cc-by-4.0
+task_categories:
+- evaluation
+tags:
+- mathematics
+- reasoning
+- auto-grading
+- human-evaluation
+pretty_name: IMO-GradingBench
+size_categories:
+- 10k < n < 100k
+---
+
+# IMO-GradingBench
+
+IMO-GradingBench is a large dataset designed to advance automatic evaluation of mathematical proofs. It contains over 186,000 grading entries (with a subset of 1,000 high-quality human gradings) for model-generated solutions to IMO Bench problems.
+
+## Dataset Structure
+
+The dataset contains the following columns:
+
+- `Grading ID`: Unique identifier for the grading entry.
+- `Problem ID`: Reference to the problem being graded.
+- `Problem`: The problem statement.
+- `Solution`: Reference ground-truth solution.
+- `Grading guidelines`: Criteria used for grading.
+- `Response`: The model-generated output being evaluated.
+- `Points`: Numeric score assigned (0-10 or 0-7 scale depending on configuration).
+- `Reward`: Qualitative category (Correct, Partial, Incorrect, etc.).
+- `Problem Source`: Original competition source.
+
+## Citation
+
+```latex
+@inproceedings{luong-etal-2025-towards,
+    title = "Towards Robust Mathematical Reasoning",
+    author  = {Thang Luong and Dawsen Hwang and Hoang H. Nguyen and Golnaz Ghiasi and Yuri Chervonyi and Insuk Seo and Junsu Kim and Garrett Bingham and Jonathan Lee and Swaroop Mishra and Alex Zhai and Clara Huiyi Hu and Henryk Michalewski and Jimin Kim and Jeonghyun Ahn and Junhwi Bae and Xingyou Song and Trieu H. Trinh and Quoc V. Le and Junehyuk Jung},
+    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
+    year = "2025",
+    url = "https://aclanthology.org/2025.emnlp-main.1794/",
+}
+```
diff --git a/imobench/hf_dataset_cards/proofbench.md b/imobench/hf_dataset_cards/proofbench.md
@@ -0,0 +1,44 @@
+---
+language:
+- en
+license: cc-by-4.0
+task_categories:
+- theorem-proving
+tags:
+- mathematics
+- reasoning
+- math-olympiad
+- proof
+pretty_name: IMO-ProofBench
+size_categories:
+- n < 1k
+---
+
+# IMO-ProofBench
+
+IMO-ProofBench consists of 60 proof-based mathematical problems, vetted by experts. Each problem includes a complete ground-truth solution and specific grading guidelines.
+
+## Dataset Structure
+
+The dataset contains the following columns:
+
+- `Problem ID`: Unique identifier.
+- `Problem`: The problem statement in LaTeX format.
+- `Solution`: Full reference proof.
+- `Grading guidelines`: Step-by-step criteria for scoring.
+- `Category`: Main mathematical category.
+- `Level`: Difficulty classification (IMO-easy, IMO-medium, IMO-hard).
+- `Short Answer`: A brief summary of the final answer/result.
+- `Source`: The original competition source.
+
+## Citation
+
+```latex
+@inproceedings{luong-etal-2025-towards,
+    title = "Towards Robust Mathematical Reasoning",
+    author  = {Thang Luong and Dawsen Hwang and Hoang H. Nguyen and Golnaz Ghiasi and Yuri Chervonyi and Insuk Seo and Junsu Kim and Garrett Bingham and Jonathan Lee and Swaroop Mishra and Alex Zhai and Clara Huiyi Hu and Henryk Michalewski and Jimin Kim and Jeonghyun Ahn and Junhwi Bae and Xingyou Song and Trieu H. Trinh and Quoc V. Le and Junehyuk Jung},
+    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
+    year = "2025",
+    url = "https://aclanthology.org/2025.emnlp-main.1794/",
+}
+```