Add AIME2025 benchmark by AlexCuadron · Pull Request #10 · AlexCuadron/OpenHands

AlexCuadron · 2025-03-05T00:56:34Z

This PR adds a new benchmark for AIME2025 based on the existing AIME2024 benchmark. The benchmark evaluates AI agents on problems from the American Invitational Mathematics Examination (AIME) 2025.

Changes

Created a new directory structure for AIME2025 benchmark
Copied and adapted the necessary files from AIME2024
Updated the dataset loading code to use the AIME2025 dataset from Hugging Face
Added test scripts for dataset loading and answer extraction
Updated all references to AIME2024 in the code
Added fallback mechanism to handle the case when Docker is not available
Improved robustness of scripts with better error handling

Testing

The benchmark has been tested with:

Dataset loading from Hugging Face
Answer extraction and normalization
Fallback to local runtime when Docker is not available

- Added update_llm_config_for_completions_logging to imports - Modified get_config to accept instance parameter - Updated llm_config to enable completions logging - Updated process_instance to pass instance to get_config This change makes aider_bench save llm_completions in the same way as swe_bench, with completions being saved in {eval_output_dir}/llm_completions/{instance_id}/

…tions-fork feat: Enable llm_completions logging in aider_bench

…lation

…execute_bash, finish, str_replace_editor)

…line options

…polyglot_benchmark

…path matching and debugging output

…an instance ID

Add polyglot benchmark implementation

- Added update_llm_config_for_completions_logging to imports - Modified get_config to accept instance parameter - Updated llm_config to enable completions logging - Updated process_instance to pass instance to get_config This change makes aider_bench save llm_completions in the same way as swe_bench, with completions being saved in {eval_output_dir}/llm_completions/{instance_id}/

…hmark

…0 and AIME2024 benchmarks

…ion call

…stness

…>...</name>

…ases

…rmat

…tool=...> format

…n.py

openhands-agent and others added 30 commits February 25, 2025 04:45

Merge pull request #4 from AlexCuadron/feature/aider-bench-llm-comple…

9315619

…tions-fork feat: Enable llm_completions logging in aider_bench

Merge remote-tracking branch 'origin/main'

dc59367

Add polyglot benchmark implementation

bc8f20d

Fix argument parser in polyglot benchmark

37ba696

Improve polyglot benchmark path handling and fix logging error

890377d

Add Docker configuration options and troubleshooting guide

8af6f11

Add local Docker image build support for polyglot benchmark

32335ff

Set Docker image to build automatically by default

5610010

Fix Docker build issues by adding unzip and simplifying Gradle instal…

c9e232e

…lation

Restrict polyglot benchmark to use only the same tools as SWE-Bench (…

97e7ca7

…execute_bash, finish, str_replace_editor)

Fix runtime completion to use Docker runtime for running tests

44bcb39

Add script to test one instance per language in polyglot benchmark

601da45

Add one-per-language testing mode to polyglot benchmark run_infer.sh

84293fd

Update README with one-per-language testing instructions and command-…

87d9e15

…line options

Enable LLM completions logging in aider_bench run_infer.py

8a5dc59

Include tools information in evaluation output directory names

8ffe33e

Add evaluation parameter to run_infer.sh scripts for aider_bench and …

d45b98d

…polyglot_benchmark

Update README files with documentation for the new evaluation parameter

62d2632

Fix output directory detection in evaluation scripts

c8dab2c

Fix LLM completions logging to ensure it's enabled in all benchmarks

fa9a0f8

Improve output directory detection in evaluation scripts with better …

8a4ca1e

…path matching and debugging output

Fix handling of 'eval' parameter to prevent it from being treated as …

a2d7e63

…an instance ID

Merge pull request #6 from AlexCuadron/polyglot-benchmark-clean

013ff2d

Add polyglot benchmark implementation

Add polyglot benchmark implementation

96f6c8a

Fix argument parser in polyglot benchmark

ccff971

Improve polyglot benchmark path handling and fix logging error

e63c293

Add Docker configuration options and troubleshooting guide

3e98953

Add local Docker image build support for polyglot benchmark

95e212b

AlexCuadron added 29 commits March 3, 2025 02:31

Apply temperature settings and boxed answer directive to Math500 benc…

b237ddb

…hmark

Fix answer normalization to handle currency values properly in Math50…

7be62fc

…0 and AIME2024 benchmarks

sth

164fcba

Fix overthinking analysis in AIME2024 benchmark

9ad1892

Fix LLM completion method in overthinking analysis

c96c1e1

Implement retry mechanism for overthinking solutions

4b52092

Fix issue with execute_ipython_cell being interpreted as finish funct…

eb68e2a

…ion call

Add AIME2025 benchmark

55e510d

Fix AIME2025 benchmark to handle Docker availability and improve robu…

450bb27

…stness

Add prefix-based LLM implementation for AIME2025 benchmark

739c3c0

Update prefix implementation to use <tool> instead of <function>

3ce309a

Update parameter format from <parameter=name>...</parameter> to <name…

cb6decc

…>...</name>

Fix regex pattern to exclude tool tag when matching parameters

57a450f

Update function names and variables to use 'tool' instead of 'function'

8237b2a

Update import statements in llm.py to use new function names with ali…

af37213

…ases

Add fallback for older versions of LiteLLM without register_provider

8346553

Add simplified prefix setup and script for AIME2025 benchmark

147364f

Add direct prefix patch for AIME2025 benchmark

a857a62

Resolve merge conflicts in fn_call_converter.py, keeping the tool= fo…

ca5b817

…rmat

Resolve merge conflicts from openhands-workspace-idr6hq92

831dec1

Update fn_call_converter.py to use <function=...> format instead of <…

ca069ef

…tool=...> format

Update to use <tool=...> format with <code>...</code> tags

dd10edf

Fix FN_PARAM_REGEX_PATTERN to exclude tool tags

39b4515

Add custom providers and configuration files for AIME2025 benchmark

1cf2075

Add function aliases for backward compatibility

e8e6c3f

Simplify conditional prefix LLM to avoid dependency issues

453c953

Fix custom_qwen_provider to handle missing register_provider attribute

943a268

Add main function to run_infer.py for compatibility with run_with_qwe…

3df8936

…n.py

Fix run_with_qwen.py to pass through all arguments to run_infer.main()

fa7c282

AlexCuadron force-pushed the main branch from e76a772 to 9adfced Compare April 1, 2025 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AIME2025 benchmark#10

Add AIME2025 benchmark#10
AlexCuadron wants to merge 133 commits intomainfrom
add-aime2025-benchmark

AlexCuadron commented Mar 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AlexCuadron commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlexCuadron commented Mar 5, 2025 •

edited

Loading