Skip to content

Add AIME2025 benchmark#10

Open
AlexCuadron wants to merge 133 commits intomainfrom
add-aime2025-benchmark
Open

Add AIME2025 benchmark#10
AlexCuadron wants to merge 133 commits intomainfrom
add-aime2025-benchmark

Conversation

@AlexCuadron
Copy link
Owner

@AlexCuadron AlexCuadron commented Mar 5, 2025

This PR adds a new benchmark for AIME2025 based on the existing AIME2024 benchmark. The benchmark evaluates AI agents on problems from the American Invitational Mathematics Examination (AIME) 2025.

Changes

  • Created a new directory structure for AIME2025 benchmark
  • Copied and adapted the necessary files from AIME2024
  • Updated the dataset loading code to use the AIME2025 dataset from Hugging Face
  • Added test scripts for dataset loading and answer extraction
  • Updated all references to AIME2024 in the code
  • Added fallback mechanism to handle the case when Docker is not available
  • Improved robustness of scripts with better error handling

Testing

The benchmark has been tested with:

  • Dataset loading from Hugging Face
  • Answer extraction and normalization
  • Fallback to local runtime when Docker is not available

openhands-agent and others added 30 commits February 25, 2025 04:45
- Added update_llm_config_for_completions_logging to imports
- Modified get_config to accept instance parameter
- Updated llm_config to enable completions logging
- Updated process_instance to pass instance to get_config

This change makes aider_bench save llm_completions in the same way as swe_bench,
with completions being saved in {eval_output_dir}/llm_completions/{instance_id}/
…tions-fork

feat: Enable llm_completions logging in aider_bench
Add polyglot benchmark implementation
- Added update_llm_config_for_completions_logging to imports
- Modified get_config to accept instance parameter
- Updated llm_config to enable completions logging
- Updated process_instance to pass instance to get_config

This change makes aider_bench save llm_completions in the same way as swe_bench,
with completions being saved in {eval_output_dir}/llm_completions/{instance_id}/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants