Add benchmarks from FC4RLLM repository by AlexCuadron · Pull Request #7 · AlexCuadron/OpenHands

AlexCuadron · 2025-02-26T08:51:45Z

This PR adds implementations for several benchmarks from the FC4RLLM repository:

APPS: A code generation benchmark with 10,000 programming problems
MATH: A dataset of 12,500 challenging competition mathematics problems
HotpotQA: A question answering dataset featuring multi-hop questions
WikiTableQuestion: (Placeholder created, but not fully implemented)
AlfWorld: (Placeholder created, but not fully implemented)

Each benchmark follows the same structure as the existing aider_bench, with appropriate modifications for the specific dataset.

- Added update_llm_config_for_completions_logging to imports - Modified get_config to accept instance parameter - Updated llm_config to enable completions logging - Updated process_instance to pass instance to get_config This change makes aider_bench save llm_completions in the same way as swe_bench, with completions being saved in {eval_output_dir}/llm_completions/{instance_id}/

…tions-fork feat: Enable llm_completions logging in aider_bench

…lation

…execute_bash, finish, str_replace_editor)

…line options

…polyglot_benchmark

…path matching and debugging output

…an instance ID

Add polyglot benchmark implementation

github-actions · 2025-03-31T02:47:35Z

This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

openhands-agent and others added 25 commits February 25, 2025 04:45

Merge pull request #4 from AlexCuadron/feature/aider-bench-llm-comple…

9315619

…tions-fork feat: Enable llm_completions logging in aider_bench

Merge remote-tracking branch 'origin/main'

dc59367

Add polyglot benchmark implementation

bc8f20d

Fix argument parser in polyglot benchmark

37ba696

Improve polyglot benchmark path handling and fix logging error

890377d

Add Docker configuration options and troubleshooting guide

8af6f11

Add local Docker image build support for polyglot benchmark

32335ff

Set Docker image to build automatically by default

5610010

Fix Docker build issues by adding unzip and simplifying Gradle instal…

c9e232e

…lation

Restrict polyglot benchmark to use only the same tools as SWE-Bench (…

97e7ca7

…execute_bash, finish, str_replace_editor)

Fix runtime completion to use Docker runtime for running tests

44bcb39

Add script to test one instance per language in polyglot benchmark

601da45

Add one-per-language testing mode to polyglot benchmark run_infer.sh

84293fd

Update README with one-per-language testing instructions and command-…

87d9e15

…line options

Enable LLM completions logging in aider_bench run_infer.py

8a5dc59

Include tools information in evaluation output directory names

8ffe33e

Add evaluation parameter to run_infer.sh scripts for aider_bench and …

d45b98d

…polyglot_benchmark

Update README files with documentation for the new evaluation parameter

62d2632

Fix output directory detection in evaluation scripts

c8dab2c

Fix LLM completions logging to ensure it's enabled in all benchmarks

fa9a0f8

Improve output directory detection in evaluation scripts with better …

8a4ca1e

…path matching and debugging output

Fix handling of 'eval' parameter to prevent it from being treated as …

a2d7e63

…an instance ID

Merge pull request #6 from AlexCuadron/polyglot-benchmark-clean

013ff2d

Add polyglot benchmark implementation

Add benchmarks from FC4RLLM repository

4eb5a22

AlexCuadron force-pushed the main branch from 013ff2d to 205a79b Compare February 28, 2025 23:52

github-actions bot added the Stale label Mar 31, 2025

AlexCuadron force-pushed the main branch from e76a772 to 9adfced Compare April 1, 2025 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks from FC4RLLM repository#7

Add benchmarks from FC4RLLM repository#7
AlexCuadron wants to merge 25 commits intomainfrom
add-fc4rllm-benchmarks

AlexCuadron commented Feb 26, 2025

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlexCuadron commented Feb 26, 2025

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants