Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
92e98f6
feat: Enable llm_completions logging in aider_bench
openhands-agent Feb 25, 2025
9315619
Merge pull request #4 from AlexCuadron/feature/aider-bench-llm-comple…
AlexCuadron Feb 25, 2025
dc59367
Merge remote-tracking branch 'origin/main'
AlexCuadron Feb 26, 2025
bc8f20d
Add polyglot benchmark implementation
AlexCuadron Feb 26, 2025
37ba696
Fix argument parser in polyglot benchmark
AlexCuadron Feb 26, 2025
890377d
Improve polyglot benchmark path handling and fix logging error
AlexCuadron Feb 26, 2025
8af6f11
Add Docker configuration options and troubleshooting guide
AlexCuadron Feb 26, 2025
32335ff
Add local Docker image build support for polyglot benchmark
AlexCuadron Feb 26, 2025
5610010
Set Docker image to build automatically by default
AlexCuadron Feb 26, 2025
c9e232e
Fix Docker build issues by adding unzip and simplifying Gradle instal…
AlexCuadron Feb 26, 2025
97e7ca7
Restrict polyglot benchmark to use only the same tools as SWE-Bench (…
AlexCuadron Feb 26, 2025
44bcb39
Fix runtime completion to use Docker runtime for running tests
AlexCuadron Feb 26, 2025
601da45
Add script to test one instance per language in polyglot benchmark
AlexCuadron Feb 26, 2025
84293fd
Add one-per-language testing mode to polyglot benchmark run_infer.sh
AlexCuadron Feb 26, 2025
87d9e15
Update README with one-per-language testing instructions and command-…
AlexCuadron Feb 26, 2025
8a5dc59
Enable LLM completions logging in aider_bench run_infer.py
AlexCuadron Feb 26, 2025
8ffe33e
Include tools information in evaluation output directory names
AlexCuadron Feb 26, 2025
d45b98d
Add evaluation parameter to run_infer.sh scripts for aider_bench and …
AlexCuadron Feb 26, 2025
62d2632
Update README files with documentation for the new evaluation parameter
AlexCuadron Feb 26, 2025
c8dab2c
Fix output directory detection in evaluation scripts
AlexCuadron Feb 26, 2025
fa9a0f8
Fix LLM completions logging to ensure it's enabled in all benchmarks
AlexCuadron Feb 26, 2025
8a4ca1e
Improve output directory detection in evaluation scripts with better …
AlexCuadron Feb 26, 2025
a2d7e63
Fix handling of 'eval' parameter to prevent it from being treated as …
AlexCuadron Feb 26, 2025
013ff2d
Merge pull request #6 from AlexCuadron/polyglot-benchmark-clean
AlexCuadron Feb 26, 2025
ee6026b
feat: Enable llm_completions logging in aider_bench
openhands-agent Feb 25, 2025
96f6c8a
Add polyglot benchmark implementation
AlexCuadron Feb 26, 2025
ccff971
Fix argument parser in polyglot benchmark
AlexCuadron Feb 26, 2025
e63c293
Improve polyglot benchmark path handling and fix logging error
AlexCuadron Feb 26, 2025
3e98953
Add Docker configuration options and troubleshooting guide
AlexCuadron Feb 26, 2025
95e212b
Add local Docker image build support for polyglot benchmark
AlexCuadron Feb 26, 2025
ec56525
Set Docker image to build automatically by default
AlexCuadron Feb 26, 2025
1117f17
Fix Docker build issues by adding unzip and simplifying Gradle instal…
AlexCuadron Feb 26, 2025
68aeb43
Restrict polyglot benchmark to use only the same tools as SWE-Bench (…
AlexCuadron Feb 26, 2025
1f9c157
Fix runtime completion to use Docker runtime for running tests
AlexCuadron Feb 26, 2025
929c475
Add script to test one instance per language in polyglot benchmark
AlexCuadron Feb 26, 2025
98bddf9
Add one-per-language testing mode to polyglot benchmark run_infer.sh
AlexCuadron Feb 26, 2025
d96491e
Update README with one-per-language testing instructions and command-…
AlexCuadron Feb 26, 2025
65b6c6f
Enable LLM completions logging in aider_bench run_infer.py
AlexCuadron Feb 26, 2025
3018e95
Include tools information in evaluation output directory names
AlexCuadron Feb 26, 2025
a3a0876
Add evaluation parameter to run_infer.sh scripts for aider_bench and …
AlexCuadron Feb 26, 2025
5bbb8ab
Update README files with documentation for the new evaluation parameter
AlexCuadron Feb 26, 2025
f6ea8de
Fix output directory detection in evaluation scripts
AlexCuadron Feb 26, 2025
d279418
Fix LLM completions logging to ensure it's enabled in all benchmarks
AlexCuadron Feb 26, 2025
1a9bd9b
Improve output directory detection in evaluation scripts with better …
AlexCuadron Feb 26, 2025
205a79b
Fix handling of 'eval' parameter to prevent it from being treated as …
AlexCuadron Feb 26, 2025
d8bd1e4
Structured logging mode (#7034)
raymyers Mar 1, 2025
4012d34
Add MATH-500 benchmark with custom finish tool
AlexCuadron Mar 1, 2025
33002e4
Add run_infer.sh script for MATH-500 benchmark
AlexCuadron Mar 1, 2025
750e083
Fix error handling in MATH-500 benchmark
AlexCuadron Mar 1, 2025
0b27dc8
Update README with run_infer.sh usage instructions
AlexCuadron Mar 1, 2025
3534be6
Add support for togetherDeepseek model in run_infer.sh
AlexCuadron Mar 1, 2025
2d647e8
Update README with togetherDeepseek model information
AlexCuadron Mar 1, 2025
ead4068
Fix run_infer.sh script to properly handle togetherDeepseek model
AlexCuadron Mar 1, 2025
edd1152
Update README with instructions for setting the Together API key
AlexCuadron Mar 1, 2025
666a7c5
Fix KeyError in fn_call_converter.py by adding proper key existence c…
AlexCuadron Mar 1, 2025
dac2200
Remove temporary config file creation in math500 run_infer.sh
AlexCuadron Mar 1, 2025
1ee8951
Fix LiteLLM cost calculation for unmapped models
AlexCuadron Mar 1, 2025
d9b35cb
Limit CodeActAgent to only use IPython tool for MATH500 benchmark
AlexCuadron Mar 1, 2025
b264ff1
Fix tool configuration for MATH500 benchmark to be compatible with fu…
AlexCuadron Mar 1, 2025
bd93ed4
Suppress all logging for unmapped models in LiteLLM cost calculation
AlexCuadron Mar 1, 2025
ce71ae9
Create custom Math500CodeActAgent that only uses IPython and Finish t…
AlexCuadron Mar 1, 2025
b10994d
Add ability to specify allowed tools for MATH500 benchmark via run_in…
AlexCuadron Mar 1, 2025
25c151c
Merge changes and fix conflicts
AlexCuadron Mar 1, 2025
70cd04d
Fix EvalMetadata usage by storing allowed_tools in details field
AlexCuadron Mar 1, 2025
681fec2
Update in-context learning example to use IPython for math problems
AlexCuadron Mar 1, 2025
9b7e033
Update first example to show model correcting its mistake using Python
AlexCuadron Mar 1, 2025
b37b022
Enhance function call example to demonstrate model self-correction th…
AlexCuadron Mar 1, 2025
48e1494
Enhance MATH500 benchmark to encourage Python verification at each step
AlexCuadron Mar 1, 2025
89b57c5
Add sympy and other math libraries to MATH500 benchmark environment
AlexCuadron Mar 1, 2025
26491b7
Make MATH500 instructions more general about tool verification rather…
AlexCuadron Mar 1, 2025
0a1c5d9
Fix: Use runtime_extra_deps instead of setup_commands for installing …
AlexCuadron Mar 1, 2025
c24ba5a
Fix: Use jupyter/scipy-notebook image with pre-installed scientific l…
AlexCuadron Mar 1, 2025
6cfb166
Fix: Simplify Docker setup by using standard Python image with pip in…
AlexCuadron Mar 1, 2025
28d2a38
Update instructions to have agent install libraries directly with %pip
AlexCuadron Mar 1, 2025
0ffc6de
Merge branch 'main' into test
AlexCuadron Mar 1, 2025
21b4fe5
Merge pull request #9 from AlexCuadron/test
AlexCuadron Mar 1, 2025
3a03ca3
Add AIME2024 benchmark based on AI-MO/aimo-validation-aime dataset
AlexCuadron Mar 1, 2025
0b19a7a
Merge remote changes
AlexCuadron Mar 1, 2025
c62c109
Update AIME2024 scripts to support positional arguments for compatibi…
AlexCuadron Mar 1, 2025
b673ed8
Fix AIME2024 scripts to match MATH500 format exactly for compatibility
AlexCuadron Mar 1, 2025
e930fc7
Improve answer extraction and normalization for AIME2024 benchmark
AlexCuadron Mar 1, 2025
85344a3
Add eval_infer.sh script for running evaluation on existing output files
AlexCuadron Mar 1, 2025
af98361
Significantly improve answer extraction and add debugging tools for A…
AlexCuadron Mar 1, 2025
ec0607a
Enhance AIME2024 prompt to encourage problem decomposition and struct…
AlexCuadron Mar 2, 2025
a92155b
Update fn_call_converter.py with structured problem-solving example
AlexCuadron Mar 2, 2025
ac6b727
Add solution parameter to FinishTool for benchmark problems
AlexCuadron Mar 2, 2025
d5c0ce1
Improve solution parameter description in FinishTool
AlexCuadron Mar 2, 2025
8bc3df4
Enhance solution parameter instructions and examples for benchmark pr…
AlexCuadron Mar 2, 2025
b560a81
Fix contradictory instructions for solution parameter
AlexCuadron Mar 2, 2025
ca91a3c
Add explicit reminders about properly closing function tags and using…
AlexCuadron Mar 2, 2025
64f44d8
Improve answer normalization for AIME benchmark with numerical compar…
AlexCuadron Mar 2, 2025
566d2b2
Enhance AIME benchmark analysis with detailed answer comparison
AlexCuadron Mar 2, 2025
60c855e
Enforce Python usage before allowing finish function
AlexCuadron Mar 2, 2025
42d2366
Fix missing import for IPythonRunCellAction
AlexCuadron Mar 2, 2025
094295c
Update instructions to focus on programmatic approach instead of sub-…
AlexCuadron Mar 2, 2025
1bb396d
Improve answer normalization for mathematical expressions with sqrt
AlexCuadron Mar 2, 2025
2d90cd4
Update instructions to emphasize step-by-step verification with code
AlexCuadron Mar 2, 2025
4feb0da
Fix directory creation and add error handling in analyze_results.py
AlexCuadron Mar 2, 2025
1222571
Add warnings about floating-point calculations and rounding errors
AlexCuadron Mar 2, 2025
8c88a22
Add final verification step before accepting finish action
AlexCuadron Mar 2, 2025
3c82377
Update MATH500 helper.py to match AIME2024 instructions
AlexCuadron Mar 2, 2025
062db5e
Enhance AIME2024 benchmark with boxed answer format and temperature o…
AlexCuadron Mar 3, 2025
bc9789f
Integrate ThinkingAgent to detect and filter overthinking solutions i…
AlexCuadron Mar 3, 2025
7b6053d
Improve ThinkingAgent integration with file generation and analysis
AlexCuadron Mar 3, 2025
b237ddb
Apply temperature settings and boxed answer directive to Math500 benc…
AlexCuadron Mar 3, 2025
7be62fc
Fix answer normalization to handle currency values properly in Math50…
AlexCuadron Mar 3, 2025
164fcba
sth
AlexCuadron Mar 3, 2025
9ad1892
Fix overthinking analysis in AIME2024 benchmark
AlexCuadron Mar 3, 2025
c96c1e1
Fix LLM completion method in overthinking analysis
AlexCuadron Mar 3, 2025
4b52092
Implement retry mechanism for overthinking solutions
AlexCuadron Mar 3, 2025
e76a772
Merge pull request #12 from AlexCuadron/aime2024-benchmark
AlexCuadron Mar 5, 2025
a461b98
Add thinking prefix and tool response for empty assistant messages
AlexCuadron Mar 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion evaluation/benchmarks/aider_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ development environment and LLM.
## Start the evaluation

```bash
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids] [run_evaluation]
```

- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for
Expand All @@ -31,6 +31,7 @@ development environment and LLM.
- `eval-num-workers`: the number of workers to use for evaluation. Default: `1`.
- `eval_ids`, e.g. `"1,3,10"`, limits the evaluation to instances with the
given IDs (comma separated).
- `run_evaluation`: set to `eval` to automatically run evaluation after the benchmark completes.

There are also following optional environment variables you can set:

Expand All @@ -53,7 +54,11 @@ You can update the arguments in the script
- `--eval-ids`: the IDs of the examples to evaluate (comma separated). For example, `"1,3,10"`.

```bash
# Run benchmark without evaluation
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10"

# Run benchmark with automatic evaluation
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10" eval
```

### Run Inference on `RemoteRuntime` (experimental)
Expand Down
22 changes: 20 additions & 2 deletions evaluation/benchmarks/aider_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
update_llm_config_for_completions_logging,
)
from openhands.controller.state.state import State
from openhands.core.config import (
Expand All @@ -44,6 +45,7 @@


def get_config(
instance: pd.Series,
metadata: EvalMetadata,
) -> AppConfig:
sandbox_config = get_default_sandbox_config_for_eval()
Expand All @@ -58,7 +60,13 @@ def get_config(
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
# Update llm_config to enable completions logging
llm_config = update_llm_config_for_completions_logging(
metadata.llm_config,
metadata.eval_output_dir,
str(instance.instance_id)
)
config.set_llm_config(llm_config)
agent_config = config.get_agent_config(metadata.agent_class)
agent_config.enable_prompt_extensions = False

Expand Down Expand Up @@ -161,7 +169,7 @@ def process_instance(
metadata: EvalMetadata,
reset_logger: bool = True,
) -> EvalOutput:
config = get_config(metadata)
config = get_config(instance, metadata)

# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
Expand Down Expand Up @@ -275,13 +283,23 @@ def process_instance(
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

# Create details dictionary with agent configuration
agent_details = {
"agent_config": {
"codeact_enable_jupyter": False,
"codeact_enable_browsing": False,
"codeact_enable_llm_editor": False,
}
}

metadata = make_metadata(
llm_config,
'AiderBench',
args.agent_cls,
args.max_iterations,
args.eval_note,
args.eval_output_dir,
details=agent_details,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')

Expand Down
66 changes: 65 additions & 1 deletion evaluation/benchmarks/aider_bench/scripts/run_infer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,21 @@ AGENT=$3
EVAL_LIMIT=$4
NUM_WORKERS=$5
EVAL_IDS=$6
RUN_EVALUATION=$7 # New parameter to run evaluation after benchmark

# Special case: if the 7th parameter is "eval", set RUN_EVALUATION to "eval"
if [ "$RUN_EVALUATION" = "eval" ]; then
echo "Evaluation mode enabled"
fi

# Special case: if any parameter is "eval", set RUN_EVALUATION to "eval"
for param in "$@"; do
if [ "$param" = "eval" ]; then
RUN_EVALUATION="eval"
echo "Evaluation mode enabled"
break
fi
done

if [ -z "$NUM_WORKERS" ]; then
NUM_WORKERS=1
Expand Down Expand Up @@ -51,10 +66,59 @@ if [ -n "$EVAL_LIMIT" ]; then
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi

if [ -n "$EVAL_IDS" ]; then
# Only pass eval-ids if it's not "eval" (which is a special parameter for evaluation mode)
if [ -n "$EVAL_IDS" ] && [ "$EVAL_IDS" != "eval" ]; then
echo "EVAL_IDS: $EVAL_IDS"
COMMAND="$COMMAND --eval-ids $EVAL_IDS"
fi

# Run the command
eval $COMMAND

# Get the output directory - first try the default location
OUTPUT_DIR=$(find evaluation/evaluation_outputs -path "*/AiderBench/$AGENT/*" -type d -name "*$EVAL_NOTE*" 2>/dev/null | sort -r | head -n 1)

# If not found, try to find it anywhere under evaluation_outputs
if [ -z "$OUTPUT_DIR" ]; then
OUTPUT_DIR=$(find . -path "*/evaluation_outputs/*" -path "*/AiderBench/$AGENT/*" -type d -name "*$EVAL_NOTE*" 2>/dev/null | sort -r | head -n 1)
fi

# If still not found, try to find any output.jsonl file
if [ -z "$OUTPUT_DIR" ]; then
OUTPUT_FILE=$(find . -name "output.jsonl" 2>/dev/null | sort -r | head -n 1)
if [ -n "$OUTPUT_FILE" ]; then
OUTPUT_DIR=$(dirname "$OUTPUT_FILE")
fi
else
OUTPUT_FILE="$OUTPUT_DIR/output.jsonl"
fi

# Print the output directory and file for debugging
echo ""
echo "Output directory: $OUTPUT_DIR"
echo "Output file: $OUTPUT_FILE"

# Run evaluation if requested
if [ "$RUN_EVALUATION" = "eval" ]; then
echo ""
echo "======================================"
echo "Running evaluation on results..."
echo "======================================"
echo ""

if [ -f "$OUTPUT_FILE" ]; then
echo "Evaluating results in: $OUTPUT_FILE"
poetry run python evaluation/benchmarks/aider_bench/scripts/summarize_results.py "$OUTPUT_FILE"

# Save the evaluation results
EVAL_RESULTS_FILE="$OUTPUT_DIR/evaluation_results.txt"
echo "Saving evaluation results to: $EVAL_RESULTS_FILE"
poetry run python evaluation/benchmarks/aider_bench/scripts/summarize_results.py "$OUTPUT_FILE" > "$EVAL_RESULTS_FILE"

echo ""
echo "Evaluation complete. Results saved to: $EVAL_RESULTS_FILE"
else
echo "Error: Output file not found: $OUTPUT_FILE"
echo "Cannot run evaluation."
fi
fi
103 changes: 103 additions & 0 deletions evaluation/benchmarks/aime2024/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# AIME2024 Benchmark

This benchmark evaluates the performance of AI agents on problems from the American Invitational Mathematics Examination (AIME). The dataset is sourced from [AI-MO/aimo-validation-aime](https://huggingface.co/datasets/AI-MO/aimo-validation-aime) on Hugging Face.

## Dataset

The AIME is a challenging mathematics competition for high school students in the United States. The problems require advanced mathematical reasoning and problem-solving skills. The dataset contains 90 problems from various AIME competitions.

## Running the Benchmark

### Prerequisites

- Python 3.11+
- OpenHands installed
- Required Python packages: `datasets`, `pandas`, `matplotlib`

### Running a Single Example

To run a single example from the AIME2024 benchmark:

```bash
cd OpenHands
bash evaluation/benchmarks/aime2024/scripts/run_example.sh togetherDeepseek HEAD CodeActAgent 1 1 "0" "" ipython_only
```

This format follows: `<llm-config> <commit-hash> <agent-cls> <eval-limit> <num-workers> <eval-ids> <run-evaluation> <allowed-tools>`

This will run the first problem in the dataset.

### Running the Full Benchmark

To run the full AIME2024 benchmark:

```bash
cd OpenHands
bash evaluation/benchmarks/aime2024/scripts/run_infer.sh togetherDeepseek HEAD CodeActAgent 500 20 "" eval ipython_only
```

### Options

#### Positional Arguments:
1. `MODEL_CONFIG`: LLM configuration to use (required)
2. `COMMIT_HASH`: Git commit hash to use (optional)
3. `AGENT`: Agent class to use (default: "CodeActAgent")
4. `EVAL_LIMIT`: Limit the number of examples to evaluate (default: 0 for full benchmark, 1 for example)
5. `NUM_WORKERS`: Number of workers for parallel evaluation (default: 1)
6. `EVAL_IDS`: Comma-separated list of example IDs to evaluate (default: "" for full benchmark, "0" for example)
7. `RUN_EVALUATION`: Set to "eval" to run evaluation after benchmark
8. `ALLOWED_TOOLS`: Tools allowed for the agent (default: "all")

## Analyzing Results

There are three ways to analyze the results of the benchmark:

### 1. Using the eval_infer.sh script (recommended)

If you already have an output.jsonl file from a previous run, you can analyze it directly:

```bash
bash evaluation/benchmarks/aime2024/scripts/eval_infer.sh <path-to-output-jsonl> [output-directory]
```

Example:
```bash
bash evaluation/benchmarks/aime2024/scripts/eval_infer.sh ./evaluation/evaluation_outputs/AIME2024/CodeActAgent/v0.26.0/output.jsonl
```

### 2. Using the analyze_results.py script directly

```bash
poetry run python evaluation/benchmarks/aime2024/scripts/analyze_results.py <path-to-results-jsonl> --output-dir <output-directory>
```

### 3. Including "eval" in your benchmark run

Simply include "eval" in your command to automatically run the analysis after the benchmark:

```bash
bash evaluation/benchmarks/aime2024/scripts/run_infer.sh togetherDeepseek HEAD CodeActAgent 500 20 "" eval ipython_only
```

All methods will generate:
- A summary of the results in JSON format
- Plots of the overall accuracy and accuracy by problem ID
- A detailed CSV file with the results for each problem

## Benchmark Details

The AIME2024 benchmark evaluates the agent's ability to:
1. Understand complex mathematical problems
2. Apply mathematical reasoning and problem-solving skills
3. Use tools (like Python libraries) to verify calculations and reasoning
4. Arrive at the correct numerical answer

AIME problems typically have integer answers, and the agent is evaluated based on whether it produces the exact correct answer.

## Example Problem

Here's an example problem from the dataset:

> Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

The correct answer is 116.
Loading
Loading