Align default argument values with evaluation repository #377

simonrosenberg · 2026-01-28T17:35:12Z

Summary

This PR aligns the default argument values in the benchmarks repository with the values used in the evaluation repository (OpenHands/evaluation) eval-job/values.yaml configuration.

Changes

Shared defaults updated in `args_parser.py`:

Argument	Old Default	New Default	Reason
`--workspace`	`docker`	`remote`	Production uses remote workspaces
`--max-iterations`	`100`	`500`	Sufficient for complex tasks
`--critic`	`pass`	`finish_with_patch`	Ensures agent produces valid patches

Benchmark-specific overrides using `parser.set_defaults()`:

Benchmark	Override Values
`gaia`	`dataset="gaia-benchmark/GAIA"`
`swtbench`	`dataset="eth-sri/SWT-bench_Verified_bm25_27k_zsp"`
`commit0`	`max_attempts=1, max_retries=1` (in addition to existing dataset)
`swebench`	Uses global default (`princeton-nlp/SWE-bench_Verified`)
`swebenchmultimodal`	Already correct (`dataset`, `split="dev"`)

Documentation

Updated AGENTS.md with documentation about the default values alignment pattern

Benefits

Consistency: Running benchmarks locally now uses the same defaults as production
Maintainability: Clear pattern for benchmark-specific overrides via set_defaults()
Documentation: AGENTS.md explains the pattern for future contributors

Testing

All modified files pass py_compile syntax validation
No functional changes to evaluation logic, only default values

@simonrosenberg can click here to continue refining the PR

Update args_parser.py and benchmark-specific run_infer.py files to use default values that match the evaluation repository (OpenHands/evaluation) eval-job/values.yaml configuration. Shared defaults updated in args_parser.py: - workspace: 'docker' -> 'remote' - max-iterations: 100 -> 500 - critic: 'pass' -> 'finish_with_patch' Benchmark-specific overrides using parser.set_defaults(): - gaia: dataset='gaia-benchmark/GAIA' - swtbench: dataset='eth-sri/SWT-bench_Verified_bm25_27k_zsp' - commit0: max_attempts=1, max_retries=1 (in addition to existing dataset) Also updated AGENTS.md to document the default values alignment pattern. Co-authored-by: openhands <openhands@all-hands.dev>

…hmultimodal - swebench: Add explicit set_defaults(dataset, split) for consistency with other benchmarks, even though values match global defaults - swebenchmultimodal: Update comment to match the pattern used in other benchmarks Co-authored-by: openhands <openhands@all-hands.dev>

Each benchmark now sets its own dataset default via set_defaults(), so no global default is needed. Co-authored-by: openhands <openhands@all-hands.dev>

All benchmarks in the evaluation repository use .llm_config/runtime.json as the LLM config path, so use this as the default. Co-authored-by: openhands <openhands@all-hands.dev>

This reverts commit c34d730.

openhands-agent added 5 commits January 28, 2026 17:34

Remove default dataset from args_parser.py

dcb940f

Each benchmark now sets its own dataset default via set_defaults(), so no global default is needed. Co-authored-by: openhands <openhands@all-hands.dev>

Add default value for llm_config_path

c34d730

All benchmarks in the evaluation repository use .llm_config/runtime.json as the LLM config path, so use this as the default. Co-authored-by: openhands <openhands@all-hands.dev>

Revert "Add default value for llm_config_path"

6af188a

This reverts commit c34d730.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align default argument values with evaluation repository #377

Align default argument values with evaluation repository #377

Uh oh!

simonrosenberg commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Align default argument values with evaluation repository #377

Are you sure you want to change the base?

Align default argument values with evaluation repository #377

Uh oh!

Conversation

simonrosenberg commented Jan 28, 2026

Summary

Changes

Shared defaults updated in args_parser.py:

Benchmark-specific overrides using parser.set_defaults():

Documentation

Benefits

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shared defaults updated in `args_parser.py`:

Benchmark-specific overrides using `parser.set_defaults()`: