Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

Summary

This PR aligns the default argument values in the benchmarks repository with the values used in the evaluation repository (OpenHands/evaluation) eval-job/values.yaml configuration.

Changes

Shared defaults updated in args_parser.py:

Argument Old Default New Default Reason
--workspace docker remote Production uses remote workspaces
--max-iterations 100 500 Sufficient for complex tasks
--critic pass finish_with_patch Ensures agent produces valid patches

Benchmark-specific overrides using parser.set_defaults():

Benchmark Override Values
gaia dataset="gaia-benchmark/GAIA"
swtbench dataset="eth-sri/SWT-bench_Verified_bm25_27k_zsp"
commit0 max_attempts=1, max_retries=1 (in addition to existing dataset)
swebench Uses global default (princeton-nlp/SWE-bench_Verified)
swebenchmultimodal Already correct (dataset, split="dev")

Documentation

  • Updated AGENTS.md with documentation about the default values alignment pattern

Benefits

  • Consistency: Running benchmarks locally now uses the same defaults as production
  • Maintainability: Clear pattern for benchmark-specific overrides via set_defaults()
  • Documentation: AGENTS.md explains the pattern for future contributors

Testing

  • All modified files pass py_compile syntax validation
  • No functional changes to evaluation logic, only default values

@simonrosenberg can click here to continue refining the PR

Update args_parser.py and benchmark-specific run_infer.py files to use
default values that match the evaluation repository (OpenHands/evaluation)
eval-job/values.yaml configuration.

Shared defaults updated in args_parser.py:
- workspace: 'docker' -> 'remote'
- max-iterations: 100 -> 500
- critic: 'pass' -> 'finish_with_patch'

Benchmark-specific overrides using parser.set_defaults():
- gaia: dataset='gaia-benchmark/GAIA'
- swtbench: dataset='eth-sri/SWT-bench_Verified_bm25_27k_zsp'
- commit0: max_attempts=1, max_retries=1 (in addition to existing dataset)

Also updated AGENTS.md to document the default values alignment pattern.

Co-authored-by: openhands <openhands@all-hands.dev>
…hmultimodal

- swebench: Add explicit set_defaults(dataset, split) for consistency with
  other benchmarks, even though values match global defaults
- swebenchmultimodal: Update comment to match the pattern used in other benchmarks

Co-authored-by: openhands <openhands@all-hands.dev>
Each benchmark now sets its own dataset default via set_defaults(),
so no global default is needed.

Co-authored-by: openhands <openhands@all-hands.dev>
All benchmarks in the evaluation repository use .llm_config/runtime.json
as the LLM config path, so use this as the default.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants