Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

Summary

This PR creates a single source of truth for all Multi-SWE-Bench constant values and hyperparameters by introducing benchmarks/multiswebench/constants.py.

Fixes #366

Changes

1. Created benchmarks/multiswebench/constants.py

A new module containing all constant values organized into logical categories:

  • Dataset Configuration: DEFAULT_DATASET, DEFAULT_SPLIT, DEFAULT_LANGUAGE, DEFAULT_MODEL_NAME, DEFAULT_VERSION
  • Docker/Image Configuration: DEFAULT_DOCKER_IMAGE_PREFIX, DEFAULT_BUILD_TARGET, and environment variable names
  • Runtime Configuration: DEFAULT_RUNTIME_API_URL, DEFAULT_STARTUP_TIMEOUT, boolean defaults, and environment variable names
  • Evaluation Configuration: DEFAULT_EVAL_MODE, DEFAULT_MAX_WORKERS, DEFAULT_LOG_LEVEL, FIX_PATCH_RUN_CMD, etc.
  • Path Configuration: DATASET_CACHE_DIR, DEFAULT_WORKING_DIR
  • Workspace Configuration: DEFAULT_ENV_SETUP_COMMANDS

2. Updated all multiswebench modules to import from constants.py

  • build_images.py
  • download_dataset.py
  • eval_infer.py
  • run_infer.py
  • scripts/data/data_change.py
  • scripts/eval/update_multi_swe_bench_config.py

3. Added comprehensive tests

Created tests/test_multiswebench_constants.py with 28 tests covering:

  • All constant values and their expected types
  • Verification that all modules properly import from constants.py
  • All constants are exportable from the module

Testing

All 28 new tests pass:

tests/test_multiswebench_constants.py::TestDatasetConstants::test_default_dataset PASSED
tests/test_multiswebench_constants.py::TestDatasetConstants::test_default_split PASSED
tests/test_multiswebench_constants.py::TestDatasetConstants::test_default_language PASSED
...
tests/test_multiswebench_constants.py::TestAllConstantsExported::test_all_constants_importable PASSED
======================== 28 passed ========================

Existing tests continue to pass:

tests/test_metrics.py::test_benchmark_metrics_collection[multiswebench-MultiSWEBenchEvaluation] PASSED

Benefits

  • Single source of truth: All hyperparameters are now defined in one place
  • Easy to audit: Reviewers can quickly verify all constant values are correctly set
  • Maintainability: Changes to default values only need to be made in one location
  • Documentation: Constants are organized with clear comments explaining their purpose

@simonrosenberg can click here to continue refining the PR

This commit creates a single source of truth for all Multi-SWE-Bench
constant values and hyperparameters by:

1. Creating benchmarks/multiswebench/constants.py with all constants:
   - Dataset configuration (DEFAULT_DATASET, DEFAULT_SPLIT, DEFAULT_LANGUAGE, etc.)
   - Docker/Image configuration (DEFAULT_DOCKER_IMAGE_PREFIX, DEFAULT_BUILD_TARGET, etc.)
   - Runtime configuration (DEFAULT_RUNTIME_API_URL, DEFAULT_STARTUP_TIMEOUT, etc.)
   - Evaluation configuration (DEFAULT_EVAL_MODE, DEFAULT_MAX_WORKERS, etc.)
   - Path configuration (DATASET_CACHE_DIR, DEFAULT_WORKING_DIR, etc.)
   - Environment variable names for all configurable values

2. Updating all multiswebench modules to import from constants.py:
   - build_images.py
   - download_dataset.py
   - eval_infer.py
   - run_infer.py
   - scripts/data/data_change.py
   - scripts/eval/update_multi_swe_bench_config.py

3. Adding comprehensive tests in tests/test_multiswebench_constants.py

Fixes #366

Co-authored-by: openhands <openhands@all-hands.dev>
…taset.py

- Removed tests/test_multiswebench_constants.py
- Reverted benchmarks/multiswebench/scripts/data/data_change.py to original
- Reverted benchmarks/multiswebench/download_dataset.py to original
- Removed DEFAULT_VERSION and DATASET_CACHE_DIR from constants.py

Co-authored-by: openhands <openhands@all-hands.dev>
Replace individual constants (DEFAULT_EVAL_MODE, DEFAULT_FORCE_BUILD, etc.)
with a single DEFAULT_EVAL_HARNESS_CONFIG dictionary that serves as a
template for the Multi-SWE-Bench evaluation harness configuration.

This is cleaner because:
- All config values are used together in one place
- The complete config structure is visible at a glance
- Single import instead of 10 individual imports
- Easier to maintain

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regroup all multiswebench hyper parameters in a single source of truth benchmarks/multiswebench/constants.py

3 participants