Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Jan 27, 2026

Summary

This PR creates a new constants.py module in benchmarks/swebench that serves as the single source of truth for all constant values used in the SWE-Bench evaluation workflow.

Fixes #348

Changes

New File: benchmarks/swebench/constants.py

Created a centralized constants module containing:

  • Dataset Configuration: DEFAULT_DATASET
  • Docker Image Configuration: DOCKER_IMAGE_PREFIX, DOCKER_IMAGE_TAG, WRAPPED_REPOS
  • Build Configuration: BUILD_TARGET_SOURCE_MINIMAL, BUILD_TARGET_BINARY, DEFAULT_BUILD_TARGET
  • Runtime Configuration: DEFAULT_RUNTIME_API_URL, DEFAULT_REMOTE_RUNTIME_STARTUP_TIMEOUT
  • Evaluation Configuration: DEFAULT_EVAL_WORKERS
  • Model Configuration: DEFAULT_MODEL_NAME, DEFAULT_CLI_MODEL_NAME
  • Git Configuration: GIT_USER_EMAIL, GIT_USER_NAME, GIT_COMMIT_MESSAGE
  • Patch Processing: SETUP_FILES_TO_REMOVE

Updated Files

  • run_infer.py: Updated to import and use constants from constants.py
  • eval_infer.py: Updated to import and use constants from constants.py
  • build_images.py: Updated to import and use constants from constants.py

Behavior Preservation

All original default values are preserved exactly:

  • DEFAULT_MODEL_NAME = "OpenHands" (matches original function default)
  • DEFAULT_CLI_MODEL_NAME = "openhands" (matches original CLI argparse default)
  • DEFAULT_DATASET = "princeton-nlp/SWE-bench_Verified"
  • DOCKER_IMAGE_PREFIX = "docker.io/swebench/"
  • DOCKER_IMAGE_TAG = "latest"
  • DEFAULT_REMOTE_RUNTIME_STARTUP_TIMEOUT = 600 (int type for proper semantics)
  • DEFAULT_EVAL_WORKERS = 12 (int type, converted to str at argparse usage)
  • All git configuration and patch processing values unchanged

Code Quality Improvements

  • Type Safety: All constants use typing.Final annotations
  • Immutability: WRAPPED_REPOS uses frozenset, SETUP_FILES_TO_REMOVE uses tuple
  • Proper Types: DEFAULT_EVAL_WORKERS is now int (with type=int in argparse)
  • Clean Headers: Simplified section comments for better readability

Benefits

  • Single Source of Truth: All hyperparameters are now defined in one place, making it easy to review and modify them
  • Improved Maintainability: Changes to constant values only need to be made in one location
  • Better Discoverability: Developers can easily find all configurable parameters in one file
  • Type Safety: Final annotations prevent accidental reassignment
  • Immutable Collections: Using frozenset and tuple prevents accidental mutation

…tants.py

This commit creates a new constants.py module in benchmarks/swebench that
serves as the single source of truth for all constant values used in the
SWE-Bench evaluation workflow.

Changes:
- Create benchmarks/swebench/constants.py with all constant values:
  - Dataset configuration (DEFAULT_DATASET, DEFAULT_SPLIT)
  - Docker image configuration (DOCKER_IMAGE_PREFIX, DOCKER_IMAGE_TAG)
  - Build configuration (BUILD_TARGET_*, DEFAULT_BUILD_TARGET)
  - Runtime configuration (DEFAULT_RUNTIME_API_URL, DEFAULT_REMOTE_RUNTIME_STARTUP_TIMEOUT)
  - Evaluation configuration (DEFAULT_MAX_ITERATIONS, DEFAULT_NUM_WORKERS, etc.)
  - Model configuration (DEFAULT_MODEL_NAME)
  - Git configuration (GIT_USER_EMAIL, GIT_USER_NAME, GIT_COMMIT_MESSAGE)
  - Patch processing (SETUP_FILES_TO_REMOVE)

- Update run_infer.py, eval_infer.py, and build_images.py to import and use
  constants from the constants module

Fixes #348

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg force-pushed the openhands/swebench-constants-348 branch from b0ab705 to 545be81 Compare January 27, 2026 19:01
@simonrosenberg simonrosenberg changed the title Regroup all swebench hyperparameters in a single source of truth constants.py refactor(swebench): regroup all hyperparameters in constants.py Jan 28, 2026
- Fix DEFAULT_MODEL_NAME case: 'openhands' -> 'OpenHands' to match original function default
- Move WRAPPED_REPOS from build_images.py to constants.py
- Remove unused constants: DEFAULT_SPLIT, DEFAULT_MAX_ITERATIONS, DEFAULT_NUM_WORKERS,
  DEFAULT_MAX_ATTEMPTS, DEFAULT_MAX_RETRIES, DEFAULT_N_LIMIT, DEFAULT_NOTE, DEFAULT_OUTPUT_DIR
- Fix type: DEFAULT_REMOTE_RUNTIME_STARTUP_TIMEOUT is now int (600) instead of string

Co-authored-by: openhands <openhands@all-hands.dev>
@OpenHands OpenHands deleted a comment from openhands-ai bot Jan 28, 2026
@OpenHands OpenHands deleted a comment from openhands-ai bot Jan 28, 2026
@simonrosenberg
Copy link
Collaborator Author

@OpenHands
/codereview-roasted
Make sure this PR doesn't alter any constant or behavior
Suggest next step refactoring for cleaner code

@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

Code Review Summary for PR #373

I performed a critical code review of the PR "refactor(swebench): regroup all hyperparameters in constants.py" as requested with the /codereview-roasted command.

🔴 Verdict: Needs Rework

Critical Finding: Behavior-Altering Change Detected

The PR claims "Behavior Preservation" but actually changes CLI behavior:

Location Original Value PR Value
eval_infer.py argparse --model-name default "openhands" (lowercase) "OpenHands" (mixed case)

This silently changes behavior for users running swebench-eval output.jsonl without specifying --model-name.

Additional Issues Identified:

  1. Mutable constants - WRAPPED_REPOS (set) and SETUP_FILES_TO_REMOVE (list) should be immutable (frozenset and tuple)
  2. Type inconsistency - DEFAULT_EVAL_WORKERS = "12" is a string but semantically an integer
  3. Verbose formatting - Section headers with ===== are excessive for 54 lines

Suggested Next Steps for Cleaner Code:

  1. Fix the breaking change - Either add DEFAULT_CLI_MODEL_NAME = "openhands" or change DEFAULT_MODEL_NAME to lowercase
  2. Use immutable types - frozenset for sets, tuple for lists
  3. Consider typing.Final for explicit constant declaration
  4. Group related constants - Consider NamedTuple or @dataclass(frozen=True) for related groups like git config

No code changes were made - this was a review-only request. The feedback has been provided above for the PR author to address.

View full conversation

…tion

- Fix breaking change: add DEFAULT_CLI_MODEL_NAME to preserve original CLI default ('openhands')
- Use typing.Final for all constants to indicate immutability
- Use frozenset for WRAPPED_REPOS (immutable)
- Use tuple for SETUP_FILES_TO_REMOVE (immutable)
- Change DEFAULT_EVAL_WORKERS to int type with proper conversion at usage sites
- Simplify section headers for cleaner code
- Add type=int to --workers argparse argument for proper type handling

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #373 at branch `openhands/swebench-constants-348`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@simonrosenberg
Copy link
Collaborator Author

@OpenHands fix the precommit errors

@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

The pyright type checker was failing because DEFAULT_BUILD_TARGET was
typed as Final[str] but build_image() expects a TargetType which is
Literal['binary', 'binary-minimal', 'source', 'source-minimal'].

This fix adds a local TargetType alias and properly types the build
target constants to match the expected type signature.
@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

I've fixed the pre-commit errors in PR #373. The issue was a pyright type checking failure.

Problem: The DEFAULT_BUILD_TARGET constant in constants.py was typed as Final[str], but the build_image() function in build_utils.py expects a TargetType parameter which is Literal['binary', 'binary-minimal', 'source', 'source-minimal']. This caused pyright to fail with:

Argument of type "str" cannot be assigned to parameter "target" of type "TargetType"

Fix: Updated constants.py to:

  1. Import Literal from typing
  2. Define a local TargetType alias matching the expected literal type
  3. Type the build target constants (BUILD_TARGET_SOURCE_MINIMAL, BUILD_TARGET_BINARY, DEFAULT_BUILD_TARGET) as Final[TargetType] instead of Final[str]

Changes pushed: The fix has been committed and pushed to the openhands/swebench-constants-348 branch. All pre-commit checks now pass:

  • ✅ Ruff format
  • ✅ Ruff lint
  • ✅ PEP8 style check (pycodestyle)
  • ✅ Type check with Pyright (strict)

View full conversation

@simonrosenberg
Copy link
Collaborator Author

@OpenHands
/codereview-roasted

The goal of this PR is to centralize all hyperparameters and constant values that are necessary to run a benchmark in a clear constant.py file.

Problem: get_parser is used across benchmarks. All its arguments are shared across benchmarks except dataset that is benchmark specific.
Question: how to structure the code so that

  1. the value "princeton-nlp/SWE-bench_Verified" is only written in a single swebench/constants.py (and same for other benchmark dataset values)
  2. Each benchmark should have its own default dataset value so that when running it from CLI there is no need to input the correct dataset value

Perhaps we could not have any defaults on the utils/args_parser.py but then in each specific benchmark run-infer, we input the correct dataset?

@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

Code Review for PR #373: "refactor(swebench): regroup all hyperparameters in constants.py"

🟡 Taste Rating: Acceptable - Works but has an architectural gap


[CRITICAL ISSUES] (Must address to achieve stated goal)

[benchmarks/utils/args_parser.py, Line 26-27] Data Duplication: The PR claims to centralize constants, but "princeton-nlp/SWE-bench_Verified" is still hardcoded in the shared args_parser.py:

default="princeton-nlp/SWE-bench_Verified",

This violates the stated goal of having dataset values "only written in a single constants.py file."


[IMPROVEMENT OPPORTUNITIES] (Recommended architecture)

Your intuition is correct. Here's the recommended approach:

1. Remove the default from benchmarks/utils/args_parser.py:

parser.add_argument(
    "--dataset",
    type=str,
    default=None,  # No default - each benchmark sets its own
    help="Dataset name (required unless benchmark provides default)",
)

2. Each benchmark sets its own default in run_infer.py:

# In benchmarks/swebench/run_infer.py
from benchmarks.swebench import constants
parser = get_parser()
parser.set_defaults(dataset=constants.DEFAULT_DATASET)
# In benchmarks/swebenchmultimodal/run_infer.py
from benchmarks.swebenchmultimodal import constants  # create this
parser = get_parser()
parser.set_defaults(dataset=constants.DEFAULT_DATASET)

3. Create constants.py for each benchmark that needs one:

  • benchmarks/swebench/constants.py ✅ (already done)
  • benchmarks/swebenchmultimodal/constants.pyDEFAULT_DATASET = "princeton-nlp/SWE-bench_Multimodal"
  • benchmarks/commit0/constants.pyDEFAULT_DATASET = "wentingzhao/commit0_combined"
  • etc.

This pattern is already partially used in swebenchmultimodal/run_infer.py and commit0/run_infer.py with parser.set_defaults(), but they hardcode the value instead of importing from constants.


[STYLE NOTES]

[benchmarks/swebench/constants.py] Good use of Final type hints and frozenset/tuple for immutability. The TargetType literal definition is clean.


VERDICT:

Worth merging as-is for swebench-specific improvements, but...

Incomplete for stated goal - The PR doesn't fully achieve "dataset value only written once" because args_parser.py still has the hardcoded default.


KEY INSIGHT:

The current PR is a good first step for swebench, but to truly centralize constants across all benchmarks, you need to:

  1. Remove the default from the shared parser
  2. Have each benchmark's run_infer.py call parser.set_defaults(dataset=<benchmark>.constants.DEFAULT_DATASET)

This matches your suggested approach: "Perhaps we could not have any defaults on the utils/args_parser.py but then in each specific benchmark run-infer, we input the correct dataset?"

Recommended Next Steps:

  1. Merge this PR as-is (swebench constants are properly centralized)
  2. Create a follow-up PR that:
    • Removes the default from args_parser.py
    • Updates each benchmark's run_infer.py to set its own default
    • Creates constants.py files for other benchmarks as needed

View full conversation

- Remove hardcoded dataset default from args_parser.py (now None)
- Each benchmark sets its own dataset default via parser.set_defaults()
- Add shared constants to utils/constants.py:
  - DEFAULT_WORKSPACE='remote' (behavior change from 'docker')
  - DEFAULT_SPLIT, DEFAULT_MAX_ITERATIONS, DEFAULT_NUM_EVAL_WORKERS
  - DEFAULT_OUTPUT_DIR, DEFAULT_MAX_ATTEMPTS, DEFAULT_MAX_RETRIES
  - DEFAULT_NOTE, DEFAULT_N_LIMIT, DEFAULT_CRITIC
- Update args_parser.py to use constants for all defaults
- Update critics.py to use DEFAULT_CRITIC constant
- Update swebench/run_infer.py to set its own dataset default

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg merged commit 2e8638d into main Jan 28, 2026
2 checks passed
@simonrosenberg simonrosenberg deleted the openhands/swebench-constants-348 branch January 28, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regroup all swebench hyper parameters in a single source of truth benchmarks/swebench/constants.py

4 participants