-
Notifications
You must be signed in to change notification settings - Fork 31
refactor(swebench): regroup all hyperparameters in constants.py #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5eecddf to
b0ab705
Compare
…tants.py This commit creates a new constants.py module in benchmarks/swebench that serves as the single source of truth for all constant values used in the SWE-Bench evaluation workflow. Changes: - Create benchmarks/swebench/constants.py with all constant values: - Dataset configuration (DEFAULT_DATASET, DEFAULT_SPLIT) - Docker image configuration (DOCKER_IMAGE_PREFIX, DOCKER_IMAGE_TAG) - Build configuration (BUILD_TARGET_*, DEFAULT_BUILD_TARGET) - Runtime configuration (DEFAULT_RUNTIME_API_URL, DEFAULT_REMOTE_RUNTIME_STARTUP_TIMEOUT) - Evaluation configuration (DEFAULT_MAX_ITERATIONS, DEFAULT_NUM_WORKERS, etc.) - Model configuration (DEFAULT_MODEL_NAME) - Git configuration (GIT_USER_EMAIL, GIT_USER_NAME, GIT_COMMIT_MESSAGE) - Patch processing (SETUP_FILES_TO_REMOVE) - Update run_infer.py, eval_infer.py, and build_images.py to import and use constants from the constants module Fixes #348 Co-authored-by: openhands <openhands@all-hands.dev>
b0ab705 to
545be81
Compare
- Fix DEFAULT_MODEL_NAME case: 'openhands' -> 'OpenHands' to match original function default - Move WRAPPED_REPOS from build_images.py to constants.py - Remove unused constants: DEFAULT_SPLIT, DEFAULT_MAX_ITERATIONS, DEFAULT_NUM_WORKERS, DEFAULT_MAX_ATTEMPTS, DEFAULT_MAX_RETRIES, DEFAULT_N_LIMIT, DEFAULT_NOTE, DEFAULT_OUTPUT_DIR - Fix type: DEFAULT_REMOTE_RUNTIME_STARTUP_TIMEOUT is now int (600) instead of string Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands |
|
I'm on it! simonrosenberg can track my progress at all-hands.dev |
Code Review Summary for PR #373I performed a critical code review of the PR "refactor(swebench): regroup all hyperparameters in constants.py" as requested with the 🔴 Verdict: Needs ReworkCritical Finding: Behavior-Altering Change DetectedThe PR claims "Behavior Preservation" but actually changes CLI behavior:
This silently changes behavior for users running Additional Issues Identified:
Suggested Next Steps for Cleaner Code:
No code changes were made - this was a review-only request. The feedback has been provided above for the PR author to address. |
…tion
- Fix breaking change: add DEFAULT_CLI_MODEL_NAME to preserve original CLI default ('openhands')
- Use typing.Final for all constants to indicate immutability
- Use frozenset for WRAPPED_REPOS (immutable)
- Use tuple for SETUP_FILES_TO_REMOVE (immutable)
- Change DEFAULT_EVAL_WORKERS to int type with proper conversion at usage sites
- Simplify section headers for cleaner code
- Add type=int to --workers argparse argument for proper type handling
Co-authored-by: openhands <openhands@all-hands.dev>
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
|
@OpenHands fix the precommit errors |
|
I'm on it! simonrosenberg can track my progress at all-hands.dev |
The pyright type checker was failing because DEFAULT_BUILD_TARGET was typed as Final[str] but build_image() expects a TargetType which is Literal['binary', 'binary-minimal', 'source', 'source-minimal']. This fix adds a local TargetType alias and properly types the build target constants to match the expected type signature.
|
I've fixed the pre-commit errors in PR #373. The issue was a pyright type checking failure. Problem: The Fix: Updated
Changes pushed: The fix has been committed and pushed to the
|
|
@OpenHands The goal of this PR is to centralize all hyperparameters and constant values that are necessary to run a benchmark in a clear constant.py file. Problem: get_parser is used across benchmarks. All its arguments are shared across benchmarks except dataset that is benchmark specific.
Perhaps we could not have any defaults on the utils/args_parser.py but then in each specific benchmark run-infer, we input the correct dataset? |
|
I'm on it! simonrosenberg can track my progress at all-hands.dev |
Code Review for PR #373: "refactor(swebench): regroup all hyperparameters in constants.py"🟡 Taste Rating: Acceptable - Works but has an architectural gap[CRITICAL ISSUES] (Must address to achieve stated goal)[benchmarks/utils/args_parser.py, Line 26-27] Data Duplication: The PR claims to centralize constants, but default="princeton-nlp/SWE-bench_Verified",This violates the stated goal of having dataset values "only written in a single constants.py file." [IMPROVEMENT OPPORTUNITIES] (Recommended architecture)Your intuition is correct. Here's the recommended approach: 1. Remove the default from parser.add_argument(
"--dataset",
type=str,
default=None, # No default - each benchmark sets its own
help="Dataset name (required unless benchmark provides default)",
)2. Each benchmark sets its own default in # In benchmarks/swebench/run_infer.py
from benchmarks.swebench import constants
parser = get_parser()
parser.set_defaults(dataset=constants.DEFAULT_DATASET)# In benchmarks/swebenchmultimodal/run_infer.py
from benchmarks.swebenchmultimodal import constants # create this
parser = get_parser()
parser.set_defaults(dataset=constants.DEFAULT_DATASET)3. Create
This pattern is already partially used in [STYLE NOTES][benchmarks/swebench/constants.py] Good use of VERDICT:✅ Worth merging as-is for swebench-specific improvements, but... ❌ Incomplete for stated goal - The PR doesn't fully achieve "dataset value only written once" because KEY INSIGHT:The current PR is a good first step for swebench, but to truly centralize constants across all benchmarks, you need to:
This matches your suggested approach: "Perhaps we could not have any defaults on the utils/args_parser.py but then in each specific benchmark run-infer, we input the correct dataset?" Recommended Next Steps:
|
- Remove hardcoded dataset default from args_parser.py (now None) - Each benchmark sets its own dataset default via parser.set_defaults() - Add shared constants to utils/constants.py: - DEFAULT_WORKSPACE='remote' (behavior change from 'docker') - DEFAULT_SPLIT, DEFAULT_MAX_ITERATIONS, DEFAULT_NUM_EVAL_WORKERS - DEFAULT_OUTPUT_DIR, DEFAULT_MAX_ATTEMPTS, DEFAULT_MAX_RETRIES - DEFAULT_NOTE, DEFAULT_N_LIMIT, DEFAULT_CRITIC - Update args_parser.py to use constants for all defaults - Update critics.py to use DEFAULT_CRITIC constant - Update swebench/run_infer.py to set its own dataset default Co-authored-by: openhands <openhands@all-hands.dev>
….py" This reverts commit 214c3ab.
Summary
This PR creates a new
constants.pymodule inbenchmarks/swebenchthat serves as the single source of truth for all constant values used in the SWE-Bench evaluation workflow.Fixes #348
Changes
New File:
benchmarks/swebench/constants.pyCreated a centralized constants module containing:
DEFAULT_DATASETDOCKER_IMAGE_PREFIX,DOCKER_IMAGE_TAG,WRAPPED_REPOSBUILD_TARGET_SOURCE_MINIMAL,BUILD_TARGET_BINARY,DEFAULT_BUILD_TARGETDEFAULT_RUNTIME_API_URL,DEFAULT_REMOTE_RUNTIME_STARTUP_TIMEOUTDEFAULT_EVAL_WORKERSDEFAULT_MODEL_NAME,DEFAULT_CLI_MODEL_NAMEGIT_USER_EMAIL,GIT_USER_NAME,GIT_COMMIT_MESSAGESETUP_FILES_TO_REMOVEUpdated Files
run_infer.py: Updated to import and use constants fromconstants.pyeval_infer.py: Updated to import and use constants fromconstants.pybuild_images.py: Updated to import and use constants fromconstants.pyBehavior Preservation
All original default values are preserved exactly:
DEFAULT_MODEL_NAME = "OpenHands"(matches original function default)DEFAULT_CLI_MODEL_NAME = "openhands"(matches original CLI argparse default)DEFAULT_DATASET = "princeton-nlp/SWE-bench_Verified"DOCKER_IMAGE_PREFIX = "docker.io/swebench/"DOCKER_IMAGE_TAG = "latest"DEFAULT_REMOTE_RUNTIME_STARTUP_TIMEOUT = 600(int type for proper semantics)DEFAULT_EVAL_WORKERS = 12(int type, converted to str at argparse usage)Code Quality Improvements
typing.FinalannotationsWRAPPED_REPOSusesfrozenset,SETUP_FILES_TO_REMOVEusestupleDEFAULT_EVAL_WORKERSis nowint(withtype=intin argparse)Benefits
Finalannotations prevent accidental reassignmentfrozensetandtupleprevents accidental mutation