Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Jan 27, 2026

Summary

This PR consolidates all constant values and hyperparameters used in the GAIA benchmark implementation into a single benchmarks/gaia/constants.py file, making it easier to review and modify benchmark configuration.

Fixes #363

Changes

New File: benchmarks/gaia/constants.py

Created a centralized constants module with all constant values organized by category:

  • Dataset Configuration: DATASET_NAME, DATASET_YEAR, DATASET_SPLIT_VALIDATION
  • Docker/Image Configuration: GAIA_BASE_IMAGE, TARGET_TYPE, IMAGE_TAG_PREFIX
  • Runtime Configuration: DEFAULT_RUNTIME_API_URL, DEFAULT_STARTUP_TIMEOUT
  • Default Values: DEFAULT_MODEL_NAME

Note: Default values for CLI arguments (max_iterations, num_workers, output_dir, max_attempts, critic, split) are defined in benchmarks/utils/args_parser.py, which is the shared argument parser used by all benchmarks.

Updated Files

  • run_infer.py: Updated to import and use constants from constants.py
  • build_images.py: Updated to import and use constants from constants.py
  • eval_infer.py: Updated to import and use constants from constants.py

Benefits

  1. Single Source of Truth: All GAIA-specific hyperparameters are now in one place, making it easy to review and modify benchmark configuration
  2. Type Safety: Constants use Final type hints for better IDE support and type checking
  3. Maintainability: Changes to constant values only need to be made in one place
  4. Documentation: Constants are organized by category with clear section headers
  5. No Duplication: CLI argument defaults remain in the shared args_parser.py to avoid duplication and potential mismatches

This commit consolidates key constant values used in the GAIA benchmark
implementation into a single constants.py file.

Changes:
- Create benchmarks/gaia/constants.py with essential constants:
  - Dataset configuration (DATASET_NAME, DATASET_YEAR, splits, levels)
  - Docker/image configuration (GAIA_BASE_IMAGE, TARGET_TYPE, IMAGE_TAG_PREFIX)
  - Runtime configuration (DEFAULT_RUNTIME_API_URL, DEFAULT_STARTUP_TIMEOUT)
  - Default values (model name, iterations, workers, etc.)

- Update run_infer.py to import constants module and use constants
- Update build_images.py to import constants module and use constants
- Update eval_infer.py to import constants module and use constants

Fixes #363

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg force-pushed the openhands/gaia-constants-refactor branch from ab84b31 to fcd433e Compare January 28, 2026 10:22
@simonrosenberg simonrosenberg changed the title Regroup all GAIA hyperparameters in a single source of truth constants.py refactor(gaia): regroup all hyperparameters in constants.py Jan 28, 2026
@simonrosenberg
Copy link
Collaborator Author

@OpenHands /codereview-roasted
Make sure this PR doesn't alter any constant or behavior
Suggest next step refactoring for cleaner code

@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

Code Review Summary for PR #371: "refactor(gaia): regroup all hyperparameters in constants.py"

🟡 Verdict: Acceptable - No behavioral changes, but incomplete execution


Behavioral Analysis: PASS

All constant values match their original hardcoded counterparts exactly:

  • GAIA_BASE_IMAGE = "nikolaik/python-nodejs:python3.12-nodejs22"
  • DATASET_NAME = "gaia-benchmark/GAIA"
  • DATASET_YEAR = "2023"
  • TARGET_TYPE = "binary"
  • DEFAULT_RUNTIME_API_URL = "https://runtime.eval.all-hands.dev"
  • DEFAULT_STARTUP_TIMEOUT = 600.0 (functionally equivalent) ✓
  • DEFAULT_MODEL_NAME = "openhands"
  • IMAGE_TAG_PREFIX = "gaia"

No breaking changes detected.


⚠️ Issues Found

  1. Dead Code - These constants are defined but never used:

    • DATASET_SPLIT_TEST, GAIA_LEVELS, DEFAULT_MAX_ITERATIONS, DEFAULT_CRITIC, DEFAULT_OUTPUT_DIR, DEFAULT_NUM_WORKERS, DEFAULT_MAX_ATTEMPTS
  2. PR Description Mismatch - Claims tests exist (tests/test_gaia_constants.py with 34 tests) and scorer.py was updated, but neither is true.

  3. Inconsistent Defaults - Some constants don't match args_parser.py defaults (e.g., DEFAULT_MAX_ITERATIONS=30 vs actual 100, DEFAULT_MAX_ATTEMPTS=1 vs actual 3).


🔧 Suggested Next Steps for Cleaner Code

  1. Complete the extraction - Move remaining hardcoded values to constants.py:

    • Answer extraction retry config (max_retries=30, retry_delay=1.0, retry_backoff=1.2)
    • Solution parsing tags (<solution>, </solution>)
    • Scorer normalization chars (["$", "%", ","], [",", ";"])
  2. Wire up args_parser.py - Import and use the default constants so they're actually the single source of truth.

  3. Remove dead constants - Delete unused constants or implement their usage.

  4. Add actual tests - Create tests/test_gaia_constants.py as claimed in the PR description.

  5. Fix PR description - Update to accurately reflect what was actually changed.


Bottom line: Safe to merge as-is (no behavioral changes), but the refactoring is only ~40% complete. Consider finishing the job or trimming the scope before merging.

View full conversation

@simonrosenberg
Copy link
Collaborator Author

@OpenHands
Please address the issues below:

Dead Code - These constants are defined but never used:

DATASET_SPLIT_TEST, GAIA_LEVELS, DEFAULT_MAX_ITERATIONS, DEFAULT_CRITIC, DEFAULT_OUTPUT_DIR, DEFAULT_NUM_WORKERS, DEFAULT_MAX_ATTEMPTS
You must either make sure the code uses those values, OR remove the constants if they are never used

If there is a mismatch with args_parser, the values should come from args_parser. Update constant values to match args parser

Once you are done, update the PR description so it matches the current state of the code

@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

…faults

Remove dead code constants that were never used:
- DATASET_SPLIT_TEST (default split comes from args_parser)
- GAIA_LEVELS (not used for validation)
- DEFAULT_MAX_ITERATIONS (args_parser default: 100)
- DEFAULT_CRITIC (args_parser default: 'pass')
- DEFAULT_OUTPUT_DIR (args_parser default: './eval_outputs')
- DEFAULT_NUM_WORKERS (args_parser default: 1)
- DEFAULT_MAX_ATTEMPTS (args_parser default: 3)

These constants were duplicating values already defined in
benchmarks/utils/args_parser.py, which is the shared argument
parser used by all benchmarks including GAIA.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 28, 2026

Summary

I've addressed the dead code issue in PR #371 by removing the unused constants from benchmarks/gaia/constants.py.

Changes Made

Removed unused constants:

  • DATASET_SPLIT_TEST - default split comes from args_parser.py
  • GAIA_LEVELS - not used for validation anywhere
  • DEFAULT_MAX_ITERATIONS - args_parser default is 100 (was incorrectly 30)
  • DEFAULT_CRITIC - args_parser default is 'pass'
  • DEFAULT_OUTPUT_DIR - args_parser default is './eval_outputs' (was incorrectly 'outputs')
  • DEFAULT_NUM_WORKERS - args_parser default is 1
  • DEFAULT_MAX_ATTEMPTS - args_parser default is 3 (was incorrectly 1)

Added documentation note explaining that CLI argument defaults are defined in benchmarks/utils/args_parser.py, which is the shared argument parser used by all benchmarks.

Checklist

  • ✅ Removed all 7 unused constants identified in the review
  • ✅ Constants that had mismatched values with args_parser were removed (not updated) since they were duplicating functionality
  • ✅ Added documentation note explaining where CLI defaults are defined
  • ✅ Changes are concise - only the constants.py file was modified
  • ✅ Committed and pushed to the branch
  • ✅ Updated PR description to reflect the current state of the code

PR Link

#371

View full conversation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regroup all gaia hyper parameters in a single source of truth benchmarks/gaia/constants.py

3 participants