Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

Summary

This PR adds a dynamic run-name to the build images workflows for index benchmarks, so that the SDK commit hash is displayed in the workflow run title when provided.

Changes

Added run-name property to the following workflow files:

  • build-swebench-images.yml
  • build-swebenchmultimodal-images.yml
  • build-swtbench-images.yml
  • build-commit0-images.yml
  • build-gaia-image.yml

When triggered via workflow_dispatch with an sdk-commit input, the workflow run title will now include the SDK commit hash. For example:

  • Build SWE-Bench Images (SDK: abc1234)
  • Build GAIA Images (SDK: main)

This makes it easier to identify which SDK version was used for each build at a glance, as mentioned in the issue.

Related Issue

Fixes #350

@juanmichelini can click here to continue refining the PR

This adds a dynamic run-name to the build images workflows for:
- SWE-Bench
- SWE-Bench Multimodal
- SWT-Bench
- Commit0
- GAIA

When triggered via workflow_dispatch with an sdk-commit input, the workflow
run title will now include the SDK commit hash (e.g., 'Build SWE-Bench Images
(SDK: abc1234)'). This makes it easier to identify which SDK version was used
for each build at a glance.

Fixes #350

Co-authored-by: openhands <openhands@all-hands.dev>
juanmichelini added a commit that referenced this pull request Jan 23, 2026
The commit0-lite benchmark contains 16 instances total, but only 10 are
used as reference (gold) instances for accuracy calculation on the
official leaderboard.

Issue: PR #351 showed 100.7% accuracy (3652/3628) because we were
including all 16 repos instead of just the 10 reference repos, leading
to incorrect test totals and impossible >100% accuracy.

The 10 reference repos are:
- simpy
- tinydb
- marshmallow
- wcwidth
- imapclient
- voluptuous
- jinja
- deprecated
- cookiecutter
- cachetools

Changes:
- run_infer.py: Filter dataset to only reference repos
- eval_infer.py: Skip non-reference repos when calculating totals
- Updated total_instances from 16 to 10
- Added comments with links to official leaderboard documentation

References:
- Leaderboard: https://commit-0.github.io/analysis/
- Breakdown: https://commit-0.github.io/analysis_commit0-lite-plain_fillin/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build images action for index benchmarks should contain the sdk commit hash they used in the title

3 participants